Data Intensive Computing for General Relational Data Learning

Sponsor: National Science Foundation

This project addresses a three year integrated research and education program focusing on engaging into an in-depth research in developing novel parallel frameworks for a wide spectrum of state-of-the-art solutions to a series of fundamental problems in relational data learning. The PIs shall focus on unsupervised relational data community discovery and analysis for the relational data learning, built upon the PIs' existing strength on the state-of-the-art research in relational data mining and parallel computation and scheduling. The technologies developed from this project shall have immediate important applications with broader societal impacts such as social network analysis, biological information discovery, financial and economic development analysis and prediction, natural disaster prediction, as well as military intelligence analysis.

 

It is well-observed that the whole world is full of data that are highly related and of diverse data object types such as people, organizations, and events. In many applications, it is intended to discover the hidden structures through such relationships involving different types of data objects in the world, in addition to "clusters" of the same type of data objects. On the other hand, it is too often that there is no luxury to have any training data with ground truth for knowledge discovery. Consequently, unsupervised relational data community discovery is expected and desired for all these applications.

 

 

Unsupervised relational data learning typically involves a large collection of data objects and thus algorithms for the relational data learning are computation-intensive. This calls for massively parallel solutions in order to make the algorithms scalable to large collections of data. The advances in data center technology make it possible and cost-effective to take advantage of hundreds of thousands of commodity hardware to perform massive parallel data intensive computation. Yet, the system architecture and emerging parallel programming paradigms in the data center technology pose many challenges in designing parallel solutions.

 

The intellectual merit of this project includes the revolutionized understanding in the context of distributed implementation of a wide spectrum of state-of-the-art solutions to the fundamental problems in the literature of relational data learning as well as the expected breakthrough in the interdisciplinary and multidisciplinary research communities including parallel computation and scheduling, data mining and machine learning, and pattern analysis, that shall undoubtedly advance the literature in these areas.

 

The broader impacts include the phenomenal societal impacts in the expected breakthrough in developing parallel computing paradigms on general relational data learning that can be immediately deployed in important applications such as social network analysis, biological information discovery, financial and economic development analysis and prediction, natural disaster prediction, as well as military intelligence analysis. The integrated innovative community outreach component shall contribute substantially to the revolution of high school curricula specifically and the K-12 education of the nation in general.

 

With the stated motivations, this project focuses on the following four objectives to be achieved synergistically: (1) to develop novel theory and methodologies to study a series of challenging and open unsupervised general relational data community discovery problems; (2) to develop novel parallel computing paradigms tailored to the theory and methodologies to be developed for the unsupervised general relational data community discovery; (3) to extensively evaluate the developed parallel computing paradigms working closely with the industrial collaborators and to showcase the technology in real-world applications; and (4) to develop and evaluate the innovative community outreach component working closely with local high schools for developing the curriculum for high school students' independent scientific research.

 

NSF Project Manager:  Dr. Jie Yang

Project Personnel:
 

PI: Prof. Zhongfei (Mark) Zhang

PhD students:

NSF REU Students:

Partners:


Publications:

 

Code Release :

Data Release :

 

This material is based upon the work supported by the National Science Foundation under Award No. 1017828.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Go back to the Multimedia Computing Research Lab homepage