Research Interests

  • Data Mining,
  • Machine Learning,
  • Ranking and Search Relevance,
  • Relational Data Clustering and Comunity Generation,
  • Text Analysis ,
  • Bioinformatics,
  • Social Network Analysis,
  • Collaborative Filtering. 

Thesis Research

In many important applications such as text mining, biological data processing, collaborative filtering, and social network analysis, it is typical that data objects do not exist in isolation; instead, it is ubiquitous that they exist through relations. More importantly, it is the relations among objects that are of crucial importance to pattern discovery. On the other hand, there is often no such luxury to have any training data in a learning task. Consequently, learning cluster structures from those interrelated objects (multi-type or single type), relational data clustering, has become one of the most important data mining and machine learning topics in both industry and academia, though it is still a fairly new topic. In general, relational data contain three types of information, heterogeneous relations between objects of different types, homogeneous relations between objects of the same type, and attributes for individual objects. My thesis work focuses on developing a unified theoretical framework for relational data clustering and effective algorithms for different types of relational data from a wide range of applications.

- Bi-partite heterogeneous relational data clustering

Clustering on bi-partite heterogeneous relational data is also known as co-clustering in the literature. We proposed a new co-clustering model, Block Value Decomposition (BVD), for Bi-partite heterogeneous relational data clustering. Under this model, we developed a novel co-clustering algorithm for non-negative dyadic data, that iteratively computes the three decomposition matrices based on the multiplicative updating rules. The proposed algorithm can be applied to simultaneously clustering genes and conditions in microarray data, clustering documents and words in text mining, and clustering users and items in collaborative filtering.

- K-partite heterogeneous relational data clustering

In many applications, multi-type heterogeneous relational data formulates k-partite graphs with various structures. For example, Web pages, search queries, and Web users in a Web search system form a tri-partite graph. We proposed a general model, Relation Summary Network (RSN), to find the hidden structures (the local cluster structures and the global community structures) from a k-partite graph. The model provides a principal framework for unsupervised learning on k-partite graphs of various structures. Under this model, based on the matrix representation of a k-partite graph We have reformulated the graph approximation as an optimization problem of matrix approximation and have derived an iterative algorithm to find the hidden structures from a k-partite graph under a broad range of distortion measures. Using the similar model formulation, we have also derived a novel spectral clustering algorithm for multi-type relational data, Spectral Relational Clustering (SRC), which has the simplicity and effectiveness of the traditional spectral approaches to the attribute data clustering but at the same time applicable to k-partite relational data with various structures.

- Homogeneous relational data clustering

Learning clusters (communities) from homogeneous relational graphs has been of significant interests in the literature. Existing efforts mainly focus on strongly intra-connected communities in which the nodes are intra-community close and inter-community loose. In addition to the strongly intra-connected communities, other types of communities also attract intensive attention in many important applications. For example, in Web mining, we are also interested in the communities of Web pages that sparsely link to each other but all densely link to the same Web pages, such as a community of music "fans" Web pages which share the same taste on music and are densely linked to the same set of music Web pages but sparsely linked to each other. These various types of communities can be unified into a general concept, a link-pattern based community, which is a group of nodes with similar link patterns. We have developed a general model based on the Symmetric Convex Coding (SCC) of a graph affinity matrix for learning link-pattern based communities.

- General relational data clustering

The most general case is relational data with all the three types of information. In practice, general relational data clustering is expected from a wide spectrum of applications involving various relational data; in theory, it generalizes a number of important existing clustering problems such as the traditional attributes-based clustering, semi-supervised clustering, co-clustering, and graph clustering. I have developed a probabilistic model, Mixed Membership Relational Clustering (MMRC), for general relational data clustering, which provides a principal framework to unify various important clustering tasks including traditional attributes-based clustering, semi-supervised clustering, co-clustering, and graph clustering. The proposed model seeks to identify cluster structures for each type of data objects and interaction patterns between different types of objects. It is applicable to relational data of various structures. Under this model, I have derived parametric hard and soft relational clustering algorithms under a large number of exponential family distributions. The algorithms are applicable to various relational data from various applications and at the same time unify a number of stat-of-the-art clustering algorithms: co-clustering algorithms, the k-partite graph clustering, Bregman k-means, and semi-supervised clustering based on hidden Markov random fields.

Back to Bo Long's Homepage