|
Research Interests
|
|
Thesis Research In many important applications such as text mining, biological data processing, collaborative filtering, and social network analysis, it is typical that data objects do not exist in isolation; instead, it is ubiquitous that they exist through relations. More importantly, it is the relations among objects that are of crucial importance to pattern discovery. On the other hand, there is often no such luxury to have any training data in a learning task. Consequently, learning cluster structures from those interrelated objects (multi-type or single type), relational data clustering, has become one of the most important data mining and machine learning topics in both industry and academia, though it is still a fairly new topic. In general, relational data contain three types of information, heterogeneous relations between objects of different types, homogeneous relations between objects of the same type, and attributes for individual objects. My thesis work focuses on developing a unified theoretical framework for relational data clustering and effective algorithms for different types of relational data from a wide range of applications.
- Bi-partite heterogeneous relational data clustering Clustering on bi-partite
heterogeneous relational data is also known as co-clustering in the literature. We proposed a new co-clustering model, Block
Value Decomposition (BVD), for Bi-partite heterogeneous relational data clustering. Under this model, we developed a novel co-clustering algorithm for
non-negative dyadic data, that iteratively computes the three
decomposition matrices based on the multiplicative updating rules.
The proposed
algorithm can be applied to simultaneously clustering genes and
conditions in microarray data, clustering documents and words in
text mining, and clustering users and items in collaborative
filtering. -
K-partite heterogeneous relational data clustering In many applications, multi-type heterogeneous relational data
formulates k-partite graphs with various structures. For example,
Web pages, search queries, and Web users in a Web search system form a tri-partite graph.
We proposed a general model, Relation Summary Network (RSN), to find the hidden structures (the local cluster structures and the global community structures) from a k-partite
graph. The model provides a principal
framework for unsupervised learning on k-partite graphs of various
structures. Under this model, based on the matrix representation of
a k-partite graph We have reformulated the graph approximation as an
optimization problem of matrix approximation and have derived an
iterative algorithm to find the hidden structures from a k-partite
graph under a broad range of distortion measures. Using the similar
model formulation, we have also derived a novel spectral clustering
algorithm for multi-type relational data, Spectral Relational Clustering (SRC), which has the
simplicity and effectiveness of the traditional spectral approaches
to the attribute data clustering but at the same time applicable to
k-partite relational data with various structures. - Homogeneous relational data clustering Learning clusters
(communities) from homogeneous relational graphs has been of
significant interests in the literature. Existing efforts mainly
focus on strongly intra-connected communities in which the nodes are
intra-community close and inter-community loose.
In addition to the strongly intra-connected communities, other types of communities
also attract intensive attention in many important applications. For
example, in Web mining, we are also interested in the communities of
Web pages that sparsely link to each other but all densely link to
the same Web pages, such as a community of music "fans" Web pages
which share the same taste on music and are densely linked to the
same set of music Web pages but sparsely linked to each other.
These various types of communities can be unified into a general
concept, a link-pattern based community, which is a group of
nodes with similar link patterns. We have developed a general model
based on the Symmetric Convex Coding (SCC) of a graph affinity matrix for
learning link-pattern based communities. - General relational data clustering The most general case is relational data with all the three types of
information. In practice, general relational data clustering is
expected from a wide spectrum of applications involving various
relational data; in theory, it generalizes a number of important
existing clustering problems such as the traditional
attributes-based clustering, semi-supervised clustering,
co-clustering, and graph clustering. I have developed a
probabilistic model, Mixed Membership Relational Clustering (MMRC), for general relational data clustering, which
provides a principal framework to unify various important clustering
tasks including traditional attributes-based clustering,
semi-supervised clustering, co-clustering, and graph clustering. The
proposed model seeks to identify cluster structures for each type of
data objects and interaction patterns between different types of
objects. It is applicable to relational data of various structures.
Under this model, I have derived parametric hard and soft relational clustering
algorithms under a large number of exponential family distributions.
The algorithms are applicable to various relational data from various applications and at the
same time unify a number of stat-of-the-art clustering
algorithms: co-clustering algorithms, the k-partite graph
clustering, Bregman k-means, and semi-supervised clustering based on hidden Markov random fields. |