other information

  • Convex Approximation to Mixture Models
  • Problem Description

    Mixture models are a family of widely used statistic models in data mining, especially in the problem of clustering. Mixture model inference has been a challenging task for decades. The majority of the current solutions are non-convex ones: The objective functional to be optimized in most cases contains multiple local minimizers. Therefore finding the global optimal solution is infeasible.

    The mixture model inference procedure we introduced here has a natural convex formulation and therefore a global optimal solution can be garenteed. The corresponding algorithm outperforms the non-convex algorithms, such as the EM methods and the variational Bayesian methods, in terms of both accuracy and speed. The core part of the algorithm programmed in matlab is no more than 10 lines. We call the algorithm CAMS

    CAMS also provides theoretical complement to the affinity propagation technique discovered by Frey and Dueck.

    An example of clustering by CAMS

    Fig.1 (a)     3,000 samples of Gaussian mixtures Fig.1 (b)    20 clusters are automatically determined by CAMS. Data samples in each cluster form a tree structure during the clustering. The red dot in each cluster is the root sample or the 'exemplar' of that cluster.

    CAMS not only clusters data, but also produces a hierarchical relation among data within each cluster. The data samples within each cluster form a natural tree structure during the clustering process. This feature provides meaningful information especially for partitioning social networks.

    Graph partitioning examples by CAMS

  • Partitioning the collaboration network
  • (a) Fig.1     The top-10 sub-graphs with hierarchical relations by partitioning the collaboration network in computational geometry. The root vertex in each cluster is labeled with the corresponding scientist's name.

  • Partitioning the web linkage network
  • Fig.2 (a)     The network of political blogosphere of US 2004 election Fig.2 (b) Three clusters automatically determined by CAMS, corresponding to Democratic, Republican and neutral, respectively. The root blog for each cluster is labeled with the name of the corresponding blog

  • Partitioning the social network
  • Fig. 3 (a)  Co-appearance network of characters in the novel Les Miserables. Fig. 3 (b)  Who-support-who character relation trees obtained via CAMS

    Resource download

  • Matlab code download (tested under Matlab 2012a)
  • This matlab code is suitable for clustering vector data.

  • C++ code download (tested under linux)
  • This C++ code is suitable for large sparse network/graph data.

  • Methodology description download
  • This pdf file contians problem formulation, theoretical proofs, and various experimental results.