Exploiting Multimodal Synergy for Large Scale and Diverse Image Retrieval in Digital Archives

Sponsor: National Science Foundation

This project addresses a four year research and education program (2006 - 2010) focusing on developing a revolutionary approach to large scale and diverse image retrieval in digital archives. It has become ubiquitous today that almost all the digital archives contain not just the traditional structured data, but more often the multimedia data; with the rapid development in technologies, it has become more and more dominant for the multimedia data in digital archives. Given a typical presence of the multimedia data in digital archives, imagery is considered as the most popular modality of the multimedia data probably only next to text. Consequently, image retrieval becomes an important research area in the literature, and thus is considered as the focused research area towards the development of effective and efficient Multimedia Information Retrieval (MIR) technologies in digital archives.

Due to this consideration, image retrieval has been studied for over a decade as an emerging area called Content Based Image Retrieval (CBIR), and has become a major focus of attention in the research in MIR. The current status of the research in image retrieval exhibits two notorious bottlenecks: (1) the issue of the semantic gap -- the majority of the existing methods in the literature focuses on using low-level image features to retrieve images and it is well-studied that it is usually insufficient to find similar images solely using image features due to the gap between the image features and the semantic concepts carried in the image; this is due to the fact that it is found to be very difficult to directly represent and use the semantic concepts in image retrieval; and (2) the issue of scalability -- all the existing methods in the literature are only demonstrated using very clean data sets (e.g., the Corel data) and very small data sets (typically below 10,000 images); this is due to the three reasons: (a) most of the proposed methods in the literature are not scalable in nature (e.g., linear search in complexity); (b) in addition to the complexity in nature, many existing methods are sensitive to the diversity of the image content and quality, which results in reporting experiments using very clean data such as the Corel collection; and (c) the image retrieval community at present does not yet have a standard benchmark collection similar to the ones in the text retrieval community; consequently, each research group typically uses the data sets either collected by their own or shared with other research groups which are typically small in scale. Note that here the scalability issue refers to both the scales in diversity of the image content and quality and the scales in size of the image databases. This observation is supported by the recent research in the literature in this area; it has been noted that the data sets used in most recent automatic image annotation and/or image retrieval systems fail to capture the difficulties inherent in many real image databases.

On the other hand, it is well-observed that often imagery data does not exist in isolation; instead, typically there is rich collateral information co-existing with image data in many applications. Examples include the Web, many domain-archived image databases (in which there are annotations to images), and even consumer photo collections. In order to reduce the semantic gap, recently multimodal approaches to image retrieval are proposed in the literature to explicitly exploit the redundancy co-existing in the collateral information to the images. In addition to the improved retrieval accuracy, another added benefit found in the multimodal approaches is the multiple query modalities -- users may query image databases either by image, or by a collateral information modality (e.g., text), or by any combinations.

This project focuses on developing a novel multimodal approach to image retrieval by explicitly exploiting the synergy between the multimodal data in addressing the two bottlenecks simultaneously. Ultimately, this project aims at revolutionizing the research in image retrieval and developing and advancing the proven and working technologies allowing large scale and diverse image retrieval in digital archives.

Specifically, as an integrated research and education program, this project focuses on the following three specific objectives to be achieved synergistically: (1) to develop a revolutionized theory as well as the related methodology as a multimodal approach to large scale and diverse image retrieval that addresses the semantic gap and the scalability issues simultaneously; (2) to extensively evaluate the theory and the methodology using truly large scale and diverse multimodal data; and (3) to develop and evaluate innovative community outreach activities through the existing partnership in research collaborations in this project to further promote knowledge dissemination.

The intellectual merit of this project includes the revolutionized understanding of the image retrieval in the multimodal context as well as the expected breakthrough in effective and efficient image retrieval that shall undoubtedly advance the literature of CBIR as well as MIR and generate profound impact in the related areas including pattern recognition, data mining, and computer vision.

The broader impact of this project is two folds. Educationally, the development, the implementation, and the evaluation of the innovative community outreach activities in this project shall promote the timely and effective knowledge dissemination related to multimodal image retrieval and to further enrich the pedagogical literature; the disseminated knowledge to the collaborating organizations, especially those non-profit organizations, shall further advance and enhance their research and services to the whole society. Technologically, the expected breakthrough in image retrieval shall embrace a new era of technological revolution in a wide range of applications noticeably including the Web search engines, digital libraries, as well as K-12 learning tools.

NSF Project Manager: Dr. Maria Zemankova

Project Personnel:

PI: Prof. Zhongfei (Mark) Zhang

PhD student:

Master student:


Code Release :

ยท         EMML code(based on paper: Zhen Guo, Zhongfei (Mark) Zhang, Eric P. Xing, and Christos Faloutsos, Enhanced Max Margin Learning on Multimodal Data Mining in a Multimedia Database, Proc. the 13th ACM International Conference on Knowledge Discovery and Data Mining, San Jose, CA, USA, August, 2007)


This material is based upon the work supported by the National Science Foundation under Award No. 0535162.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Go back to the Multimedia Computing Research Lab homepage