Determining Text Databases to Search in the Internet


Clement Yu*†††††††††††††† Weiyi Meng**

*Department of EECS††† **Department of Computer Science

University of Illinois†††† State University of New York

at Chicago††††††††††††††††† at Binghamton


Contact Information

Clement Yu†††††††††††††††††††††††††† Weiyi Meng

Department of EECS†††††††††††††††††† Department of Computer Science

University of Illinois at Chicago††† State University of New York at Binghamton

Chicago, IL 60607††††††††††††††††††† Binghamton, NY 13902

Phone: (312) 996-2318††††††††††††††† Phone: (607) 777-4311

Fax: (312) 413-0024††††††††††††††††† Fax: (607) 777-4729

Email:†††††††††††††† Email:

URL:†††† URL:




List of Supported Students:

Zonghuan Wu, Fang Liu, Shuang Liu: Research Assistants


Project Award Information:

Award numbers: IIS-9902792, IIS-9902872

Duration: 10/1/1999 - 9/30/2002

Title: Determining Text Databases to Search in the Internet


Keywords: Internet resource discovery, distributed information retrieval, collection fusion, metasearch engine


Project Summary:

The main objective of this collaborative project is to develop theoretically rigorous yet practically applicable techniques needed for the development of next generation metasearch engines and apply these techniques to build an operational metasearch engine. A metasearch engine is a system that provides unified access to multiple existing search engines. Upon receiving a query, the metasearch engine determines the appropriate search engines to invoke, the documents to retrieve from each invoked search engine and finally the set of documents to be shown to the user. Methods that can scale to a large number of search engines will be developed to accurately predict the usefulness of a search engine with respect to each query. A necessary and sufficient conditions is developed to optimally rank search engines in descending order of desirability with respect to a given query. Efficient solutions are proposed to satisfy this condition, including the incorporation of the link-derived importance of a page into the desirability measure. A dynamic concept hierarchy based method that has the potential to improve the retrieval of actually useful documents (not just similar documents) is studied. Techniques that while guaranteeing the retrieval of all potentially useful documents from a local search engine, also minimize the retrieval of useless documents will be investigated. Such a guarantee is highly desired in applications that need to retrieve all potentially useful documents. The problem of merging results from multiple sources to a single ranked list will also be studied. The techniques to be developed are likely to require certain knowledge about each search engine. As such knowledge or information may be proprietary, a method that uses probing queries to discover needed knowledge about each local search engine is explored.


Publications and Products:

1.     W. Meng, C. Yu, and K. Liu. Detection of Heterogeneities in a Multiple Text Database Environment. Fourth IFCIS Conference on Cooperative Information Systems (CoopIS'99), Edinburgh, Scotland, September 1999, pp.22-33.

2.     C. Yu, W. Meng, K. Liu, W. Wu, and N. Rishe. Efficient and Effective Metasearch for a Large Number of Text Databases. ACM International Conference on Information and Knowledge Management, 1999, pp.217-224.

3.     W. Wang, W. Meng, and C. Yu. Concept Hierarchy Based Text Database Categorization in a Metasearch Engine Environment. 1st Intíl Conference on Web Information Systems Engineering, Hong Kong, 2000, pp.283-290.

4.     K. Liu, W. Meng, C. Yu, C. Zhang, and N. Rishe. Discovery of Similarity Computations of Search Engines. ACM International Conference on Information and Knowledge Management, Washington, D.C., November 2000, pp.290-297.

5.     Z. Wu, W. Meng, C. Yu, and Z. Li. Towards a Highly-Scalable and Effective Metasearch Engine. 10th World Wide Web Conference, May 2001 (to appear).

6.     C. Yu, W. Meng, W. Wu, and K. Liu. Efficient and Effective Metasearch for Text Databases Incorporating Linkages among Documents. ACM SIGMOD Conference, May 2001 (to appear).

7.     K. Liu, C. Yu, W. Meng, W. Wu, and N. Rishe. A Statistical Method for Estimating the Usefulness of Text Databases. IEEE Transactions on Knowledge and Data Engineering (to appear).

8.     C. Yu, K. Liu, W. Meng, Z. Wu, and N. Rishe. A Methodology to Retrieve Text Documents from Multiple Databases. IEEE Transactions on Knowledge and Data Engineering (to appear).

9. C. Yu, P. Sharma, W. Meng, and Y. Qin. Database Selection for Processing k Nearest Neighbors Queries in Distributed Environments. First ACM/IEEE Joint Conference on Digital Libraries, Roanoke, VA, June 2001 (to appear).

10. A prototype demo metasearch engine is created (


Project Impact:

Human Resources: One Ph.D. student and five MS students (two female) graduated; 4 Ph.D. students and several M.S. students, including some female students, are currently working on the project.

Education and Curriculum Development: Course material based on this project have been used in two graduate courses CS534: Web Data Management, EECS582: Information Retrieval).

Infrastructure: Four computers were purchased and they were installed in the two database laboratories of the two investigators.

Tutorials and Seminars: Three tutorials were given in three internationall confs (VLDB'99, ACM DL'99, WISE'00). Twelve seminars were given to companies and universities (NEC, Lockheed-Martin, SUNY-Binghamton, Hong Kong Univ., Chinese Univ. of Hong Kong, Hong Kong Univ. of Science and Technology, Database Society of China, Northwestern Univ., Intíl Conf. on Informational Society (Japan), Univ. of Tokyo, Univ. of Tsukuba (Japan), Univ. of Minnesota).


Goals, Objectives, and Targeted Activities

This three-year project is in the middle of its second year. We are on schedule to achieve all the goals planned for the project. In the last year, we made significant progress in the following areas.

1.    Developed highly scalable and effective methods to perform database selection and collection fusion.

2.    Developed a practical database categorization algorithm to help improve the retrieval of useful documents.

3.    Developed a method to incorporate the link-derived importance (PageRank) of a page into the solution.


The following are the objectives during the final year of the project.

1.    Continue to improve the solutions developed in the first two years.

2.    Study feedback process in distributed database environments and compare the performance of different strategies, including the use of pseudo relevance feedback.

3.    Implement a metasearch engine for over 100 searchable major US universities.

4. Write a book on metasearch engine.


Project References (See


Area Background:

Many text sources are available in the Internet. Each text source usually has an associated search engine. These widely distributed search engines are highly heterogeneous. They may employ different techniques to represent and rank documents, and they usually provide access to different sets of documents of diverse interest. Frequently, a user's information needs are stored in the databases of multiple local search engines. As the number of search engines increases, there is an increasing need for automatic search brokers (metasearch engines) which can invoke multiple search engines as it is inconvenient and inefficient for an ordinary user to utilize multiple search engines and identify useful documents from the results returned from multiple search engines. Through a search broker, only a single query is needed from a user to retrieve desired documents.


There are a number of challenges in building an efficient and effective metasearch engine. Among the challenges, the database selection problem is to identify text sources that are likely to return useful documents to a given query. The document selection problem is to determine what documents should be retrieved from each identified source. The result merging problem is to combine the documents returned from all identified sources. This project aims at finding good solution to these problems and to build a prototype metasearch engine.


Area References:

1.    J. Callan, Z. Lu, and. W. Croft. Searching Distributed Collections with Inference Networks. ACM SIGIR Conference, 1995, pp.21-28.

2.    Digital Library Collaborative Working Groups. Resource Discovery in a Globally-Distributed Digital Library. Working Group Report, 1999 (

3.    D. Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using Metasearch. ACM TOIS, 15(3), July 1997, pp.195-222.

4.    Y. Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in Cyberspace, Stanford University, March 1999.

5.    J. French, A. Powell, C. Viles. Evaluating Database Selection Techniques: A Testbed and Experiment. ACM SIGIR Conference, pp.121-129, 1998.

6.    L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford Proposal for Internet Meta-Searching. ACM SIGMOD, May 1997, pp.207-218.

7.    L. Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and Broker Hierarchies. VLDB Conference, 1995.

8.    L. Gravano, and H. Garcia-Molina. Merging Ranks from Heterogeneous Internet Sources. VLDB Conference, 1997.

9.    E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy. ACM SIGIR Conference, Seattle, 1995, pp.172-179.

10.J. Xu, and J. Callan. Effective Retrieval with Distributed Collections. ACM SIGIR Conference, pp.112-120, Melbourne, Australia, 1998.