Determining Text Databases to Search in the Internet


Clement Yu*††††††††††††††††††††† Weiyi Meng**

*Department of Computer Science**Department of Computer Science

University of Illinois††††††††††† State University of New York

at Chicago†††††††††††††††† ††††††at Binghamton


Contact Information

Clement Yu†††††††††††††††††††††††††† Weiyi Meng

Department of Computer Science†††††† Department of Computer Science

University of Illinois at Chicago††† State University of New York at Binghamton

Chicago, IL 60607††††††††††††††††††Binghamton, NY 13902

Phone: (312) 996-2318††††††††††††††† Phone: (607) 777-4311

Fax: (312) 413-0024††††††††††††††††† Fax: (607) 777-4729

Email:†††††††††††††††† Email:

URL:††† †††URL:




List of Supported Students:

Zonghuan Wu, Fang Liu, Shuang Liu: Research Assistants


Project Award Information:

Award numbers: IIS-9902792, IIS-9902872

Duration: 10/1/1999 - 9/30/2002

Title: Determining Text Databases to Search in the Internet


Keywords: Internet resource discovery, distributed information retrieval, collection fusion, metasearch engine


Project Summary:

The main objective of this collaborative project is to develop theoretically rigorous yet practically applicable techniques needed for the development of next generation metasearch engines. A metasearch engine is a system that provides unified access to multiple existing search engines. Upon receiving a query, the metasearch engine determines the appropriate search engines to invoke, the documents to retrieve from each invoked search engine and finally the set of documents to be shown to the user. Methods that can scale to a large number of search engines will be developed to accurately predict the usefulness of a search engine with respect to each query. A necessary and sufficient condition is developed to optimally rank search engines in descending order of desirability with respect to a given query. Efficient solutions are proposed to satisfy this condition, including the incorporation of the link-derived importance of a page into the desirability measure. A dynamic concept hierarchy based method that has the potential to improve the retrieval of actually useful documents (not just similar documents) is studied. Techniques that while guaranteeing the retrieval of all potentially useful documents from a local search engine, also minimize the retrieval of useless documents will be investigated. The problem of merging results from multiple sources to a single ranked list has also been studied. The proposed techniques require certain knowledge about each search engine and a method that uses probing queries to discover needed knowledge about each local search engine is explored.


Publications and Products:

1.     W. Meng, C. Yu, and K. Liu. Detection of Heterogeneities in a Multiple Text Database Environment. Fourth IFCIS Conference on Cooperative Information Systems (CoopIS'99), Edinburgh, Scotland, September 1999, pp.22-33.

2.     C. Yu, W. Meng, K. Liu, W. Wu, and N. Rishe. Efficient and Effective Metasearch for a Large Number of Text Databases. ACM International Conference on Information and Knowledge Management, 1999, pp.217-224.

3.     W. Wang, W. Meng, and C. Yu. Concept Hierarchy Based Text Database Categorization in a Metasearch Engine Environment. 1st Intíl Conference on Web Information Systems Engineering, Hong Kong, 2000, pp.283-290.

4.     K. Liu, W. Meng, C. Yu, C. Zhang, and N. Rishe. Discovery of Similarity Computations of Search Engines. ACM International Conference on Information and Knowledge Management, Washington, D.C., November 2000, pp.290-297.

5.     Z. Wu, W. Meng, C. Yu, and Z. Li. Towards a Highly Scalable and Effective Metasearch Engine. 10th WWW Conference, Hong Kong, May 2001, pp.386-395.

6.     C. Yu, W. Meng, W. Wu, and K. Liu. Efficient and Effective Metasearch for Text Databases Incorporating Linkages among Documents. ACM SIGMOD Conference, May 2001, pp.187-198.

7.        C. Yu, P. Sharma, W. Meng, and Y. Qin. Database Selection for Processing k Nearest Neighbors Queries in Distributed Environments. First ACM/IEEE Joint Conference on Digital Libraries, Roanoke, VA, June 2001, pp.215-222.

8.     W. Meng, Z. Wu, C. Yu, Z. Li. A Highly Scalable and Effective Method for Metasearch. ACM Transactions on Information Systems, 19(3), July 2001, pp.310-335.

9.     K. Liu, C. Yu, W. Meng, A. Santoso, and C. Zhang. Discovering the Representative of a Search Engine. Tenth ACM International Conference on Information and Knowledge Management, (poster paper), Atlanta, Georgia, November 2001, pp.577-579.

10. W. Meng, W. Wang, H. Sun, and C. Yu. Concept Hierarchy Based Text Database Categorization. International Journal on Knowledge and Information Systems, 4(2), March 2002, pp.132-150.

11. W. Meng, C. Yu, K. Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys, 34(1), March 2002, pp.48-89.

12. K. Liu, C. Yu, W. Meng, W. Wu, and N. Rishe. A Statistical Method for Estimating the Usefulness of Text Databases. IEEE Transactions on Knowledge and Data Engineering (to appear).

13. C. Yu, K. Liu, W. Meng, Z. Wu, and N. Rishe. A Methodology to Retrieve Text Documents from Multiple Databases. IEEE Transactions on Knowledge and Data Engineering (to appear).

14. A prototype demo metasearch engine is created (


Project Impact:

Human Resources: One Ph.D. student and seven MS students (four female) graduated; 4 Ph.D. students and several M.S. students, including some female and minority students, are currently working on the project.

Education and Curriculum Development: Course material based on this project have been used in two graduate courses CS534: Web Data Management, CS582: Information Retrieval).

Infrastructure: Four computers were purchased and they were installed in the two database laboratories of the two investigators.

Tutorials and Seminars: Three tutorials were given in three international conferences (VLDB'99, ACM DL'99, WISE'00). Over ten seminars were given to companies and universities.


Goals, Objectives, and Targeted Activities

This three-year project is in the middle of its third year. We are on schedule to achieve all the goals planned for the project. In the last year, we did research in the following areas.

1.    Improve database selection and collection fusion methods for longer queries.

2.    Study the impact of using user profiles for personalized text retrieval.

3.    Study database selection solutions for distributed top-N queries.

4.    Develop programs for implementing large-scale metasearch engines.


For the last six months of the project, we will continue our research on personalized search and distributed top-N query processing.


Project References (See


Area Background:

Many text sources are available in the Internet. Each text source usually has an associated search engine. These widely distributed search engines are highly heterogeneous. They may employ different techniques to represent and rank documents, and they usually provide access to different sets of documents of diverse interest. Frequently, a user's information needs are stored in the databases of multiple local search engines. As the number of search engines increases, there is an increasing need for automatic search brokers (metasearch engines) which can invoke multiple search engines as it is inconvenient and inefficient for an ordinary user to utilize multiple search engines and identify useful documents from the results returned from multiple search engines. Through a search broker, only a single query is needed from a user to retrieve desired documents.


There are a number of challenges in building an efficient and effective metasearch engine. Among the challenges, the database selection problem is to identify text sources that are likely to return useful documents to a given query. The document selection problem is to determine what documents should be retrieved from each identified source. The result merging problem is to combine the documents returned from all identified sources. This project aims at finding good solution to these problems and to build a prototype metasearch engine.


Area References:

1.    J. Callan, Z. Lu, and. W. Croft. Searching Distributed Collections with Inference Networks. ACM SIGIR Conference, 1995, pp.21-28.

2.    Digital Library Collaborative Working Groups. Resource Discovery in a Globally-Distributed Digital Library. Working Group Report, 1999 (

3.    D. Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using Metasearch. ACM TOIS, 15(3), July 1997, pp.195-222.

4.    Y. Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in Cyberspace, Stanford University, March 1999.

5.    J. French, A. Powell, C. Viles. Evaluating Database Selection Techniques: A Testbed and Experiment. ACM SIGIR Conference, pp.121-129, 1998.

6.    L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford Proposal for Internet Meta-Searching. ACM SIGMOD, May 1997, pp.207-218.

7.    L. Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and Broker Hierarchies. VLDB Conference, 1995.

8.    L. Gravano, and H. Garcia-Molina. Merging Ranks from Heterogeneous Internet Sources. VLDB Conference, 1997.

9.    E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy. ACM SIGIR Conference, Seattle, 1995, pp.172-179.

10.J. Xu, and J. Callan. Effective Retrieval with Distributed Collections. ACM SIGIR Conference, pp.112-120, Melbourne, Australia, 1998.