Determining Text Databases to Search in the Internet

 

Clement Yu Weiyi Meng

University of Illinois at Chicago State University of New York at Binghamton

Contact Information

 

Clement Yu Weiyi Meng

Department of EECS Department of Computer Science

University of Illinois at Chicago State University of New York at Binghamton

Chicago, IL 60607 Binghamton, NY 13902

Phone: (312) 996-2318 Phone: (607) 777-4311

Fax: (312) 413-0024 Fax: (607) 777-4729

Email: yu@eecs.uic.edu Email: meng@cs.binghamton.edu

URL: http://www.eecs.uic.edu/~yu URL: http://panda.cs.binghamton.edu/~meng

 

WWW Page: http://panda.cs.binghamton.edu/~meng/metasearch.html

 

Project Award Information

 

Award numbers: IIS-9902792, IIS-9902872

Duration: 10/1/1999 - 9/30/2002

Title: Determining Text Databases to Search in the Internet

 

Keywords

 

Internet resource discovery, distributed information retrieval, collection fusion

 

Project Summary

 

The main objective of this collaborative project is to develop theoretically rigorous yet practically applicable techniques needed for the development of next generation metasearch engines and apply these techniques to build an operational metasearch engine. A metasearch engine is a layer of software on top of existing search engines. Upon receiving a query, the metasearch engine determines the appropriate search engines to invoke, the documents to retrieve from each invoked search engine, the documents to be transmitted from the invoked search engines to the metasearch engine and finally the set of documents to be shown to the user. Methods that can scale to a large number of search engines will be developed to accurately predict the usefulness of a search engine with respect to each query. In this project, we develop necessary and sufficient conditions to optimally rank search engines in descending order of desirability with respect to a given query. A dynamic concept hierarchy based method that has the potential to improve the retrieval of actually useful documents (not just similar documents) will be studied. Techniques that while guaranteeing the retrieval of all potentially useful documents from a local search engine, also minimize the retrieval of useless documents will be investigated. Such a guarantee is highly desired in applications (such as legal and medical applications) that need to retrieve all potentially useful documents. The problem of merging results from multiple sources to a single ranked list will also be studied. The techniques to be developed are likely to require certain knowledge about each search engine. As such knowledge or information may be proprietary, methods that use probing queries to discover needed knowledge about each local search engine will be explored. Testbeds for our metasearch engine will be developed and extensive experiments will be carried out.

 

Publications and Products

 

1.        W. Meng, C. Yu, and K. Liu. Detection of Heterogeneities in a Multiple Text Database Environment. Fourth IFCIS Conference on Cooperative Information Systems (CoopIS'99), Edinburgh, Scotland, September 1999.

2.        C. Yu, W. Meng, K. Liu, W. Wu, and N. Rishe. Efficient and Effective Metasearch for a Large Number of Text Databases. ACM International Conference on Information and Knowledge Management (CIKM'99), Kansas City, November 1999.

3.        K. Liu, C. Yu, W. Meng, W. Wu, and N. Rishe. A Statistical Method for Estimating the Usefulness of Text Databases.IEEE Transactions on Knowledge and Data Engineering (to appear).

 

Project Impact

 

         Human Resources: One Ph.D. student was recently graduated; one female MS student graduated; 1 Ph.D. student and several M.S. students, including some female students, are currently working on the project.

         Education and Curriculum Development: Course material based on this project is being used in two graduate courses (CS534: Web Data Management, EECS582: Information Retrieval).

         Infrastructure: Four computers were purchased and they were installed in the two database laboratories of the two investigators.

         Tutorials and Seminars: Two tutorials were given in two international conferences (VLDB'99 and ACM DL'99). Six seminars were given to companies and universities (NEC, Lockheed-Martin, SUNY-Binghamton, Hong Kong University, Chinese University of Hong Kong, Hong Kong University of Science and Technology).

 

Goals, Objectives, and Targeted Activities

 

This three-year project is in the middle of its first year. We are working on several aspects of developing a more effective and efficient metasearch engine. The following are the first year's objectives and activities.

 

1.        Study methods that hopefully can be used to yield estimates that are close to the ideal situation. Performance studies will be provided to determine the accuracy of our solutions in comparison to the ideal situation. Our goal is to achieve estimation accuracy in the range of 95% - 98% of the ideal situation.

2.        Provide a database usefulness estimation method that is much more efficient than our current estimation method which is of exponential complexity. We aim at a method with a low polynomial time complexity.

3.        Develop new database usefulness estimation methods that utilize database representatives of smaller sizes. The new methods should reduce the size of database representatives by about 50%.

4.        Experiment with different types of queries and larger databases. This includes using the TREC collections to evaluate the effectiveness of our approach based on the retrieval of relevant documents.

5.        Implement a metasearch engine for over 100 major CS departments in the US. This will be housed in our laboratories so that we can perform experiments with ease.

6.        Materials related to this project will be taught in courses by the PIs from both universities. A number of students will be recruited to participate in various parts of this project. This activity will be continued in subsequent years.

 

Project References

 

See http://panda.cs.binghamton.edu/~meng/metasearch.html)

 

1.        W. Meng, K. Liu, C. Yu, X. Wang, Y. Chang, and N. Rishe. Determining Text Databases to Search in the Internet. International Conference on Very Large Data Bases (VLDB'98), New York City, August 1998. (http://panda.cs.binghamton.edu/~meng/pub.d/vldb98fi.ps.gz)

2.        W. Meng, K. Liu, C. Yu, W. Wu, and N. Rishe. Estimating the Usefulness of Search Engines. International Conference on Data Engineering (ICDE'99), Sydney, Australia, March 1999. (http://panda.cs.binghamton.edu/~meng/pub.d/icde99.ps.gz)

3.        C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe. Finding the Most Similar Documents across Multiple Text Databases. IEEE Conference on Advances in Digital Libraries (ADL'99), Baltimore, Maryland, May 1999. (http://panda.cs.binghamton.edu/~meng/pub.d/adl99.ps.gz)

4.        K. Liu, C. Yu, W. Meng, and N. Rishe. Discovery of Similarity Computation on the Internet. ACM Conference on Digital Libraries (DL'99) (poster paper), University of California, Berkeley, August 1999.

 

5.        W. Meng, C. Yu, and K. Liu. Detection of Heterogeneities in a Multiple Text Database Environment. IFCIS Conference on Cooperative Information Systems (CoopIS'99), Edinburgh, Scotland, September 1999. (http://panda.cs.binghamton.edu/~meng/pub.d/coopis99.ps.gz)

6.        C. Yu, W. Meng, K. Liu, W. Wu, and N. Rishe. Efficient and Effective Metasearch for a Large Number of Text Databases. ACM International Conference on Information and Knowledge Management (CIKM'99), Kansas City, November 1999. (http://panda.cs.binghamton.edu/~meng/pub.d/cikm99.ps.gz)

7.        K. Liu, C. Yu, W. Meng, W. Wu, and N. Rishe. A Statistical Method for Estimating the Usefulness of Text Databases. IEEE Transactions on Knowledge and Data Engineering (to appear).

 

Area Background

 

Many text sources are available in the Internet. Each text source usually has an associated search engine. These widely distributed search engines are highly heterogeneous. They may employ different techniques to represent and rank documents, and they usually provide access to different sets of documents of diverse interest. Frequently, a user's information needs are stored in the databases of multiple local search engines. As the number of search engines increases, there is an increasing need for automatic search brokers (metasearch engines) which can invoke multiple search engines as it is inconvenient and inefficient for an ordinary user to utilize multiple search engines and identify useful documents from the results returned from multiple search engines. Through a search broker, only a single query is needed from a user to retrieve desired documents.

 

There are a number of challenges in building an efficient and effective metasearch engine. Among the challenges, the database selection problem is to identify text sources that are likely to return useful documents to a given query. The document selection problem is to determine what documents should be retrieved from each identified source. The result merging problem is to combine the documents returned from all identified sources. This project aims at finding good solution to these problems and to build a prototype metasearch engine.

 

Area References

 

1.        J. Callan, Z. Lu, and. W. Croft. Searching Distributed Collections with Inference Networks. ACM SIGIR Conference, 1995, pp.21-28.

2.        Digital Library Collaborative Working Groups. Resource Discovery in a Globally-Distributed Digital Library. Working Group Report, 1999 (http://www.iei.pi.cnr.it/DELOS/NSF/resourcediscovery.htm)

3.        D. Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using Metasearch. ACM TOIS, 15(3), July 1997, pp.195-222.

4.        Y. Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in Cyberspace, Stanford University, March 1999.

5.        J. French, A. Powell, C. Viles. Evaluating Database Selection Techniques: A Testbed and Experiment. ACM SIGIR Conference, pp.121-129, Melbourne, Australia, 1998.

6.        L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford Proposal for Internet Meta-Searching. ACM SIGMOD, Tucson, May 1997, pp.207-218.

7.        L. Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and Broker Hierarchies. VLDB Conference, 1995.

8.        L. Gravano, and H. Garcia-Molina. Merging Ranks from Heterogeneous Internet Sources. VLDB Conference, 1997.

9.        K. Liu, C. Yu, W. Meng, W. Wu and N. Rishe. A Statistical Method for Estimating the Usefulness of Text Databases. IEEE Transactions on Knowledge and Data Engineering (to appear).

10.     W. Meng, K. Liu, C. Yu, X. Wang, Y. Chang, and N. Rishe. Determine Text Databases to Search in the Internet. VLDB Conference, New York City, August 1998, pp.14-25.

11.     E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy. ACM SIGIR Conference, Seattle, 1995, pp.172-179.

12.     J. Xu, and J. Callan. Effective Retrieval with Distributed Collections. ACM SIGIR Conference, pp.112-120, Melbourne, Australia, 1998.

13.     C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe. Finding the Most Similar Documents across Multiple Text Databases. IEEE Conference on Advances in Digital Libraries, pp.150-162, Baltimore, Maryland, May 1999.

14.     C. Yu, W. Meng, K. Liu, W. Wu, and N. Rishe. Efficient and Effective Metasearch for a Large Number of Text Databases. ACM International Conference on Information and Knowledge Management (CIKM'99), 1999.