WebScales: Towards a Highly Scalable Metasearch Engine

Vijay Raghavan, Zonghuan Wu	King-Lup Liu
Center for Advanced Computer Studies	Webscalers, LLC.
University Louisiana at Lafayette	121 Conque Drive
Lafayette, LA 70504	Lafayette, LA 70506
Phone: (337) 482-6603; (337) 482-5243	Email: kliu2002@yahoo.com
Fax: (337) 482-5791
Email: {raghavan, zwu}@cacs.louisiana.edu

List of Supported Students:

Fang Liu, Shuang Liu, Hai He: Research Assistants

Project Award Information:

Duration: 8/15/2002 - 8/14/2005

Title: WebScales: Towards a Highly Scalable Metasearch Engine

Keywords: Large-scale metasearch engine, distributed information retrieval, search engine discovery, search engine wrapper, database selection

Project Summary:

The main objective of this collaborative project is to develop enabling techniques for a large-scale metasearch engine that aims at covering a much larger portion of the Web and at the same time retrieving more up-to-date and more useful documents than existing search engines and metasearch engines. A metasearch engine is a system that provides unified access to multiple existing search engines. Upon receiving a query, the metasearch engine determines the appropriate search engines to invoke, the documents to retrieve from each invoked search engine and finally the set of documents to be shown to the user. The main problems to be studied in this project include (1) how to automatically discover useful search engines on the Web; (2) how to automatically and accurately categorize search engines into a concept hierarchy and how to use user profiles to map user queries to appropriate concept(s) in the hierarchy; (3) how to automatically incorporate search engines into a metasearch engine; (4) how to perform accurate database selection for longer queries; and (5) how to merge results returned from multiple search engines.

Publications and Products:

1. F. Liu, C. Yu, W. Meng. Personalize Web Search by Mapping User Queries to Categories. ACM International Conference on Information and Knowledge Management (CIKM'02), pp.558-565, McLean, Virginia, 2002.

2. K. Liu, C. Yu, W. Meng. Discovering the Representative of a Search Engine. ACM International Conference on Information and Knowledge Management (CIKM'02), poster paper, pp.652-654, McLean, Virginia, 2002.

3. Z. Wu, V. Raghavan, D. Chun, W. Meng, and C. Yu. SE-LEGO: A System to Create Metasearch Engines on Demand. ACM SIGIR Conference, Demo paper, pp.464, Toronto, Canada, July 2003.

4. Z. Wu, V. Raghavan, C. Du, W. Meng, H. He, C. Yu. Creating Customized Metasearch Engines on Demand Using SE-LEGO. International Conference on Web-Age Information Management (WAIM'03), Chengdu, China, Demo paper, pp.503-503, August 2003.

5. C. Yu, G. Philip, W. Meng. Distributed Top-N Query Processing with Possibly Uncooperative Local Systems. International Conference on Very Large Data Bases (VLDB'03), pp.117-128, Berlin, Germany, 2003.

6. H. He, W. Meng, C. Yu, Z. Wu. WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce. International Conference on Very Large Data Bases (VLDB'03), pp.357-368, Berlin, 2003.

7. Z. Wu, V. Raghavan, H. Qian, V. Rama K, W. Meng, H. He, C. Yu. Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine. IEEE/WIC International Conference on Web Intelligence, pp.658-661, Halifax, Canada, 2003.

8. C. Yu, W. Meng. Web Search Technology. In The Internet Encyclopedia edited by Hossein Bidgoli, Wiley Publishers, pp.738-753, 2003.

9. F. Liu, C. Yu, W. Meng. Personalized Web Search for Improving Retrieval Effectiveness. IEEE Transactions on Knowledge and Data Engineering, Vol.16, No.1, pp.28-40, January 2004.

10. W. Wu, C. Yu, W. Meng. Database Selection for Longer Queries. Proceedings of the 2004 Meeting of the International Federation of Classification Societies, Chicago, July 2004.

11. W. Wu, C. Yu, A. Doan, W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. ACM SIGMOD Conference, pp.95-106, Paris, France, June 2004.

12. H. He, W. Meng, C. Yu, Z. Wu. Automatic Extraction of Web Search Interfaces for Interface Schema Integration. World Wide Web Conference (WWW2004), poster paper, pp.414-415, New York City, May 2004.

13. Q. Peng, W. Meng, H. He, C. Yu. Clustering E-Commerce Search Engines. World Wide Web Conference (WWW2004), poster paper, pp.416-417, New York City, May 2004.

14. S. Liu, F. Liu, C. Yu, W. Meng. An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases. ACM SIGIR Conference, pp.266-272, Sheffield, UK, July 2004.

15. H. He, W. Meng, C. Yu, Z. Wu. Automatic Integration of Web Search Interfaces with WISE-Integrator. VLDB Journal (to appear).

16. Q. Peng, W. Meng, H. He, C. Yu. WISE-Cluster: Clustering E-Commerce Search Engines Automatically. 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), Washington, DC, November 2004 (to appear).

Project Impact:

Human Resources: One Ph.D. student and seven MS students graduated; 5 Ph.D. students and several M.S. students are currently working on the project.

Education and Curriculum Development: Course materials based on this project have been used in three graduate courses: CS632 – Advanced Database Systems, CS634 – Web Data Management, CS582 – Information Retrieval.

Goals, Objectives, and Targeted Activities:

This three-year project is at the end of the second year. In the second year, we did research in the following areas.

1. Designed and implemented a new method to automatically extract retrieved results from returned result pages.

2. Improved our prototype system (SE-LEGO) for automatically discovering search engines and incorporating them into a metasearch engine.

3. Designed and implemented a new result merging algorithm.

4. Developed a newspaper metasearch engine prototype system with 50 local newspaper search engines.

5. Designed and implemented an algorithm to generate database representatives.

6. Improved the method for automatically integrating the interfaces of multiple e-commerce search engines.

7. Developed a method to cluster e-commerce search engines.

The following are the objectives of the next year:

1. Implement a metasearch engine with at least 1,000 search engines to test the scalability of our approach.

2. Implement an algorithm for generating database representatives for deep Web search engines.

3. Further improve SELEGO, especially the search engine connection and result extraction components.

4. Develop a better database categorization algorithm and test its effectiveness.

5. Study practical solutions for identifying redundant search engines in a metasearch engine context.

Area Background:

Currently there are hundreds of thousands search engines on the Web and each of them covers a small portion of the Web (either the deep Web or the surface Web). Creating a metasearch engine on top of all useful search engines is an effective way to combine the coverages of these search engines and to reach a large portion of the deep Web. Due to the large number of search engines involved, highly scalable and automated techniques are needed to create and maintain such a metasearch engine. This project aims to solve the technical problems towards building such a metasearch engine. To automatically discover useful search engines on the Web, a specialized Web crawler that can recognize Web search engine pages is needed. To automatically incorporate search engines into a metasearch engine, methods that can analyze search engine pages to extract connection information and that can analyze result pages to extract correct result information are needed. To identify potentially useful search engines for each user query efficiently and accurately, techniques that can collect characteristic information of each search engine efficiently and accurately, that can organize such information of all search engines in a scalable manner, and that can utilize the information for efficient and accurate search engine selection are needed. This project is related to distributed information retrieval. In addition, projects on metasearching techniques for Web sources on structured data are also related.

Area References:

1. Crescenzi, V., Mecca, G., And Merialdo, P. RoadRunner: Towards automatic data extraction from large Web sites. International Conference on Very Large Data Bases Rome, Italy, 2001, pp. 109-118.

2. D. Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using Metasearch. ACM TOIS, 15(3), July 1997, pp.195-222.

3. Y. Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in Cyberspace, Stanford University, March 1999.

4. J. French, A. Powell, C. Viles. Evaluating Database Selection Techniques: A Testbed and Experiment. ACM SIGIR Conference, pp.121-129, 1998.

5. L. Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and Broker Hierarchies. VLDB Conference, 1995.

6. L. Gravano, H. Garcia-Molina. Merging Ranks from Heterogeneous Internet Sources. VLDB Conference, 1997.

7. B. He, K. Chang. Statistical Schema Integration Across the Deep Web. ACM SIGMOD Conference, 2003.

8. P. Ipeirotis, L. Gravano, and M. Sahami. Probe, Count, and Classify. ACM SIGMOD Conference, 2001.

9. P. Ipeirotis, and L. Gravano. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. VLDB Conference, Hong Kong, 2002.

10. W. Meng, C. Yu, K. Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp.48-89.

11. W. Meng, Z. Wu, C. Yu, Z. Li. A Highly-Scalable and Effective Method for Metasearch. ACM Transactions on Information Systems 19(3), pp.310-335, July 2001.

12. C. Yu, K. Liu, W. Meng, Z. Wu, N. Rishe. A Methodology to Retrieve Text Documents from Multiple Databases. IEEE Transactions on Knowledge and Data Engineering, 14:6, Nov./Dec. 2002, pp.1347-1361.

13. E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy. ACM SIGIR Conference, Seattle, 1995, pp.172-179.

14. J. Xu, and J. Callan. Effective Retrieval with Distributed Collections. ACM SIGIR Conference, pp.112-120, Melbourne, Australia, 1998.

Project Website:

Project URL: http://www.cs.binghamton.edu/~meng/metasearch.html

This site lists all publications (including all annual IDM workshop reports) related to this project. The ps or pdf files of these publications are also available at this site.

Clement Yu	Weiyi Meng
Department of Computer Science	Department of Computer Science
University of Illinois at Chicago	State University of New York at Binghamton
Chicago, IL 60607	Binghamton, NY 13902
Phone: (312) 996-2318	Phone: (607) 777-4311
Fax: (312) 413-0024	Fax: (607) 777-4729
Email: yu@cs.uic.edu	Email: meng@cs.binghamton.edu
URL: http://www.cs.uic.edu/~yu	URL: http://www.cs.binghamton.edu/~meng