WebScales: Towards a Highly Scalable Metasearch Engine


Project Award Numbers: IIS-0208434, IIS-0208574


Principal Investigators:


Clement Yu

Weiyi Meng

Department of Computer Science

Department of Computer Science

University of Illinois at Chicago

State University of New York at Binghamton

Chicago, IL 60607

Binghamton, NY 13902

Phone: (312) 996-2318

Phone: (607) 777-4311

Fax: (312) 413-0024

Fax: (607) 777-4729

Email: yu@cs.uic.edu

Email: meng@cs.binghamton.edu

URL: http://www.cs.uic.edu/~yu

URL: http://www.cs.binghamton.edu/~meng




Vijay Raghavan

Zonghuan Wu

Center for Advanced Computer Studies

Center for Advanced Computer Studies

University Louisiana at Lafayette

University Louisiana at Lafayette

Lafayette, LA 70504

Lafayette, LA 70504

Phone: (337) 482-6603

Phone: (337) 482-5243

Fax: (337) 482-5791

Fax: (337) 482-5791

Email: raghavan@cacs.louisiana.edu

Email: zwu@cacs.louisiana.edu


List of Supported Students:

     Fang Liu, Shuang Liu, Hai He: Research Assistants


Project Award Information:

     Duration: 8/15/2002 - 8/14/2005

     Title: WebScales: Towards a Highly Scalable Metasearch Engine


Keywords: Large-scale metasearch engine, distributed information retrieval, search engine discovery, search engine wrapper, database selection


Project Summary:


The main objective of this collaborative project is to develop enabling techniques for a large-scale metasearch engine that aims at covering a much larger portion of the Web and at the same time retrieving more up-to-date and more useful documents than existing search engines and metasearch engines. A metasearch engine is a system that provides unified access to multiple existing search engines. Upon receiving a query, the metasearch engine determines the appropriate search engines to invoke, the documents to retrieve from each invoked search engine and finally the set of documents to be shown to the user. The main problems to be studied in this project include (1) how to automatically discover useful search engines on the Web; (2) how to automatically and accurately categorize search engines into a concept hierarchy and how to use user profiles to map user queries to appropriate concept(s) in the hierarchy; (3) how to automatically incorporate search engines into a metasearch engine; (4) how to perform accurate database selection for longer queries; and (5) how to merge results returned from multiple search engines.


Publications and Products:


1.      F. Liu, C. Yu, W. Meng. Personalize Web Search by Mapping User Queries to Categories. ACM International Conference on Information and Knowledge Management (CIKM'02), McLean, Virginia, November 2002, pp.558-565.

2.      K. Liu, C. Yu, W. Meng. Discovering the Representative of a Search Engine. ACM International Conference on Information and Knowledge Management (CIKM'02), (poster paper), pp.652-654, McLean, Virginia, November 2002.

3.      Z. Wu, V. Raghavan, D. Chun, W. Meng, and C. Yu. SE-LEGO: A System to Create Metasearch Engines on Demand. ACM SIGIR Conference, Demo paper, Toronto, Canada, July 2003 (to appear).

4.      Z. Wu, V. Raghavan, C. Du, W. Meng, H. He and C. Yu. Creating Customized Metasearch Engines on Demand Using SE-LEGO. International Conference on Web-Age Information Management (WAIM'03), Chengdu, China, Demo paper, August 2003 (to appear).

5.      C. Yu, G. Philip, W. Meng. Distributed Top-N Query Processing with Possibly Uncooperative Local Systems. International Conference on Very Large Data Bases (VLDB'03), Berlin, Germany, September 2003 (to appear).

6.      H. He, W. Meng, C. Yu, and Z. Wu. WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce. International Conference on Very Large Data Bases (VLDB'03), Berlin, Germany, September 2003 (to appear).

7.      Z. Wu, V. Raghavan, H. Qian, V. Rama K, W. Meng, H. He, C. Yu. Towards Automatic Incorporation of Search Engines into a Large-Scale Metasearch Engine. 2003 IEEE/WIC International Conference on Web Intelligence, Halifax, Canada, October 2003 (to appear).

8.      F. Liu, C. Yu, W. Meng. Personalized Web Search for Improving Retrieval Effectiveness. IEEE Transactions on Knowledge and Data Engineering (to appear).

9.      C. Yu, and W. Meng. Web Search Technology. In The Internet Encyclopedia edited by Hossein Bidgoli, Wiley Publishers (to appear).


Project Impact:


Human Resources: One Ph.D. student and five MS students graduated; 4 Ph.D. students and several M.S. students are currently working on the project.


Education and Curriculum Development: Course material based on this project have been used in two graduate courses CS632: Advanced Database Systems, CS582: Information Retrieval).


Goals, Objectives, and Targeted Activities:


This three-year project is at the end of its first year. We are on schedule to achieve the goals planned for the project. In the first year, we did research in the following areas.

1.      Develop methods to automatically discover search engines, connect to them and extract retrieved results from returned result pages.

2.      Improve database selection and collection fusion methods for longer queries.

3.      Improve using user profiles to map user queries to appropriate categories in a concept hierarchy and to improve the retrieval effectiveness of documents.

4.      Study some important issues in extending our metasearch engine techniques to more structured data (those in e-commerce search engines). These issues include the top-N query problem in distributed relational databases and automatically integrating the interfaces of multiple e-commerce search engines.

5.      Implement a preliminary prototype system (SE-LEGO) for automatically discovering search engines and incorporating them into a metasearch engine.


The following are the objectives of the next year:

1.      Continue to improve our methods to automatically connect to search engines and to automatically extract search results. Make the prototype system SE-LEGO more robust.

2.      Develop a better database categorization algorithm and test its effectiveness.

3.      Study practical solutions for removing redundant and peculiar search engines. We aim at methods with low polynomial time complexity.

4.      Implement the representative collection component of the WebScales system.


Area Background:


Currently there are hundreds of thousands search engines on the Web and each of them covers a small portion of the Web (either the deep Web or the surface Web). Creating a metasearch engine on top of all useful search engines is an effective way to combine the coverages of these search engines and to reach a large portion of the deep Web. Due to the large number of search engines involved, highly scalable and automated techniques are needed to create and maintain such a metasearch engine. This project aims to solve the technical problems towards building such a metasearch engine. To automatically discover useful search engines on the Web, a specialized Web crawler that can recognize Web search engine pages is needed. To automatically incorporate search engines into a metasearch engine, methods that can analyze search engine pages to extract connection information and that can analyze result pages to extract correct result information are needed. To identify potentially useful search engines for each user query efficiently and accurately, techniques that can collect characteristic information of each search engine efficiently and accurately, that can organize such information of all search engines in a scalable manner, and that can utilize the information for efficient and accurate search engine selection are needed. This project is related to distributed information retrieval. In addition, projects on metasearching techniques for Web sources on structured data are also related.


Area References:


1.      Crescenzi, V., Mecca, G., And Merialdo, P. RoadRunner: Towards automatic data extraction from large Web sites. In Proceedings of the 26th International Conference on Very Large Data Bases Rome, Italy, 2001, pp. 109-118.

2.      D. Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using Metasearch. ACM TOIS, 15(3), July 1997, pp.195-222.

3.      Y. Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple, Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in Cyberspace, Stanford University, March 1999.

4.      J. French, A. Powell, C. Viles. Evaluating Database Selection Techniques: A Testbed and Experiment. ACM SIGIR Conference, pp.121-129, 1998.

5.      L. Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and Broker Hierarchies. VLDB Conference, 1995.

6.      L. Gravano, and H. Garcia-Molina. Merging Ranks from Heterogeneous Internet Sources. VLDB Conference, 1997.

7.      B. He, K. Chang. Statistical Schema Integration Across the Deep Web. ACM SIGMOD Conference, 2003.

8.      P. Ipeirotis, L. Gravano, and M. Sahami. Probe, Count, and Classify. ACM SIGMOD Conference, 2001.

9.      P. Ipeirotis, and L. Gravano. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. VLDB Conference, Hong Kong, 2002.

10.   W. Meng, C. Yu, K. Liu. Building Efficient and Effective Metasearch Engines. ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp.48-89.

11.   W. Meng, Z. Wu, C. Yu, Z. Li. A Highly-Scalable and Effective Method for Metasearch. ACM Transactions on Information Systems 19(3), pp.310-335, July 2001.

12.   C. Yu, K. Liu, W. Meng, Z. Wu, N. Rishe. A Methodology to Retrieve Text Documents from Multiple Databases. IEEE Transactions on Knowledge and Data Engineering, Vol.14, No.6, November/December 2002, pp.1347-1361.

13.   E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy. ACM SIGIR Conference, Seattle, 1995, pp.172-179.

14.   J. Xu, and J. Callan. Effective Retrieval with Distributed Collections. ACM SIGIR Conference, pp.112-120, Melbourne, Australia, 1998.


Project Website:


Project URL: http://www.cs.binghamton.edu/~meng/metasearch.html


This site lists all publications (including some technical reports and all annual IDM workshop reports) related to this project. The ps or pdf files of these publications are also available at this site.