WebScales: Towards a Highly Scalable
Metasearch Engine
Project
Award Numbers: IIS-0208434,
IIS-0208574
Principal
Investigators:
Clement Yu |
Weiyi Meng |
Department of Computer Science |
Department of Computer Science |
University of Illinois at Chicago |
State University of New York at Binghamton |
Chicago, IL 60607 |
Binghamton, NY 13902 |
Phone: (312) 996-2318 |
Phone: (607) 777-4311 |
Fax: (312) 413-0024 |
Fax: (607) 777-4729 |
Email: yu@cs.uic.edu |
Email: meng@cs.binghamton.edu |
URL: http://www.cs.uic.edu/~yu |
URL: http://www.cs.binghamton.edu/~meng |
Collaborators:
Vijay Raghavan |
Zonghuan Wu |
Center for Advanced Computer Studies |
Center for Advanced Computer Studies |
University Louisiana at Lafayette |
University Louisiana at Lafayette |
Lafayette, LA 70504 |
Lafayette, LA 70504 |
Phone: (337) 482-6603 |
Phone: (337) 482-5243 |
Fax: (337) 482-5791 |
Fax: (337) 482-5791 |
Email: raghavan@cacs.louisiana.edu |
Email: zwu@cacs.louisiana.edu |
List of
Supported Students:
Fang Liu, Shuang Liu, Hai He: Research
Assistants
Project
Award Information:
Duration: 8/15/2002 - 8/14/2005
Title: WebScales:
Towards a Highly Scalable Metasearch Engine
Keywords: Large-scale metasearch engine, distributed
information retrieval, search engine discovery, search engine wrapper, database
selection
Project
Summary:
The main objective of this collaborative project is to develop
enabling techniques for a large-scale metasearch engine that aims at covering a
much larger portion of the Web and at the same time retrieving more up-to-date
and more useful documents than existing search engines and metasearch engines.
A metasearch engine is a system that provides unified access to multiple
existing search engines. Upon receiving a query, the metasearch engine
determines the appropriate search engines to invoke, the documents to retrieve
from each invoked search engine and finally the set of documents to be shown to
the user. The main problems to be studied in this project include (1) how to
automatically discover useful search engines on the Web; (2) how to
automatically and accurately categorize search engines into a concept hierarchy
and how to use user profiles to map user queries to appropriate concept(s) in
the hierarchy; (3) how to automatically incorporate search engines into a
metasearch engine; (4) how to perform accurate database selection for longer
queries; and (5) how to merge results returned from multiple search engines.
Publications
and Products:
1. F.
Liu, C. Yu, W. Meng. Personalize Web Search by Mapping User Queries to
Categories. ACM International Conference on Information and Knowledge
Management (CIKM'02), McLean, Virginia, November 2002, pp.558-565.
2. K.
Liu, C. Yu, W. Meng. Discovering the Representative of a Search Engine. ACM
International Conference on Information and Knowledge Management (CIKM'02),
(poster paper), pp.652-654, McLean, Virginia, November 2002.
3. Z.
Wu, V. Raghavan, D. Chun, W. Meng, and C. Yu. SE-LEGO: A System to Create
Metasearch Engines on Demand. ACM SIGIR Conference, Demo paper, Toronto,
Canada, July 2003 (to appear).
4. Z.
Wu, V. Raghavan, C. Du, W. Meng, H. He and C. Yu. Creating Customized
Metasearch Engines on Demand Using SE-LEGO. International Conference on Web-Age
Information Management (WAIM'03), Chengdu, China, Demo paper, August 2003 (to
appear).
5. C. Yu,
G. Philip, W. Meng. Distributed Top-N Query Processing with Possibly
Uncooperative Local Systems. International Conference on Very Large Data Bases
(VLDB'03), Berlin, Germany, September 2003 (to appear).
6. H.
He, W. Meng, C. Yu, and Z. Wu. WISE-Integrator: An Automatic Integrator of Web
Search Interfaces for E-Commerce. International Conference on Very Large Data
Bases (VLDB'03), Berlin, Germany, September 2003 (to appear).
7. Z.
Wu, V. Raghavan, H. Qian, V. Rama K, W. Meng, H. He, C. Yu. Towards Automatic
Incorporation of Search Engines into a Large-Scale Metasearch Engine. 2003
IEEE/WIC International Conference on Web Intelligence, Halifax, Canada, October
2003 (to appear).
8. F.
Liu, C. Yu, W. Meng. Personalized Web Search for Improving Retrieval
Effectiveness. IEEE Transactions on Knowledge and Data Engineering (to appear).
9. C.
Yu, and W. Meng. Web Search Technology. In The Internet Encyclopedia edited by
Hossein Bidgoli, Wiley Publishers (to appear).
Project
Impact:
Human Resources: One Ph.D. student and five MS students
graduated; 4 Ph.D. students and several M.S. students are currently working on
the project.
Education and Curriculum Development:
Course material based on this project have been used in two graduate courses
CS632: Advanced Database Systems, CS582: Information Retrieval).
Goals,
Objectives, and Targeted Activities:
This three-year project is at the end of its first year. We are on
schedule to achieve the goals planned for the project. In the first year, we
did research in the following areas.
1. Develop
methods to automatically discover search engines, connect to them and extract
retrieved results from returned result pages.
2. Improve
database selection and collection fusion methods for longer queries.
3. Improve
using user profiles to map user queries to appropriate categories in a concept
hierarchy and to improve the retrieval effectiveness of documents.
4. Study
some important issues in extending our metasearch engine techniques to more
structured data (those in e-commerce search engines). These issues include the
top-N query problem in distributed relational databases and automatically
integrating the interfaces of multiple e-commerce search engines.
5. Implement
a preliminary prototype system (SE-LEGO) for automatically discovering search
engines and incorporating them into a metasearch engine.
The following are the objectives of the next year:
1. Continue
to improve our methods to automatically connect to search engines and to
automatically extract search results. Make the prototype system SE-LEGO more robust.
2. Develop
a better database categorization algorithm and test its effectiveness.
3. Study
practical solutions for removing redundant and peculiar search engines. We aim
at methods with low polynomial time complexity.
4. Implement
the representative collection component of the WebScales system.
Area
Background:
Currently there are hundreds of thousands search engines on the
Web and each of them covers a small portion of the Web (either the deep Web or
the surface Web). Creating a metasearch engine on top of all useful search
engines is an effective way to combine the coverages of these search engines
and to reach a large portion of the deep Web. Due to the large number of search
engines involved, highly scalable and automated techniques are needed to create
and maintain such a metasearch engine. This project aims to solve the technical
problems towards building such a metasearch engine. To automatically discover
useful search engines on the Web, a specialized Web crawler that can recognize
Web search engine pages is needed. To automatically incorporate search engines
into a metasearch engine, methods that can analyze search engine pages to
extract connection information and that can analyze result pages to extract
correct result information are needed. To identify potentially useful search
engines for each user query efficiently and accurately, techniques that can
collect characteristic information of each search engine efficiently and
accurately, that can organize such information of all search engines in a scalable
manner, and that can utilize the information for efficient and accurate search
engine selection are needed. This project is related to distributed information
retrieval. In addition, projects on metasearching techniques for Web sources on
structured data are also related.
Area
References:
1. Crescenzi,
V., Mecca, G., And Merialdo, P. RoadRunner: Towards automatic data extraction
from large Web sites. In Proceedings of the 26th International Conference on
Very Large Data Bases Rome, Italy, 2001, pp. 109-118.
2. D.
Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using
Metasearch. ACM TOIS, 15(3), July 1997, pp.195-222.
3. Y.
Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple,
Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in
Cyberspace, Stanford University, March 1999.
4. J.
French, A. Powell, C. Viles. Evaluating Database Selection Techniques: A
Testbed and Experiment. ACM SIGIR Conference, pp.121-129, 1998.
5. L.
Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and
Broker Hierarchies. VLDB Conference, 1995.
6. L.
Gravano, and H. Garcia-Molina. Merging Ranks from Heterogeneous Internet
Sources. VLDB Conference, 1997.
7. B. He, K. Chang. Statistical Schema
Integration Across the Deep Web. ACM SIGMOD Conference, 2003.
8. P.
Ipeirotis, L. Gravano, and M. Sahami. Probe, Count, and Classify. ACM SIGMOD
Conference, 2001.
9. P.
Ipeirotis, and L. Gravano. Distributed Search over the Hidden Web: Hierarchical
Database Sampling and Selection. VLDB Conference, Hong Kong, 2002.
10. W.
Meng, C. Yu, K. Liu. Building Efficient and Effective Metasearch Engines. ACM
Computing Surveys, Vol. 34, No. 1, March 2002, pp.48-89.
11. W.
Meng, Z. Wu, C. Yu, Z. Li. A Highly-Scalable and Effective Method for
Metasearch. ACM Transactions on Information Systems 19(3), pp.310-335, July
2001.
12. C.
Yu, K. Liu, W. Meng, Z. Wu, N. Rishe. A Methodology to Retrieve Text Documents
from Multiple Databases. IEEE Transactions on Knowledge and Data Engineering,
Vol.14, No.6, November/December 2002, pp.1347-1361.
13. E.
Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy.
ACM SIGIR Conference, Seattle, 1995, pp.172-179.
14. J.
Xu, and J. Callan. Effective Retrieval with Distributed Collections. ACM SIGIR
Conference, pp.112-120, Melbourne, Australia, 1998.
Project
Website:
Project URL:
http://www.cs.binghamton.edu/~meng/metasearch.html
This site lists all publications (including some technical reports
and all annual IDM workshop reports) related to this project. The ps or pdf
files of these publications are also available at this site.