WebScales: Towards a Highly Scalable
Metasearch Engine
Project
Award Numbers: IIS-0208434,
IIS-0208574
Principal
Investigators:
|
Clement Yu |
Weiyi Meng |
|
Department of Computer Science |
Department of Computer Science |
|
University of Illinois at Chicago |
State University of New York at Binghamton |
|
Chicago, IL 60607 |
Binghamton, NY 13902 |
|
Phone: (312) 996-2318 |
Phone: (607) 777-4311 |
|
Fax: (312) 413-0024 |
Fax: (607) 777-4729 |
|
Email: yu@cs.uic.edu |
Email: meng@cs.binghamton.edu |
|
URL: http://www.cs.uic.edu/~yu |
URL: http://www.cs.binghamton.edu/~meng |
Collaborators:
|
Vijay Raghavan, Zonghuan Wu |
King-Lup Liu |
|
Center for Advanced Computer Studies |
Webscalers, LLC. |
|
University Louisiana at Lafayette |
121 Conque Drive |
|
Lafayette, LA 70504 |
Lafayette, LA 70506 |
|
Phone: (337) 482-6603; (337) 482-5243 |
Email: kliu2002@yahoo.com |
|
Fax: (337) 482-5791 |
|
|
Email: {raghavan, zwu}@cacs.louisiana.edu |
|
List of
Supported Students:
Fang Liu, Shuang Liu, Hai He: Research
Assistants
Project
Award Information:
Duration: 8/15/2002 - 8/14/2005
Title: WebScales:
Towards a Highly Scalable Metasearch Engine
Keywords: Large-scale metasearch engine, distributed
information retrieval, search engine discovery, search engine wrapper, database
selection
Project
Summary:
The main objective of this collaborative project is to develop
enabling techniques for a large-scale metasearch engine that aims at covering a
much larger portion of the Web and at the same time retrieving more up-to-date
and more useful documents than existing search engines and metasearch engines.
A metasearch engine is a system that provides unified access to multiple
existing search engines. Upon receiving a query, the metasearch engine
determines the appropriate search engines to invoke, the documents to retrieve
from each invoked search engine and finally the set of documents to be shown to
the user. The main problems to be studied in this project include (1) how to
automatically discover useful search engines on the Web; (2) how to
automatically and accurately categorize search engines into a concept hierarchy
and how to use user profiles to map user queries to appropriate concept(s) in
the hierarchy; (3) how to automatically incorporate search engines into a
metasearch engine; (4) how to perform accurate database selection for longer
queries; and (5) how to merge results returned from multiple search engines.
Publications
and Products:
1. F.
Liu, C. Yu, W. Meng. Personalize Web Search by Mapping User Queries to
Categories. ACM International Conference on Information and Knowledge
Management (CIKM'02), pp.558-565, McLean, Virginia, 2002.
2. K.
Liu, C. Yu, W. Meng. Discovering the Representative of a Search Engine. ACM
International Conference on Information and Knowledge Management (CIKM'02),
poster paper, pp.652-654, McLean, Virginia, 2002.
3. Z.
Wu, V. Raghavan, D. Chun, W. Meng, and C. Yu. SE-LEGO: A System to Create
Metasearch Engines on Demand. ACM SIGIR Conference, Demo paper, pp.464,
Toronto, Canada, July 2003.
4. Z.
Wu, V. Raghavan, C. Du, W. Meng, H. He, C. Yu. Creating Customized Metasearch
Engines on Demand Using SE-LEGO. International Conference on Web-Age
Information Management (WAIM'03), Chengdu, China, Demo paper, pp.503-503,
August 2003.
5. C.
Yu, G. Philip, W. Meng. Distributed Top-N Query Processing with Possibly Uncooperative
Local Systems. International Conference on Very Large Data Bases (VLDB'03),
pp.117-128, Berlin, Germany, 2003.
6. H.
He, W. Meng, C. Yu, Z. Wu. WISE-Integrator: An Automatic Integrator of Web
Search Interfaces for E-Commerce. International Conference on Very Large Data
Bases (VLDB'03), pp.357-368, Berlin, 2003.
7. Z.
Wu, V. Raghavan, H. Qian, V. Rama K, W. Meng, H. He, C. Yu. Towards Automatic
Incorporation of Search Engines into a Large-Scale Metasearch Engine. IEEE/WIC
International Conference on Web Intelligence, pp.658-661, Halifax, Canada,
2003.
8. C.
Yu, W. Meng. Web Search Technology. In The Internet Encyclopedia edited by
Hossein Bidgoli, Wiley
Publishers, pp.738-753, 2003.
9. F.
Liu, C. Yu, W. Meng. Personalized Web Search for Improving Retrieval Effectiveness.
IEEE Transactions on Knowledge and Data Engineering, Vol.16, No.1, pp.28-40, January 2004.
10.
W. Wu, C.
Yu, W. Meng. Database Selection for
Longer Queries. Proceedings of the 2004 Meeting of the
International Federation of Classification Societies, Chicago, July 2004.
11.
W. Wu, C.
Yu, A. Doan, W. Meng. An Interactive Clustering-based Approach to Integrating Source Query
interfaces on the Deep Web. ACM SIGMOD Conference, pp.95-106,
Paris, France, June 2004.
12.
H. He, W.
Meng, C. Yu, Z. Wu. Automatic Extraction of Web Search Interfaces for Interface Schema
Integration. World Wide Web Conference (WWW2004), poster paper, pp.414-415, New
York City, May 2004.
13.
Q. Peng, W.
Meng, H. He, C. Yu. Clustering E-Commerce Search Engines. World
Wide Web Conference (WWW2004), poster paper, pp.416-417, New York City, May
2004.
14.
S. Liu, F.
Liu, C. Yu, W. Meng. An Effective Approach to Document Retrieval via Utilizing
WordNet and Recognizing Phrases. ACM SIGIR Conference, pp.266-272, Sheffield,
UK, July 2004.
15. H. He, W. Meng, C. Yu, Z. Wu. Automatic
Integration of Web Search Interfaces with WISE-Integrator. VLDB Journal (to
appear).
16. Q. Peng, W. Meng, H. He, C. Yu. WISE-Cluster: Clustering E-Commerce Search Engines Automatically.
6th ACM International Workshop on Web Information and Data Management (WIDM
2004), Washington, DC, November 2004 (to appear).
Project
Impact:
Human Resources: One Ph.D. student and seven MS students
graduated; 5 Ph.D. students and several M.S. students are currently working on
the project.
Education and Curriculum Development:
Course materials based on this project have been used in three graduate
courses: CS632 – Advanced Database Systems, CS634 – Web Data Management, CS582
– Information Retrieval.
Goals,
Objectives, and Targeted Activities:
This three-year project is at the end of the second year. In the
second year, we did research in the following areas.
1. Designed
and implemented a new method to automatically extract retrieved results from
returned result pages.
2. Improved
our prototype system (SE-LEGO) for automatically discovering search engines and
incorporating them into a metasearch engine.
3. Designed
and implemented a new result merging algorithm.
4. Developed
a newspaper metasearch engine prototype system with 50 local newspaper search
engines.
5. Designed
and implemented an algorithm to generate database representatives.
6. Improved
the method for automatically integrating the interfaces of multiple e-commerce
search engines.
7. Developed
a method to cluster e-commerce search engines.
The following are the objectives of the next year:
1. Implement
a metasearch engine with at least 1,000 search engines to test the scalability
of our approach.
2. Implement
an algorithm for generating database representatives for deep Web search
engines.
3. Further
improve SELEGO, especially the search engine connection and result extraction
components.
4. Develop
a better database categorization algorithm and test its effectiveness.
5. Study
practical solutions for identifying redundant search engines in a metasearch
engine context.
Area
Background:
Currently there are hundreds of thousands search engines on the
Web and each of them covers a small portion of the Web (either the deep Web or
the surface Web). Creating a metasearch engine on top of all useful search
engines is an effective way to combine the coverages of these search engines
and to reach a large portion of the deep Web. Due to the large number of search
engines involved, highly scalable and automated techniques are needed to create
and maintain such a metasearch engine. This project aims to solve the technical
problems towards building such a metasearch engine. To automatically discover
useful search engines on the Web, a specialized Web crawler that can recognize
Web search engine pages is needed. To automatically incorporate search engines
into a metasearch engine, methods that can analyze search engine pages to
extract connection information and that can analyze result pages to extract
correct result information are needed. To identify potentially useful search
engines for each user query efficiently and accurately, techniques that can
collect characteristic information of each search engine efficiently and
accurately, that can organize such information of all search engines in a
scalable manner, and that can utilize the information for efficient and
accurate search engine selection are needed. This project is related to
distributed information retrieval. In addition, projects on metasearching
techniques for Web sources on structured data are also related.
Area
References:
1. Crescenzi,
V., Mecca, G., And Merialdo, P. RoadRunner: Towards automatic data extraction
from large Web sites. International Conference on Very Large Data Bases Rome,
Italy, 2001, pp. 109-118.
2. D.
Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using
Metasearch. ACM TOIS, 15(3), July 1997, pp.195-222.
3. Y.
Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple,
Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in
Cyberspace, Stanford University, March 1999.
4. J.
French, A. Powell, C. Viles. Evaluating Database Selection Techniques: A
Testbed and Experiment. ACM SIGIR Conference, pp.121-129, 1998.
5. L.
Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and
Broker Hierarchies. VLDB Conference, 1995.
6. L.
Gravano, H. Garcia-Molina. Merging Ranks from Heterogeneous Internet Sources.
VLDB Conference, 1997.
7. B. He, K. Chang. Statistical Schema
Integration Across the Deep Web. ACM SIGMOD Conference, 2003.
8. P.
Ipeirotis, L. Gravano, and M. Sahami. Probe, Count, and Classify. ACM SIGMOD
Conference, 2001.
9. P.
Ipeirotis, and L. Gravano. Distributed Search over the Hidden Web: Hierarchical
Database Sampling and Selection. VLDB Conference, Hong Kong, 2002.
10. W.
Meng, C. Yu, K. Liu. Building Efficient and Effective Metasearch Engines. ACM
Computing Surveys, Vol. 34, No. 1, March 2002, pp.48-89.
11. W.
Meng, Z. Wu, C. Yu, Z. Li. A Highly-Scalable and Effective Method for
Metasearch. ACM Transactions on Information Systems 19(3), pp.310-335, July
2001.
12. C.
Yu, K. Liu, W. Meng, Z. Wu, N. Rishe. A Methodology to Retrieve Text Documents
from Multiple Databases. IEEE Transactions on Knowledge and Data Engineering,
14:6, Nov./Dec. 2002, pp.1347-1361.
13. E.
Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy.
ACM SIGIR Conference, Seattle, 1995, pp.172-179.
14. J.
Xu, and J. Callan. Effective Retrieval with Distributed Collections. ACM SIGIR
Conference, pp.112-120, Melbourne, Australia, 1998.
Project
Website:
Project URL:
http://www.cs.binghamton.edu/~meng/metasearch.html
This site lists all publications (including all annual IDM
workshop reports) related to this project. The ps or pdf files of these
publications are also available at this site.