Determining Text Databases to
Search in the Internet
Clement Yu* Weiyi Meng**
*Department of Computer
Science **Department of Computer
Science
University of Illinois State University of New York
at Chicago at Binghamton
Contact
Information
Clement
Yu Weiyi Meng
Department
of Computer Science Department of
Computer Science
University
of Illinois at Chicago State
University of New York at Binghamton
Chicago,
IL 60607 Binghamton, NY 13902
Phone:
(312) 996-2318 Phone:
(607) 777-4311
Fax:
(312) 413-0024 Fax:
(607) 777-4729
Email:
yu@cs.uic.edu Email:
meng@cs.binghamton.edu
URL:
http://www.cs.uic.edu/~yu URL: http://www.cs.binghamton.edu/~meng
WWW
PAGE: http://www.cs.binghamton.edu/~meng/metasearch.html
List
of Supported Students:
Zonghuan
Wu, Fang Liu, Shuang Liu: Research Assistants
Project
Award Information:
Award
numbers: IIS-9902792, IIS-9902872
Duration:
10/1/1999 - 9/30/2002
Title:
Determining Text Databases to Search in the Internet
Keywords: Internet resource discovery,
distributed information retrieval, collection fusion, metasearch engine
Project
Summary:
The main objective of this collaborative project is to develop
theoretically rigorous yet practically applicable techniques needed for the
development of next generation metasearch engines. A metasearch engine is a
system that provides unified access to multiple existing search engines. Upon
receiving a query, the metasearch engine determines the appropriate search
engines to invoke, the documents to retrieve from each invoked search engine
and finally the set of documents to be shown to the user. Methods that can
scale to a large number of search engines will be developed to accurately
predict the usefulness of a search engine with respect to each query. A
necessary and sufficient condition is developed to optimally rank search
engines in descending order of desirability with respect to a given query.
Efficient solutions are proposed to satisfy this condition, including the
incorporation of the link-derived importance of a page into the desirability
measure. A dynamic concept hierarchy based method that has the potential to
improve the retrieval of actually useful documents (not just similar documents)
is studied. Techniques that while guaranteeing the retrieval of all potentially
useful documents from a local search engine, also minimize the retrieval of
useless documents will be investigated. The problem of merging results from
multiple sources to a single ranked list has also been studied. The proposed
techniques require certain knowledge about each search engine and a method that
uses probing queries to discover needed knowledge about each local search
engine is explored.
Publications
and Products:
1.
W.
Meng, C. Yu, and K. Liu. Detection of Heterogeneities in a Multiple Text
Database Environment. Fourth IFCIS Conference on Cooperative Information
Systems (CoopIS'99), Edinburgh, Scotland, September 1999, pp.22-33.
2.
C.
Yu, W. Meng, K. Liu, W. Wu, and N. Rishe. Efficient and Effective Metasearch
for a Large Number of Text Databases. ACM International Conference on
Information and Knowledge Management, 1999, pp.217-224.
3.
W.
Wang, W. Meng, and C. Yu. Concept Hierarchy Based Text Database Categorization
in a Metasearch Engine Environment. 1st Int’l Conference on Web Information
Systems Engineering, Hong Kong, 2000, pp.283-290.
4.
K.
Liu, W. Meng, C. Yu, C. Zhang, and N. Rishe. Discovery of Similarity Computations
of Search Engines. ACM International Conference on Information and Knowledge
Management, Washington, D.C., November 2000, pp.290-297.
5.
Z.
Wu, W. Meng, C. Yu, and Z. Li. Towards a Highly Scalable and Effective
Metasearch Engine. 10th WWW Conference, Hong Kong, May 2001, pp.386-395.
6.
C.
Yu, W. Meng, W. Wu, and K. Liu. Efficient and Effective Metasearch for Text
Databases Incorporating Linkages among Documents. ACM SIGMOD Conference, May
2001, pp.187-198.
7.
C.
Yu, P. Sharma, W. Meng, and Y. Qin. Database Selection for Processing k Nearest
Neighbors Queries in Distributed Environments. First ACM/IEEE Joint Conference
on Digital Libraries, Roanoke, VA, June 2001, pp.215-222.
8. W. Meng, Z. Wu, C. Yu, Z. Li. A Highly Scalable and Effective Method for Metasearch. ACM Transactions on Information Systems, 19(3), July 2001, pp.310-335.
9.
K. Liu, C. Yu, W. Meng, A. Santoso, and C. Zhang.
Discovering the Representative of a Search Engine. Tenth ACM International
Conference on Information and Knowledge Management, (poster paper), Atlanta,
Georgia, November 2001, pp.577-579.
10.
W. Meng, W. Wang, H. Sun, and C. Yu. Concept Hierarchy
Based Text Database Categorization. International Journal on Knowledge and
Information Systems, 4(2), March 2002, pp.132-150.
11.
W. Meng, C. Yu, K. Liu. Building Efficient and
Effective Metasearch Engines. ACM Computing Surveys, 34(1), March 2002,
pp.48-89.
12.
K.
Liu, C. Yu, W. Meng, W. Wu, and N. Rishe. A Statistical Method for Estimating
the Usefulness of Text Databases. IEEE Transactions on Knowledge and Data
Engineering (to appear).
13.
C.
Yu, K. Liu, W. Meng, Z. Wu, and N. Rishe. A Methodology to Retrieve Text
Documents from Multiple Databases. IEEE Transactions on Knowledge and Data
Engineering (to appear).
14.
A
prototype demo metasearch engine is created
(http://www.data.binghamton.edu:8080/CSams/index.html).
Project
Impact:
Human Resources: One Ph.D. student and seven MS students (four female) graduated;
4 Ph.D. students and several M.S. students, including some female and minority
students, are currently working on the project.
Education and Curriculum Development: Course material based on this
project have been used in two graduate courses CS534: Web Data Management,
CS582: Information Retrieval).
Infrastructure: Four computers were purchased and they were installed in the two
database laboratories of the two investigators.
Tutorials and Seminars: Three tutorials were given in three international conferences
(VLDB'99, ACM DL'99, WISE'00). Over ten seminars were given to companies and
universities.
Goals,
Objectives, and Targeted Activities
This three-year project is in the middle of its third year. We are
on schedule to achieve all the goals planned for the project. In the last year,
we did research in the following areas.
1.
Improve
database selection and collection fusion methods for longer queries.
2.
Study
the impact of using user profiles for personalized text retrieval.
3.
Study
database selection solutions for distributed top-N queries.
4.
Develop
programs for implementing large-scale metasearch engines.
For the
last six months of the project, we will continue our research on personalized
search and distributed top-N query processing.
Project
References (See
http://www.cs.binghamton.edu/~meng/metasearch.html)
Area
Background:
Many text sources are available in the Internet. Each text source
usually has an associated search engine. These widely distributed search
engines are highly heterogeneous. They may employ different techniques to
represent and rank documents, and they usually provide access to different sets
of documents of diverse interest. Frequently, a user's information needs are
stored in the databases of multiple local search engines. As the number of
search engines increases, there is an increasing need for automatic search
brokers (metasearch engines) which can invoke multiple search engines as it is
inconvenient and inefficient for an ordinary user to utilize multiple search
engines and identify useful documents from the results returned from multiple
search engines. Through a search broker, only a single query is needed from a
user to retrieve desired documents.
There are a number of challenges in building an efficient and
effective metasearch engine. Among the challenges, the database selection
problem is to identify text sources that are likely to return useful documents
to a given query. The document selection problem is to determine what documents
should be retrieved from each identified source. The result merging problem is
to combine the documents returned from all identified sources. This project
aims at finding good solution to these problems and to build a prototype
metasearch engine.
Area
References:
1.
J.
Callan, Z. Lu, and. W. Croft. Searching Distributed Collections with Inference
Networks. ACM SIGIR Conference, 1995, pp.21-28.
2.
Digital
Library Collaborative Working Groups. Resource Discovery in a
Globally-Distributed Digital Library. Working Group Report, 1999
(http://www.iei.pi.cnr.it/DELOS/NSF/resourcediscovery.htm)
3.
D.
Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using
Metasearch. ACM TOIS, 15(3), July 1997, pp.195-222.
4.
Y.
Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple,
Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in
Cyberspace, Stanford University, March 1999.
5.
J.
French, A. Powell, C. Viles. Evaluating Database Selection Techniques: A
Testbed and Experiment. ACM SIGIR Conference, pp.121-129, 1998.
6.
L.
Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford Proposal
for Internet Meta-Searching. ACM SIGMOD, May 1997, pp.207-218.
7.
L.
Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and
Broker Hierarchies. VLDB Conference, 1995.
8.
L.
Gravano, and H. Garcia-Molina. Merging Ranks from Heterogeneous Internet
Sources. VLDB Conference, 1997.
9.
E.
Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy.
ACM SIGIR Conference, Seattle, 1995, pp.172-179.
10.J. Xu, and J. Callan. Effective
Retrieval with Distributed Collections. ACM SIGIR Conference, pp.112-120, Melbourne,
Australia, 1998.