Determining Text Databases to
Search in the Internet
Clement Yu* Weiyi Meng**
*Department of EECS **Department of Computer Science
University of Illinois State University of New York
at Chicago at Binghamton
Contact
Information
Clement
Yu Weiyi Meng
Department
of EECS Department of
Computer Science
University
of Illinois at Chicago State
University of New York at Binghamton
Chicago,
IL 60607 Binghamton,
NY 13902
Phone:
(312) 996-2318 Phone:
(607) 777-4311
Fax:
(312) 413-0024 Fax:
(607) 777-4729
Email:
yu@eecs.uic.edu Email:
meng@cs.binghamton.edu
URL:
http://www.eecs.uic.edu/~yu URL:
http://opal.cs.binghamton.edu/~meng
WWW
PAGE: http://opal.cs.binghamton.edu/~meng/metasearch.html
List
of Supported Students:
Zonghuan
Wu, Fang Liu, Shuang Liu: Research Assistants
Project
Award Information:
Award
numbers: IIS-9902792, IIS-9902872
Duration:
10/1/1999 - 9/30/2002
Title:
Determining Text Databases to Search in the Internet
Keywords: Internet resource discovery,
distributed information retrieval, collection fusion, metasearch engine
Project
Summary:
The main objective of this collaborative project is to develop
theoretically rigorous yet practically applicable techniques needed for the
development of next generation metasearch engines and apply these techniques to
build an operational metasearch engine. A metasearch engine is a system that
provides unified access to multiple existing search engines. Upon receiving a
query, the metasearch engine determines the appropriate search engines to
invoke, the documents to retrieve from each invoked search engine and finally
the set of documents to be shown to the user. Methods that can scale to a large
number of search engines will be developed to accurately predict the usefulness
of a search engine with respect to each query. A necessary and sufficient
conditions is developed to optimally rank search engines in descending order of
desirability with respect to a given query. Efficient solutions are proposed to
satisfy this condition, including the incorporation of the link-derived
importance of a page into the desirability measure. A dynamic concept hierarchy
based method that has the potential to improve the retrieval of actually useful
documents (not just similar documents) is studied. Techniques that while
guaranteeing the retrieval of all potentially useful documents from a local
search engine, also minimize the retrieval of useless documents will be
investigated. Such a guarantee is highly desired in applications that need to
retrieve all potentially useful documents. The problem of merging results from
multiple sources to a single ranked list will also be studied. The techniques
to be developed are likely to require certain knowledge about each search
engine. As such knowledge or information may be proprietary, a method that uses
probing queries to discover needed knowledge about each local search engine is
explored.
Publications
and Products:
1.
W.
Meng, C. Yu, and K. Liu. Detection of Heterogeneities in a Multiple Text
Database Environment. Fourth IFCIS Conference on Cooperative Information
Systems (CoopIS'99), Edinburgh, Scotland, September 1999, pp.22-33.
2.
C.
Yu, W. Meng, K. Liu, W. Wu, and N. Rishe. Efficient and Effective Metasearch
for a Large Number of Text Databases. ACM International Conference on
Information and Knowledge Management, 1999, pp.217-224.
3.
W.
Wang, W. Meng, and C. Yu. Concept Hierarchy Based Text Database Categorization
in a Metasearch Engine Environment. 1st Int’l Conference on Web Information
Systems Engineering, Hong Kong, 2000, pp.283-290.
4.
K.
Liu, W. Meng, C. Yu, C. Zhang, and N. Rishe. Discovery of Similarity
Computations of Search Engines. ACM International Conference on Information and
Knowledge Management, Washington, D.C., November 2000, pp.290-297.
5.
Z.
Wu, W. Meng, C. Yu, and Z. Li. Towards a Highly-Scalable and Effective
Metasearch Engine. 10th World Wide Web Conference, May 2001 (to appear).
6.
C.
Yu, W. Meng, W. Wu, and K. Liu. Efficient and Effective Metasearch for Text
Databases Incorporating Linkages among Documents. ACM SIGMOD Conference, May
2001 (to appear).
7.
K.
Liu, C. Yu, W. Meng, W. Wu, and N. Rishe. A Statistical Method for Estimating
the Usefulness of Text Databases. IEEE Transactions on Knowledge and Data
Engineering (to appear).
8.
C.
Yu, K. Liu, W. Meng, Z. Wu, and N. Rishe. A Methodology to Retrieve Text
Documents from Multiple Databases. IEEE Transactions on Knowledge and Data
Engineering (to appear).
9.
C.
Yu, P. Sharma, W. Meng, and Y. Qin. Database Selection for Processing k Nearest
Neighbors Queries in Distributed Environments. First ACM/IEEE Joint Conference
on Digital Libraries, Roanoke, VA, June 2001 (to appear).
10.
A
prototype demo metasearch engine is created
(http://slate.cs.binghamton.edu:8080/CSams/).
Project
Impact:
Human Resources: One Ph.D. student and five MS students (two female) graduated; 4
Ph.D. students and several M.S. students, including some female students, are
currently working on the project.
Education and Curriculum Development: Course material based on this project
have been used in two graduate courses CS534: Web Data Management, EECS582:
Information Retrieval).
Infrastructure: Four computers were purchased and they were installed in the two
database laboratories of the two investigators.
Tutorials and Seminars: Three tutorials were given in three internationall confs (VLDB'99, ACM
DL'99, WISE'00). Twelve seminars were given to companies and universities (NEC,
Lockheed-Martin, SUNY-Binghamton, Hong Kong Univ., Chinese Univ. of Hong Kong,
Hong Kong Univ. of Science and Technology, Database Society of China,
Northwestern Univ., Int’l Conf. on Informational Society (Japan), Univ. of
Tokyo, Univ. of Tsukuba (Japan), Univ. of Minnesota).
Goals,
Objectives, and Targeted Activities
This three-year project is in the middle of its second year. We
are on schedule to achieve all the goals planned for the project. In the last
year, we made significant progress in the following areas.
1.
Developed
highly scalable and effective methods to perform database selection and
collection fusion.
2.
Developed
a practical database categorization algorithm to help improve the retrieval of
useful documents.
3.
Developed
a method to incorporate the link-derived importance (PageRank) of a page into
the solution.
The
following are the objectives during the final year of the project.
1.
Continue
to improve the solutions developed in the first two years.
2.
Study
feedback process in distributed database environments and compare the
performance of different strategies, including the use of pseudo relevance
feedback.
3.
Implement
a metasearch engine for over 100 searchable major US universities.
4.
Write a book on metasearch engine.
Project
References (See
http://opal.cs.binghamton.edu/~meng/metasearch.html)
Area
Background:
Many text sources are available in the Internet. Each text source
usually has an associated search engine. These widely distributed search
engines are highly heterogeneous. They may employ different techniques to
represent and rank documents, and they usually provide access to different sets
of documents of diverse interest. Frequently, a user's information needs are
stored in the databases of multiple local search engines. As the number of
search engines increases, there is an increasing need for automatic search
brokers (metasearch engines) which can invoke multiple search engines as it is
inconvenient and inefficient for an ordinary user to utilize multiple search
engines and identify useful documents from the results returned from multiple
search engines. Through a search broker, only a single query is needed from a
user to retrieve desired documents.
There are a number of challenges in building an efficient and
effective metasearch engine. Among the challenges, the database selection
problem is to identify text sources that are likely to return useful documents
to a given query. The document selection problem is to determine what documents
should be retrieved from each identified source. The result merging problem is
to combine the documents returned from all identified sources. This project aims
at finding good solution to these problems and to build a prototype metasearch
engine.
Area
References:
1.
J.
Callan, Z. Lu, and. W. Croft. Searching Distributed Collections with Inference
Networks. ACM SIGIR Conference, 1995, pp.21-28.
2.
Digital
Library Collaborative Working Groups. Resource Discovery in a
Globally-Distributed Digital Library. Working Group Report, 1999
(http://www.iei.pi.cnr.it/DELOS/NSF/resourcediscovery.htm)
3.
D.
Dreilinger, and A. Howe. Experiences with Selecting Search Engines Using Metasearch.
ACM TOIS, 15(3), July 1997, pp.195-222.
4.
Y.
Fan, and S. Gauch. Adaptive Agents for Information Gathering from Multiple,
Distributed Information Sources. 1999 AAAI Symposium on Intelligent Agents in
Cyberspace, Stanford University, March 1999.
5.
J. French,
A. Powell, C. Viles. Evaluating Database Selection Techniques: A Testbed and
Experiment. ACM SIGIR Conference, pp.121-129, 1998.
6.
L.
Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford Proposal
for Internet Meta-Searching. ACM SIGMOD, May 1997, pp.207-218.
7.
L.
Gravano, and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and
Broker Hierarchies. VLDB Conference, 1995.
8.
L.
Gravano, and H. Garcia-Molina. Merging Ranks from Heterogeneous Internet
Sources. VLDB Conference, 1997.
9.
E.
Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategy.
ACM SIGIR Conference, Seattle, 1995, pp.172-179.
10.J. Xu, and J. Callan. Effective
Retrieval with Distributed Collections. ACM SIGIR Conference, pp.112-120,
Melbourne, Australia, 1998.