Research on Location-Sensitive Search Engine

(Final report of CS534X project)

 

Ruofei Zhang

Abstract

This paper comes up with a proposal for the research on location-sensitive search engine. In addition to the problem definition and requirement description of the location-sensitive search engine, I focus on the analysis of solutions to three problems on location-sensitive search engine. These three problems are the formula of measurement of neighborhood distance, the design of data structure to support multiple types of neighborhood definition and the effective and efficient approaches for representing web graph and for calculating minimum distance between two pages, respectively. These problems should be researched and solved in the design phase of proposed location-sensitive search engine. Some potential solutions to these problems are provided and I expect to evaluate them in the future work.

 

1.   Introduction

With the growing of Internet, the Intranet and Virtual Private Network (VPN) are emerging in big companies and institutions at the comparable speed. To find the relevant information in these Intranet and VPN, apart from the conventional content similarity, the neighborhood distance from a user submitted document (URL) is also a consideration to decide the ranking score of a document to the query. So how to take the neighborhood distance between a targeted document and the starting point (provided by user) into consideration to design and search resource database as well as to rank relevant documents is a new research direction in the field of information retrieval and search engine.

Our aimed location-sensitive search engine, in some degree, is an extension of traditional search engines. It not only calculates the content similarity between documents in the resource database and the query but also computes the distance (or the intimacy of neighborhood) between the documents and user provided starting point (URL). By adding the distance as a factor to the formula calculating the similarity score of each document, we get a comprehensive rank result of a document collection. This rank result reflects both relevance degree and neighborhood proximity of documents to the user's request. Adding distance information to ranking documents will help users to determine their needed documents conveniently. For instance, user may only want to browse relevant documents in a scope of neighborhood (distance). Location-sensitive searching is especially useful for employees accessing an Intranet, MAN or VPN, which do not want to be flooded by the vast results of outer cyberspace, instead they only need to get relevant documents in a customized spectrum.

 

2. Clarification of Problem Definition

The distance between two documents (URL) is defined as the minimum number of hyperlinks which connect these two documents. The documents in the Web can be deemed as nodes and the hyperlinks can be thought as directed edges, so we can construct a directed graph to represent the pool of web pages as well as their relationships. For example, as figure 1 shows, document D1 has a hyperlink pointing to document D2 and D2 has a hyperlink pointing to D3. In addition, D1 also have path to D3 by other intermediate documents but the distance of the path (the number of hyperlinks constituting this path) is bigger than 2, then the distance between D1 and D3 is 2. Similarly, the distance between D1 and D2 and between D2 and D3 is 1 respectively.

 

Figure 1 Illustration of neighborhoods

 

       As I described in my first report, we can conceive four types of location-sensitive search engine based on distance concept defined above and the direction of paths constituting by hyperlinks between documents

l        The first type is that we only consider the paths constituting by forwarding hyperlinks, i.e. the edges in the graph of WWW documents is unidirectional and formed by hyperlinks. For example, in Figure 1, D3 has distance 2 from D1, but we don't consider that D1 also has distance 2 from D3 unless that there is a path from D3 to D1 too in the graph.

l        The second type is that we only consider the paths constituting by backward hyperlinks, i.e. the edges in the graph of WWW documents is unidirectional too but formed by reverse hyperlinks. That means D1 has distance 2 from D3, but we don't think D3 also has distance 2 from D1 if there is no path from D3 to D1.

l        The third type is that we consider both above cases together, i.e. the edges between WWW documents are bi-directional. So the distance relation between two documents are symmetric. If document i has distance M from document j, then document j has distance M from document i as well.

l        Based on the hierarchical structure of domain name, we can construct the fourth type location-sensitive search engine. In this type we consider the distance between two documents as the number of the hierarchy level of the domains these two documents residing on. For instance, we can define the distance between panda.cs.binghamton.edu/index.html and www.cs.binghamton.edu/index.html as well as the distance between www.cs.binghamton.edu/index.html and www.binghamton.edu/index.hteml to 1 because of their hierarchical structure. We employ the hierarchical structure of domain name system in the WWW. Each domain name may have many "child" (sub) domains. So we also can construct a graph to represent such relationship:

 Figure 2 Illustration of hierarchy distance

 

       Then we can calculate such hierarchy distance based on the similar principle of computing distance between nodes in a graph. The relationship between this type of distance can be explored with the following definition:

1)      If the two pages are on the same server, their distance is defined as 0.

2)      If the two pages are not on the same server but in the same organization, their distance is defined as 1.

3)      Otherwise the difference between two pages is defined as 2.

This definition of distance is only one option, we can come up with other definitions, but the principle is similar.

       Because the first three types of location sensitive search engine are primarily similar on the distance concept between documents. We can consider them as type I. But the fourth type is much different, so we separate it as type II.

 

3. Analysis of Problems

In location-sensitive search engine, the typical form of a user's query is a list of one or more keywords and a starting URL, from which we will calculate distance of pertinent documents. The proposed query preprocessing involves stop-word removal, word stemming, and phrase identification. The resulting keywords are then looked up on the inverted file index to create a pool of potentially relevant URL's. If there are phrases specified in the query, those URLs which do not contain the phrases are removed from the pool. The pool is then handed over to the ranking algorithm.

 

3.1   Distance Computation and Ranking algorithm for Type I

In the aimed location-sensitive search engine of type I, we will employ improved ranking algorithm based on conventional methods. The algorithm I am going to use is the TF*IDF algorithm based on vector space model. To take the distance from starting URL into consideration to compute ranking scores I will modify the ranking score formula to following forms.

First we just calculate the ranking score of documents by conventional vector space similarity function.

(1)

where:

       :      the weight of term (the j-th query tem) in (the i-th document in the database)

       :     the term frequency of  in  (the i-th document in the database)

       :    , the IDF of . N is the number of documents in the potential relevant document pool

(2) We use cosine function to calculate the similarity of each document in the pool relevant to the query:

where:

       :  the content similarity score of with respect to query q

       M: the number of terms in the query

       After we get the , then we use breadth first search (BFS) algorithm to search the minimum path between starting URL and each document  in the potential relevant pool. Supposing by the BFS we get (distance from the starting URL) for each  existing in the graph, then we can calculate an invert distance variable for each :

      if <=

=0                                   if >

where:

       :   the maximum distance of documents in the pool to the starting URL.

If we can't find in the BFS searching, then will be 0.

By this formula the weigh of  will be reduced when  increase and  is also normalized. I.e., if =, then =0, if =0 (the minimum possible value of ), then =1.

The final ranking score of  is:

where:

       c1, c2 is two weighter facter. I suppose c1=0.9, c2=0.1

       Because both and  are normalized, so the final ranking score of each potential relevant document  is normalized and we can control their range between 0 to 1.

 

3.2   Distance Computation and Ranking algorithm for Type II

In type II, we do not need to store the graph structure and employ the BFS algorithm to compute the hierarchy distance because we can calculate such distance solely based on the domain name resolution. By parsing the domain name we can get the hierarchical structures among potential relevant documents and get the distance of each document to the starting URL.

We can use the formula for type I to calculate the comprehensive ranking score.

 is in [0,1] and the final ranking score of  is:

 is also normalized and we can control their range between 0 to 1.

 

3.3   Approaches to Support Multiple Types of Neighborhood

We have two choices to implement the proposed location-sensitive search engine. One is to use the data structures designed by ourselves to represent web graphs and to calculate the shortest distance between two pages. Another approach is to apply relational database to support the representation of the link structures of web and the calculation of distances. We will describe these two approaches in detail in the following sections.

 

3.3.1    Balanced Data Structure Used to Represent the Web Graph

One of our design goals is to support multiple types of neighborhood, so it is very important to design a good balanced data structure to represent the graph for all types of neighborhood definition of type I search engine.

There are three types neighborhood to considerate about the certain problem. The first type is that we only consider the paths constituting by forwarding hyperlinks; The second type is that we only consider the paths constituting by backward hyperlinks; The third type is that we consider both forwarding edges and backward edges, i.e. the edges between WWW documents are bi-directional.

To support all these three types of neighborhood at the same time, the data structure to represent graph should be efficient to identify all forwarding and backward links of one document as well as the source and destination nodes of one edge (hyperlink). Because the number of documents changes dynamically, it's clear that the adjacency matrix is not suitable here. If we use adjacency list data structure to represent the graph, although it's easy to get all forwarding links of one document, it's very time consuming to find all backward links of a document because we have to traverse entire adjacency list to achieve this goal. So this data structure is not good enough too. I propose to construct an orthogonal list to store the graph. An orthogonal list can be thought as the combination of adjacency list and counter adjacency list. There is one record for each directed edge and one record for each node in the orthogonal list.

There are four fields in the record of edges: tailvex and headvex are the tail node and head node of the edge respectively. The hlink field points to the next edge which has the same headvex field as the particular edge. The tlink field points to the next edge which has the same tailvex field. So all edges having the same headvex will compose one linked list and all edges having the same tailvex compose a linked list too. The head nodes of these linked lists are node records of the graph. The node record can contain corresponding information about the document and two fields. One if firstin, which points to the first edge record that points to this node. The other is firstout, which points to the first edge record that points to other nodes from this node. The data structure is described as follows:

Type arclink=*arctype;

              arctype= RECORD

                            tailvex,headvex:    vtxptr;

                            hlink,tlink: arclink;

                            END;

              vnode=RECORD

                            data:vertex;

firstin,firstout: arclink;

END;

ortholist=ARRAY[vtxptr] of vnode;

 

Edge Record:

Tailvex

headvex

hlink

tlink

 

Node Record

data

firstin

firstout

 

       For example, the orthogonal list of the web page graph below is shown as follows:


 

 


We can construct this orthogonal list in advance based on the documents crawled and store it in the disk so we can use it when users submit a query. Each time a new collection of documents is crawled, we should reconstruct the orthogonal list too to represent the updated graph structure.

It's efficient to find all forwarding links and backward links of one document, so we can calculate the distance in affordable time consuming. In addition, the time complexity of constructing an orthogonal list for one graph is same as the time complexity of constructing an adjacency list for that graph. By storing the graph in this data structure we can support all three types of type I search engine at the same time. It's feasible to perform BFS on such data structure to calculate the distance among documents.

 

3.3.2    The Approach of Applying Database to Represent Web Graph

Another approach of representing graph is to use a database table. We can store the graph as a table in a DBMS and use SQL queries to find documents in certain neighborhood. Here the distance is based on the number of edges (links) between two web pages on the shortest path. We can have a table in the database which contains the information of link structures in the web pages. This is table is created when we crawl and parse the web pages. The table has the following schema:

 

Parent URL

Child URL

Label

 

When we calculate the distance from a user-provided URL---SU to a particular URL---TU (TU is the URL got from traditional search engine based on the similarity comparison with the query). We can write a program using embedded SQL to conduct the computation of distance. The step of calculating distance between these two web pages is shown as follows:

1.      Construct a queue of URLs: L and initialize it to containing only TU (associated with a number: distance=0)

2.      Get out the head URL: HU from the queue L

3.      Perform a query to get the URL set of parent URLs of HU, each result URL is associated with a number distance, whose value is equal to HU.distance+1

4.      If the set contains SU, return this URL’s distance value

5.      Else put the result set into the queue L

6.      If Min(distance value of each URL in the queue)<=distmax (which is specified by the user), go to step 2

7.      Return dismax

By applying such algorithm to the database SQL query, we can achieve the objective of calculating distance between two web pages. In addition, we can also handle the distance computation of mixed forward and backward links by such approach. This defined neighbor can be illustrated as:

 

 

 

 

 

 


Based on mixed links, a and b can be at distance of 2.

 

 The algorithm to calculate the distance among this kind of neighborhood is similar to the above algorithm and is described as follows:

1.      Construct a queue of URLs: L and initialize it to containing only TU (associated with a number: distance=0)

2.      Get out the head URL: HU from the queue L

3.      Perform a query to get the URL set of parent URLs of HU, each result URL is associated with a number distance, whose value is equal to HU.distance+1

4.      If the set contains SU, return this URL’s distance value

5.      Else put the result set into the queue L

6.      Perform a query to get the URL set of child URLs of HU, each result URL is associated with a number distance, whose value is equal to HU.distance+1

7.      If the set contains SU, return this URL’s distance value

8.      Else put the result set into the queue L

9.      If Min(distance value of each URL in the queue)<=distmax (which is specified by the user), go to step 2

10.  Return dismax

 

So we can also handle the neighbors defined by mixed forward and backward links.

       For type II search engine, because we need not to store the graph and use BFS to calculate the minimum distance it's straightforward, to some extent, to get the measurement of hierarchy distance by using domain name parsing method.

 

4. Ongoing Work

       In the conducting research we will conceive the efficient data structure for multiple types of neighborhood much deeply and try to improve our formula to calculate distance weight as well as final ranking score. At the same time, the design of optimal disk I/O access structure and the form of index file are also the next content of our research. The issue about implementation, approach and algorithm of graph's database representation deserve more discussion too.

 

Reference:

[1] Sergey Brin, Lawrence Page, “The Anatomy of a Large-Scale Hypertextual Web search Engine”, Stanford University, 1998

[2] Budi Yuwono, Dik L.Lee, “WISE: A World Wide Web Resource Database System”, The Ohio State University, Hong Kong University of Science and Technology, 1996

[3] Filippo Menczer, “ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery”, University of California at San Diego, 1998

[4] Roy Goldman, Narayanan Shivakmar, “Proximity Search in Databases”, Stanford University, 1999

[5] L. Gravano, and H. Garcia-Molina, "Generalizing GlOSS to Vector-Space databases and Broker Hierarchies", Stanford University, 1995

[6] C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe, “Finding the Most Similar Documents across Multiple Text Databases”, Proc. of the IEEE Conference on Advances in Digital Libraries (ADL'99), Baltimore, Maryland, May 1999.

[7] W. Meng, K. Liu, C. Yu, X. Wang, Y. Chang, and N. Rishe, “Determining Text Databases to Search in the Internet”, Proc. of the 24th International Conference on Very Large Data Bases (VLDB'98), New York City, August 1998