*(Final report of CS534X project)*

* *

Ruofei Zhang

**Abstract**

This paper comes up with a proposal for the research
on location-sensitive search engine. In addition to the problem definition and
requirement description of the location-sensitive search engine, I focus on the
analysis of solutions to three problems on location-sensitive search engine.
These three problems are the formula of measurement of neighborhood distance,
the design of data structure to support multiple types of neighborhood
definition and the effective and efficient approaches for representing web
graph and for calculating minimum distance between two pages, respectively.
These problems should be researched and solved in the design phase of proposed
location-sensitive search engine. Some potential solutions to these problems
are provided and I expect to evaluate them in the future work.

**1. ****Introduction**

With
the growing of Internet, the Intranet and Virtual Private Network (VPN) are
emerging in big companies and institutions at the comparable speed. To find the
relevant information in these Intranet and VPN, apart from the conventional
content similarity, the neighborhood distance from a user submitted document
(URL) is also a consideration to decide the ranking score of a document to the
query. So how to take the neighborhood distance between a targeted document and
the starting point (provided by user) into consideration to design and search
resource database as well as to rank relevant documents is a new research
direction in the field of information retrieval and search engine.

Our
aimed location-sensitive search engine, in some degree, is an extension of
traditional search engines. It not only calculates the content similarity
between documents in the resource database and the query but also computes the
distance (or the intimacy of neighborhood) between the documents and user
provided starting point (URL). By adding the distance as a factor to the
formula calculating the similarity score of each document, we get a
comprehensive rank result of a document collection. This rank result reflects
both relevance degree and neighborhood proximity of documents to the user's
request. Adding distance information to ranking documents will help users to
determine their needed documents conveniently. For instance, user may only want
to browse relevant documents in a scope of neighborhood (distance).
Location-sensitive searching is especially useful for employees accessing an
Intranet, MAN or VPN, which do not want to be flooded by the vast results of
outer cyberspace, instead they only need to get relevant documents in a
customized spectrum.

**2.
Clarification of Problem Definition **

The
distance between two documents (URL) is defined as the minimum number of
hyperlinks which connect these two documents. The documents in the Web can be
deemed as nodes and the hyperlinks can be thought as directed edges, so we can
construct a directed graph to represent the pool of web pages as well as their
relationships. For example, as figure 1 shows, document D1 has a hyperlink
pointing to document D2 and D2 has a hyperlink pointing to D3. In addition, D1
also have path to D3 by other intermediate documents but the distance of the
path (the number of hyperlinks constituting this path) is bigger than 2, then
the distance between D1 and D3 is 2. Similarly, the distance between D1 and D2
and between D2 and D3 is 1 respectively.

Figure 1 Illustration of
neighborhoods

As
I described in my first report, we can conceive four types of
location-sensitive search engine based on distance concept defined above and
the direction of paths constituting by hyperlinks between documents

l
The
first type is that we only consider the paths constituting by forwarding
hyperlinks, i.e. the edges in the graph of WWW documents is unidirectional and
formed by hyperlinks. For example, in Figure 1, D3 has distance 2 from D1, but
we don't consider that D1 also has distance 2 from D3 unless that there is a
path from D3 to D1 too in the graph.

l
The
second type is that we only consider the paths constituting by backward
hyperlinks, i.e. the edges in the graph of WWW documents is unidirectional too
but formed by reverse hyperlinks. That means D1 has distance 2 from D3, but we
don't think D3 also has distance 2 from D1 if there is no path from D3 to D1.

l
The
third type is that we consider both above cases together, i.e. the edges
between WWW documents are bi-directional. So the distance relation between two
documents are symmetric. If document i has distance M from document j, then
document j has distance M from document i as well.

l
Based
on the hierarchical structure of domain name, we can construct the fourth type
location-sensitive search engine. In this type we consider the distance between
two documents as the number of the hierarchy level of the domains these two
documents residing on. For instance, we can define the distance between panda.cs.binghamton.edu/index.html
and www.cs.binghamton.edu/index.html as well as the distance between www.cs.binghamton.edu/index.html
and www.binghamton.edu/index.hteml
to 1 because of their hierarchical structure. We employ the hierarchical
structure of domain name system in the WWW. Each domain name may have many
"child" (sub) domains. So we also can construct a graph to represent
such relationship:

Figure 2
Illustration of hierarchy distance

Then
we can calculate such hierarchy distance based on the similar principle of
computing distance between nodes in a graph. The relationship between this type
of distance can be explored with the following definition:

1)
If
the two pages are on the same server, their distance is defined as 0.

2)
If
the two pages are not on the same server but in the same organization, their
distance is defined as 1.

3)
Otherwise
the difference between two pages is defined as 2.

This
definition of distance is only one option, we can come up with other
definitions, but the principle is similar.

Because
the first three types of location sensitive search engine are primarily similar
on the distance concept between documents. We can consider them as type I. But
the fourth type is much different, so we separate it as type II.

**3. Analysis of
Problems**

In location-sensitive search engine, the
typical form of a user's query is a list of one or more keywords and a starting
URL, from which we will calculate distance of pertinent documents. The proposed
query preprocessing involves stop-word removal, word stemming, and phrase
identification. The resulting keywords are then looked up on the inverted file
index to create a pool of potentially relevant URL's. If there are phrases
specified in the query, those URLs which do not contain the phrases are removed
from the pool. The pool is then handed over to the ranking algorithm.

**3.1 **** Distance Computation and
Ranking algorithm for Type I**

In
the aimed location-sensitive search engine of type I, we will employ improved
ranking algorithm based on conventional methods. The algorithm I am going to
use is the TF*IDF algorithm based on vector space model. To take the distance
from starting URL into consideration to compute ranking scores I will modify
the ranking score formula to following forms.

First
we just calculate the ranking score of documents by conventional vector space
similarity function.

(1)

_{}

where:

_{}: the
weight of term _{}(the j-th query tem) in_{} (the i-th document in the database)

_{}: the
term frequency of _{} in _{} (the i-th
document in the database)

_{}: _{}, the IDF of _{}. N is the number of documents in the potential relevant
document pool

(2) We use cosine function to calculate the
similarity of each document in the pool relevant to the query:

_{}

where:

_{}: the content
similarity score of_{} with respect to query q

M:
the number of terms in the query

After
we get the _{}, then we use breadth first search (BFS) algorithm to search
the minimum path between starting URL and each document _{} in the potential
relevant pool. Supposing by the BFS we get _{}(distance from the starting URL) for each _{} existing in the
graph, then we can calculate an invert distance variable _{}for each _{}:

_{} if
_{}<=_{}

_{}=0 if
_{}>_{}

where:

_{}: the
maximum distance of documents in the pool to the starting URL.

If we can't find_{} in the BFS searching, then _{}will be 0.

By this formula the _{}weigh of _{} will be reduced
when _{} increase and _{} is also
normalized. I.e., if _{}=_{}, then _{}=0, if _{}=0 (the minimum possible value of _{}), then _{}=1.

The final ranking score of _{} is:

_{}

where:

c1,
c2 is two weighter facter. I suppose c1=0.9, c2=0.1

Because
both _{}and _{} are normalized,
so the final ranking score of each potential relevant document _{} is normalized
and we can control their range between 0 to 1.

**3.2 **** Distance Computation and
Ranking algorithm for Type II**

In
type II, we do not need to store the graph structure and employ the BFS
algorithm to compute the hierarchy distance because we can calculate such
distance solely based on the domain name resolution. By parsing the domain name
we can get the hierarchical structures among potential relevant documents and
get the distance of each document to the starting URL.

We
can use the formula for type I to calculate the comprehensive ranking score.

_{}

_{} is in [0,1] and
the final ranking score of _{} is:

_{}

_{} is also
normalized and we can control their range between 0 to 1.

**3.3 **** Approaches to Support
Multiple Types of Neighborhood**

We
have two choices to implement the proposed location-sensitive search engine.
One is to use the data structures designed by ourselves to represent web graphs
and to calculate the shortest distance between two pages. Another approach is
to apply relational database to support the representation of the link
structures of web and the calculation of distances. We will describe these two
approaches in detail in the following sections.

**3.3.1 ****Balanced Data Structure Used to Represent the Web Graph**

One
of our design goals is to support multiple types of neighborhood, so it is very
important to design a good balanced data structure to represent the graph for
all types of neighborhood definition of type I search engine.

There
are three types neighborhood to considerate about the certain problem. The
first type is that we only consider the paths constituting by forwarding
hyperlinks; The second type is that we only consider the paths constituting by
backward hyperlinks; The third type is that we consider both forwarding edges
and backward edges, i.e. the edges between WWW documents are bi-directional.

To
support all these three types of neighborhood at the same time, the data
structure to represent graph should be efficient to identify all forwarding and
backward links of one document as well as the source and destination nodes of
one edge (hyperlink). Because the number of documents changes dynamically, it's
clear that the adjacency matrix is not suitable here. If we use adjacency list
data structure to represent the graph, although it's easy to get all forwarding
links of one document, it's very time consuming to find all backward links of a
document because we have to traverse entire adjacency list to achieve this
goal. So this data structure is not good enough too. I propose to construct an
orthogonal list to store the graph. An orthogonal list can be thought as the
combination of adjacency list and counter adjacency list. There is one record
for each directed edge and one record for each node in the orthogonal list.

There
are four fields in the record of edges: __tailvex__ and __headvex__ are
the tail node and head node of the edge respectively. The __hlink__ field
points to the next edge which has the same __headvex__ field as the
particular edge. The __tlink__ field points to the next edge which has the
same __tailvex __field. So all edges having the same __headvex __will
compose one linked list and all edges having the same __tailvex__ compose a
linked list too. The head nodes of these linked lists are node records of the
graph. The node record can contain corresponding information about the document
and two fields. One if __firstin, __which points to the first edge record
that points to this node. The other is __firstout__, which points to the
first edge record that points to other nodes from this node. The data structure
is described as follows:

Type
arclink=*arctype;

arctype=
RECORD

tailvex,headvex: vtxptr;

hlink,tlink:
arclink;

END;

vnode=RECORD

data:vertex;

firstin,firstout: arclink;

END;

ortholist=ARRAY[vtxptr]
of vnode;

Edge
Record:

Tailvex |
headvex |
hlink |
tlink |

Node
Record

data |
firstin |
firstout |

For
example, the orthogonal list of the web page graph below is shown as follows:

We
can construct this orthogonal list in advance based on the documents crawled
and store it in the disk so we can use it when users submit a query. Each time
a new collection of documents is crawled, we should reconstruct the orthogonal
list too to represent the updated graph structure.

It's
efficient to find all forwarding links and backward links of one document, so
we can calculate the distance in affordable time consuming. In addition, the
time complexity of constructing an orthogonal list for one graph is same as the
time complexity of constructing an adjacency list for that graph. By storing
the graph in this data structure we can support all three types of type I
search engine at the same time. It's feasible to perform BFS on such data
structure to calculate the distance among documents.

**3.3.2 ****The Approach of Applying Database to Represent Web Graph**

Another
approach of representing graph is to use a database table. We can store the
graph as a table in a DBMS and use SQL queries to find documents in certain
neighborhood. Here the distance is based on the number of edges (links) between
two web pages on the shortest path. We can have a table in the database which
contains the information of link structures in the web pages. This is table is
created when we crawl and parse the web pages. The table has the following
schema:

Parent URL |
Child URL |
Label |

When
we calculate the distance from a user-provided URL---SU to a particular
URL---TU (TU is the URL got from traditional search engine based on the
similarity comparison with the query). We can write a program using embedded
SQL to conduct the computation of distance. The step of calculating distance
between these two web pages is shown as follows:

1.
Construct
a queue of URLs: L and initialize it to containing only TU (associated with a
number: distance=0)

2.
Get
out the head URL: HU from the queue L

3.
Perform
a query to get the URL set of parent URLs of HU, each result URL is associated
with a number distance, whose value is equal to HU.distance+1

4.
If
the set contains SU, return this URL’s distance value

5.
Else
put the result set into the queue L

6.
If
Min(distance value of each URL in the queue)<=distmax (which is specified by
the user), go to step 2

7.
Return
dismax

By
applying such algorithm to the database SQL query, we can achieve the objective
of calculating distance between two web pages. In addition, we can also handle
the distance computation of mixed forward and backward links by such approach.
This defined neighbor can be illustrated as:

Based
on mixed links, a and b can be at distance of 2.

The algorithm to calculate the distance
among this kind of neighborhood is similar to the above algorithm and is
described as follows:

1.
Construct
a queue of URLs: L and initialize it to containing only TU (associated with a
number: distance=0)

2.
Get
out the head URL: HU from the queue L

3.
Perform
a query to get the URL set of parent URLs of HU, each result URL is associated
with a number distance, whose value is equal to HU.distance+1

4.
If
the set contains SU, return this URL’s distance value

5.
Else
put the result set into the queue L

6.
Perform
a query to get the URL set of child URLs of HU, each result URL is associated
with a number distance, whose value is equal to HU.distance+1

7.
If
the set contains SU, return this URL’s distance value

8.
Else
put the result set into the queue L

9.
If
Min(distance value of each URL in the queue)<=distmax (which is specified by
the user), go to step 2

10.
Return
dismax

So
we can also handle the neighbors defined by mixed forward and backward links.

For
type II search engine, because we need not to store the graph and use BFS to
calculate the minimum distance it's straightforward, to some extent, to get the
measurement of hierarchy distance by using domain name parsing method.

** **

**4. Ongoing
Work**

In
the conducting research we will conceive the efficient data structure for
multiple types of neighborhood much deeply and try to improve our formula to
calculate distance weight as well as final ranking score. At the same time, the
design of optimal disk I/O access structure and the form of index file are also
the next content of our research. The issue about implementation, approach and
algorithm of graph's database representation deserve more discussion too.

**Reference:**

[1] Sergey Brin, Lawrence Page, “The Anatomy of a
Large-Scale Hypertextual Web search Engine”, Stanford University, 1998

[2] Budi Yuwono, Dik L.Lee, “WISE: A World Wide Web
Resource Database System”, The Ohio State University, Hong Kong University of
Science and Technology, 1996

[3] Filippo Menczer, “ARACHNID: Adaptive Retrieval
Agents Choosing Heuristic Neighborhoods for Information Discovery”, University
of California at San Diego, 1998

[4] Roy Goldman, Narayanan Shivakmar, “Proximity
Search in Databases”, Stanford University, 1999

[5] L. Gravano, and H. Garcia-Molina,
"Generalizing GlOSS to Vector-Space databases and Broker
Hierarchies", Stanford University, 1995

[6] C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe,
“Finding the Most Similar Documents across Multiple Text Databases”, Proc. of
the IEEE Conference on Advances in Digital Libraries (ADL'99), Baltimore,
Maryland, May 1999.

[7] W. Meng, K. Liu, C. Yu, X. Wang, Y. Chang, and
N. Rishe, “Determining Text Databases to Search in the Internet”, Proc. of the
24th International Conference on Very Large Data Bases (VLDB'98), New York
City, August 1998