Manning Ch. D., Raghavan P., Sch?tze H. Introduction to Information Retrieval - Введение в информационный поиск

Подождите немного. Документ загружается.

Online edition (c)2009 Cambridge UP

456 20 Web crawling and indexes

1: www.stanford.edu/alchemy

2: www.stanford.edu/biology

3: www.stanford.edu/biology/plant

4: www.stanford.edu/biology/plant/copyright

5: www.stanford.edu/biology/plant/people

6: www.stanford.edu/chemistry

◮

Figure 20.5 A lexicographically ordered set of URLs.

This table representation cuts the space taken by the naive representation

(in which we explicitly represent each link by its two end points, each a 32-bit

integer) by 50%. Our description below will focus on the table for the links

from each page; it should be clear that the techniques apply just as well to

the table of links to each page. To further reduce the storage for the table, we

exploit several ideas:

1. Similarity between lists: Many rows of the table have many entries in

common. Thus, if we explicitly represent a prototype row for several

similar rows, the remainder can be succinctly expressed in terms of the

prototypical row.

2. Locality: many links from a page go to “nearby” pages – pages on the

same host, for instance. This suggests that in encoding the destination of

a link, we can often use small integers and thereby save space.

3. We use gap encodings in sorted lists: rather than store the destination of

each link, we store the offset from the previous entry in the row.

We now develop each of these techniques.

In a lexicographic ordering of all URLs, we treat each URL as an alphanu-

meric string and sort these strings. Figure

20.5 shows a segment of this sorted

order. For a true lexicographic sort of web pages, the domain name part of

the URL should be inverted, so that www.stanford.edubecomes edu.stanford.www,

but this is not necessary here since we are mainly concerned with links local

to a single host.

To each URL, we assign its position in this ordering as the unique identi-

fying integer. Figure

20.6 shows an example of such a numbering and the

resulting table. In this example sequence, www.stanford.edu/biology

is assigned the integer 2 since it is second in the sequence.

We next exploit a property that stems from the way most websites are

structured to get similarity and locality. Most websites have a template with

a set of links from each page in the site to a ﬁxed set of pages on the site (such

Online edition (c)2009 Cambridge UP

20.4 Connectivity servers 457

1: 1, 2, 4, 8, 16, 32, 64

2: 1, 4, 9, 16, 25, 36, 49, 64

3: 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144

4: 1, 4, 8, 16, 25, 36, 49, 64

◮

Figure 20.6 A four-row segment of the table of links.

as its copyright notice, terms of use, and so on). In this case, the rows cor-

responding to pages in a website will have many table entries in common.

Moreover, under the lexicographic ordering of URLs, it is very likely that the

pages from a website appear as contiguous rows in the table.

We adopt the following strategy: we walk down the table, encoding each

table row in terms of the seven preceding rows. In the example of Figure

20.6,

we could encode the fourth row as “the same as the row at offset 2 (mean-

ing, two rows earlier in the table), with 9 replaced by 8”. This requires the

speciﬁcation of the offset, the integer(s) dropped (in this case 9) and the in-

teger(s) added (in this case 8). The use of only the seven preceding rows has

two advantages: (i) the offset can be expressed with only 3 bits; this choice

is optimized empirically (the reason for seven and not eight preceding rows

is the subject of Exercise

20.4) and (ii) ﬁxing the maximum offset to a small

value like seven avoids having to perform an expensive search among many

candidate prototypes in terms of which to express the current row.

What if none of the preceding seven rows is a good prototype for express-

ing the current row? This would happen, for instance, at each boundary

between different websites as we walk down the rows of the table. In this

case we simply express the row as starting from the empty set and “adding

in” each integer in that row. By using gap encodings to store the gaps (rather

than the actual integers) in each row, and encoding these gaps tightly based

on the distribution of their values, we obtain further space reduction. In ex-

periments mentioned in Section 20.5, the series of techniques outlined here

appears to use as few as 3 bits per link, on average – a dramatic reduction

from the 64 required in the naive representation.

While these ideas give us a representation of sizable web graphs that com-

fortably ﬁt in memory, we still need to support connectivity queries. What

is entailed in retrieving from this representation the set of links from a page?

First, we need an index lookup from (a hash of) the URL to its row number

in the table. Next, we need to reconstruct these entries, which may be en-

coded in terms of entries in other rows. This entails following the offsets to

reconstruct these other rows – a process that in principle could lead through

many levels of indirection. In practice however, this does not happen very

often. A heuristic for controlling this can be introduced into the construc-

Online edition (c)2009 Cambridge UP

458 20 Web crawling and indexes

tion of the table: when examining the preceding seven rows as candidates

from which to model the current row, we demand a threshold of similarity

between the current row and the candidate prototype. This threshold must

be chosen with care. If the threshold is set too high, we seldom use proto-

types and express many rows afresh. If the threshold is too low, most rows

get expressed in terms of prototypes, so that at query time the reconstruction

of a row leads to many levels of indirection through preceding prototypes.

Exercise 20.4

We noted that expressing a row in terms of one of seven preceding rows allowed us

to use no more than three bits to specify which of the preceding rows we are using

as prototype. Why seven and not eight preceding rows? (Hint: consider the case when

none of the preceding seven rows is a good prototype.)

Exercise 20.5

We noted that for the scheme in Section 20.4, decoding the links incident on a URL

could result in many levels of indirection. Construct an example in which the number

of levels of indirection grows linearly with the number of URLs.

20.5 Ref erences and further reading

The ﬁrst web crawler appears to be Matthew Gray’s Wanderer, written in the

spring of 1993. The Mercator crawler is due to Najork and Heydon (Najork

and Heydon 2001; 2002); the treatment in this chapter follows their work.

Other classic early descriptions of web crawling include Burner (1997), Brin

and Page (1998), Cho et al. (1998) and the creators of the Webbase system

at Stanford (Hirai et al. 2000). Cho and Garcia-Molina (2002) give a taxon-

omy and comparative study of different modes of communication between

the nodes of a distributed crawler. The Robots Exclusion Protocol standard

is described at http://www.robotstxt.org/wc/exclusion.html. Boldi et al. (2002) and

Shkapenyuk and Suel (2002) provide more recent details of implementing

large-scale distributed web crawlers.

Our discussion of DNS resolution (Section

20.2.2) uses the current conven-

tion for internet addresses, known as IPv4 (for Internet Protocol version 4) –

each IP address is a sequence of four bytes. In the future, the convention for

addresses (collectively known as the internet address space) is likely to use a

new standard known as IPv6 (http://www.ipv6.org/).

Tomasic and Garcia-Molina (1993) and Jeong and Omiecinski (1995) are

key early papers evaluating term partitioning versus document partitioning

for distributed indexes. Document partitioning is found to be superior, at

least when the distribution of terms is skewed, as it typically is in practice.

This result has generally been conﬁrmed in more recent work (MacFarlane

et al. 2000). But the outcome depends on the details of the distributed system;

Online edition (c)2009 Cambridge UP

20.5 References and further reading 459

at least one thread of work has reached the opposite conclusion (Ribeiro-

Neto and Barbosa 1998, Badue et al. 2001). Sornil (2001) argues for a par-

titioning scheme that is a hybrid between term and document partitioning.

Barroso et al. (2003) describe the distribution methods used at Google. The

ﬁrst implementation of a connectivity server was described by Bharat et al.

(1998). The scheme discussed in this chapter, currently believed to be the

best published scheme (achieving as few as 3 bits per link for encoding), is

described in a series of papers by Boldi and Vigna (2004a;b).

Online edition (c)2009 Cambridge UP

DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 461

21 Link analysi s

The analysis of hyperlinks and the graph structure of the Web has been in-

strumental in the development of web search. In this chapter we focus on the

use of hyperlinks for ranking web search results. Such link analysis is one

of many factors considered by web search engines in computing a compos-

ite score for a web page on any given query. We begin by reviewing some

basics of the Web as a graph in Section

21.1, then proceed to the technical

development of the elements of link analysis for ranking.

Link analysis for web search has intellectual antecedents in the ﬁeld of cita-

tion analysis, aspects of which overlap with an area known as bibliometrics.

These disciplines seek to quantify the inﬂuence of scholarly articles by ana-

lyzing the pattern of citations amongst them. Much as citations represent the

conferral of authority from a scholarly article to others, link analysis on the

Web treats hyperlinks from a web page to another as a conferral of authority.

Clearly, not every citation or hyperlink implies such authority conferral; for

this reason, simply measuring the quality of a web page by the number of

in-links (citations from other pages) is not robust enough. For instance, one

may contrive to set up multiple web pages pointing to a target web page,

with the intent of artiﬁcially boosting the latter’s tally of in-links. This phe-

nomenon is referred to as link spam. Nevertheless, the phenomenon of ci-

tation is prevalent and dependable enough that it is feasible for web search

engines to derive useful signals for ranking from more sophisticated link

analysis. Link analysis also proves to be a useful indicator of what page(s)

to crawl next while crawling the web; this is done by using link analysis to

guide the priority assignment in the front queues of Chapter

20.

Section 21.1 develops the basic ideas underlying the use of the web graph

in link analysis. Sections

21.2 and 21.3 then develop two distinct methods for

link analysis, PageRank and HITS.

Online edition (c)2009 Cambridge UP

462 21 Link analysis

21.1 The Web as a graph

Recall the notion of the web graph from Section

19.2.1 and particularly Fig-

ure

19.2. Our study of link analysis builds on two intuitions:

1. The anchor text pointing to page B is a good description of page B.

2. The hyperlink from A to B represents an endorsement of page B, by the

creator of page A. This is not always the case; for instance, many links

amongst pages within a single website stem from the user of a common

template. For instance, most corporate websites have a pointer from ev-

ery page to a page containing a copyright notice – this is clearly not an

endorsement. Accordingly, implementations of link analysis algorithms

will typical discount such “internal” links.

21.1.1 Anchor text and the web graph

The following fragment of HTML code from a web page shows a hyperlink

pointing to the home page of the Journal of the ACM:

<a href="http://www.acm.org/jacm/">Journal of the ACM.</a>

In this case, the link points to the page http://www.acm.org/jacm/ and

the anchor text is Journal of the ACM. Clearly, in this example the anchor is de-

scriptive of the target page. But then the target page (B = http://www.acm.org/jacm/)

itself contains the same description as well as considerable additional infor-

mation on the journal. So what use is the anchor text?

The Web is full of instances where the page B does not provide an accu-

rate description of itself. In many cases this is a matter of how the publish-

ers of page B choose to present themselves; this is especially common with

corporate web pages, where a web presence is a marketing statement. For

example, at the time of the writing of this book the home page of the IBM

corporation (http://www.ibm.com)did not contain the term computer any-

where in its HTML code, despite the fact that IBM is widely viewed as the

world’s largest computer maker. Similarly, the HTML code for the home

page of Yahoo! (http://www.yahoo.com) does not at this time contain the

word portal.

Thus, there is often a gap between the terms in a web page, and how web

users would describe that web page. Consequently, web searchers need not

use the terms in a page to query for it. In addition, many web pages are rich

in graphics and images, and/or embed their text in these images; in such

cases, the HTML parsing performed when crawling will not extract text that

is useful for indexing these pages. The “standard IR” approach to this would

be to use the methods outlined in Chapter

9 and Section 12.4. The insight

Online edition (c)2009 Cambridge UP

21.1 The Web as a graph 463

behind anchor text is that such methods can be supplanted by anchor text,

thereby tapping the power of the community of web page authors.

The fact that the anchors of many hyperlinks pointing to http://www.ibm.com

include the word computer can be exploited by web search engines. For in-

stance, the anchor text terms can be included as terms under which to index

the target web page. Thus, the postings for the term computer would include

the document http://www.ibm.com and that for the term portal would in-

clude the document http://www.yahoo.com, using a special indicator to

show that these terms occur as anchor (rather than in-page) text. As with

in-page terms, anchor text terms are generally weighted based on frequency,

with a penalty for terms that occur very often (the most common terms in an-

chor text across the Web are Click and here, using methods very similar to idf).

The actual weighting of terms is determined by machine-learned scoring, as

in Section

15.4.1; current web search engines appear to assign a substantial

weighting to anchor text terms.

The use of anchor text has some interesting side-effects. Searching for big

blue on most web search engines returns the home page of the IBM corpora-

tion as the top hit; this is consistent with the popular nickname that many

people use to refer to IBM. On the other hand, there have been (and con-

tinue to be) many instances where derogatory anchor text such as evil empire

leads to somewhat unexpected results on querying for these terms on web

search engines. This phenomenon has been exploited in orchestrated cam-

paigns against speciﬁc sites. Such orchestrated anchor text may be a form

of spamming, since a website can create misleading anchor text pointing to

itself, to boost its ranking on selected query terms. Detecting and combating

such systematic abuse of anchor text is another form of spam detection that

web search engines perform.

The window of text surrounding anchor text (sometimes referred to as ex-

tended anchor t ext) is often usable in the same manner as anchor text itself;

consider for instance the fragment of web text there is good discussion

of vedic scripture <a>here</a>. This has been considered in a num-

ber of settings and the useful width of this window has been studied; see

Section

21.4 for references.

Exercise 21.1

Is it always possible to follow directed edges (hyperlinks) in the web graph from any

node (web page) to any other? Why or why not?

Exercise 21.2

Find an instance of misleading anchor-text on the Web.

Exercise 21.3

Given the collection of anchor-text phrases for a web page x, suggest a heuristic for

choosing one term or phrase from this collection that is most descriptive of x.

Online edition (c)2009 Cambridge UP

464 21 Link analysis





















◮

Figure 21.1 The random surfer at node A proceeds with probability 1/3 to each

of B, C and D.

Exercise 21.4

Does your heuristic in the previous exercise take into account a single domain D

repeating anchor text for x from multiple pages in D?

21.2 PageRan k

We now focus on scoring and ranking measures derived from the link struc-

ture alone. Our ﬁrst technique for link analysis assigns to every node in

the web graph a numerical score between 0 and 1, known as its PageRank.PAGERANK

The PageRank of a node will depend on the link structure of the web graph.

Given a query, a web search engine computes a composite score for each

web page that combines hundreds of features such as cosine similarity (Sec-

tion

6.3) and term proximity (Section 7.2.2), together with the PageRank score.

This composite score, developed using the methods of Section

15.4.1, is used

to provide a ranked list of results for the query.

Consider a random surfer who begins at a web page (a node of the web

graph) and executes a random walk on the Web as follows. At each time

step, the surfer proceeds from his current page A to a randomly chosen web

page that A hyperlinks to. Figure

21.1 shows the surfer at a node A, out of

which there are three hyperlinks to nodes B, C and D; the surfer proceeds at

the next time step to one of these three nodes, with equal probabilities 1/3.

As the surfer proceeds in this random walk from node to node, he visits

some nodes more often than others; intuitively, these are nodes with many

links coming in from other frequently visited nodes. The idea behind Page-

Rank is that pages visited more often in this walk are more important.

What if the current location of the surfer, the node A, has no out-links?

To address this we introduce an additional operation for our random surfer:

the teleport operation. In the teleport operation the surfer jumps from a nodeTELEPORT

to any other node in the web graph. This could happen because he types

Online edition (c)2009 Cambridge UP

21.2 PageRank 465

an address into the URL bar of his browser. The destination of a teleport

operation is modeled as being chosen uniformly at random from all web

pages. In other words, if N is the total number of nodes in the web graph

the teleport operation takes the surfer to each node with probability 1/N.

The surfer would also teleport to his present position with probability 1/N.

In assigning a PageRank score to each node of the web graph, we use the

teleport operation in two ways: (1) When at a node with no out-links, the

surfer invokes the teleport operation. (2) At any node that has outgoing links,

the surfer invokes the teleport operation with probability 0 < α < 1 and the

standard random walk (follow an out-link chosen uniformly at random as in

Figure 21.1) with probability 1 − α, where α is a ﬁxed parameter chosen in

advance. Typically, α might be 0.1.

In Section

21.2.1, we will use the theory of Markov chains to argue that

when the surfer follows this combined process (random walk plus teleport)

he visits each node v of the web graph a ﬁxed fraction of the time π(v) that

depends on (1) the structure of the web graph and (2) the value of α. We call

this value π(v) the PageRank of v and will show how to compute this value

in Section

21.2.2.

21.2.1 Markov chains

A Markov chain is a discrete-time stochastic process: a process that occurs in

a series of time-steps in each of which a random choice is made. A Markov

chain consists of N states. Each web page will correspond to a state in the

Markov chain we will formulate.

A Markov chain is characterized by an N × N transition probability matrix P

each of whose entries is in the interval [0, 1]; the entries in each row of P add

up to 1. The Markov chain can be in one of the N states at any given time-

step; then, the entry P

tells us the probability that the state at the next time-

step is j, conditioned on the current state being i. Each entry P

is known as a

transition probability and depends only on the current state i; this is known

as the Markov property. Thus, by the Markov property,

∀i, j, P

∈ [0, 1]

and

∀i,

∑

j=1

= 1.

(21.1)

A matrix with non-negative entries that satisﬁes Equation (21.1) is known

as a stochastic matrix. A key property of a stochastic matrix is that it has aSTOCHASTIC MATRIX

principal left eigenvector corresponding to its largest eigenvalue, which is 1.PRINCIPAL LEFT

EIGENVECTOR

1. This is consistent with our usage of N for the number of documents in the collection.