Manning Ch. D., Raghavan P., Sch?tze H. Introduction to Information Retrieval - Введение в информационный поиск

Подождите немного. Документ загружается.

Online edition (c)2009 Cambridge UP

216 10 XML retrieval

10.6 Ref erences and further reading

There are many good introductions to XML, including (Harold and Means

2004). Table

10.1 is inspired by a similar table in (van Rijsbergen 1979). Sec-

tion 10.4 follows the overview of INEX 2002 by Gövert and Kazai (2003),

published in the proceedings of the meeting (Fuhr et al. 2003a). The pro-

ceedings of the four following INEX meetings were published as Fuhr et al.

(2003b), Fuhr et al. (2005), Fuhr et al. (2006), and Fuhr et al. (2007). An up-

todate overview article is Fuhr and Lalmas (2007). The results in Table

10.4

are from (Kamps et al. 2006). Chu-Carroll et al. (2006) also present evidence

that XML queries increase precision compared with unstructured queries.

Instead of coverage and relevance, INEX now evaluates on the related but

different dimensions of exhaustivity and speciﬁcity (Lalmas and Tombros

2007). Trotman et al. (2006) relate the tasks investigated at INEX to real world

uses of structured retrieval such as structured book search on internet book-

store sites.

The structured document retrieval principle is due to Chiaramella et al.

(1996). Figure

10.5 is from (Fuhr and Großjohann 2004). Rahm and Bernstein

(2001) give a survey of automatic schema matching that is applicable to XML.

The vector-space based XML retrieval method in Section

10.3 is essentially

IBM Haifa’s JuruXML system as presented by Mass et al. (2003) and Carmel

et al. (2003). Schlieder and Meuss (2002) and Grabs and Schek (2002) describe

similar approaches. Carmel et al. (2003) represent queries as XML fragments.XML FRAGMENT

The trees that represent XML queries in this chapter are all XML fragments,

but XML fragments also permit the operators +, − and phrase on content

nodes.

We chose to present the vector space model for XML retrieval because it

is simple and a natural extension of the unstructured vector space model

in Chapter

6. But many other unstructured retrieval methods have been

applied to XML retrieval with at least as much success as the vector space

model. These methods include language models (cf. Chapter

12, e.g., Kamps

et al. (2004), List et al. (2005), Ogilvie and Callan (2005)), systems that use

a relational database as a backend (Mihajlovi´c et al. 2005, Theobald et al.

2005; 2008), probabilistic weighting (Lu et al. 2007), and fusion (Larson 2005).

There is currently no consensus as to what the best approach to XML retrieval

is.

Most early work on XML retrieval accomplished relevance ranking by fo-

cusing on individual terms, including their structural contexts, in query and

document. As in unstructured information retrieval, there is a trend in more

recent work to model relevance ranking as combining evidence from dis-

parate measurements about the query, the document and their match. The

combination function can be tuned manually (Arvola et al. 2005, Sigurbjörns-

son et al. 2004) or trained using machine learning methods (Vittaut and Gal-

Online edition (c)2009 Cambridge UP

10.7 Exercises 217

linari (2006), cf. Section 15.4.1, page 341).

An active area of XML retrieval research is focused ret rieval (Trotman et al.FOCUSED RETRIEVAL

2007), which aims to avoid returning nested elements that share one or more

common subelements (cf. discussion in Section

10.2, page 203). There is ev-

idence that users dislike redundancy caused by nested elements (Betsi et al.

2006). Focused retrieval requires evaluation measures that penalize redun-

dant results lists (Kazai and Lalmas 2006, Lalmas et al. 2007). Trotman and

Geva (2006) argue that XML retrieval is a form of passage retrieval. In passagePASSAGE RETRIEVAL

retrieval (Salton et al. 1993, Hearst and Plaunt 1993, Zobel et al. 1995, Hearst

1997, Kaszkiel and Zobel 1997), the retrieval system returns short passages

instead of documents in response to a user query. While element bound-

aries in XML documents are cues for identifying good segment boundaries

between passages, the most relevant passage often does not coincide with an

XML element.

In the last several years, the query format at INEX has been the NEXI stan-

dard proposed by Trotman and Sigurbjörnsson (2004). Figure 10.3 is from

their paper. O’Keefe and Trotman (2004) give evidence that users cannot reli-

ably distinguish the child and descendant axes. This justiﬁes only permitting

descendant axes in NEXI (and XML fragments). These structural constraints

were only treated as “hints” in recent INEXes. Assessors can judge an ele-

ment highly relevant even though it violates one of the structural constraints

speciﬁed in a NEXI query.

An alternative to structured query languages like NEXI is a more sophisti-

cated user interface for query formulation (Tannier and Geva 2005, van Zwol

et al. 2006, Woodley and Geva 2006).

A broad overview of XML retrieval that covers database as well as IR ap-

proaches is given by Amer-Yahia and Lalmas (2006) and an extensive refer-

ence list on the topic can be found in (Amer-Yahia et al. 2005). Chapter 6

of Grossman and Frieder (2004) is a good introduction to structured text re-

trieval from a database perspective. The proposed standard for XQuery is

available at http://www.w3.org/TR/xquery/ including an extension for full-text

queries (Amer-Yahia et al. 2006): http://www.w3.org/TR/xquery-full-text/. Work

that has looked at combining the relational database and the unstructured

information retrieval approaches includes (Fuhr and Rölleke 1997), (Navarro

and Baeza-Yates 1997), (Cohen 1998), and (Chaudhuri et al. 2006).

10.7 Exe rcises

Exercise 10.4

Find a reasonably sized XML document collection (or a collection using a markup lan-

guage different from XML like HTML) on the web and download it. Jon Bosak’s XML

edition of Shakespeare and of various religious works at http://www.ibiblio.org/bosak/ or

the ﬁrst 10,000 documents of the Wikipedia are good choices. Create three CAS topics

Online edition (c)2009 Cambridge UP

218 10 XML retrieval

of the type shown in Figure 10.3 that you would expect to do better than analogous

CO topics. Explain why an XML retrieval system would be able to exploit the XML

structure of the documents to achieve better retrieval results on the topics than an

unstructured retrieval system.

Exercise 10.5

For the collection and the topics in Exercise 10.4, (i) are there pairs of elements e

and

, with e

a subelement of e

such that both answer one of the topics? Find one case

each where (ii) e

(iii) e

is the better answer to the query.

Exercise 10.6

Implement the (i) SIMMERGE (ii) SIMNOMERGE algorithm in Section 10.3 and run it

for the collection and the topics in Exercise

10.4. (iii) Evaluate the results by assigning

binary relevance judgments to the ﬁrst ﬁve documents of the three retrieved lists for

each algorithm. Which algorithm performs better?

Exercise 10.7

Are all of the elements in Exercise 10.4 appropriate to be returned as hits to a user or

are there elements (as in the example <b>definitely</b> on page

203) that you

would exclude?

Exercise 10.8

We discussed the tradeoff between accuracy of results and dimensionality of the vec-

tor space on page

207. Give an example of an information need that we can answer

correctly if we index all lexicalized subtrees, but cannot answer if we only index struc-

tural terms.

Exercise 10.9

If we index all structural terms, what is the size of the index as a function of text size?

Exercise 10.10

If we index all lexicalized subtrees, what is the size of the index as a function of text

size?

Exercise 10.11

Give an example of a query-document pair for which SIMNOMERGE(q, d) is larger

than 1.0.

Online edition (c)2009 Cambridge UP

DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 219

Probabilistic information

re t rieval

During the discussion of relevance feedback in Section

9.1.2, we observed

that if we have some known relevant and nonrelevant documents, then we

can straightforwardly start to estimate the probability of a term t appearing

in a relevant document P(t|R = 1), and that this could be the basis of a

classiﬁer that decides whether documents are relevant or not. In this chapter,

we more systematically introduce this probabilistic approach to IR, which

provides a different formal basis for a retrieval model and results in different

techniques for setting term weights.

Users start with informatio n needs, which they translate into query repre-

sentations. Similarly, there are documents, which are converted into document

representations (the latter differing at least by how text is tokenized, but per-

haps containing fundamentally less information, as when a non-positional

index is used). Based on these two representations, a system tries to de-

termine how well documents satisfy information needs. In the Boolean or

vector space models of IR, matching is done in a formally deﬁned but seman-

tically imprecise calculus of index terms. Given only a query, an IR system

has an uncertain understanding of the information need. Given the query

and document representations, a system has an uncertain guess of whether

a document has content relevant to the information need. Probability theory

provides a principled foundation for such reasoning under uncertainty. This

chapter provides one answer as to how to exploit this foundation to estimate

how likely it is that a document is relevant to an information need.

There is more than one possible retrieval model which has a probabilistic

basis. Here, we will introduce probability theory and the Probability Rank-

ing Principle (Sections

11.1–11.2), and then concentrate on the Binary Inde-

pendence Model (Section

11.3), which is the original and still most inﬂuential

probabilistic retrieval model. Finally, we will introduce related but extended

methods which use term counts, including the empirically successful Okapi

BM25 weighting scheme, and Bayesian Network models for IR (Section

11.4).

In Chapter 12, we then present the alternative probabilistic language model-

Online edition (c)2009 Cambridge UP

220 11 Probabilistic information retrieval

ing approach to IR, which has been developed with considerable success in

recent years.

11.1 Review of basic probability theory

We hope that the reader has seen a little basic probability theory previously.

We will give a very quick review; some references for further reading appear

at the end of the chapter. A variable A represents an event (a subset of the

space of possible outcomes). Equivalently, we can represent the subset via a

random variable, which is a function from outcomes to real numbers; the sub-RANDOM VARIABLE

set is the domain over which the random variable A has a particular value.

Often we will not know with certainty whether an event is true in the world.

We can ask the probability of the event 0 ≤ P(A) ≤ 1. For two events A and

B, the joint event of both events occurring is described by the joint probabil-

ity P(A, B). The conditional probability P(A|B) expresses the probability of

event A given that event B occurred. The fundamental relationship between

joint and conditional probabilities is given by the chain rule:CHAIN RULE

P(A, B) = P(A ∩ B) = P(A|B)P(B) = P(B|A)P(A)(11.1)

Without making any assumptions, the probability of a joint event equals the

probability of one of the events multiplied by the probability of the other

event conditioned on knowing the ﬁrst event happened.

Writing P(

A) for the complement of an event, we similarly have:

A, B) = P(B|A)P(A)(11.2)

Probability theory also has a partition rule, which says that if an event B canPARTITION RULE

be divided into an exhaustive set of disjoint subcases, then the probability of

B is the sum of the probabilities of the subcases. A special case of this rule

gives that:

P(B) = P(A, B) + P(

A, B)(11.3)

From these we can derive Bayes’ Rule for inverting conditional probabili-BAYES’ RULE

ties:

P(A|B) =

P(B|A)P(A)

P(B)

P(B|A)

∑

X∈{A,

P(B|X)P(X)

P(A)

(11.4)

This equation can also be thought of as a way of updating probabilities. We

start off with an initial estimate of how likely the event A is when we do

not have any other information; this is the prior probability P(A). Bayes’ rulePRIOR PROBABILITY

lets us derive a posterior probability P(A|B) after having seen the evidence B,POSTERIOR

PROBABILITY

Online edition (c)2009 Cambridge UP

11.2 The Probability Ranking Principle 221

based on the likelihood of B occurring in the two cases that A does or does not

hold.

Finally, it is often useful to talk about the odds of an event, which provideODDS

a kind of multiplier for how probabilities change:

Odds: O(A) =

P(A)

1 − P(A)

(11.5)

11.2 The Probability Ranking Principle

11.2.1 T he 1/0 loss case

We assume a ranked retrieval setup as in Section

6.3, where there is a collec-

tion of documents, the user issues a query, and an ordered list of documents

is returned. We also assume a binary notion of relevance as in Chapter

8. For

a query q and a document d in the collection, let R

d,q

be an indicator random

variable that says whether d is relevant with respect to a given query q. That

is, it takes on a value of 1 when the document is relevant and 0 otherwise. In

context we will often write just R for R

d,q

Using a probabilistic model, the obvious order in which to present doc-

uments to the user is to rank documents by their estimated probability of

relevance with respect to the information need: P (R = 1|d, q). This is the ba-

sis of the Probability Ranking Principle (PRP) (van Rijsbergen 1979, 113–114):PROBABILITY

RANKING PRINCIPLE

“If a reference retrieval system’s response to each request is a ranking

of the documents in the collection in order of decreasing probability

of relevance to the user who submitted the request, where the prob-

abilities are estimated as accurately as possible on the basis of what-

ever data have been made available to the system for this purpose, the

overall effectiveness of the system to its user will be the best that is

obtainable on the basis of those data.”

In the simplest case of the PRP, there are no retrieval costs or other utility

concerns that would differentially weight actions or errors. You lose a point

for either returning a nonrelevant document or failing to return a relevant

document (such a binary situation where you are evaluated on your accuracy

is called 1/0 loss). The goal is to return the best possible results as the top k1/0 LOSS

documents, for any value of k the user chooses to examine. The PRP then

says to simply rank all documents in decreasing order of P(R = 1|d, q). If

a set of retrieval results is to be returned, rather than an ordering, the BayesBAYES OPTIMAL

DECISION RULE

1. The term likelihood is just a synonym of probability. It is the probability of an event or data

according to a model. The term is usually used when people are thinking of holding the data

ﬁxed, while varying the model.

Online edition (c)2009 Cambridge UP

222 11 Probabilistic information retrieval

Optimal Decision Rule, the decision which minimizes the risk of loss, is to

simply return documents that are more likely relevant than nonrelevant:

d is relevant iff P(R = 1|d, q) > P(R = 0|d, q)

(11.6)

Theorem 11.1.

The PRP is op timal, in the sense that it minimizes the expected loss

(also known as the Bayes risk) under 1/0 loss.BAYES RISK

The proof can be found in Ripley (1996). However, it requires that all proba-

bilities are known correctly. This is never the case in practice. Nevertheless,

the PRP still provides a very useful foundation for developing models of IR.

11.2.2 T he PRP with retrieval costs

Suppose, instead, that we assume a model of retrieval costs. Let C

be the

cost of not retrieving a relevant document and C

the cost of retrieval of a

nonrelevant document. Then the Probability Ranking Principle says that if

for a speciﬁc document d and for all documents d

′

not yet retrieved

· P(R = 0|d) − C

· P(R = 1|d) ≤ C

· P(R = 0|d

′

) − C

· P(R = 1|d

′

)

(11.7)

then d is the next document to be retrieved. Such a model gives a formal

framework where we can model differential costs of false positives and false

negatives and even system performance issues at the modeling stage, rather

than simply at the evaluation stage, as we did in Section

8.6 (page 168). How-

ever, we will not further consider loss/utility models in this chapter.

11.3 The Binary I ndependence Model

The Binary Independence Model (BIM) we present in this section is the modelBINARY

INDEPENDENCE

MODEL

that has traditionally been used with the PRP. It introduces some simple as-

sumptions, which make estimating the probability function P(R |d, q) prac-

tical. Here, “binary” is equivalent to Boolean: documents and queries are

both represented as binary term incidence vectors. That is, a document d

is represented by the vector ~x = (x

, . . . , x

) where x

= 1 if term t is

present in document d and x

= 0 if t is not present in d. With this rep-

resentation, many possible documents have the same vector representation.

Similarly, we represent q by the incidence vector ~q (the distinction between

q and ~q is less central since commonly q is in the form of a set of words).

“Independence” means that terms are modeled as occurring in documents

independently. The model recognizes no association between terms. This

assumption is far from correct, but it nevertheless often gives satisfactory

results in practice; it is the “naive” assumption of Naive Bayes models, dis-

cussed further in Section

13.4 (page 265). Indeed, the Binary Independence

Online edition (c)2009 Cambridge UP

11.3 The Binary Independence Model 223

Model is exactly the same as the multivariate Bernoulli Naive Bayes model

presented in Section

13.3 (page 263). In a sense this assumption is equivalent

to an assumption of the vector space model, where each term is a dimension

that is orthogonal to all other terms.

We will ﬁrst present a model which assumes that the user has a single

step information need. As discussed in Chapter

9, seeing a range of results

might let the user reﬁne their information need. Fortunately, as mentioned

there, it is straightforward to extend the Binary Independence Model so as to

provide a framework for relevance feedback, and we present this model in

Section

11.3.4.

To make a probabilistic retrieval strategy precise, we need to estimate how

terms in documents contribute to relevance, speciﬁcally, we wish to know

how term frequency, document frequency, document length, and other statis-

tics that we can compute inﬂuence judgments about document relevance,

and how they can be reasonably combined to estimate the probability of doc-

ument relevance. We then order documents by decreasing estimated proba-

bility of relevance.

We assume here that the relevance of each document is independent of the

relevance of other documents. As we noted in Section

8.5.1 (page 166), this

is incorrect: the assumption is especially harmful in practice if it allows a

system to return duplicate or near duplicate documents. Under the BIM, we

model the probability P(R|d, q) that a document is relevant via the probabil-

ity in terms of term incidence vectors P(R|~x,~q). Then, using Bayes rule, we

have:

P(R = 1|~x,~q) =

P(~x|R = 1,~q)P(R = 1|~q)

P(~x|~q)

(11.8)

P(R = 0|~x,~q) =

P(~x|R = 0,~q)P(R = 0|~q)

P(~x|~q)

Here, P(~x|R = 1,~q) and P(~x|R = 0,~q) are the probability that if a relevant or

nonrelevant, respectively, document is retrieved, then that document’s rep-

resentation is ~x. You should think of this quantity as deﬁned with respect to

a space of possible documents in a domain. How do we compute all these

probabilities? We never know the exact probabilities, and so we have to use

estimates: Statistics about the actual document collection are used to estimate

these probabilities. P(R = 1|~q) and P(R = 0|~q) indicate the prior probability

of retrieving a relevant or nonrelevant document respectively for a query ~q.

Again, if we knew the percentage of relevant documents in the collection,

then we could use this number to estimate P(R = 1|~q) and P(R = 0|~q). Since

a document is either relevant or nonrelevant to a query, we must have that:

P(R = 1|~x,~q) + P(R = 0|~x,~q) = 1

(11.9)

Online edition (c)2009 Cambridge UP

224 11 Probabilistic information retrieval

11.3.1 Deriving a ranking function for query terms

Given a query q, we wish to order returned documents by descending P(R =

1|d, q). Under the BIM, this is modeled as ordering by P(R = 1|~x,~q). Rather

than estimating this probability directly, because we are interested only in the

ranking of documents, we work with some other quantities which are easier

to compute and which give the same ordering of documents. In particular,

we can rank documents by their odds of relevance (as the odds of relevance

is monotonic with the probability of relevance). This makes things easier,

because we can ignore the common denominator in (

11.8), giving:

O(R|~x,~q) =

P(R = 1|~x,~q)

P(R = 0|~x,~q)

P(R=1|~q)P(~x|R=1,~q)

P(~x|~q)

P(R=0|~q)P(~x|R=0,~q)

P(~x|~q)

P(R = 1|~q)

P(R = 0|~q)

P(~x|R = 1,~q)

P(~x|R = 0,~q)

(11.10)

The left term in the rightmost expression of Equation (11.10) is a constant for

a given query. Since we are only ranking documents, there is thus no need

for us to estimate it. The right-hand term does, however, require estimation,

and this initially appears to be difﬁcult: How can we accurately estimate the

probability of an entire term incidence vector occurring? It is at this point that

we make the Naive Bayes conditional independence assumption that the presenceNAIVE BAYES

ASSUMPTION

or absence of a word in a document is independent of the presence or absence

of any other word (given the query):

P(~x|R = 1,~q)

P(~x|R = 0,~q)

∏

t=1

P(x

|R = 1,~q)

P(x

|R = 0,~q)

(11.11)

So:

O(R|~x,~q) = O(R|~q) ·

∏

t=1

P(x

|R = 1,~q)

P(x

|R = 0,~q)

(11.12)

Since each x

is either 0 or 1, we can separate the terms to give:

O(R|~x,~q) = O(R|~q) ·

∏

t:x

P(x

= 1|R = 1,~q)

P(x

= 1|R = 0,~q)

∏

t:x

P(x

= 0|R = 1,~q)

P(x

= 0|R = 0,~q)

(11.13)

Henceforth, let p

= P(x

= 1|R = 1,~q) be the probability of a term appear-

ing in a document relevant to the query, and u

= P(x

= 1|R = 0,~q) be the

probability of a term appearing in a nonrelevant document. These quantities

can be visualized in the following contingency table where the columns add

to 1:

(11.14)

document relevant (R = 1) nonrelevant (R = 0)

Term present x

= 1 p

Term absent x

= 0 1 − p

1 −u

Online edition (c)2009 Cambridge UP

11.3 The Binary Independence Model 225

Let us make an additional simplifying assumption that terms not occur-

ring in the query are equally likely to occur in relevant and nonrelevant doc-

uments: that is, if q

= 0 then p

= u

. (This assumption can be changed,

as when doing relevance feedback in Section

11.3.4.) Then we need only

consider terms in the products that appear in the query, and so,

O(R|~q,~x) = O(R|~q) ·

∏

t:x

∏

t:x

=0,q

1 − p

1 −u

(11.15)

The left product is over query terms found in the document and the right

product is over query terms not found in the document.

We can manipulate this expression by including the query terms found in

the document into the right product, but simultaneously dividing through

by them in the left product, so the value is unchanged. Then we have:

O(R|~q,~x) = O (R|~q) ·

∏

t:x

(1 − u

)

(1 − p

)

∏

t:q

1 − p

1 − u

(11.16)

The left product is still over query terms found in the document, but the right

product is now over all query terms. That means that this right product is a

constant for a particular query, just like the odds O(R|~q). So the only quantity

that needs to be estimated to rank documents for relevance to a query is the

left product. We can equally rank documents by the logarithm of this term,

since log is a monotonic function. The resulting quantity used for ranking is

called the Retrieval Status Value (RSV) in this model:RETRIEVAL STATUS

VALUE

RSV

= log

∏

t:x

(1 −u

)

(1 − p

)

∑

t:x

log

(1 − u

)

(1 − p

)

(11.17)

So everything comes down to computing the RSV. Deﬁne c

= log

(1 −u

)

(1 − p

)

= log

(1 − p

)

+ log

1 −u

(11.18)

The c

terms are log odds ratios for the terms in the query. We have the

odds of the term appearing if the document is relevant (p

/(1 − p

)) and the

odds of the term appearing if the document is nonrelevant (u

/(1 −u

)). The

odds rat io is the ratio of two such odds, and then we ﬁnally take the log of thatODDS RATIO

quantity. The value will be 0 if a term has equal odds of appearing in relevant

and nonrelevant documents, and positive if it is more likely to appear in

relevant documents. The c

quantities function as term weights in the model,

and the document score for a query is RS V

∑

. Operationally, we

sum them in accumulators for query terms appearing in documents, just as

for the vector space model calculations discussed in Section

7.1 (page 135).

We now turn to how we estimate these c

quantities for a particular collection

and query.