Manning Ch. D., Raghavan P., Sch?tze H. Introduction to Information Retrieval - Введение в информационный поиск

Подождите немного. Документ загружается.

Online edition (c)2009 Cambridge UP

286 13 Text classiﬁcation and Naive Bayes

13.7 Ref erences and further reading

General introductions to statistical classiﬁcation and machine learning can be

found in (Hastie et al. 2001), (Mitchell 1997), and (Duda et al. 2000), including

many important methods (e.g., decision trees and boosting) that we do not

cover. A comprehensive review of text classiﬁcation methods and results is

(Sebastiani 2002). Manning and Schütze (1999, Chapter 16) give an accessible

introduction to text classiﬁcation with coverage of decision trees, perceptrons

and maximum entropy models. More information on the superlinear time

complexity of learning methods that are more accurate than Naive Bayes can

be found in (Perkins et al. 2003) and (Joachims 2006a).

Maron and Kuhns (1960) described one of the ﬁrst NB text classiﬁers. Lewis

(1998) focuses on the history of NB classiﬁcation. Bernoulli and multinomial

models and their accuracy for different collections are discussed by McCal-

lum and Nigam (1998). Eyheramendy et al. (2003) present additional NB

models. Domingos and Pazzani (1997), Friedman (1997), and Hand and Yu

(2001) analyze why NB performs well although its probability estimates are

poor. The ﬁrst paper also discusses NB’s optimality when the independence

assumptions are true of the data. Pavlov et al. (2004) propose a modiﬁed

document representation that partially addresses the inappropriateness of

the independence assumptions. Bennett (2000) attributes the tendency of NB

probability estimates to be close to either 0 or 1 to the effect of document

length. Ng and Jordan (2001) show that NB is sometimes (although rarely)

superior to discriminative methods because it more quickly reaches its opti-

mal error rate. The basic NB model presented in this chapter can be tuned for

better effectiveness (Rennie et al. 2003;Kołcz and Yih 2007). The problem of

concept drift and other reasons why state-of-the-art classiﬁers do not always

excel in practice are discussed by Forman (2006) and Hand (2006).

Early uses of mutual information and χ

for feature selection in text clas-

siﬁcation are Lewis and Ringuette (1994) and Schütze et al. (1995), respec-

tively. Yang and Pedersen (1997) review feature selection methods and their

impact on classiﬁcation effectiveness. They ﬁnd that pointwise mutual infor-POINTWISE MUTUAL

INFORMATION

mation is not competitive with other methods. Yang and Pedersen refer to

expected mutual information (Equation (

13.16)) as information gain (see Ex-

ercise 13.13, page 285). (Snedecor and Cochran 1989) is a good reference for

the χ

test in statistics, including the Yates’ correction for continuity for 2 ×2

tables. Dunning (1993) discusses problems of the χ

test when counts are

small. Nongreedy feature selection techniques are described by Hastie et al.

(2001). Cohen (1995) discusses the pitfalls of using multiple signiﬁcance tests

and methods to avoid them. Forman (2004) evaluates different methods for

feature selection for multiple classiﬁers.

David D. Lewis deﬁnes the ModApte split at www.daviddlewis.com/resources/testcollections/reuters21578/readme

based on Apté et al. (1994). Lewis (1995) describes utility measures for theUTILITY MEASURE

Online edition (c)2009 Cambridge UP

13.7 References and further reading 287

evaluation of text classiﬁcation systems. Yang and Liu (1999) employ signif-

icance tests in the evaluation of text classiﬁcation methods.

Lewis et al. (2004) ﬁnd that SVMs (Chapter

15) perform better on Reuters-

RCV1 than kNN and Rocchio (Chapter

14).

Online edition (c)2009 Cambridge UP

DRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome. 289

14 Vector space classi ﬁcation

The document representation in Naive Bayes is a sequence of terms or a bi-

nary vector he

, . . . , e

|V|

i ∈ {0, 1}

V|. In this chapter we adopt a different

representation for text classiﬁcation, the vector space model, developed in

Chapter

6. It represents each document as a vector with one real-valued com-

ponent, usually a tf-idf weight, for each term. Thus, the document space X,

the domain of the classiﬁcation function γ, is R

|V|

. This chapter introduces a

number of classiﬁcation methods that operate on real-valued vectors.

The basic hypothesis in using the vector space model for classiﬁcation is

the contiguity hypothesis.CONTIGUITY

HYPOTHESIS

Contiguity hypothesis. Documents in the same class form a contigu-

ous region and regions of different classes do not overlap.

There are many classiﬁcation tasks, in particular the type of text classiﬁcation

that we encountered in Chapter

13, where classes can be distinguished by

word patterns. For example, documents in the class China tend to have high

values on dimensions like Chinese, Beijing, and Mao whereas documents in the

class UK tend to have high values for London, British and Queen. Documents

of the two classes therefore form distinct contiguous regions as shown in

Figure

14.1 and we can draw boundaries that separate them and classify new

documents. How exactly this is done is the topic of this chapter.

Whether or not a set of documents is mapped into a contiguous region de-

pends on the particular choices we make for the document representation:

type of weighting, stop list etc. To see that the document representation is

crucial, consider the two classes written by a group vs. written by a single per-

son. Frequent occurrence of the ﬁrst person pronoun I is evidence for the

single-person class. But that information is likely deleted from the document

representation if we use a stop list. If the document representation chosen

is unfavorable, the contiguity hypothesis will not hold and successful vector

space classiﬁcation is not possible.

The same considerations that led us to prefer weighted representations, in

particular length-normalized tf-idf representations, in Chapters 6 and 7 also

Online edition (c)2009 Cambridge UP

290 14 Vector space classiﬁcation

⋄

China

Kenya

⋆

◮

Figure 14.1 Vector space classiﬁcation into three classes.

apply here. For example, a term with 5 occurrences in a document should get

a higher weight than a term with one occurrence, but a weight 5 times larger

would give too much emphasis to the term. Unweighted and unnormalized

counts should not be used in vector space classiﬁcation.

We introduce two vector space classiﬁcation methods in this chapter, Roc-

chio and kNN. Rocchio classiﬁcation (Section

14.2) divides the vector space

into regions centered on centroids or prototypes, one for each class, computedPROTOTYPE

as the center of mass of all documents in the class. Rocchio classiﬁcation is

simple and efﬁcient, but inaccurate if classes are not approximately spheres

with similar radii.

kNN or k nearest neighbor classiﬁcation (Section

14.3) assigns the majority

class of the k nearest neighbors to a test document. kNN requires no explicit

training and can use the unprocessed training set directly in classiﬁcation.

It is less efﬁcient than other classiﬁcation methods in classifying documents.

If the training set is large, then kNN can handle non-spherical and other

complex classes better than Rocchio.

A large number of text classiﬁers can be viewed as linear classiﬁers – clas-

siﬁers that classify based on a simple linear combination of the features (Sec-

tion

14.4). Such classiﬁers partition the space of features into regions sepa-

rated by linear decision hyperplanes, in a manner to be detailed below. Because

of the bias-variance tradeoff (Section

14.6) more complex nonlinear models

Online edition (c)2009 Cambridge UP

14.1 Document representations and measures of rel atedness in vector spaces 291

true

projected

′

◮

Figure 14.2 Projections of small areas of the unit sphere preserve distances. Left:

A projection of the 2D semicircle to 1D. For the points x

, x

at x coordinates

−0.9, −0.2, 0, 0.2, 0.9 the distance |x

| ≈ 0.201 only differs by 0.5% from |x

′

| =

0.2; but |x

|/|x

′

| = d

true

projected

≈ 1.06/0.9 ≈ 1.18 is an example of a large

distortion (18%) when projecting a large area. Right: The corresponding projection of

the 3D hemisphere to 2D.

are not systematically better than linear models. Nonlinear models have

more parameters to ﬁt on a limited amount of training data and are more

likely to make mistakes for small and noisy data sets.

When applying two-class classiﬁers to problems with more than two classes,

there are one-of tasks – a document must be assigned to exactly one of several

mutually exclusive classes – and any-of tasks – a document can be assigned to

any number of classes as we will explain in Section

14.5. Two-class classiﬁers

solve any-of problems and can be combined to solve one-of problems.

14.1 Document representations and measures of relatedness in vec-

tor spaces

As in Chapter

6, we represent documents as vectors in R

|V|

in this chapter.

To illustrate properties of document vectors in vector classiﬁcation, we will

render these vectors as points in a plane as in the example in Figure

14.1.

In reality, document vectors are length-normalized unit vectors that point

to the surface of a hypersphere. We can view the 2D planes in our ﬁgures

as projections onto a plane of the surface of a (hyper-)sphere as shown in

Figure

14.2. Distances on the surface of the sphere and on the projection

plane are approximately the same as long as we restrict ourselves to small

areas of the surface and choose an appropriate projection (Exercise

14.1).

Online edition (c)2009 Cambridge UP

292 14 Vector space classiﬁcation

Decisions of many vector space classiﬁers are based on a notion of dis-

tance, e.g., when computing the nearest neighbors in kNN classiﬁcation.

We will use Euclidean distance in this chapter as the underlying distance

measure. We observed earlier (Exercise

6.18, page 131) that there is a direct

correspondence between cosine similarity and Euclidean distance for length-

normalized vectors. In vector space classiﬁcation, it rarely matters whether

the relatedness of two documents is expressed in terms of similarity or dis-

tance.

However, in addition to documents, centroids or averages of vectors also

play an important role in vector space classiﬁcation. Centroids are not length-

normalized. For unnormalized vectors, dot product, cosine similarity and

Euclidean distance all have different behavior in general (Exercise

14.6). We

will be mostly concerned with small local regions when computing the sim-

ilarity between a document and a centroid, and the smaller the region the

more similar the behavior of the three measures is.

Exercise 14.1

For small areas, distances on the surface of the hypersphere are approximated well

by distances on its projection (Figure

14.2) because α ≈ sin α for small angles. For

what size angle is the distortion α/ sin(α) (i) 1.01, (ii) 1.05 and (iii) 1.1?

14.2 Rocchio classiﬁcation

Figure 14.1 shows three classes, China, UK and Kenya, in a two-dimensional

(2D) space. Documents are shown as circles, diamonds and X’s. The bound-

aries in the ﬁgure, which we call decision boundaries, are chosen to separateDECISION BOUNDARY

the three classes, but are otherwise arbitrary. To classify a new document,

depicted as a star in the ﬁgure, we determine the region it occurs in and as-

sign it the class of that region – China in this case. Our task in vector space

classiﬁcation is to devise algorithms that compute good boundaries where

“good” means high classiﬁcation accuracy on data unseen during training.

Perhaps the best-known way of computing good class boundaries is Roc-ROCCHIO

CLASSIFICATION

chio classiﬁcation, which uses centroids to deﬁne the boundaries. The centroid

CENTROID

of a class c is computed as the vector average or center of mass of its mem-

bers:

~µ(c) =

∑

d∈D

~v(d)

(14.1)

where D

is the set of documents in D whose class is c: D

= {d : hd, ci ∈ D}.

We denote the normalized vector of d by ~v(d) (Equation (6.11), page 122).

Three example centroids are shown as solid circles in Figure 14.3.

The boundary between two classes in Rocchio classiﬁcation is the set of

points with equal distance from the two centroids. For example, |a

| = |a

Online edition (c)2009 Cambridge UP

14.2 Rocchio classiﬁcation 293

⋄

China

Kenya

⋆

◮

Figure 14.3 Rocchio classiﬁcation.

| = |b

|, and |c

| = |c

| in the ﬁgure. This set of points is always a line.

The generalization of a line in M-dimensional space is a hyperplane, which

we deﬁne as the set of points ~x that satisfy:

~x = b

(14.2)

where ~w is the M-dimensional normal vector

of the hyperplane and b is aNORMAL VECTOR

constant. This deﬁnition of hyperplanes includes lines (any line in 2D can

be deﬁned by w

+ w

= b) and 2-dimensional planes (any plane in 3D

can be deﬁned by w

+ w

= b). A line divides a plane in two,

a plane divides 3-dimensional space in two, and hyperplanes divide higher-

dimensional spaces in two.

Thus, the boundaries of class regions in Rocchio classiﬁcation are hyper-

planes. The classiﬁcation rule in Rocchio is to classify a point in accordance

with the region it falls into. Equivalently, we determine the centroid ~µ(c) that

the point is closest to and then assign it to c. As an example, consider the star

in Figure 14.3. It is located in the China region of the space and Rocchio

therefore assigns it to China. We show the Rocchio algorithm in pseudocode

in Figure

14.4.

1. Recall from basic linear algebra that ~v · ~w = ~v

~w, i.e., the dot product of ~v and ~w equals the

product by matrix multiplication of the transpose of ~v and ~w.

Online edition (c)2009 Cambridge UP

294 14 Vector space classiﬁcation

term weights

vector Chinese Japan Tokyo Macao Beijing Shanghai

0 0 0 0 1.0 0

0 0 0 0 0 1.0

0 0 0 1.0 0 0

0 0.71 0.71 0 0 0

~µ

0 0 0 0.33 0.33 0.33

~µ

0 0.71 0.71 0 0 0

◮

Table 14.1 Vectors and class centroids for the data in Table 13.1.

✎

Example 14.1: Table

14.1 shows the tf-idf vector representations of the ﬁve docu-

ments in Table

13.1 (page 261), using the formula (1 + log

t,d

) log

(4/df

) if tf

t,d

0 (Equation (

6.14), page 127). The two class centroids are µ

= 1/3 · (

)

and µ

= 1/1 · (

). The distances of the test document from the centroids are

|µ

−

| ≈ 1.15 and |µ

−

| = 0.0. Thus, Rocchio assigns d

The separating hyperplane in this case has the following parameters:

~w ≈ (0 − 0.71 − 0.71 1/3 1/3 1/3)

b = −1/3

See Exercise

14.15 for how to compute ~w and b. We can easily verify that this hy-

perplane separates the documents as desired: ~w

≈ 0 · 0 + −0.71 ·0 + −0.71 · 0 +

1/3 ·0 + 1/3 · 1.0 + 1/3 ·0 = 1/3 > b (and, similarly, ~w

> b for i = 2 and i = 3)

and ~w

= −1 < b. Thus, documents in c are above the hyperplane (~w

d > b) and

documents in

c are below the hyperplane (~w

d < b).

The assignment criterion in Figure

14.4 is Euclidean distance (APPLYROC-

CHIO, line 1). An alternative is cosine similarity:

Assign d to class c = arg max

′

cos(~µ(c

′

),~v(d))

As discussed in Section 14.1, the two assignment criteria will sometimes

make different classiﬁcation decisions. We present the Euclidean distance

variant of Rocchio classiﬁcation here because it emphasizes Rocchio’s close

correspondence to K-means clustering (Section

16.4, page 360).

Rocchio classiﬁcation is a form of Rocchio relevance feedback (Section 9.1.1,

page

178). The average of the relevant documents, corresponding to the most

important component of the Rocchio vector in relevance feedback (Equa-

tion (9.3), page 182), is the centroid of the “class” of relevant documents.

We omit the query component of the Rocchio formula in Rocchio classiﬁca-

tion since there is no query in text classiﬁcation. Rocchio classiﬁcation can be

Online edition (c)2009 Cambridge UP

14.2 Rocchio classiﬁcation 295

TRAINROCCHIO(C, D)

1 for each c

∈ C

2 do D

← {d : hd, c

i ∈ D}

3 ~µ

←

∑

d∈D

~v(d)

4 return {~µ

, . . . ,~µ

}

APPLYROCCHIO({~µ

, . . . ,~µ

}, d)

1 return arg min

|~µ

−~v(d)|

◮

Figure 14.4 Rocchio classiﬁcation: Training and testing.

a a

◮

Figure 14.5 The multimodal class “a” consists of two different clusters (small

upper circles centered on X’s). Rocchio classiﬁcation will misclassify “o” as “a”

because it is closer to the centroid A of the “a” class than to the centroid B of the “b”

class.

applied to J > 2 classes whereas Rocchio relevance feedback is designed to

distinguish only two classes, relevant and nonrelevant.

In addition to respecting contiguity, the classes in Rocchio classiﬁcation

must be approximate spheres with similar radii. In Figure

14.3, the solid

square just below the boundary between UK and Kenya is a better ﬁt for the

class UK since UK is more scattered than Kenya. But Rocchio assigns it to

Kenya because it ignores details of the distribution of points in a class and

only uses distance from the centroid for classiﬁcation.

The assumption of sphericity also does not hold in Figure

14.5. We can-