Online edition (c)2009 Cambridge UP
4.7 References and further reading 83
Exercise 4.10
For optimal load balancing, the inverters in MapReduce must get segmented postings
files of similar sizes. For a new collection, the distribution of key-value pairs may not
be known in advance. How would you solve this problem?
Exercise 4.11
Apply MapReduce to the problem of counting how often each term occurs in a set of
files. Specify map and reduce operations for this task. Write down an example along
the lines of Figure
4.6.
Exercise 4.12
We claimed (on page 80) that an auxiliary index can impair the quality of collec-
tion statistics. An example is the term weighting method idf, which is defined as
log(N/df
i
) where N is the total number of documents and df
i
is the number of docu-
ments that term i occurs in (Section
6.2.1, page 117). Show that even a small auxiliary
index can cause significant error in idf when it is computed on the main index only.
Consider a rare term that suddenly occurs frequently (e.g., Flossie as in Tropical Storm
Flossie).
4.7 Refe rences and further reading
Witten et al. (1999, Chapter 5) present an extensive treatment of the subject of
index construction and additional indexing algorithms with different trade-
offs of memory, disk space, and time. In general, blocked sort-based indexing
does well on all three counts. However, if conserving memory or disk space
is the main criterion, then other algorithms may be a better choice. See Wit-
ten et al. (1999), Tables 5.4 and 5.5; BSBI is closest to “sort-based multiway
merge,” but the two algorithms differ in dictionary structure and use of com-
pression.
Moffat and Bell (1995) show how to construct an index “in situ,” that
is, with disk space usage close to what is needed for the final index and
with a minimum of additional temporary files (cf. also Harman and Candela
(1990)). They give Lesk (1988) and Somogyi (1990) credit for being among
the first to employ sorting for index construction.
The SPIMI method in Section
4.3 is from (Heinz and Zobel 2003). We have
simplified several aspects of the algorithm, including compression and the
fact that each term’s data structure also contains, in addition to the postings
list, its document frequency and house keeping information. We recommend
Heinz and Zobel (2003) and Zobel and Moffat (2006) as up-do-date, in-depth
treatments of index construction. Other algorithms with good scaling prop-
erties with respect to vocabulary size require several passes through the data,
e.g., FAST-INV (Fox and Lee 1991, Harman et al. 1992).
The MapReduce architecture was introduced by Dean and Ghemawat (2004).
An open source implementation of MapReduce is available at http://lucene.apache.org/hadoop/.
Ribeiro-Neto et al. (1999) and Melnik et al. (2001) describe other approaches