
447
Appendix C: Of Stems, Leaves, Boxes, Whiskers, and Smoothies
Another of Tukey’s techniques is called hanning. This is a running weighted
mean. You replace a data point with the sum of one fourth the previous data
point plus half the data point plus one fourth the next data point. The formula:
Still another technique is the skip mean. For this one, I let the formula tell the
story:
Tukey provides a number of others, but I confine the discussion to these
three.
In EDA, you don’t just use one technique on a set of data. Often, you start
with a median smooth, repeat it several times, and then try one or two
others.
For the data in the scatterplot in Figure C-13, I applied the three-median
smooth, repeated it (that is, I applied it to the newly smoothed data), hanned
the smoothed data, and then applied the skip mean. Again, no technique (or
order of techniques) is right or wrong. You apply what you think illuminates
meaningful features of the data.
Figure C-14 shows part of a worksheet for all of this. I obviously couldn’t fit
all 108 years in one screenshot, but this gives you the idea. Column A shows
the year, column B the number of home runs hit that year in the American
League. The remaining columns show successive smooths of the data.
Column C applies the three-median smooth to column B, column D applies
the three-median smooth to column C. A quick look at the numbers shows
that the repetition didn’t make much difference. Column E applies hanning to
column D, and column F applies the skip mean to column E. In columns C–F,
I used the actual number of home runs for the first value (for the year 1901)
and for the final value (for the year 2008).
Just to clue you in on how I arrived at the smoothed values, here are the
worksheet formulas for a typical cell in each column.
31 454060-bapp03.indd 44731 454060-bapp03.indd 447 4/21/09 7:41:04 PM4/21/09 7:41:04 PM