Schmuller J. Statistical Analysis with Excel For Dummies

Подождите немного. Документ загружается.

Chapter 18

A Career in Modeling

In This Chapter

▶ What is a model?

▶ Modeling and fitting

▶ Working with the Monte Carlo method

odel is a term that gets thrown around a lot these days. Simply put,

a model is something you know and can work with that helps you

understand something you know little about. A model is supposed to mimic,

in some way, the thing it’s modeling. A globe, for example, is a model of the

earth. A street map is a model of a neighborhood. A blueprint is a model of

a house.

Researchers use models to help them understand natural processes and

phenomena. Business analysts use models to help them understand business

processes. The models these people use might include concepts from math-

ematics and statistics — concepts that are so well known they can shed light

on the unknown. The idea is to create a model that consists of concepts you

understand, put the model through its paces, and see if the results look like

real-world results.

In this chapter, I discuss modeling. My goal is to show how you can harness

Excel’s statistical capabilities to help you understand processes in your world.

Modeling a Distribution

In one approach to modeling, you gather data and group them into a distri-

bution. Next, you try and figure out a process that results in that kind of a

distribution. Restate that process in statistical terms so that it can generate a

distribution, and then see how well the generated distribution matches up to

the real one. This “process you figure out and restate in statistical terms” is

the model.

25 454060-ch18.indd 34925 454060-ch18.indd 349 4/21/09 7:37:14 PM4/21/09 7:37:14 PM

350

Part IV: Working with Probability

If the distribution you generate matches up well with the real data, does this

mean your model is “right”? Does it mean the process you guessed is the pro-

cess that produces the data?

Unfortunately, no. The logic doesn’t work that way. You can show that a

model is wrong, but you can’t prove that it’s right.

Plunging into the Poisson distribution

In this section, I go through an example of modeling with the Poisson distri-

bution. I introduced this distribution in Chapter 17, and I told you it seems to

characterize an array of processes in the real world. By characterize a pro-

cess, I mean that a distribution of real-world data looks a lot like a Poisson

distribution. When this happens, it’s possible that the kind of process that

produces a Poisson distribution is also responsible for producing the data.

What is that process? Start with a random variable x that tracks the number

of occurrences of a specific event in an interval. In Chapter 17, the “interval”

was a sample of 1,000 universal joints, and the specific event was “defective

joint.” Poisson distributions are also appropriate for events occurring in

intervals of time, and the event can be something like “arrival at a toll booth.”

Next, I outline the conditions for a Poisson process, and use both defective

joints and toll booth arrivals to illustrate:

✓ The numbers of occurrences of the event in two nonoverlapping inter-

vals are independent.

The number of defective joints in one sample is independent of the number

of defective joints in another. The number of arrivals at a toll booth during

one hour is independent of the number of arrivals during another.

✓ The probability of an occurrence of the event is proportional to the size

of the interval.

The chance that you’ll find a defective joint is larger in a sample of

10,000 than it is in a sample of 1,000. The chance of an arrival at a toll

booth is greater for one hour than it is for a half hour.

✓ The probability of more than one occurrence of the event in a small

interval is 0 or close to 0.

In a sample of 1,000 universal joints, you have an extremely low prob-

ability of finding two defective ones right next to one another. At any

time, two vehicles don’t arrive at a toll booth simultaneously.

As I show you in Chapter 17, the formula for the Poisson distribution is

25 454060-ch18.indd 35025 454060-ch18.indd 350 4/21/09 7:37:14 PM4/21/09 7:37:14 PM

351

Chapter 18: A Career in Modeling

In this equation, μ represents the average number of occurrences of the

event in the interval you’re looking at, and e is the constant 2.781828 (fol-

lowed by infinitely many more decimal places).

Time to use the Poisson in a model. At the FarBlonJet Corporation, web design-

ers track the number of hits per hour on the intranet home page. They moni-

tor the page for 200 consecutive hours, and group the data as in Table 18-1.

Table 18-1 Hits Per Hour on the FarBlonJet Intranet Home Page

Hits/Hour Observed Hours Hits/Hour X

Observed Hours

010 0

130 30

244 88

3 44 132

4 36 144

518 90

610 60

78 56

Total 200 600

The first column shows the variable Hits/Hour. The second column, Observed

Hours, shows the number of hours in which each value of Hits/Hour occurred.

In the 200 hours observed, 10 of those hours went by with no hits, 30 hours

had one hit, 44 had two hits, and so on. These data lead the web designers

to use a Poisson distribution to model Hits/Hour. Another way to say this:

They believe a Poisson process produces the number of hits per hour on the

Web page.

Multiplying the first column by the second column results in the third

column. Summing the third column shows that in the 200 observed hours the

intranet page received 600 hits. So the average number of hits/hour is 3.00.

Applying the Poisson distribution to this example,

From here on, I pick it up in Excel.

25 454060-ch18.indd 35125 454060-ch18.indd 351 4/21/09 7:37:14 PM4/21/09 7:37:14 PM

352

Part IV: Working with Probability

Using POISSON

Figure 18-1 shows each value of x (hits/hour), the probability of each x if the

average number of hits per hour is 3, the predicted number of hours, and the

observed number of hours (taken from the second column in Table 18-1). I

selected cell B3 so that the formula box shows how I used the POISSON work-

sheet function. I autofilled Column B down to cell B10. (For the details on

using POISSON, see Chapter 17.)

Figure 18-1:

Web-page

hits/hour —

Poisson-

predicted

(μ=3) and

observed.

To get the predicted number of hours, I multiplied each probability in

Column B by 200 (the total number of observed hours). I used Excel’s graph-

ics capabilities (see Chapter 3) to show you how close the predicted hours

are to the observed hours. They look pretty close, don’t they?

Testing the model’s fit

Well, “looking pretty close” isn’t enough for a statistician. A statistical test is

a necessity. As is the case with all statistical tests, this one starts with a null

hypothesis and an alternative hypothesis. Here they are:

: The distribution of observed hits/hour follows a Poisson distribution.

: Not H

The appropriate statistical test involves an extension of the binomial distri-

bution. It’s called the multinomial distribution — “multi” because it encom-

passes more categories than just “success” and “failure.” It’s difficult to work

with, and Excel has no worksheet function to handle the computations.

Fortunately, pioneering statistician Karl Pearson (inventor of the correla-

tion coefficient) noticed that χ

(“chi-square”), a distribution I show you in

25 454060-ch18.indd 35225 454060-ch18.indd 352 4/21/09 7:37:14 PM4/21/09 7:37:14 PM

353

Chapter 18: A Career in Modeling

Chapter 11, approximates the multinomial. Originally intended for one-sample

hypothesis tests about variances, χ

has become much better known for

applications like the one I’m about to show you.

Pearson’s big idea was this. If you want to know how well a hypothesized

distribution (like the Poisson) fits a sample (like the observed hours), use

the distribution to generate a hypothesized sample (our predicted hours, for

instance), and work with this formula:

Usually, this is written with Expected rather than Predicted, and both

Observed and Expected are abbreviated. The usual form of this formula is:

For this example

What does that total up to? Excel figures it out for us. Figure 18-2 shows the

same columns as before, with column F holding the values for (O-E)

/E. I

could have used this formula

=((D3-C3)^2)/C3

to calculate the value in F3 and then autofill up to F10.

I chose a different route. First I assigned the name Predicted_Hrs to C3:C10

and the name Observed_Hrs to D3:D10. Then I used an array formula (see

Chapter 2). I selected F3:F10 and created this formula

=(Observed_Hrs-Predicted_Hrs)^2/Predicted_Hrs

Pressing CTRL+Shift+Enter puts the values into F3:F10. That key combination

also puts the curly brackets into the formula in the Formula Bar.

The sum of the values in column F is in cell F11, and that’s χ

. If you’re trying

to show that the Poisson distribution is a good fit to the data, you’re looking

for a low value of χ

25 454060-ch18.indd 35325 454060-ch18.indd 353 4/21/09 7:37:15 PM4/21/09 7:37:15 PM

354

Part IV: Working with Probability

Figure 18-2:

Web page

hits/hour —

Poisson-

predicted

(μ=3) and

observed,

along with

the cal-

culations

needed to

compute χ

OK. Now what? Is 3.5661 high or is it low?

To find out, you evaluate the calculated value of χ

against the χ

distribution.

The goal is to find the probability of getting a value at least as high as the cal-

culated value, 3.5661. The trick is to know how many degrees of freedom (df)

you have. For a goodness-of-fit application like this one

where k = the number of categories and m = the number of parameters esti-

mated from the data. The number of categories is 8 (0 Hits/Hour through 7

Hits/Hour). The number of parameters? I used the observed hours to esti-

mate the parameter μ, so m in this example is 1. That means df = 8-1-1= 6.

Use the worksheet function CHIDIST on the value in F11, with 6 df. CHIDIST

returns .73515, the probability of getting a χ

of at least 3.5661 if H

is true.

(See Chapter 10 for more on CHIDIST.) Figure 18-3 shows the χ

distribution

with 6 df and the area to the right of 3.5661.

If α = .05, the decision is to not reject H

— meaning you can’t reject the

hypothesis that the observed data come from a Poisson distribution.

This is one of those infrequent times when it’s beneficial to not reject H

—

if you want to make the case that a Poisson process is producing the data. If

the probability had been just a little greater than .05, not rejecting H

would

look suspicious. The large probability, however, makes nonrejection of H

—

and an underlying Poisson process — seem more reasonable. (For more on

this see the sidebar in Chapter 10.)

25 454060-ch18.indd 35425 454060-ch18.indd 354 4/21/09 7:37:15 PM4/21/09 7:37:15 PM

355

Chapter 18: A Career in Modeling

Figure 18-3:

The χ

dis-

tribution,

df = 6. The

shaded

area is the

probability

of getting a

of at least

3.5661 if H

is true.

f(x

)

3.56610

0.05

0.1

0.15

10 20

A word about CHITEST

Excel provides CHITEST, a worksheet function that on first look appears

to carry out the test I showed you with about one tenth the work I did on

the worksheet. Its Function Arguments dialog box provides one box for the

observed values and another for the expected values.

The problem is that CHITEST does not return a value for χ

. It skips that step

and returns the probability that you’ll get a χ

at least as high as the one you

calculate from the observed values and the predicted values.

The problem is that CHITEST’s degrees of freedom are wrong for this case.

CHITEST goes ahead and assumes that df = k-1 (7) rather than k-m-1 (6). You

lose a degree of freedom because you estimate μ from the data. In other

kinds of modeling, you lose more than one degree of freedom. Suppose, for

example, you believe that a normal distribution characterizes the underlying

process. In that case, you estimate μ and σ from the data, and you lose two

degrees of freedom.

By basing its answer on less than the correct df, CHITEST gives you an inap-

propriately large (and misleading) value for the probability.

25 454060-ch18.indd 35525 454060-ch18.indd 355 4/21/09 7:37:15 PM4/21/09 7:37:15 PM

356

Part IV: Working with Probability

CHITEST would be perfect if it had an option for entering df, or if it returned a

value for χ

(which you could then evaluate via CHIDIST and the correct df).

When you don’t lose any degrees of freedom, CHITEST works as advertised.

Does that ever happen? In the next section, it does.

Playing ball with a model

Baseball is a game that generates huge amounts of statistics — and many

study these statistics closely. SABR, the Society for American Baseball

Research, has sprung from the efforts of a band of dedicated fan-statisticians

(fantasticians?) who delve into the statistical nooks and crannies of the Great

American Pastime. They call their work sabermetrics. (I made up “fantasti-

cians.” They call themselves “sabermetricians.”)

The reason I mention this is that sabermetrics supplies a nice example of

modeling. It’s based on the obvious idea that during a game a baseball team’s

objective is to score runs, and to keep its opponent from scoring runs. The

better a team does at both, the more games it wins. Bill James, who gave

sabermetrics its name and is its leading exponent, discovered a neat relation-

ship between the amount of runs a team scores, the amount of runs the team

allows, and its winning percentage. He calls it the Pythagorean percentage:

Think of it as a model for predicting games won. Calculate this percentage

and multiply it by the number of games a team plays. Then compare the

answer to the team’s wins. How well does the model predict the number of

games each team won during the 2008 season?

To find out, I found all the relevant data for every Major League team for

2008. (Thank you, www.baseball-reference.com.) I put the data into the

worksheet in Figure 18-4.

As Figure 18-4 shows, I used an array formula to calculate the Pythagorean

percentage in Column D. First, I assigned the name Runs_Scored to the data

in Column B, and the name Runs_Allowed to the data in Column C. Then I

selected D2:D31 and created the formula

=Runs_Scored^2/(Runs_Scored^2 + Runs_Allowed^2)

Next, I pressed CTRL+Shift+Enter to put the values into D2:D31 and the curly

brackets into the formula in the Formula Bar.

25 454060-ch18.indd 35625 454060-ch18.indd 356 4/21/09 7:37:15 PM4/21/09 7:37:15 PM

357

Chapter 18: A Career in Modeling

Figure 18-4:

Runs

scored, runs

allowed,

predicted

wins, and

wins for

each major

league

baseball

team in

2008.

Had I wanted to do it another way, I’d have put this formula in Cell D2:

=B2^2/((B2^2)+(C2^2))

Then I would have autofilled the remaining cells in Column D.

Finally, I multiplied each Pythagorean percentage in Column D by the number

of games each team played (24 teams played 162 games, 6 played 161) to get

the predicted wins in Column E. Because the number of wins can only be a

whole number, I used the ROUND function to round off the predicted wins.

For example, the formula that supplies the value in E3 is:

=ROUND(D3*162,0)

The zero in the parentheses indicates that I wanted no decimal places.

Before proceeding, I assigned the name Predicted_Wins to the data in

Column E, and the name Wins to the data in Column F.

25 454060-ch18.indd 35725 454060-ch18.indd 357 4/21/09 7:37:15 PM4/21/09 7:37:15 PM

358

Part IV: Working with Probability

How well does the model fit with reality? This time, CHITEST can supply the

answer. I don’t lose any degrees of freedom here: I didn’t use the Wins data

in Column F to estimate any parameters, like a mean or a variance, and then

apply those parameters to calculate Predicted Wins. Instead, the predictions

came from other data — the Runs Scored and the Runs Allowed. For this

reason, df = k-m-1= 30-0-1 = 29.

Here’s how to use CHITEST (when it’s appropriate!):

1. With the data entered, select a cell for CHITEST’s answer.

2. From the Statistical Functions menu, select CHITEST and click OK to

open the Function Arguments dialog box for CHITEST. (See Figure 18-5.)

Figure 18-5:

The

CHITEST

Function

Arguments

dialog box.

3. In the Function Arguments dialog box, type the appropriate values for

the arguments.

In the Actual_range box, type the cell range that holds the scores for the

observed values. For this example, that’s Wins (the name for F2:F32).

In the Expected_range box, type the cell range that holds the predicted

values. For this example, it’s Predicted_Wins (the name for E2:E32).

With the cursor in the Expected_range box, the dialog box mentions a

product of row totals and column totals. Don’t let that confuse you. That

has to do with a slightly different application of this function (which I

cover in Chapter 20).

With values entered for Actual_range and for Expected_range, the

answer appears in the dialog box. The answer here is .99999518, which

means that with 29 degrees of freedom you have a huge chance of find-

ing a value of χ

at least as high as the one you’d calculate from these

observed values and these predicted values. Bottom line: The model fits

the data extremely well.

4. Click OK to put the answer into the selected cell.

25 454060-ch18.indd 35825 454060-ch18.indd 358 4/21/09 7:37:16 PM4/21/09 7:37:16 PM