Data Preparation 55
Let us assume that an outlier is not due to a data entry error (e.g., 99 was entered
instead of 9) or the failure to specify a missing data code (e.g., –9) in the data editor of
a statistics computer tool; that is, the outlier is a valid score. One possibility is that the
case does not belong to the population from which you intended to sample. Suppose
that a senior graduate student audits a lower-level undergraduate class in which a ques-
tionnaire is distributed. The auditing student is from a different population, and his or
her questionnaire responses may be extreme compared with those of classmates. If it is
determined that a case with outlier scores is not from the same population as the rest,
then it is best to remove that case from the sample. Otherwise, there are ways to reduce
the influence of extreme scores if they are retained. One option is to convert extreme
scores to a value that equals the next most extreme score that is within three standard
deviations of the mean. Another is to apply a mathematical transformation to a variable
with outliers. Transformations are discussed later in this chapter.
Missing data
The topic of how to analyze data sets with missing observations is complicated. Entire
books and special sections of journals (Allison, 2001; Little & Rubin, 2002; McKnight,
McKnight, Sidani, & Figueredo, 2007; West, 2001) are devoted to it. This is fortunate
because it is not possible here to give a comprehensive account of the topic. The goal
instead is to acquaint you with basic analysis options, explain the relevance of these
options to SEM, and provide references for further study.
Ideally, researchers would always work with complete data sets, ones with no miss-
ing values. Otherwise, prevention is the best approach. For example, questionnaire items
that are clear and unambiguous may prevent missing responses, and completed forms
should be reviewed for missing responses before research participants leave the labora-
tory. In the real world, missing values occur in many (if not most) data sets, despite the
best efforts at prevention. Missing data occur for many reasons, including hardware
failure, software bugs, missed appointments, and case attrition. A few missing values,
such as less than 5% on a single variable, in a large sample may be of little concern. This
is especially true if the reason for data loss is ignorable, which means accidental or not
systematic. Selection among methods to deal with the missing observations in this case
is pretty much arbitrary in that the method used does not tend to make much difference.
A systematic data loss pattern, on the other hand, means that incomplete cases differ
from cases with complete records for some reason, rather than randomly. Thus, results
based only on the cases with complete records may not generalize to whole population.
This situation is more difficult because the use of different methods for handling miss-
ing data could yield different results, perhaps all biased.
Most methods that deal with missing observations assume that the data loss pat-
tern is ignorable. There are two general kinds of ignorable patterns, missing at random
(MAR) and missing completely at random (MCAR). If the missing observations on
some variable X differ from the observed scores on that variable only by chance, the data
loss pattern is MAR. If, in addition to the property just mentioned, the presence versus