Quality Progress
March 2003 Table of Contents
Smart Project Selection
Narrow your list of improvement projects with outlier techniques
by Joseph D. Conklin
Say your company has identified 13 quality improvement projects as part of its Six Sigma program. Now it's time to pick which ones to do. Assume the projects have been looked at by all the departments involved. Each project has been assessed on a variety of factors and given a summary score.
The scores are on a scale of 0 to 100%. The higher the score, the better a project is. The scores for the 13 projects, ranked from lowest to highest, appear in Table 1. (From here on, I won't write the % sign after the data values; it is understood.)
The Problem
You want to pick the best set of projects from the list, but you can't assume they are all equally good. How do you decide between the best and not quite best?
Setting an arbitrary cutoff score has some drawbacks. First, any measuring system has variation. Two projects might differ by a few points in their scores but be equally good in reality. Second, any measurement system that uses scores to rate projects is bound to have some element of subjective judgment wrapped up in it. This adds another level of noise to the measuring system. The problem with an arbitrary cutoff is the projects on either side of the cutoff point might really be of the same importance.
What you have is an outlier identification problem. Outliers are values in a data set with a high probability of being in a truly different group from the others. There are many statistical techniques for identifying outliers, but the two I will discuss here are a graphical tool known as a box and whisker plot and a version of a test called Dixon's outlier test.
Box and whisker plots can be constructed in a spreadsheet.1 They are also an option in some statistical computer packages, such as Minitab. Dixon's test requires only a calculator and access to a table of critical values. We provide a table in this article that covers four to 20 data points.
Box and Whisker Plot
Figure 1 shows an illustration of a box and whisker plot. (Note: Box and whisker plots are often drawn to show two categories of outliers. The first category lies between 1.5 and 3.0 times the interquartile range from the median. The second is all the outliers beyond the first category. This distinction, however, is not necessary for the purpose of this article.)
To identify outliers based on the box and whisker plot technique, you need to determine the first quartile (Q1), median (M) and third quartile (Q3). The first quartile is the region where the lowest 25% of the data is found. The median is the point that has half the data below it and half the data above it. The third quartile runs from the median to the point below which 75% of the data is found.
For our data set:
- Q1 is 88.25. (See Table 2 to learn how to figure out the Q1.)
- M is 91. (See Table 3 to learn how to figure out the M.)
- Q3 is 94.75. (See Table 4 to learn how to figure out the Q3.)
After determining these three values, you can determine the outliers:
1. Subtract Q1 from Q3 to obtain the interquartile range (IQR). 94.75 - 88.25 = 6.50.
2. Take the IQR, 6.50, and multiply it by 1.5. You end up with 9.75.
3. Subtract 9.75 from the median, 91 (91 - 9.75 = 81.25).
4. Add 9.75 to the median, 91 (91 + 9.75 = 100.75).
5. Look for any values below 81.25. There is one, 74.
6. Look for any values above 100.75. There are none. 100 is the highest possible score.
Using the box and whisker technique, you end up with one outlier, 74. You can then drop the project with this score from your list and consider the remaining 12 as coming from the same group.
The box and whisker plot calculations don't assume your data come from any particular distribution, such as normal or exponential. This is what makes the technique flexible. Also, the calculations identify all the outliers in one pass. If there is more than one outlier, this plot makes it easy to find them all.
The price you pay for this convenience is a loss in what statisticians call sensitivity. Compared to other techniques tailored to a particular distribution, such as a normal distribution, the box and whisker approach requires the outliers to be more extreme before it can pick them up. If you aren't sure whether your data are approximately normal, exponential or of some other distribution, the box and whisker approach will let you err on the side of being conservative in what you call an outlier.2
Dixon's Outlier Test
For convenience, I'll use the same project scores to illustrate a version of Dixon's outlier test. This would not be a good idea in practice because Dixon's test is tailored to a normal distribution, and our project scores have natural upper and lower boundaries, 0% and 100%. In general, data in the form of percentages tend not to follow a normal distribution.
Dixon's test requires four numbers from your data set: the two largest and the two smallest (97, 96, 86 and 74). The calculations are as follows:
1. Subtract 96 from 97 (97 - 96 = 1).
2. Subtract 74 from 86 (86 - 74 = 12).
3. Subtract the smallest value from the largest value (97 - 74 = 23).
4. Divide 1 by 23 and calculate it two places to the right of the decimal point (1/23 = 0.04).
5. Divide 12 by 23 and calculate it two places to the right of the decimal point (12/23 = 0.52).
6. Take the larger of 0.04 and 0.52 and compare it to the critical values in Table 5. Because there are 13 scores, you need to compare 0.52 to the critical values for a 13-point data set.
7. The table lists only critical values for even numbered sample sizes from 10 to 20. To obtain the critical values for a sample size of 13, it is reasonable to interpolate between the values for 12 and 14. To do this, average the 5% values for 12 and 14: (0.429 + 0.397)/2 = 0.826/2 = 0.413. The 1% value for a sample size of 13 is found the same way: (0.520 + 0.485)/2 = 1.005/2 = 0.503. A similar calculation would apply to sample sizes of 11, 15, 17 and 19.
8. The results of the preceding step tell you if you accept a 5% chance of being wrong, the calculation you compare to the critical value has to exceed 0.413. For a 1% chance of being wrong, the value has to exceed 0.503.
With a calculated value of 0.52, you can conclude with a 1% chance of being wrong that you have an outlier. Since 0.52 originated with the calculation involving 74, the lowest project score, you classify this score as the outlier and treat the remaining 12 scores as being part of the same group.
If you repeat Dixon's outlier test on the remaining 12 scores because you suspect there might be more outliers, you run into a problem. Dixon's test is designed to identify a single outlier. When tests for single outliers are repeated on the same data set, the risk levels in Table 5 no longer apply exactly.
What's Next?
If you prefer some other approach to the box and whisper technique for finding multiple outliers, your best option is to look at the other tests specifically designed to detect them. The test you choose will depend on what assumptions you are willing to make.3
Your next best option is to repeat Dixon's test no more than three or four times on the same data set, but only use the values for the 5% point. This partly compensates for the loss in the precision of Table 5 when the test is repeated.
Having dropped the project with a score of 74 from the list, what should be your next step? Since the 12 remaining projects are considered to be in the same group statistically, you should consider doing all of them if resources permit.
If there are not enough resources for all 12, then the answer depends on whether the scores take all the relevant factors into account. If they don't, you have two options:
1. Recompute the scores to take into account all the factors and rerun the outlier tests. See if this reduces the list to a manageable number.
2. Pick one of the factors you left out and rerank the projects on the basis of that factor. Choose the projects in order from highest to lowest until all the available resources are spoken for.
If you believe the scores account for all the relevant factors, you can then start with the project that needs the least amount of resources. From there, add projects in order of increasing need for resources and stop when all available resources are spoken for. With your final choice of projects in hand, you are ready to employ Six Sigma principles to make the projects happen.
REFERENCES
1. John W. Tukey, Exploratory Data Analysis, Addison-Wesley, 1977.
2. Vic Barnett and Toby Lewis, Outliers in Statistical Data, third edition, John Wiley & Sons, 1994.
3. Ibid.
JOSEPH D. CONKLIN is a mathematical statistician at the U.S. Bureau of the Census in Suitland, MD. He earned a master's degree in statistics from Virginia Tech and is a Senior Member of ASQ. Conklin is also an ASQ certified quality manager, quality engineer, quality auditor and reliability engineer.
If you would like to comment on this article, please post your remarks on the Quality Progress Discussion Board on www.asq.org, or e-mail them to editor@asq.org.