### Assessment of Statistical Performance

Session Chair: Pete Hovey, University of Dayton**Variance Components - A Visual and Hands-on Approach to Comparing Computational Methods**

Janis Dugle

Abbott Laboratories

Abstract: To assess a chemical assay for a given product, two replicate values are produced on 10 different days. The model for this data is an example of "rep(lication) nested within day", and as such, a variance component calculation can be used to extract estimates of the day-to-day and rep-within-day variances. Three variance component methods available in the SAS® MIXED procedure are explored: Type1, Mivque0 and REML. Each method is applied to a model with an assigned proportion of day and rep variance, in several increments from 1% to 99%. For each combination of method and proportion, 1000 data sets are simulated. The variability and relationships among the estimates are illustrated graphically. These methods are then compared to show when and how they are different and whether the differences merit concern. Interesting results include 1) the relationships between methods when the Type1 day component is negative, 2) the effect of unbalanced data (missing reps), and 3) a shortcut to quickly obtain the Satterthwaite approximation used for the degrees of freedom estimate for the sum of the components and some interesting generalizations are suggested. This introduction is meant to demystify and to inspire exploration of the variance component calculations.

**Don’t Use Rank Sum Tests to Analyze Factorial Designs**

Xin (Lucy) Lu

University of Tennessee - Knoxville

Rank sum tests are common for comparing the central tendency of two or more independent samples, as illustrated in Natrella’s Experimental Statistics (1963, pp. 169ff.) and Montgomery’s Design and Analysis of Experiments (2008, pp. 112114). While some use of rankbased tests for replicated factorial designs is justifiable, the rank sum test proposed by Besseris (2009 Quality Engineering) for eightrun, fractional factorial designs is illconceived. This talk identifies two concerns regarding such tests. First, the backward elimination procedure mentioned by Besseris leads to excessive overfitting of the model; that procedure is virtually guaranteed to include inactive terms when in fact there are no active effects. Second, while one could correct for this overfitting problem, the lack of independence of the rank sum statistics makes the null distribution for the rank sums relevant only for the case of no real effects. It is for this second reason that we don’t recommend rank sum tests for unreplicated factorial designs.

**Using Actual Repairs and Repair Parts Use to Project Future Needs**

Carolyn Carroll

Stat Tech Inc.

Documentation on repairs and parts used in the repairs provide an invaluable source of information with which to project future repairs. When the age, types of operators, and environment are known about equipment, this actual, historical information allows engineers to evaluate new designs. The historical information when analyzed under assumptions considered by the engineer to be reasonable also yields an accurate forecast of future spare parts demands. This presentation will discuss one example of the use of historical records for projecting future parts needs.

**Should (T1-T2) Have Larger Uncertainty Than T1?**

Will Guthrie

NIST

In interlaboratory comparisons, laboratories sometimes calibrate a transfer instrument to compare the relative biases of their measurement processes and standards. One summary of interest from such comparisons is the pairwise difference between two laboratories’ results. Since the labs have unequal variances, the uncertainty of this difference is usually computed by the Welch-Satterthwaite (WS) procedure. In the analysis of data from a comparison of temperature calibrations, a counterintuitive property of the WS procedure was observed. Namely, the uncertainty of a between-lab difference was found to be narrower than the corresponding interval for one of the component results. The typical reaction to this situation is to suspect the WS procedure of failing to achieve its nominal confidence level. However, this is not necessarily the correct explanation. In fact, situations exist where the confidence intervals for each laboratory’s mean and for their pairwise difference all achieve the stated level of confidence even though the uncertainty of the difference is smaller than the uncertainty of at least one of its component results.