eaglet.gif (3639 bytes)

2003 Quality & Productivity Research Conference

IBM T. J. Watson Research Ctr., Yorktown Heights, NY

May 21-23, 2003


Invited Paper Sessions (with Abstracts and Papers)



Opening Plenary Session

Session Organizer: Emmanuel Yashchin, IBM Research

Session Chair: Emmanuel Yashchin, IBM Research
1. Conference kick-off: Paul Horn, IBM Senior VP, Research.

2. "Quality Mangement and Role of Statistics in IBM", Mike Jesrani, Director of the IBM Quality Management Process.

3. TBD, Anil Menon, IBM VP, Corporate Brand Strategy and Worldwide Market Intelligence.


1. Experimental Design

Session Organizer: Tim Robinson, University of Wyoming, Doug Montgomery,

Session Chair: Tim Robinson, University of Wyoming

1. "Graphical Methods to Assess the Prediction Capability of Mixture and Mixture-Process Designs", Heidi Goldfarb, Dial Corporation, Connie Borror, Arizona State University, Douglas Montgomery, Arizona State University, Christine Anderson-Cook, Virginia Polytechnic Institute and State University. Paper

Abstract: Mixture and mixture-process variable experiments are commonly encountered in the chemical, food, pharmaceutical, and consumer products industries as well as in other industrial settings. Very often, a goal of these experiments is to make predictions about various properties of the formulations. Therefore it is helpful to be able to examine the prediction variance properties of experimental designs for these types of experiments and select those that minimize these values. Variance Dispersion Graphs (VDGs) have been used to evaluate prediction variance for standard designs as well as mixture designs. They show the prediction variance as a function of distance from the design space centroid. We expand the VDG technique to handle mixture-process experiments with a Three Dimensional VDG. These graphs show the prediction variance as a function of distance from the centroids of both the process and mixture spaces, giving the experimenter a view of the entire combined mixture-process space. We will demonstrate the usefulness of these types of plots by evaluating various published and computer-generated designs for some mixture-process examples.

2. "Fraction of Design Space Graphs for Assessing Robustness of GLM Designs", Christine Anderson-Cook, Virginia Polytechnic Institute and State University.

Abstract: The Fraction of Design Space (FDS) graph (Zahran, Anderson-Cook and Myers (2003)) is a new design evaluation tool that has been used to compare competing designs in several Response Surface Methodology and Mixture experiment applications (Goldfarb, Anderson-Cook, Borror and Montgomery (2003)). It characterizes the fraction of the total design space at or below every scaled predicted variance value in a single curve, while allowing great flexibility to examine regular and irregular design spaces. A review of the FDS plot will be given, and a new application of the FDS plot is considered for Generalized Linear Models, to assess the robustness of designs to initial parameter estimates. Examples will be used to illustrate the new method.

3. "Using a Genetic Algorithm to Generate Small Exact Response Surface Designs", John Borkowski, Montana State University.

Abstract: A genetic algorithm (GA) is an evolutionary search strategy based on simplified rules of biological population genetics and theories of evolution. A GA maintains a population of candidate solutions for a problem, and then selects those candidates most fit to solve the problem. The most fit candidates are combined and/or altered by reproduction operators to produce new solutions for the next generation. The process continues, with each generation evolving more fit solutions until an acceptable solution has evolved. In this research, a GA is developed to generate near-optimal D, A, G, and IV exact N-point response surface designs in the hypercube. A catalog of designs for 1, 2, and 3 design factors has been generated. Efficiencies for classical response surface designs are calculated relative to exact optimal designs of the same design size.


2. Use of Bayesian Methods for Metrology in NIST

Session Organizer: William Guthrie, NIST
Session Chair: William Guthrie, NIST
1. "Bayesian Estimate of the Uncertainty of a Measurand Recorded with Finite Resolution", Blaza Toman, Statistical Engineering Division, NIST

Abstract: There are often cases when a measuring device records data in a digitized form, that is, a data point will be recorded as k whenever the quantity being measured has value between k-c and k+c for some finite resolution c. An estimate of the ãuncertaintyä in the quantity being measured is generally required. In the metrological literature, there are two main competing recommended solutions. One is a general solution according to the ãGuide to the Expression of Uncertainty in Measurementä, the other is a more specific solution written for dimensional and geometric measurements and published as ISO 14253-2. These two solutions lead to different answers in most cases. This talk will address this subject from a Bayesian perspective, giving an alternate solution, which clarifies the situation and sheds some light on the appropriateness of the two classical solutions.

2. "MCMC in StRD", Hung-kung Liu, Will Guthrie, and Grace Yang
Statistical Engineering Division, NIST

Abstract: In the Statistical Reference Datasets (StRD) project, NIST provided datasets on the web (http://www.itl.nist.gov/div898/strd/index.html) with certified values for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. A new area in statistical computing is the Bayesian analysis using Markov chain Monte Carlo (MCMC). Despite of its importance, the numerical accuracy of the software for MCMC is largely unknown. In this talk, we discuss our recent addition of MCMC to the StRD project.

3. "Parameter Design for Measurement Protocols by Latent Variable Methods", Walter Liggett, Statistical Engineering Division, NIST

Abstract: We present an approach to measurement system parameter design that does not require the values of the experimental units be known. The approach does require experimental units grouped in classes, a necessity when protocol execution alters the unit. A consequence of these classes is that the approach admits replication. This paper presents maximum likelihood estimates with a comparison to similar estimates in factor analysis, strategies for noise factors including those connected with secondary properties of the experimental units, and Bayesian inference on experimental contrasts through Markov chain Monte Carlo. The approach is illustrated by solderability measurements made with a wetting balance.

3. Statistical Modeling

Session Organizer: Emmanuel Yashchin, IBM Research
Session Chair: Emmanuel Yashchin, IBM Research
1. "Bayesian Modeling of Dynamic Software Reliability Growth Curve Models", Bonnie Ray, IBM Research

Abstract: We present an extended reliability growth curve model that allows model parameters to vary as a function of covariate information. In the software reliability framework, these may include such things as the number of lines of code for a product throughout the development cycle, or the number of customer licenses sold over the field life of a product. We then describe a Bayesian method for estimating the extended model. The use of Bayesian methods allows the incorporation of historical information and expert opinion into the modeling strategy, in the form of prior distributions on the parameters. The use of power priors to combine data-based and qualitative prior information is also introduced in the growth curve modeling context. Markov chain Monte Carlo methods are employed to obtain the complete posterior distributions of the model parameters, as well as other values of interest, such as the expected time to discover a specified percentage of total defects for a product. Examples using both simulated and real data are provided.

2. "Wafer Yield Modeling", Asya Takken, IBM Microelectronics, Mary Wisniewski and Emmanuel Yashchin, IBM Research

Abstract: This paper discusses the problem of yield modeling in multi-layer structures. Defects observed at vearious layers are categorized by type, size and other characteristrics. At the end of the process, the structure is viewed as a collection of chips that are subject individually to a final test. Given the above information about individual defects and final test results, we address a number of statistical issues, such as (a) estimation of defect rates based on partial data, (b) estimation, for every defect, the probability of it turning out to be the "chip-killer" and (c) final yield estimation. A number of examples is given to illustrate the proposed approach.

3. "Multivariate system monitoring using nonlinear structural equation analysis", Yasuo Amemiya, IBM Research


Abstract: Performance and reliability monitoring of a complex system in field operation is considered. A typical modern system, such as a storage server, has a hierarchical structure with subsystems and components, and can produce a large number of measurements related to performance and degradation. Longitudinal structural equation modeling can be useful in representing and exploring the multivariate data in a way consistent with the physical structure. Since such measurements are often in the form of summary counts or rates, nonlinear models in the multi-response generalized linear model framework are appropriate. Model fitting and inference procedures are given. The approach is motivated and illustrated using a problem of monitoring storage systems.


4. Advances in Statistical Process Control 1

Session Organizer: Zachary Stoumbos, Rutgers University
Session Chair: Zachary Stoumbos, Rutgers University
1. ãThe Effects of Process Variation on Multivariate Control Proceduresä, Robert L.Mason, Southwest Research Institute, Texas, You-Min Chou, The University of Texas at San Antonio, and John C. Young, McNeese State University, Lake Charles, LA

Abstract: An important feature of a multivariate process is the type of variation that it contains. Two general categories are stationary and non-stationary variation. This presentation discusses the effects of these two types of process variation on three multivariate statistical process control procedures. The methods include the multivariate exponential weighted moving average (MEWMA), the multivariate cumulative sum (MCUSUM) and Hotellingâs T2. Of particular interest is the effect of uncontrolled components of variation on these multivariate procedures as these impose on the process such data characteristics as ramp changes, mean shifts, variable drift, and autocorrelation. Comparisons of the performance of the above three multivariate procedures in the presence of these varying types of variation are presented. A key finding is that the type of process variation can have a significant influence on the performance of the control procedures, especially in the presence of non-stationary variation.

2. ãComparison of Individual Multivariate Control Charts - A Hybrid Approach for Process Monitoring and Diagnosis", Wei Jiang, INSIGHT, AT&T Labs and Kwok-Leung Tsui, Georgia Institute of Technology, Atlanta, GA Paper

Abstract: Hotellingâs T2 chart is one of the most popular multivariate control charts which is based on the generalized likelihood ratio test (GLRT) and assumes no prior information about potential mean deviations. Accordingly, variable diagnosis from a signaled T2 chart has received considerable attention in recent research. By highlighting the intrinsic relationship between multivariate control charts and statistical hypothesis testing, this paper compares the efficiencies of three widely used control charts: T2 chart, regression-adjusted chart, and the M chart under different correlation structures and mean deviations. A hybrid approach is proposed based on the GLRT and union-intersection test (UIT). The proposal serves as a complementary tool of the T2 chart and is shown effective in detecting special mean deviation structures. The proposed chart is further simplified in an adaptive version which extracts shift information from the realized data to construct testing statistic.


3. ãThe Mean Changepoint Formulation for SPCä, Douglas M. Hawkins, University of Minnesota, Minneapolis, MN, and Peihua Qiu, University of Minnesota, Minneapolis, MN Paper

Abstract: Statistical process control depends on various charting approaches (or equivalents implemented without the graphics) for detecting departures from an in-control state. One common model for the in-control state is that process readings appear to be independent realizations of some statistical distribution with known parameters. The most common example is a normal distribution with known mean and variance.
In reality, even if there is reason to believe the normality part of this model, the parameters are seldom known with high precision. More commonly the parameter values are estimates obtained from a Phase I sample of measurements taken while the process appeared to be running smoothly.
Over the last 20 years, it has become clear that the random variability in estimates, even those based on quite large Phase I samples, leads to substantial uncertainty about the run length behavior that will result from a chart based on any particular Phase I data set.
An alternative approach which been used extensively in Phase I settings is the formulation of a changepoint with unknown parameters both before and after any suspected shift. The talk sets out an adaptation of the unknown-parameter changepoint formulation for Phase II, where it must be applied repeatedly. This creates some computational issues, which turn out to be much less daunting than they seem, and some deeper issues of control of run behavior.
We conclude that the resulting method has the potential to supplant both Shewhart and cusum or EWMA charts since it has a high robustness of performance.


5. Statistical Methods for Software Engineering

Session Organizer: Bonnie Ray, IBM Research Division
Session Chair: Bonnie Ray, IBM Research Division

1. ãSome Successful Approaches to Software Reliability Modeling in Industryä, Daniel R. Jeske and Xuemei Zhang, Bell Laboratories, Lucent Technologies Paper

Abstract: Over the past three years, we have been actively engaged in both software reliability growth modeling and architecture-based software reliability modeling at Lucent Technologies. Our goal has been to include software into the overall reliability evaluation of a product design using either or both of these two fundamentally different approaches.

During the course of our application efforts to real projects, we have identified practical difficulties with each approach. The application of software reliability growth models is plagued by ad-hoc test environments, and the use of architecture-based software reliability models is plagued by a large number of unknown parameters. In this paper, we discuss our methods for overcoming these difficulties. For the first approach, we show how calibration factors can be defined and used to adjust for the mismatch between the test and operational profiles of the software. For the second approach, we present two useful ways to do sensitivity analyses that help alleviate the problem of too many uncertainties. We illustrate our methods with case studies, and offer comments on further work that is required to more satisfactorily bridge the gap between theory and applications in this research area.


2. "Analyzing the Impact of the Development Activity Allocation on Software Team Productivity", I. Robert Chiang and Lynn Kuo, University of Connecticut

Abstract: The difficulty in achieving high productivity in software development when following sequential process models such as Waterfall suggests that task (for design, coding, and testing) overlapping may become desirable. In this paper, we analyze NASA Software Engineering Laboratory project data with time series analysis and functional data analysis. Specifically, we study the effort distribution among design, coding, and testing activities as well as the degree of overlapping among them. We investigate their impact to the software productivity.

3. "Hierarchical models for software testing and reliability modeling", Todd Graves, Los Alamos National Laboratory

6. Statistics in Setting Standard, Dealing with Shocks, and Text Mining

Session Organizer: Regina Liu, Rutgers University
Session Chair: Regina Liu, Rutgers University

1. "Key Comparisons for International Standards", David Banks, U.S. Food and Drug Administration

Abstract: Efficient world trade requires that manufacturers in one country have confidence that their product will meet specifications that are verified by purchasers in another country. But these trading partners rely upon different national metrology laboratories to calibrate their equipment and there is detectable divergence between their measurement systems. This paper describes statistical methods that enable one to use data from a network of artifact comparisons to estimate the measurement functions at each participating laboratory, at both the national and lower levels. These methods cannot determine which laboratory is most accurate, but do allow one to predict the value that a given laboratory would obtain on a product from the value measured at another laboratory in the comparison network.

2. "Shock Models", Juerg Huesler, University of Bern, Switzerland

Abstract: In shock models it is generally assumed that the failure (of the system) is caused by a cumulative effect of a (large) number of non-fatal shocks or by a shock that exceeds a certain critical evel. We present work on both types but with variations of the scheme. Specifically, we consider the case that non-fatal shocks can partly harm the system by weakening its critical load. For the cumulative model, we consider the case in which the system do 'recover' from most of the non-fatal shocks after a certain period of time. We also address various statistical issues which arise from applications of such a model..

3. "Text Mining of Massive Report Data and Risk Management", Dan Jeske, Lucent Technology & Regina Liu, Rutgers University

Abstract: We develop a systematic data mining procedure for exploring large free-style text databases by automatic means, with the purpose of discovering useful features and constructing tracking statistics for measuring performance or risk. The procedure includes text analysis, risk analysis, and nonparametric inference. We use some aviation safety report repositories from the FAA and the NTSB to demonstrate our procedure in the setting of general risk management and decision-support systems. Some specific text analysis methodologies and tracking statistics are discussed.


7. Operating window experiments and extensions

Session Organizer: C. Jeff Wu, University of Michigan
Session Chair: John A. Cornell, University of Florida

1. "Operating Window - An Engineering Measure for Robustness",

Don Clausing, MIT.

2. "Failure amplication method (FAMe): an extension of the operating window method", Roshan Joseph, Georgia Tech.

3. Discussants: W. Notz, Ohio State University, Dan Frey, MIT

8. Examples of Adapting Academic Reliability Theory to Messy Industrial Problems
Session Organizer: Michael Tortorella, Rutgers University

Session Chair: Michael Tortorella
1. "Product or Process?", David A. Hoeflin (AT&T Labs)

Abstract: A well-known and widely used software reliability model, the Goel-Okumoto model [1,2,3], was derived from a software development process point of view. As such, the parameters of this model are parameters of the development process. However, this process model is commonly used to make inferences about a specific output product of the development process. As such, this modelâs parameters are not specifically for the product under test. We derive the appropriate product (conditional) model and methods, e.g., MLE normal equations, for use when examining a specific product. This product model corrects shortcomings that occur when using the process model in making inferences about the product under test as noted in [4]. The difference between the two models is clearly shown in the time limiting behavior and in the confidence regions for the modelsâ parameters. Notably, that as testing time increases to infinity, the variance in the number of defects in the product goes to zero for the product model, as one would naturally expect. As this model uses similar assumptions as the Goel-Okumoto model and in some ways is simpler, the product (conditional) model can be readily applied in place of the process (unconditional) model.

1. Jelinski, Z., and P.B. Moranda, ãSoftware Reliability Research,ä in Statistical Computer Performance Evaluation, ed. Freiberger, NY, Academic Press, pp 465-497.
2. Goel, A. L., and K. Okumoto, ãTime-Dependent Error Detection Rate Model for Software Reliability and Other Performance Measures,ä IEEE Transactions on Reliability, 28, pg 206-211
3. Musa, J., ãA Theory of Software Reliability and its Application,ä IEEE Transactions on Software Engineering, SE-1(3), pp 312-327.
4. Jeske, D. R. and H. Pham, (2001) ãOn the Maximum Likelihood Estimates for the Goel-Okumoto Software Reliability Modelä, The American Statistician, 55 (3), 219 ö 222.

2. "Practical experiment design for protecting against masked failure mechanisms in accelerated testing", Michael LuValle (OFS Laboratories)

Abstract: When dealing with a new technology, the biggest fear of a reliability engineer is that there is a ãmaskedä failure mechanism. That is, that observed failures result from a mechanism that is easily accelerable, and the failures from the easily accelerable mechanism censor or ãmaskä the failures from a less accelerable mechanism. Weâve observed such masking in the degradation of optical components in specially doped glasses. In the IC community, stress voiding comes to mind (a failure mechanism that masked itself). Some simple theory and tools for attacking this problem are described in this talk.

3. "System Reliability Redundancy Allocation with Risk-Averse Decision Makers", David W. Coit (Rutgers)

Abstract: Problem formulations are presented when there are multiple design objectives: (1) to maximize an estimate of system reliability, and (2) to minimize the variance of the reliability estimate. This is a sound and risk-averse formulation that reflects the actual needs of system designers and users. Designs with the highest estimate of system reliability may actually be a poor choice for a risk-averse applications because of an associated high variance. More realistic approaches to the problem are presented to incorporate the risk associated with the reliability estimate. System design optimization then becomes a multi-criterion optimization problem. Instead of a unique optimal solution, a Pareto optimal set of solutions is obtained to consider as a final preferred design. An example is solved using nonlinear integer programming methods. The results indicate that very different designs are obtained when the formulation is changed to incorporate minimization of reliability estimation variance as a design objective


9. Statistics in the Semiconductor Industry

Session Organizer: Asya Takken, IBM Microelectronics Division
Session Chair: Asya Takken, IBM Microelectronics Division
1. "Case Studies of Batch Processing Experiments", Diane Michelson, Sematech Paper

Abstract: Experimentation in the semiconductor industry requires clever design and clever analysis. In this paper, we look at two recent experiments performed at ISMT. The first is a strip plot design of 3 factors over 3 process steps. The second is a split plot design at a polish operation. The importance of using the correct error terms in testing the model will be discussed.

2. "Using Supersaturated Experiment Designs for Factor Screening and Robustness Analysis in the Design of a Semiconductor Clock Distribution Circuit", Duangporn Jearkpaporn, Arizona State University, Steven A. Eastman, Intel Corporation, Gerardo Gonzalez-Altamirano, Intel Corporation, Don R. Holcomb, Honeywell Engines and Systems, Alejandro Heredia-Langner, Pacific Northwest National Laboratory, Connie M. Borror, Arizona State University, Douglas C. Montgomery, Arizona State University

Abstract: Electrical signals traveling through complex semiconductor microprocessor circuits need to be synchronized to minimize idle time and optimize process performance. The global clock distribution topology and its physical characteristics are utilized to ensure that delays in the transmission of clock signals are kept to a minimum. Estimation of clock skew, the delay between clock signals that are intended to be synchronized, is often done by computer simulation of the circuit of interest. This is not an easy task since the number of factors involved is too large to be studied exhaustively. This is because the simulation time can be quite long, and the number of simulations would be too enormous to be practical. Traditional experiment designs and analysis of variance models cannot always be utilized successfully to solve this problem because they tend to be too simplistic or too resource consumptive for all but the smallest circuit designs. This paper demonstrates the relative efficiency of using various sizes of supersaturated experiment designs (SSDâs) to gain an understanding of circuit performance. Methods of analyzing SSD experiment data are compared to determine which method is more reliable at determining factors that affect circuit performance. The SSD data can also be used to identify which circuit performance measures are least robust to variation in the experiment factors.

3. "Spatial Mixture Models and Detecting Clusters with Application to IC Fabrication", Vijay Nair and Lian Xu, University of Michigan / Bristol-Myers and Squibbs.

Abstract: It is well-known that wafer map data in IC fabrication exhibit considerable degree of spatial clustering. The spatial patterns of defects can be usefuly exploited for process improvement. In this talk, we consider methods for modeling and detecting spatial patterns from data on lattices. The data are viewed as being generated from spatial mixture models, and a Bayesian hierarchical approach with priors given by suitable Markov random field models is described. The ICM algorithm is used to recover the defect-clustering patterns. Gibbs sampling is used to compute posterior distributions of parameters of interest. The results are illustrated on some real data sets.

10. Advances in Statistical Process Control 2

Session Organizer: Zachary Stoumbos, Rutgers University
Session Chair: Zachary Stoumbos, Rutgers University
1. ãA General Approach to the Monitoring of Process and Product Profiles, William Woodall, Virginia Polytechnic Institute and State University, Blacksburg, VA, and Douglas M. Montgomery, Arizona State University, Tempe, AZ Paper

Abstract: In many practical situations, the quality of a process or product is characterized by a relationship between two variables instead of bythe distribution of aunivariate quality characteristic or by the distribution of a vector consisting of several quality characteristics. Thus, each observation consists of a collection of data points that can be represented by a curve (or profile). In some calibration applications, the profile is simply a straight line while in other applications the relationship is much more complicated. In this presentation, we discuss some of the general issues involved in monitoring process and product profiles. We relate this application to functional data analysis and review applications involving linear profiles, nonlinear profiles, and the use of splines and wavelets.

2. "An Approach to Detection of Changes in Attribute Data", Emmanuel Yashchin, IBM Research Paper

Abstract: We consider situations where the observed data is in the form of a contingency table and the underlying table parameters are subject to abrupt changes of unpredictable magnitude at some unknown points in time. We derive change-point detection schemes based on the Regenerative Likelihood Ratio (RLR) approach and develop procedures for their design and analysis. We also discuss problems related to estimation of the contingency table characteristics in the presence of abrupt changes and give a number of examples related to semiconductor industry.


3. TBD


11. Industrial Applications of Six Sigma at GE

Session Organizer: Christopher Stanard, GE Global Research
Session Chair: Christopher Stanard, GE Global Research
1. "Designing Six Sigma Assemblies", Narendra Soman, GE Global Research

Abstract: Prediction and control of assembly alignments & clearances are critical to product performance and fit-up. We will introduce QPATS - Quantitative Producibility Analysis Tool Set - developed in GE to store and retrieve statistical models of manufacturing process capability. We present a case study using QPATS in combination with Monte Carlo analysis to simulate the assembly of an aircraft engine module. The assembly model contains over 50 parts & subassemblies, with more than 300 critical assembly features, and also models the real-life assembly sequence. 3-D Monte Carlo assembly simulation is used to "build" thousands of "virtual" aircraft engines, and predict the statistical variation of the clearances. The 3-D analysis provides better accuracy than traditional 1-D tolerance "stack-up" analysis, and allows designers to create multiple what-if scenarios to study the impact of tolerances and datum schemes. Linking the analysis to QPATS helps ensure that design tolerances are consistent with manufacturing capability.

2. "Incremental performance improvement during development of diagnostic algorithms", Kai Gobel, GE Global Research

Abstract: We examine the design of classifier fusion algorithm development where both the classifiers and the fusion algorithm are unknown. To properly guide algorithm development, it is desirable to evaluate algorithm performance throughout its design. Because the design space may be of considerable complexity, exhaustive simulation of the entire design space may not be practical. Another issue is the unknown classifier output distribution. Using the wrong distribution may lead to fusion tools that are either overly optimistic or otherwise distort the outcome. Either case may lead to a fuser that performs sub-optimal in practice. We study an algorithm development that is guided by a two-stroke design of experiment setup and evaluate different classifier distributions on the fused outcome. We show results from an application of diagnostic classifier fusion algorithm development for diagnosing gas path faults on aircraft engines.

3. "Development of Laser Welding of Electrodes for Ceramic Metal Halide Arc Tubes", Marshall Jones, GE Global Research


Abstract: Ceramic Metal Halide (CMH) Lamps have attractive properties that proved useful in a number of applications. The properties of the lamp, the quality of the light produced, and reliability and durability of the product are all impacted by the process of manufacture. The DFSS and Six Sigma approach was used in developing a new laser welding process for CMH Arc Tubes which has resulted in an overall improved product.


12. Industrial Statistics at Bell Labs Today

Session Organizer: Diane Lambert, Bell Labs
Session Chair: Diane Lambert, Bell Labs

1. "Statistical Technology Transfer through Software Components", David James, Duncan Temple Lang and Scott Vander Wiel, Bell-Labs, Lucent Technologies

Abstract: For more than 30 years, the S language (in its various implementations) has been the main vehicle for statistical technology transfer from Bell Labs to the various engineering and business communities in the corporation. In the last few years, however, corporate end-user computing has become almost exclusively Microsoft-based thus shifting users' preferences towards Excel and Excel-like tools for data analysis. Transferring new statistical methodology through S to such users has become increasingly difficult.

In this paper we describe an early implementation of software components for distributed statistical computing implemented in the S language. These software components provide a transparent ActiveX (COM) interface between traditional S objects and methods and Microsoft's ActiveX clients, such as Excel, Access, and corporate applications such as Business Objects (www.businessObjects.com), Oracle, etc.

As an example, we describe how R (www.r-project.org) is being used through an ActiveX interface in a manufacturing setting to orchestrate data extraction from corporate databases, data visualization, modeling, and final reporting into Excel workbooks.

2. "Setting Specifications and Adjustment Rules in Multistage Production", Suresh Goyal and Scott Vander Wiel, Bell-Labs, Lucent Technologies

Abstract: Optical-electronic circuit packs are assembled and tested in multiple stages. Test engineers set adjustment rules and test specifications for the initial production stages with the goal of meeting specs set by designer engineers at the stage of testing. These adjustment rules and initial-stage specs affect yield at the final stage. I will present a case where variation in optical power at the final inspection was too high and needed to be reduced. We formulated a random effects model that accounts for several sources of power variation and fit the model to six months of historical production data. The analysis decomposes variation into components including measurement error, adjustment error, test-set calibration and pack-to-pack differences. Follow-up experiments verified an unexpectedly large estimate of measurement error. The fitted model was used to formulate a better measurement procedure at one production stage and better adjustment rules and specs at another stage so as to improve yield at the final stage of testing.

3. "Reliability Estimation From Accelerated Degradation and Failure Data", Diane Lambert, Chuanhai Liu, and Scott Vander Wiel, Bell-Labs, Lucent Technologies

Abstract: Leading edge components with high reliability requirements may have to be deployed after limited testing. Their reliability is usually estimated by exposing components to high stress conditions and modeling failure rate as a function of stress. But the standard failure rate estimates can be misleading if few components fail, uncertainty in the parameters of the stress model is ignored, or component variability across manufacturing lots is high. This talk describes a model for estimating reliability that accommodates these sources of uncertainty when degradation measurements as well as failure counts can be obtained. An application to laser diodes is given.

13. Bayesian Reliability

Session Organizer: Ramon Leon, University of Tennessee
Session Chair: Ramon Leon, University of Tennessee
1. "Bayesian Reliability Testing for New Generation Semiconductor Processing Equipment", Michael Pore and Paul Tobias (Sematech, Ret.) Paper

Abstract: New multi-million dollar semiconductor processing tools need to be developed and deployed rapidly to match the aggressive industry roadmap for chip technologies and manufacturing productivity. In particular, a completely new set of tools was recently introduced for use in factories designed to handle 300 mm wafers (the previous largest wafer size was 200 mm). International SEMATECH (ISMT) was asked by its member companies to work with vendors to qualify many of these new tools and provide assurance that they could meet factory reliability (i.e. Mean Time Between Failure) goals. Since time was short, and many of these tools existed only as single prototype models, Bayesian reliability testing techniques were chosen to provide the most information in the shortest time. This talk will review the theory used to plan and analyze Bayesian reliability tests and test data. The practical problems of making use of prior information of different types and quality levels when specifying the Bayesian Gamma priors will also be discussed, along with some of the solutions that were implemented by ISMT statisticians and engineers .

2. "Physical Science Case Studies in Bayesian Analysis Using WinBUGS",William Guthrie, Statistical Engineering Division, NIST

Abstract: High-level software for Markov Chain Monte Carlo (MCMC) Simulation has made Bayesian analysis increasingly accessible to applied statisticians and other scientists over the last several years. BUGS, Bayesian inference Using Gibbs Sampling, and its GUI-driven, Windows counterpart, WinBUGS, are two popular computer packages for this type of analysis available on the Web. Statisticians or other scientists who wish to explore the use of BUGS or WinBUGS for their applications are encouraged by the software's authors to follow the many examples included in the documentation to get started. However, of the thirty-seven case studies included in the documentation only three are physical science applications, making the examples less convenient for statisticians working in the physical sciences than for those in biostatistics. To help make these software packages as accessible as possible, several physical science case studies in Bayesian analysis using WinBUGS will be demonstrated in this talk.


3. "Bayesian Modeling of Accelerated Life Tests with Heterogeneous Test Units", Avery J. Ashby, Ramon V. Leon, Jayanth Thyagarajan, University of Tennessee, Knoxville Paper

Abstract: WinBUGS is a software program for Bayesian analysis of complex statistical models using Markov chain Monte Carlo techniques (MCMC). We show how the models supported by the program can be used to model data obtained from accelerated life tests where there are both random and fixed effects. We illustrate the approach by predicting life of Kevlar fiber based on an accelerated life test where in addition to the stress there is a random spool effect. The talk demonstrates that Bayesian modeling using MCMC can be used to fit more realistic models for accelerated life tests than those that have been traditionally considered.

14. Invited Tutorial Session: Algorithms for Efficient Statistical Data Mining

Session Organizer: Susan Albin, Rutgers University
Session Chair: Susan Albin, Rutgers University
Speaker: Andrew Moore, Carnegie Mellon University

Abstract: How can we exploit massive data without resorting to statistically dubious computational compromises? How can we make routine statistical analysis sufficiently autonomous that it can be used safely by non-statisticians who have to make very important decisions?

We believe that the importance of these questions for applied statistics is increasing rapidly. In this talk we will fall well short of answering them adequately, but we will show a number of examples where steps in the right direction are due to new ways of geometrically preprocessing the data and subsequently asking it questions.

The examples will include biological weapons surveillance, high throughput drug screening, cosmological matter distribution, sky survey anomaly detection, self-tuning engines and accomplice detection. We will then discuss the unifying algorithms which allowed these systems to be deployed radidly and with relatively large autonomy. These involve a blend of geometric data structures such as Omohundro's Ball Trees, Callahan's Well-Separated Pairwise Decomposition, and Greengard's multipole in conjunction with new search algorithms based on "racing" samples of data, All-Dimensions Search over aggregates of records and a kind of higher-order divide and conquer over datasets and query sets.

BIO: Andrew Moore is the A. Nico Habermann associate professor of Robotics and Computer Science at the School of Computer Science, Carnegie Mellon University. His main research interests are reinforcement learning and data mining, especially data structures and algorithms to allow them to scale to large domains. The Auton Lab, co-directed by Andrew Moore and Jeff Schneider, works with Astrophysicists, Biologists, Marketing Groups, Bioinformaticists, Manufacturers and Chemical Engineers. He is funded partly from industry, and also thanks to research grants from the National Science Foundation, NASA, and more recently from the Defense Advanced Research Projects Agency to work on data mining for Biosurveillance and for helping intelligence analysts.

Andrew began his career writing video-games for an obscure British personal computer. He rapidly became a thousandaire and retired to academia, where he received a PhD from the University of Cambridge in 1991. He researched robot learning as a Post-doc working with Chris Atkeson, and then moved to a faculty position at Carnegie Mellon.

15. Sequential Methods

Session Organizer: Michael Baron, University of Texas, Dallas and IBM Research
Session Chair: Michael Baron
1. "On sequential determination of the number of computer simulations", Nitis Mukhopadhyay, University of Connecticut Paper

Abstract: The size of replication in many reported computer simulations is predominantly fixed in advance and quite often fairly arbitrarily for that matter. But, since computer simulations normally proceed in a sequential manner, it makes good sense to expect that the rich repertoire from the area of sequential analysis should play a central role in ãdecidingä the ãappropriateä number of computer simulations that one must run in a given scenario. Obviously there cannot be one unique way of determining the number of computer simulations that will perhaps be satisfactory in every conceivable scenario. Any useful ãoptimalä determination must instead take into account a particular simulation problem on hand, how ãaccuracyä of a process is measured, how much accuracy a consumer demands in the final output derived via simulations, how much unit ãerrorä would ãcostä, as well as a myriad of other practical considerations and concerns.
We focus on a specific statistical problem and partially walk through this maze. We explore how we may zero in on some ãappropriateä number of computer simulations in order to achieve pre-specified ãqualityä of the final output derived from associated simulations. We will highlight some plausible loss functions and explore few very reasonable sequential strategies that would lead to approximate ãoptimalä determination of the number of simulations. The goal of this presentation is to emphasize how the theory and methodologies of sequential analysis can potentially contribute and indeed enrich significantly any field of investigation that may use computer simulations.


2. "Balanced Randomization Designs and Classical Probability Distributions", Andrew L. Rukhin, University of Maryland Paper

Abstract: The talk compares the properties of two most commonly used balanced randomization schemes with several treatments. Such sequential schemes are common in clinical trials, load balancing in computer files storage, etc. According to the first scheme, the so-called truncated multinomial randomization design, the allocation process starts with the uniform distribution, until a treatment receives the prescribed number of subjects, after which this uniform distribution switches to the remaining treatments, and so on. The limiting joint distribution of the moments at which a treatment receives the given number of subjects is found. The second scheme, the random allocation rule, selects one of any equally likely assignments of the given number of subjects per treatment. Its limiting behavior is shown to be quite different from that of the truncated multinomial design. Formulas for the accidental bias and for the selection bias of both procedures are derived, and the large sample distribution of standard permutation tests is obtained. The relationship to classical probability theory is discussed.

3. "Asymptotic Analysis of Bayesian Quickest Change Detection Procedures", Venugopal V. Veeravalli, University of Illinois at Urbana-Champaign, Alexander Tartakovsky, University of Southern California and Michael Baron, University of Texas, Dallas and IBM Research Paper

Abstract: The optimal detection procedure for detecting changes in independent and identically distributed observations sequences (i.i.d.) in a Bayesian setting was derived by Shiryaev in nineteen sixties. However, the analysis of the performance of this procedure in terms of the average detection delay and false alarm probability has been an open problem. We investigate the performance of Shiryaev's procedure in an asymptotic setting where the false alarm probability goes to zero. The asymptotic study is performed not only in the i.i.d. case where the Shiryaev's procedure is optimal but also in general, non-i.i.d. cases and for continuous-time observations. In the latter cases, we show that Shiryaev's procedure is asymptotically optimal under mild conditions. The results of this study are shown to be especially important in studying the asymptotics of decentralized quickest change detection procedures for distributed sensor sytems.


16. Institute of Operations Research and the Management Sciences (INFORMS) Invited Session

Session Organizer: Susan Albin, Rutgers University
Session Chair: Susan Albin, Rutgers University
1. "Optimal Adjustment Of A Process Subject To Unknown Setup Errors Under Quadratic Off-Target And Fixed Adjustment Costs", Zilong Lian and Enrique Del Castillo, The Pennsylvania State University

Abstract: Consider a machine that can start production off-target where the initial offset is unknown and unobservable. The goal is to determine the optimal series of machine adjustments {U_t} that minimize the expected value of the sum of quadratic off-target costs and fixed adjustment costs. Apart of the unknown initial offset, the process is supposed to be in a state of statistical control, so the underlying production process is supposed to be of a discrete-part nature. We show, using a dynamic programming formulation based on the bayesian estimation of all unknown process parameters, how the optimal policy is of a deadband form where the width of the deadband is time-varying. Computational results and implementation are presented.

2. "Process-oriented Tolerancing for Quality Improvement in Multi-station Assembly Systems", Dariusz Ceglarek, Wenzhen Huang, University of Wisconsin - Madison

Abstract: In multi-station manufacturing systems, the quality of final products is significantly affected by both product design as well as process variables. However, historically tolerance research primarily focused on allocating tolerances based on product design characteristics of each component. Currently, there are no analytical approaches to optimally allocate tolerances to integrate product and process variables in multi-station manufacturing processes at minimum costs. The concept of process-oriented tolerancing expands the current tolerancing practices, which bound errors related to product variables, to explicitly include process variables. The resulting methodology extends the concept of ãpart interchangeabilityä into ãprocess interchangeability,ä which is critical in increasing requirements related to the suppliers selection and benchmarking. The proposed methodology is based on the development and integration of three models: tolerance-variation relation, variation propagation, and process degradation. The tolerance-variation model is based on a pin-hole fixture mechanism in multi-station assembly processes. The variation propagation model utilizes a state space representation but uses a station index instead of time index. Dynamic process effect such as tool wear is also incorporated into the framework of process-oriented tolerancing, which provides the capability to design tolerances for the whole life-cycle of a production system. Tolerances of process variables are optimally allocated through solving a nonlinear constrained optimization problem. An industry case study is used to illustrate the proposed approach.

3. "Robust Optimization of Experimentally Derived Objective Functions", Susan Albin, Rutgers University and Di Xu, American Express Paper

Abstract: In the design or improvement of systems and processes, the objective function is often a performance response surface estimated from experiments. A standard approach is to identify the levels of the design variables that optimize the estimated model. However, if the estimated model varies from the true model due to random error in the experiment, the resulting solution may be quite far from optimal. We consider all points in the confidence intervals associated with the estimated model and construct a minimax deviation model to find a robust solution that is resistant to the error in the estimated empirical model. We prove a reduction theorem to reduce the optimization model to a tractable, finite, mathematical program. The proposed approach is applied to solve for a robust order policy in an inventory problem and is compared with the canonical approach using a Monte Carlo simulation

17. Statistics in the Corporate Finance

Session Organizer: Radu Neagu, GE Research
Session Chair: Radu Neagu, GE Research
1. "Application of Six Sigma to Corporate Finance", Roger Hoerl, GE Global Research Paper

Abstract: Numerous authors, such as Dusharme* (2003) have noted that the vast majority of Six Sigma applications have occurred in manufacturing and engineering. However, Six Sigma is a generic improvement methodology, and in theory should be applicable to improve any activity. This talk will discuss the application of Six Sigma to corporate finance. It is based on the author's experience as Quality Leader of the Corporate Audit Staff of GE, a division of corporate finance. A brief introduction to corporate finance will be provided. Next, some of the unique aspects of corporate finance, relative to manufacturing and engineering will be discussed. Real Six Sigma applications, and general strategies for applying Six Sigma in corporate finance will then be reviewed. The intent is to lay the groundwork for the remaining talks in this session, which focus on detailed case studies in finance.
*Dusharme, D., "Six Sigma Survey", Quality Digest, February 2003, 24-32.


2. "Fair Valuation of Employee Stock Options", Antonio Possolo and Brock Osborn, GE Global Research Paper

Abstract: Current and proposed accounting standards, both national and international, suggest that the value of employee stock options should be estimated and recorded as expense in corporate financial statements. Since these options are rather different from stock options that are traded on exchanges (for example, they are subject to vesting and forfeiture rules, may remain alive for as long as fifteen years, and cannot be traded), their valuation calls for non-standard methodology. In addition, the patterns of forfeiture and early exercise that are observed empirically also should play a role in their fair valuation. In this presentation, we review the approach that we have been developing at General Electric to value options awarded to employees, with emphasis on its components that involve probabilistic modeling and statistical data analysis.

3. "From Corporate Default Prediction to Market Efficiency: a case study orporate default prediction and market efficiency", Radu Neagu and Roger Hoerl, GE Global Research. Paper

Abstract: The progression of a corporation from a status of financial stability into the status of financial distress usually happens over relatively large periods of time, raising the opportunity of identifying these "falling" corporations ahead of time. Consequently, our goal is to provide portfolio managers with an early notice of deteriorating financial status for a given corporation so that business decisions can be taken to mitigate loss. There are different techniques for estimating the likelihood that a corporation will go into financial default. We consider a model built using equity inferred probability of default (PD) metrics (Merton* 1996) where we construct a 2-dimensional risk space for estimating the likelihood that a company will experience financial default in the near future. For those companies whose financial risk seems to be improving over time in the 2-dimensional space, we analyze the effect of past company PD behavior on future PD status. In an efficient market, this past behavior would have no effect on future status (beyond the information conveyed by the current PD level). Our findings will, for this particular case, disprove the efficient market hypothesis. We work on a dataset formed of North American non-financial publicly traded companies and we use a publicly available definition of corporations in financial default. We conduct our study using CART analysis, logistic regression and Markov-Chain type transition probabilities analysis.
*R.C. Merton, Continuous-Time Finance; Blackwell Publishers, Revised Edition, 1996


18. Data Mining Applications

Session Organizer: Chid Apte, IBM Research Division
Session Chair: Chid Apte, IBM Research Division
1. "The Challenges in Improving Business Processes with Data Mining", Vasant Dhar (NYU Stern School of Business)

Abstract: Data Mining is a compelling proposition for business. There is enough evidence that the core technology for finding patterns in data works. Dozens of case studies across industries document the fact that sensible problem formulations coupled with analyses from real data are valuable to decision makers. However, the adoption of data mining within businesses is still relatively low. In this talk, I describe the hurdles that must be overcome in making data mining an inherent part of doing business. In particular, I describe the obstacles in each part of the data knowledge value chain, and the challenges in making data mining an inherent aspect of business processes.

2. "Algorithms for Efficient Statistical Data Mining", Andrew Moore, Carnegie Mellon University.

Abstract: How can we exploit massive data without resorting to statistically dubious computational compromises? How can we make routine statistical analysis sufficiently autonomous that it can be used safely by non-statisticians who have to make very important decisions? We believe that the importance of these questions for applied statistics is increasing rapidly. In this talk we will fall well short of answering them adequately, but we will show a number of examples where steps in the right direction are due to new ways of geometrically preprocessing the data and subsequently asking it questions.

The examples will include biological weapons surveillance, high throughput drug screening, cosmological matter distribution, sky survey anomaly detection, self-tuning engines and accomplice detection. We will then discuss the unifying algorithms which allowed these systems to be deployed radidly and with relatively large autonomy. These involve a blend of geometric data structures such as Omohundro's Ball Trees, Callahan's Well-Separated Pairwise Decomposition, and Greengard's multipole in conjunction with new search algorithms based on "racing" samples of data, All-Dimensions Search over aggregates of records and a kind of higher-order divide and conquer over datasets and query sets.


3. "Case Studies of Machine Learning for Manufacturing and Help Desks", Sholom Weiss, IBM Research.


Abstract: We describe two applications covering the opposite extremes of structured and unstructured data. One application is for a fabrication process used in the manufacture of laptop displays. The induced decision rules can have a significant impact on display yield and manufacturing costs. The task can be described as a regression problem. With the appropriate transformation of raw data, a solution is readily obtained. The second application is for unstructured data: a document database of customers' descriptions of their problems with products and the vendor's descriptions of their resolution. These records are ill-formed, containing redundant and poorly organized documents. The objective is to try to automate the procedures for reducing a database to its essential components, much like FAQs (frequently asked questions) that can be offered as self-help to customers. Any solution will require more than the simple application of a known model to data prepared in a spreadsheet format. We review this self-help application to obtain insight into complex data mining tasks for unstructured data.

19. Supply Chain Analytics

Session Organizer: Grace Lin, Senior Manager, Supply Chain and e-business Optimization, IBM Research Division
Session Chair: Roger Tsai, IBM Research Division
1. "Sales and Operations Optimization for a Supply Chain", Roger Tsai, Young M. Lee, Markus Ettl, Feng Cheng, Tom Ervolina (IBM Research), John Konopka, Shannon Liu (IBM Integrated Supply Chain)

Abstract: Several IBM divisions make quarterly Sales and Operations Planning (SOP) decisions each month. The supply quantity, or SOP decision, is based on many factors, revenue target, sales forecast, profit margin, life cycle of each computer, commonality of computer components and manufacturing flexibility. The SOP decision had been typically made using rule of thumb or heuristic methods. We have recently developed a Newsvendor model that computes an optimal SOP decision. The model is designed to optimize the expected financial performance, and the performance of the model was estimated using historic data and simulation of stochastic nature of demand. It is found that a variety of objective functions that are profits of a variety of definitions can all be cast into a common solution framework. Even the consideration of supply flexibility can be derived with such a framework with slight modification. A new concept of process window was introduced. It will be demonstrated why process window is beneficial to be considered in such setting. Our study indicates that substantial financial benefit can result from the optimization model.

2. "A Hybrid and Distributed Control Model for Supply Chain", Mohsen A. Jafari and Tayfur Altiok, Dept. of Industrial & Systems Engineering, Rutgers University

Abstract: One of the main challenges encountered by existing supply chain networks is lack of visibility over the network. As a result, companies make critical business decisions using estimates and projections and mainly based on historical data, resulting in inefficiencies such as insufficient or overstocked inventories, improperly filled orders, late deliveries, and overly optimistic partner performance demands. At the same time, the information transfer and collaboration between the enterprises within a supply chain is mostly data driven and lacks direct and pro-active responsiveness. In a truly collaborative supply chain environment with some degree of visibility, one would expect that change in demand forecast in one enterprise leads to changes in production, priorities, and even inventory thresholds in its suppliers in some pro-active manner. The impact of a true collaborative model will be immense in the presence of real time information on logistics, distribution, and production, among others. Such a model, however, requires a different view of supply chain networks, where information flow is associated with appropriate events, and proper event propagation models with feedback mechanism are defined and implemented. A similar concept has recently been also promoted by IBM and SAP AG. For instance in IBMâs Sense and Response model, a control/monitoring layer with feedback is embedded between supply chain planning and execution. In SAP AG model, event based collaborations are defined and predictive measures are established for event propagation and pro-active control across a network of enterprises.

In this talk we treat a supply chain network as a distributed/hybrid system where interaction between different nodes (enterprises) are event based, where events are triggered by some threshold-based mechanism or governed by some continuous process(es) (e.g., inventory depletion or replenishment). We present our preliminary results, where underlying business processes (which rarely change) are separated from business rules (which could dynamically change). Control layers are defined within enterprise and between enterprises across the supply chain network. We also present a preliminary prototype that is under construction at Rutgers University.

3. "Cross-enterprise Data Analysis: Methodologies, Challenges, Opportunities", James Ding, McMaster University Paper

Abstract: Recent development in IT and globalization provides many research opportunities and challenges for statistical quality control in industry. Some of interesting topics involve integration of the traditional statistical quality control methods and other mathematical tools for the practical problems under the new environment. Some of those important aspects include applications in data analysis under the ERP systems, SCM, e-business and hetrogenous social and economic structure under the emerging economics. I plan to address in detail the role of statistics in the after ERP solutions and the integration of the statistics and system dynamics in enterprise decision making.


20. Wavelets in Statistical Process Control

Session Organizer: Galit Shmueli, University of Maryland
Session Chair: Galit Shmueli, University of Maryland
1. "Wavelets for Change Point Problem and Non-Stationary Time Series", Yazhin Wang, University of Connecticut

Abstract: Because of localization property and ability to decompose processes into different frequency components, wavelets have been successively employed to study sudden structure changes in functions or signals and model non-stationary time series. This talk will review their recent developments and present some current work on locally self-similar processes.

2. "Real-time Monitoring of Daily Sales Using Wavelets", Galit Shmueli, University of Maryland

Abstract: Daily grocery and pharmacy sales tend to form dependent and non-stationary series. We suggest a method for real-time monitoring of medication sales using a combination of wavelets and autoregressive models. We illustrate the method by applying it to real sales data from a large grocery chain.

3. "Multiscale Statistical Process Control Using Libraries of Basis Functions", Bhavik Bakshi, Ohio State University


Abstract: This presentation will provide an overview of MSSPC and its different variations including MSSPC based on, principal component analysis (MSPCA), clustering by adaptive resonance theory (MSART), and libraries of basis functions. MSSPC with orthonormal wavelets is good for detecting mean shifts, while MSSPC with a library of wavelet packet basis functions can detect a broad range of changes such as, mean, autocorrelation, oscillatory and spectral changes. Average run length analysis and industrial case studies demonstrate that MSSPC is an excellent general method for detecting abnormal situations when the nature and magnitude of the change can vary over a wide range and are not known a priori. However, MSSPC does not outperform methods that are tailored to detect specific types and magnitudes of changes. In industrial practice, the ability of MSSPC to provide better average performance for different types of data and changes is usually an important advantage, since prior knowledge about the types of changes is rarely available, particularly in large multivariate systems. These properties will be illustrated via data from petrochemical processes.

21. Data Quality

Session Organizer: Tamrparni Dasu, AT&T Labs - Research
Session Chair: Tamrparni Dasu, AT&T Labs - Research
1. "Data Cleaning: The Good, The Bad, and The Ugly", Ronald K. Pearson, Daniel Baugh Institute / Thomas Jefferson University

Abstract: This talk gives a broad overview of the data cleaning problem, focusing on three general areas: "good" techniques that perform well in practice, "bad'' techniques that seem reasonable but frequently fail in practice, and ``ugly'' phenomena that we see in real datasets but would rather not. The Martin-Thomson data cleaning filter illustrates the good, since it often does an excellent job in dynamic analysis applications like spectral estimation and harmonic analysis, although at the price of some complexity. The venerable "3 sigma-edit rule'' illustrates the bad, since it often fails to find any outliers at all in contaminated datasets, due to masking effects. As a useful alternative, I will discuss the Hampel identifier which has the advantage that it also extends to a very simple nonlinear data cleaning filter that sometimes performs as well or better than the Martin-Thomson data cleaner in dynamic analysis applications. Finally, as examples of the ugly, I will consider the problems of common-mode outliers that appear simultaneously in several variables, the subtle outliers that arise as the ironic consequence of idempotent form-based data entry systems designed to enforce data quality, and the problem of outlier detection in asymmetrically distributed data sequences. All of these ideas will be illustrated with real datasets, drawn primarily from the areas of industrial process monitoring and bioinformatics.

2. "Database Technology and Data Quality", Theodore Johnson, Database Research Department, AT&T Labs - Research

Abstract: Modern (relational) databases have extensive facilities to ensure data quality : integrity constraints (on field values, uniqueness, relationships between tables, and so on), triggers, metadata support, data modeling tools, data loading tools, and simple but powerful query languages. So why is it that every database that I've encountered is filled with data quality problems? In this talk, I will outline some of the most common causes of data quality problems, and how these problems can be mitigated by the use of existing and new database related techniques. Because database complexity is a common factor in data quality problems, I will emphasize recent research in database summarization and exploration.

3. "Case Study in Data Quality Implementation", Tamraparni Dasu, Statistics Research Department, AT&T Labs - Research

Abstract: I will present a case study to illustrate the implementation of a data quality program to improve the accurate data flows in a provisioning process. The case study is based on a real application. As a part of the case study, I will discuss data quality metrics, outlining conventional ones as well as proposing updated and dynamic metrics that capture the complexity of data quality as well as the highly application specific nature of solutions and metrics.

22. Special Invited Session 1. Data Gathering

Session Organizer: Paul Tobias, Sematech (Ret)
Session Chair: Paul Tobias, Sematech (Ret)
"Data Gathering: Focusing on the Challenge", Gerald J. Hahn and Necip Doganaksoy, Adjunct Faculty, RPI and GE Global Research Center Paper

Abstract: Getting the right data is a critical part of our job÷and one that is given insufficient emphasis in training practitioners and statisticians. Although this is hardly news to this audience, we feel it merits our attention. We propose a formal process for data gathering, and provide some examples.

We also recommend a comprehensive course on data gathering. Ideally, this would be the second semester of a one-year introduction to applied statistics for both practitioners and aspiring statisticians. A condensed version could also be an important part of short courses in industry.

The proposed course illustrates the inadequacy of historical data, and describes the suggested process for getting the right data. It includes an overview of key concepts and practical considerations in the design of experiments and survey sampling, as well as in the development of data gathering systems, in general. Other course topics include sample size determination, planning analytic (as opposed to enumerative) studies, and the data cleaning process. The course de-emphasizes formal statistical analyses, such as the analysis of variance, focusing, instead on simple graphical evaluations.

We emphasize the criticality of the data gathering process in a forthcoming book, tentatively entitled ãStatistics in the Corporate World,ä planned for publication in 2004, and invite colleaguesâ comments.

23. Special Invited Session 2. Sequential Experimentation

Session Organizer: Emmanuel Yashchin, IBM Research
Session Chair: Emmanuel Yashchin, IBM Research
"Optimizing Sequential Design of Experiments", Michael Baron, University of Texas at Dallas and IBM Research Division and Claudia Schmegner, University of Texas at Dallas Paper

Abstract: It is often impractical or expensive to sample according to the classical sequential scheme, that is, one observation at a time. Sequential planning extends and generalizes the ``pure'' sequential procedures by allowing to sample observations in groups. At any moment, all the collected data are used to determine the size of the next group and to decide whether or not sampling should be terminated. We discuss optimality of sequential designs taking into account both the variable and the fixed cost of experiments. Some general guidelines for optimal sequential planning are established. It is shown that the total cost of standard sequential procedures can be reduced significantly without increasing the loss. Specific types of sequential plans are introduced and compared, some existing plans are modified and improved.