2009 Quality & Productivity Research Conference

2009 Quality & Productivity Research Conference

IBM T. J. Watson Research Ctr., Yorktown Heights, NY

June 3-5, 2009

Contributed Paper Sessions (session dates and times can be found in the Program )

1. Data Mining
Session Chair: Daniel R. Jeske, University of California - Riverside
1. "Random Forest Analysis for Vaccine Process Understanding," Matthew Wiener & Julia O'Neill, Merck & Co. Paper

Abstract: Identifying key process variables that influence vaccine potency is challenging because of the large number of variables involved in manufacturing and the inherent variability in potency measurements. This talk will demonstrate how random forest analysis has been used to illuminate relations in vaccine manufacturing data which were difficult to identify using traditional univariate statistical methods.

2. "Performance Analysis of Alternative Combinations of Classification and Clustering Algorithms with Applications to Microbial Community Profiling," Rebecca Le, University of California - Riverside Paper

Abstract: To identify any large-sample microbial community profiling in a cost-effective manner, the classification method applied on hybridization signal intensities is considered as a crucial step to improve the accuracy for the clustering results. The most selected method from various proposed classification-clustering algorithms on hybridization fingerprints is hypothesized as the best representative of the “ground-truth” clusters which are attained from clustering full nucleotide sequences.

3. "Sparse Model Recovery Methods for Enhancing System Identification," Avishy Carmi (Cambridge University, UK), Gurfil Pini (Technion, Israel), Dimitri Kanevsky (IBM, TJW Research Center) Paper

Abstract: We present a number of linear and nonlinear regression approaches for model selection, aimed at improving model identification. The proposed methods have broad applicability, ranging from identification of flight vehicle parameters to classification of fMRI data. The underlying assumption is that the independent variables, which may be represented as a sensing matrix, obey a restricted isometric property (RIP). This, in turn, facilitates the application of compressed sensing (CS) algorithms for recovering low-complexity (sparse) models with overwhelming probability. The proposed method is generalized for arbitrary sensing matrices by introducing an RIP-based random projection. This transformation is shown to improve the estimation accuracy of LASSO, the Dantzig selector and a recently-introduced Kalman filtering-based CS algorithm. We discuss possible applications of this methodology in the area of quality improvement.

4. "Bayesian Variable Selection and Grouping," Fei Liu, University of Missouri-Columbia

Abstract: Variable selection, or subset selection, plays a fundamental role in high dimensional statistical modeling. In many applications, the variables are grouping structured. It is therefore important to find the clusters of variables in the course of selection. In this paper, we propose a general Bayesian framework which allows selection, shrinkage estimation and grouping simultaneously. In addition, the connections have been established between the proposed approach and some existing methods such as LASSO, Adaptive LASSO, Elastic Net and OSCAR. We investigate the performance of the proposed approach through extensive simulation studies and bench mark data sets in the literature.

2. Design of Experiments
Session Chair: Bradley Jones, JMP SAS Institute
1. "Analysis of Computer Experiments with Functional Response," Ying Hung, Rutgers University Paper

Abstract: The use of computer experiments is becoming more and more popular and due to new technology enabling enhanced data collection, many computer experiment outputs are observed in a functional form. The literature on modeling computer experiments with functional outputs is scant as most of the modeling techniques focus on single outputs. Due to the high-dimensionality, modeling methods generally used for single outputs cannot be extended to functional outputs. In this paper, the difficulty is successfully overcome by a sequential procedure combining with an iterative correlation matrix updating technique. The proposed method is illustrated by a computer experiment in machining.

2. "Repeated Measurement Designs under Subject Dropout" Shi Zhao, Xerox Innovation Group, Dibyen Majumdar, University of Illinois at Chicago Paper

Abstract: Two major questions that concerned by a crossover design under subject dropout are addressed. 1) What statistical properties are maintained in the design at the end of the experiment? We answer it by proposing UBRMDs Balanced for Loss and showing its nice properties. 2) Whether and how can we measure the goodness of a planned design in terms of precision loss due to subject dropouts? We suggest using two quantities: the expected precision loss and the maximum precision loss. They can be treated as design selection guidance. Different realistic models are considered.

3. "New Three-Level Designs for Factor Screening and Response Surface Exploration," David J. Edwards (Virginia Commonwealth University) and Robert W. Mee, University of Tennessee Paper

Abstract: In contrast to the usual sequential nature of response surface methods (RSM), recent literature has proposed both screening and response surface exploration using only one three-level design. This approach is named “one-step RSM”. We discuss and illustrate two shortcomings of the current one-step RSM designs and analysis. Subsequently, we propose a new class of three-level designs and an analysis strategy unique to these designs that will address these shortcomings and aid the user in being appropriately advised as to factor importance. We illustrate the designs and analysis with simulated and real data.

4. "Constructing Two-Level Trend-Robust Designs," Robert Mee, University of Tennessee, Knoxville Paper

Abstract: Blocking is one means of shielding comparisons of interest from unwanted sources of variation. When the experimental units are time (or spatially) ordered and are suspected of producing a trend in the errors, certain run orders provide an alternative means of doing so. Beginning with Daniel and Wilcoxon (1966), statisticians have proposed run orders that are robust to linear and quadratic trend. This presentation discusses the rationale for using such designs and concludes with a simple approach to constructing trend-robust designs from factorial designs arranged in blocks.

3. Lifetime Data Analysis and Reliability
Session Chair: David C. Trindade, Sun Microsystems
1. "Misuses of MTBF?" Art Degenholtz and Fred Schenkelberg, Ops A La Carte, LLC Paper

Abstract: Mean Time Between Failure, MTBF, is the worst four letter acronym. It leads to misunderstanding, misinterpretation and misinformed decisions. And, MTBF is widely used. It is embedded in countless industries and ‘the way’ of discussing reliability. As you already know, MTBF is the parameter for the exponential life distribution and it has common estimation techniques. MTBF is the key element of modeling, planning, test development and vendor selection among many other elements. In this presentation, let’s review the common issues around MTBF and how these problems have led to significant errors. Then let’s explore what to do given the organization and industry will continue to use MTBF. What questions should you ask? How should you clearly explain the issues and proper use of MTBF and related probabilities? And, what basic calculation should you conduct every time you run across MTBF? Within the world of reliability engineering and statistics there has rarely been a more widely difficult metric to properly understand. Many engineers and managers do not take the time to really understand basic statistics. With very simple formulas, arguments, and definitions, we can help our respective industries advance product reliability. Used properly, there is nothing wrong with MTBF. Let’s talk about how to use MTBF properly and encourage others to do the same.

2. "Maintenance Planning of Continuous Production Systems, Illustration in the Case of an Off-shore Platform," Lea Deleris, Andrew Conn and Jonathan Hosking, IBM Research Paper

Abstract: We present a simulation-based optimization model for maintenance planning of systems where maintenance activities require shutdown. In such cases, maintenance activities should be concentrated in as few periods as possible rather than spread over the time horizon. We illustrate our approach with the case of an off-shore oil-platform.

3. "Optimal Design of Accelerated Life Tests with Multiple Stresses," Y. Zhu and E. A. Elsayed, Rutgers University Paper

Abstract: As products’ reliability increases it becomes more difficult to perform accelerated life testing to induce failures in a short time using single stress. Since product life may depend on several stresses operating simultaneously, ALT with multiple stresses is a realistic and feasible way to overcome such a difficulty. The associated challenges are to reduce the number of combinations of stress levels and spread the combinations across the design region. In this work, Latin Hypercube (LH) concept is integrated with ALT design. D-optimality and maximum inter-site distance are used to find optimum ALT test plans. The results show that D-optimality may not lead to maximum inter-site distances and a multi-objective optimization that considers both D-optimality and inter-site distance should be considered in order to find optimal LH-ALT designs. We develop an exchange algorithm for efficiently generating such designs and demonstrate its performance and efficiency.

4. "An Imputation Based Approach for Parameter Estimation in the Presence of Ambiguous Censoring with Application in Industrial Supply Chain Data," Samiran Ghosh, Indiana University-Purdue University, Indianapolis Paper

Abstract: This paper describes a novel approach based on "proportional imputation" when identical units produced in a batch have random but independent installation as well as failure time. The current problem is motivated from a real life industrial production-delivery system when identical units are shipped after production to a third party warehouse and then sold at a future date for possible installation. Due to practical limitation, at any given time point, exact installation as well as failure time are known for only those units which have failed within that time frame after the installation. Hence in house reliability engineers are presented only with a very limited as well as partial data to estimate different model parameters related to installation as well as failure distribution. In reality other units in the batch are generally ignored for lack of available statistical methodology, giving rise to gross miss-specification. In this paper we have introduced a likelihood based parametric and computationally efficient solution to overcome this problem with optimal usage of available information. Proposed methodology is also supported via simulation and a real life data example.

4. Statistical Inference
Session Chair: Rebecca Le, University of California - Riverside
1. "Tolerance Intervals: Time to Meet or Re-visit, with Simulations and Derivations," Janis Dugle, Abbott Nutrition Paper

Abstract: Can you be 95% confident that 99% of a distribution will be within some given interval? This is a statement reflective of a ‘tolerance interval’. Whenever sample data is used to estimate a distribution, tolerance intervals should be considered. But in my 20 years of statistical consulting, I have found that tolerance intervals are relatively unknown to quality practitioners who are otherwise comfortable other statistical concepts. So if this is new or ‘magic formula’ territory for you, come and walk through it with me. We’ll look at a logical derivation of a tolerance coefficient formula, and then through simulations, quantify the risk of other naïve approaches that are sometimes used.

2. "On exact two-sided statistical tolerance intervals for normal distributions with unknown means and unknown common variability," Ivan Janiga and Ivan Garaj, Slovak University of Technology in Bratislava, Slovak Republic Paper

Abstract: In process quality control there is often a need of knowledge of an interval covering a given part of the population at a given confidence level. This contribution gives an exact equation as well as algorithm for calculation of the coverage interval for a given proportion of values from normal distribution, sample size and confidence level. The generalization of the procedure for more than one distribution is also suggested.

3. "Bartlett and Wald Sequential Hypothesis Testing with Correlated Data," Judy X. Li and Daniel R. Jeske, University of California - Riverside Paper

Abstract: In most cases, correlated spatial data are analyzed based a fixed sample size design. To reduce the costs of study, sequential analysis was blended with spatial linear mixed model to further account for the correlation that likely exists in the spatial data. Both Wald and Bartlett sequential hypothesis testing procedures were considered.

4. "Change-Point Analysis of Survival Data for Clinical Trials," Xuan Chen and Michael Baron, University of Texas at Dallas Paper

Abstract: Abstract: Survival data with a change point are described by two models for the failure rate, one model before the change point and the other model after the change point. This paper focuses on the detection and estimation of such changes which is important for the evaluation and comparison of treatments and prediction of their effects. Unlike the classical change-point model, measurements may still be identically distributed, and the change point is a parameter of their common survival function. Some of the classical change-point detection techniques can still be used but the results are different. Contrary to the classical model, the maximum likelihood estimator of a change point appears strongly consistent, even in presence of nuisance parameters. However, it is no longer asymptotically efficient. We find a more accurate procedure based on Kaplan-Meier estimation of the survival function followed by the least-squares estimation of the change point. Confidence sets for the change-point parameter are proposed, and their asymptotic properties are derived. Proposed methods are applied to a recent clinical trial of the treatment program for strong drug dependence.

5. "Symbolic Algebra Applications in Quality and Productivity," Andrew Glen, United States Military Academy, Lawrence Leemis, The College of William & Mary Paper

Abstract: The APPL (A Probability Programming Language) package has been developed to perform symbolic operations on random variables. This talk will briefly introduce the language and overview applications in quality assurance (e.g., control chart constant calculation, system survival distributions, goodness-of-fit test statistic distributions) and in productivity (e.g., project management).

5. Statistical Challenges and Methods
Session Chair: William F. Guthrie, NIST
1. "Bootstrap Inference on the Characterization of a Response Surface," Robert Parody, Rochester Institute of Technology Paper

Abstract: The eigenvalues of the matrix of pure and mixed quadratic regression coefficients are critical in assessing the nature of higher dimensional second-order response surface models. Confidence intervals are needed to assess the variability in these estimates. Simulation is a technique that has been proven useful in creating critical points on linear combinations of these coefficients. However, the given linear combinations have always been prespecified, which is not the case for these eigenvalues. The approach given in this article will generalize this technique using a nonparametric bootstrap based on a pivot to create simultaneous confidence intervals for these eigenvalues.

2. "Statistical Challenges in Cyber Security," Joanne Wendelberger, Los Alamos National Laboratory Paper

Abstract: Cyber Security is a growing area of concern for government, industry, and academia. Statistics can play an important role in addressing cyber threats by providing insight into cyber data to support informed decision-making. Statistical challenges include characterization, modeling, and analysis for diverse, dynamic, and sometimes massive streams of data.

3. "Fitting Multiple Change-Point Models to a Multivariate Gaussian Model," Edgard M. Maboudou (University of Central Florida) and Douglas M. Hawkins, University of Minnesota Paper

Abstract: Statistical analysis of change point detection and estimation has received much attention recently. A time point such that observations follow a certain statistical distribution up to that point and a different distribution - commonly of the same functional form but different parameters after that point - is called a change-point. Multiple change-point problems arise when we have more than one change-point. This paper develops an exact algorithm for multivariate data to detect, estimate the change-points and the within-segment parameters using maximum likelihood estimation.

4. "Customer Segmentation Using Frequency Domain Analysis of Multivariate Categorical Time Series," Shan Hu and Nalini Ravishanker (University of Connecticut), Raj Venkatesan (University of Virginia) Paper

Abstract: Customer segmentation enables a company to divide its heterogeneous customer market into several homogeneous groups according to similar needs for products or services provided by this company. Since marketing policies and campaigns can be targeted towards those homogeneous groups more efficiently and effectively, accurate segmentation can enhance the productivity and profitability of the company. This research focuses on market segmentation using a frequency domain based cluster analysis of dependent categorical time series collected on various characteristics from a target market, viz., customers who are open to organic/natural grocery and personal products. The approach is verified by computing and comparing misclassification probabilities on data simulated under different scenarios dealing with independent and dependent multivariate categorical time series data. Application of the analysis to the customer database would help the company to understand important characteristics of the target customers in terms of their purchasing in other product categories, keeping in mind health concerns, product uniqueness, product price and quality.

6. Applications
Session Chair: Shan Hu, University of Connecticut
1. "Selection Bias in a Study of the Effect of Friction on Expected Crash Counts on Connecticut Roads," John Ivan, Nalini Ravishanker, Brien Aronov, Sizhen Guo and Eric Jackson, University of Connecticut Paper

Abstract: Selection bias is an often overlooked source of bias in the statistical analysis for many practical applications. In this project, we look at the effects of the selection bias, and use a matching technique proposed to correct for this bias. The problem deals with an understanding of the effect of friction on number of crashes on roads in Connecticut, a problem directly involved with the quality of transportation and its management in the state. The selection bias comes from the fact that friction testing is usually done on “high risk” roads, which may lead to skewed results as compared to the “usual” Connecticut road. Our analysis consists of first determining an optimal number of “matched” random samples of segments on the Connecticut roadways, collecting data from friction tests from these segments, and then analyzing the combined data consisting of the “found data” and the “random data”. Quantification of the effect of friction on crash counts (total and categorized by type), in the presence of other roadway characteristics, is a useful outcome for roadway safety planning.

2. "Estimation and Confidence Intervals for Relative Offset Under Correlated Exponential Network Delays," Jeff Pettyjohn, University of California, Riverside Paper

Abstract: Accurate clock synchronization is essential in the operation of Wireless Sensor Networks (WSN). Algorithms for clock synchronization generally rely on estimation of the relative offset of two different clocks. At the present time, a significant amount of research has taken place in this area. In particular, under the assumption of independent exponential network delays, a maximum likelihood estimator (MLE) has been found, as well as a bias correction, and a generalized confidence interval. However, the assumption of independent network delays is not always realistic. For this reason a generalized linear mixed model (GLMM) for the network delays is proposed, which allows for correlated network delays. Under this new model, an MLE and an approximate confidence interval for relative offset are developed.

3. "A Bayesian Model for Assessment of Ultra-Low Levels of Sulfur in Diesel Fuel," William F. Guthrie, NIST and W. Robert Kelly, NIST

Abstract: New regulations limit the sulfur concentration in diesel fuel to less than 15 mg/kg and should provide $150 billion in health and environmental benefits annually. However, ensuring that the measurement systems for fuel production and regulation are in control requires reference standards with lower sulfur levels than ever before. One difficulty in developing such standards is the need to correct the measurements for relatively large and variable amounts of background sulfur. This talk describes a Bayesian model for the assessment of sulfur in diesel fuel via isotope-dilution mass spectrometry that uses the three-parameter lognormal distribution to describe background variation. Results from this model have lower uncertainty and more closely follow known physical constraints than the results from models previously used for standards with higher sulfur levels.

4. "Reducing Error by Increasing Focus: Multivariate Monitoring of Biosurveillance Data," Inbal Yahav, University of Maryland

Abstract: Statistical research on disease outbreak detection using syndromic data has focused on modeling background behavior and anomaly detection. Such data contain known patterns and thus algorithms often comprise of initial preprocessing before monitoring the resulting residuals. Most studies focus on univariate monitoring of such data. We study directionally-sensitive multivariate control charts. We examine practical performance aspects of these methods using large simulation study and authentic biosurveillance data.

5. "Modeling of Defect-Limited Yield in Semiconductor Manufacturing," Emmanuel Yashchin (IBM Research), Michael Baron (U. Texas - Dallas), Asya Takken (Cisco), Mary Lanzerotti (Springer Co.)) Paper

Abstract: We discuss inference problems related to yields of semiconductor processes based on characteristics of defects observed in various stages. Of special importance are cases involving incomplete information about defects. For example, for a given wafer only data related to selected subset of the operations may be available. Furthermore, many defects are only partially classified. We describe an approach based on EM algorithm and illustrate its use for yield estimation and development of process-improvement strategies.