|
2003 Quality & Productivity Research ConferenceIBM T. J. Watson Research Ctr., Yorktown Heights, NY May 21-23, 2003 |
Invited Paper Sessions (with Abstracts and Papers)
Opening Plenary Session
Session Organizer: Emmanuel Yashchin, IBM Research
Session Chair: Emmanuel
Yashchin, IBM Research
1. Conference kick-off: Paul Horn, IBM Senior VP, Research.
2. "Quality Mangement and Role of Statistics in IBM", Mike Jesrani, Director of the IBM Quality Management Process.
3. TBD, Anil Menon, IBM VP, Corporate Brand Strategy and Worldwide Market Intelligence.
1. Experimental Design
Session Organizer: Tim Robinson, University of Wyoming, Doug Montgomery,
Session Chair: Tim Robinson, University of Wyoming
1. "Graphical Methods to Assess the Prediction Capability of Mixture
and Mixture-Process Designs", Heidi Goldfarb, Dial Corporation, Connie Borror, Arizona State University, Douglas Montgomery, Arizona State University, Christine Anderson-Cook, Virginia Polytechnic Institute and State University. Paper
Abstract: Mixture and mixture-process variable experiments are commonly encountered
in the chemical, food, pharmaceutical, and consumer products industries
as well as in other industrial settings. Very often, a goal of these experiments
is to make predictions about various properties of the formulations. Therefore
it is helpful to be able to examine the prediction variance properties
of experimental designs for these types of experiments and select those
that minimize these values. Variance Dispersion Graphs (VDGs) have been
used to evaluate prediction variance for standard designs as well as mixture
designs. They show the prediction variance as a function of distance from
the design space centroid. We expand the VDG technique to handle mixture-process
experiments with a Three Dimensional VDG. These graphs show the prediction
variance as a function of distance from the centroids of both the process
and mixture spaces, giving the experimenter a view of the entire combined
mixture-process space. We will demonstrate the usefulness of these types
of plots by evaluating various published and computer-generated designs
for some mixture-process examples.
2. "Fraction of Design Space Graphs for Assessing Robustness of GLM
Designs", Christine Anderson-Cook, Virginia Polytechnic Institute
and State University.
Abstract: The Fraction of Design Space (FDS) graph (Zahran, Anderson-Cook and Myers
(2003)) is a new design evaluation tool that has been used to compare competing
designs in several Response Surface Methodology and Mixture experiment
applications (Goldfarb, Anderson-Cook, Borror and Montgomery (2003)). It
characterizes the fraction of the total design space at or below every
scaled predicted variance value in a single curve, while allowing great
flexibility to examine regular and irregular design spaces. A review of
the FDS plot will be given, and a new application of the FDS plot is considered
for Generalized Linear Models, to assess the robustness of designs to initial
parameter estimates. Examples will be used to illustrate the new method.
3. "Using a Genetic Algorithm to Generate Small Exact Response Surface
Designs", John Borkowski, Montana State University.
Abstract: A genetic algorithm (GA) is an evolutionary search strategy based on
simplified rules of biological population genetics and theories of evolution.
A GA maintains a population of candidate solutions for a problem, and then
selects those candidates most fit to solve the problem. The most fit candidates
are combined and/or altered by reproduction operators to produce new solutions
for the next generation. The process continues, with each generation evolving
more fit solutions until an acceptable solution has evolved. In this research,
a GA is developed to generate near-optimal D, A, G, and IV exact N-point
response surface designs in the hypercube. A catalog of designs for 1,
2, and 3 design factors has been generated. Efficiencies for classical
response surface designs are calculated relative to exact optimal designs
of the same design size.
Session Organizer: William Guthrie, NIST
Session Chair: William Guthrie, NIST
1. "Bayesian Estimate of the Uncertainty of a Measurand Recorded with
Finite Resolution", Blaza Toman, Statistical Engineering Division,
NIST
Abstract: There are often cases when a measuring device records data in a digitized
form, that is, a data point will be recorded as k whenever the quantity
being measured has value between k-c and k+c for some finite resolution
c. An estimate of the ãuncertaintyä in the quantity being measured is generally
required. In the metrological literature, there are two main competing
recommended solutions. One is a general solution according to the ãGuide
to the Expression of Uncertainty in Measurementä, the other is a more specific
solution written for dimensional and geometric measurements and published
as ISO 14253-2. These two solutions lead to different answers in most cases.
This talk will address this subject from a Bayesian perspective, giving
an alternate solution, which clarifies the situation and sheds some light
on the appropriateness of the two classical solutions.
2. "MCMC in StRD", Hung-kung Liu, Will Guthrie, and Grace Yang
Statistical Engineering Division, NIST
Abstract: In the Statistical Reference Datasets (StRD) project, NIST provided datasets
on the web (http://www.itl.nist.gov/div898/strd/index.html) with certified
values for assessing the accuracy of software for univariate statistics,
linear regression, nonlinear regression, and analysis of variance. A new
area in statistical computing is the Bayesian analysis using Markov chain
Monte Carlo (MCMC). Despite of its importance, the numerical accuracy of
the software for MCMC is largely unknown. In this talk, we discuss our
recent addition of MCMC to the StRD project.
3. "Parameter Design for Measurement Protocols by Latent Variable Methods", Walter Liggett, Statistical Engineering Division, NIST
Abstract: We present an approach to measurement system parameter design that does
not require the values of the experimental units be known. The approach
does require experimental units grouped in classes, a necessity when protocol
execution alters the unit. A consequence of these classes is that the approach
admits replication. This paper presents maximum likelihood estimates with
a comparison to similar estimates in factor analysis, strategies for noise
factors including those connected with secondary properties of the experimental
units, and Bayesian inference on experimental contrasts through Markov
chain Monte Carlo. The approach is illustrated by solderability measurements
made with a wetting balance.
Session Organizer: Emmanuel Yashchin, IBM Research
Session Chair: Emmanuel Yashchin, IBM Research
1. "Bayesian Modeling of Dynamic Software Reliability Growth Curve
Models", Bonnie Ray, IBM Research
Abstract: We present an extended reliability growth curve model that allows model
parameters to vary as a function of covariate information. In the software
reliability framework, these may include such things as the number of lines
of code for a product throughout the development cycle, or the number of
customer licenses sold over the field life of a product. We then describe
a Bayesian method for estimating the extended model. The use of Bayesian
methods allows the incorporation of historical information and expert opinion
into the modeling strategy, in the form of prior distributions on the parameters.
The use of power priors to combine data-based and qualitative prior information
is also introduced in the growth curve modeling context. Markov chain Monte
Carlo methods are employed to obtain the complete posterior distributions
of the model parameters, as well as other values of interest, such as the
expected time to discover a specified percentage of total defects for a
product. Examples using both simulated and real data are provided.
2. "Wafer Yield Modeling", Asya Takken, IBM Microelectronics,
Mary Wisniewski and Emmanuel Yashchin, IBM Research
Abstract: This paper discusses the problem of yield modeling in multi-layer structures.
Defects observed at vearious layers are categorized by type, size and other
characteristrics. At the end of the process, the structure is viewed as
a collection of chips that are subject individually to a final test. Given
the above information about individual defects and final test results,
we address a number of statistical issues, such as (a) estimation of defect
rates based on partial data, (b) estimation, for every defect, the probability
of it turning out to be the "chip-killer" and (c) final yield
estimation. A number of examples is given to illustrate the proposed approach.
3. "Multivariate system monitoring using nonlinear structural equation
analysis", Yasuo Amemiya, IBM Research
Abstract: Performance and reliability monitoring of a complex system in field operation
is considered. A typical modern system, such as a storage server, has a
hierarchical structure with subsystems and components, and can produce
a large number of measurements related to performance and degradation.
Longitudinal structural equation modeling can be useful in representing
and exploring the multivariate data in a way consistent with the physical
structure. Since such measurements are often in the form of summary counts
or rates, nonlinear models in the multi-response generalized linear model
framework are appropriate. Model fitting and inference procedures are given.
The approach is motivated and illustrated using a problem of monitoring
storage systems.
Session Organizer: Zachary Stoumbos, Rutgers University
Session Chair: Zachary Stoumbos, Rutgers University
1. ãThe Effects of Process Variation on Multivariate Control Proceduresä,
Robert L.Mason, Southwest Research Institute, Texas, You-Min Chou, The
University of Texas at San Antonio, and John C. Young, McNeese State University,
Lake Charles, LA
Abstract: An important feature of a multivariate process is the type of variation
that it contains. Two general categories are stationary and non-stationary
variation. This presentation discusses the effects of these two types of
process variation on three multivariate statistical process control procedures.
The methods include the multivariate exponential weighted moving average
(MEWMA), the multivariate cumulative sum (MCUSUM) and Hotellingâs T2. Of
particular interest is the effect of uncontrolled components of variation
on these multivariate procedures as these impose on the process such data
characteristics as ramp changes, mean shifts, variable drift, and autocorrelation.
Comparisons of the performance of the above three multivariate procedures
in the presence of these varying types of variation are presented. A key
finding is that the type of process variation can have a significant influence
on the performance of the control procedures, especially in the presence
of non-stationary variation.
2. ãComparison of Individual Multivariate Control Charts - A Hybrid Approach
for Process Monitoring and Diagnosis", Wei Jiang, INSIGHT, AT&T
Labs and Kwok-Leung Tsui, Georgia Institute of Technology, Atlanta, GA
Paper
Abstract: Hotellingâs T2 chart is one of the most popular multivariate control
charts which is based on the generalized likelihood ratio test (GLRT) and
assumes no prior information about potential mean deviations. Accordingly,
variable diagnosis from a signaled T2 chart has received considerable attention
in recent research. By highlighting the intrinsic relationship between
multivariate control charts and statistical hypothesis testing, this paper
compares the efficiencies of three widely used control charts: T2 chart,
regression-adjusted chart, and the M chart under different correlation
structures and mean deviations. A hybrid approach is proposed based on
the GLRT and union-intersection test (UIT). The proposal serves as a complementary
tool of the T2 chart and is shown effective in detecting special mean deviation
structures. The proposed chart is further simplified in an adaptive version
which extracts shift information from the realized data to construct testing
statistic.
3. ãThe Mean Changepoint Formulation for SPCä, Douglas M. Hawkins, University
of Minnesota, Minneapolis, MN, and Peihua Qiu, University of Minnesota, Minneapolis, MN Paper
Abstract: Statistical process control depends on various charting approaches (or
equivalents implemented without the graphics) for detecting departures
from an in-control state. One common model for the in-control state is
that process readings appear to be independent realizations of some statistical
distribution with known parameters. The most common example is a normal
distribution with known mean and variance.
In reality, even if there is reason to believe the normality part of this model, the parameters are seldom known with high precision. More commonly the parameter values are estimates obtained from a Phase I sample of measurements taken while the process appeared to be running smoothly.
Over the last 20 years, it has become clear that the random variability in estimates, even those based on quite large Phase I samples, leads to substantial uncertainty about the run length behavior that will result from a chart based on any particular Phase I data set.
An alternative approach which been used extensively in Phase I settings is the formulation of a changepoint with unknown parameters both before and after any suspected shift. The talk sets out an adaptation of the unknown-parameter changepoint formulation for Phase II, where it must be applied repeatedly. This creates some computational issues, which turn out to be much less daunting than they seem, and some deeper issues of control of run behavior.
We conclude that the resulting method has the potential to supplant both Shewhart and cusum or EWMA charts since it has a high robustness of performance.
Session Organizer: Bonnie Ray, IBM Research Division
Session Chair: Bonnie Ray, IBM Research Division
1. ãSome Successful Approaches to Software Reliability Modeling in Industryä, Daniel R. Jeske and Xuemei Zhang, Bell Laboratories, Lucent Technologies
Paper
Abstract: Over the past three years, we have been actively engaged in both software
reliability growth modeling and architecture-based software reliability
modeling at Lucent Technologies. Our goal has been to include software
into the overall reliability evaluation of a product design using either
or both of these two fundamentally different approaches.
During the course of our application efforts to real projects, we have identified practical difficulties with each approach. The application of software reliability growth models is plagued by ad-hoc test environments, and the use of architecture-based software reliability models is plagued by a large number of unknown parameters. In this paper, we discuss our methods for overcoming these difficulties. For the first approach, we show how calibration factors can be defined and used to adjust for the mismatch between the test and operational profiles of the software. For the second approach, we present two useful ways to do sensitivity analyses that help alleviate the problem of too many uncertainties. We illustrate our methods with case studies, and offer comments on further work that is required to more satisfactorily bridge the gap between theory and applications in this research area.
2. "Analyzing the Impact of the Development Activity Allocation on
Software Team Productivity", I. Robert Chiang and Lynn Kuo, University
of Connecticut
Abstract: The difficulty in achieving high productivity in software development
when following sequential process models such as Waterfall suggests that
task (for design, coding, and testing) overlapping may become desirable.
In this paper, we analyze NASA Software Engineering Laboratory project
data with time series analysis and functional data analysis. Specifically,
we study the effort distribution among design, coding, and testing activities
as well as the degree of overlapping among them. We investigate their impact
to the software productivity.
3. "Hierarchical models for software testing and reliability modeling",
Todd Graves, Los Alamos National Laboratory
Session Organizer: Regina Liu, Rutgers University
Session Chair: Regina Liu, Rutgers University
1. "Key Comparisons for International Standards", David Banks,
U.S. Food and Drug Administration
Abstract: Efficient world trade requires that manufacturers in one country have
confidence that their product will meet specifications that are verified
by purchasers in another country. But these trading partners rely upon
different national metrology laboratories to calibrate their equipment
and there is detectable divergence between their measurement systems. This
paper describes statistical methods that enable one to use data from a
network of artifact comparisons to estimate the measurement functions at
each participating laboratory, at both the national and lower levels. These
methods cannot determine which laboratory is most accurate, but do allow
one to predict the value that a given laboratory would obtain on a product
from the value measured at another laboratory in the comparison network.
2. "Shock Models", Juerg Huesler, University of Bern, Switzerland
Abstract: In shock models it is generally assumed that the failure (of the system)
is caused by a cumulative effect of a (large) number of non-fatal shocks
or by a shock that exceeds a certain critical evel. We present work on
both types but with variations of the scheme. Specifically, we consider
the case that non-fatal shocks can partly harm the system by weakening
its critical load. For the cumulative model, we consider the case in which
the system do 'recover' from most of the non-fatal shocks after a certain
period of time. We also address various statistical issues which arise
from applications of such a model..
3. "Text Mining of Massive Report Data and Risk Management",
Dan Jeske, Lucent Technology & Regina Liu, Rutgers University
Abstract: We develop a systematic data mining procedure for exploring large free-style
text databases by automatic means, with the purpose of discovering useful
features and constructing tracking statistics for measuring performance
or risk. The procedure includes text analysis, risk analysis, and nonparametric
inference. We use some aviation safety report repositories from the FAA
and the NTSB to demonstrate our procedure in the setting of general risk
management and decision-support systems. Some specific text analysis methodologies
and tracking statistics are discussed.
Session Organizer: C. Jeff Wu, University of Michigan
Session Chair: John A. Cornell, University of Florida
1. "Operating Window - An Engineering Measure for Robustness",
Don Clausing, MIT.
2. "Failure amplication method (FAMe): an extension of the operating window method", Roshan Joseph, Georgia Tech.
3. Discussants: W. Notz, Ohio State University, Dan Frey, MIT
8. Examples of Adapting Academic Reliability Theory to Messy Industrial
Problems
Session Organizer: Michael Tortorella, Rutgers University
Session Chair: Michael Tortorella
1. "Product or Process?", David A. Hoeflin (AT&T Labs)
Abstract: A well-known and widely used software reliability model, the Goel-Okumoto
model [1,2,3], was derived from a software development process point of
view. As such, the parameters of this model are parameters of the development
process. However, this process model is commonly used to make inferences
about a specific output product of the development process. As such, this
modelâs parameters are not specifically for the product under test. We
derive the appropriate product (conditional) model and methods, e.g., MLE
normal equations, for use when examining a specific product. This product
model corrects shortcomings that occur when using the process model in
making inferences about the product under test as noted in [4]. The difference
between the two models is clearly shown in the time limiting behavior and
in the confidence regions for the modelsâ parameters. Notably, that as
testing time increases to infinity, the variance in the number of defects
in the product goes to zero for the product model, as one would naturally
expect. As this model uses similar assumptions as the Goel-Okumoto model
and in some ways is simpler, the product (conditional) model can be readily
applied in place of the process (unconditional) model.
1. Jelinski, Z., and P.B. Moranda, ãSoftware Reliability Research,ä in Statistical Computer Performance Evaluation, ed. Freiberger, NY, Academic Press, pp 465-497.
2. Goel, A. L., and K. Okumoto, ãTime-Dependent Error Detection Rate Model for Software Reliability and Other Performance Measures,ä IEEE Transactions on Reliability, 28, pg 206-211
3. Musa, J., ãA Theory of Software Reliability and its Application,ä IEEE Transactions on Software Engineering, SE-1(3), pp 312-327.
4. Jeske, D. R. and H. Pham, (2001) ãOn the Maximum Likelihood Estimates for the Goel-Okumoto Software Reliability Modelä, The American Statistician, 55 (3), 219 ö 222.
2. "Practical experiment design for protecting against masked failure mechanisms
in accelerated testing", Michael LuValle (OFS Laboratories)
Abstract: When dealing with a new technology, the biggest fear of a reliability engineer is that there is a ãmaskedä failure mechanism. That is, that observed failures result from a mechanism that is easily accelerable, and the failures from the easily accelerable mechanism censor or ãmaskä the failures from a less accelerable mechanism. Weâve observed such masking in the degradation of optical components in specially doped glasses. In the IC community, stress voiding comes to mind (a failure mechanism that masked itself). Some simple theory and tools for attacking this problem are described in this talk.
3. "System Reliability Redundancy Allocation with Risk-Averse Decision
Makers", David W. Coit (Rutgers)
Abstract: Problem formulations are presented when there are multiple design objectives:
(1) to maximize an estimate of system reliability, and (2) to minimize
the variance of the reliability estimate. This is a sound and risk-averse
formulation that reflects the actual needs of system designers and users.
Designs with the highest estimate of system reliability may actually be
a poor choice for a risk-averse applications because of an associated high
variance. More realistic approaches to the problem are presented to incorporate
the risk associated with the reliability estimate. System design optimization
then becomes a multi-criterion optimization problem. Instead of a unique
optimal solution, a Pareto optimal set of solutions is obtained to consider
as a final preferred design. An example is solved using nonlinear integer
programming methods. The results indicate that very different designs are
obtained when the formulation is changed to incorporate minimization of
reliability estimation variance as a design objective
Session Organizer: Asya Takken, IBM Microelectronics Division
Session Chair: Asya Takken, IBM Microelectronics Division
1. "Case Studies of Batch Processing Experiments", Diane Michelson,
Sematech Paper
Abstract: Experimentation in the semiconductor industry requires clever design
and clever analysis. In this paper, we look at two recent experiments performed
at ISMT. The first is a strip plot design of 3 factors over 3 process steps.
The second is a split plot design at a polish operation. The importance
of using the correct error terms in testing the model will be discussed.
2. "Using Supersaturated Experiment Designs for Factor Screening and Robustness Analysis in the Design of a Semiconductor Clock Distribution Circuit", Duangporn Jearkpaporn, Arizona State University, Steven A. Eastman,
Intel Corporation, Gerardo Gonzalez-Altamirano, Intel Corporation, Don
R. Holcomb, Honeywell Engines and Systems, Alejandro Heredia-Langner, Pacific
Northwest National Laboratory, Connie M. Borror, Arizona State University,
Douglas C. Montgomery, Arizona State University
Abstract: Electrical signals traveling through complex semiconductor microprocessor
circuits need to be synchronized to minimize idle time and optimize process
performance. The global clock distribution topology and its physical characteristics
are utilized to ensure that delays in the transmission of clock signals
are kept to a minimum. Estimation of clock skew, the delay between clock
signals that are intended to be synchronized, is often done by computer
simulation of the circuit of interest. This is not an easy task since the
number of factors involved is too large to be studied exhaustively. This
is because the simulation time can be quite long, and the number of simulations
would be too enormous to be practical. Traditional experiment designs and
analysis of variance models cannot always be utilized successfully to solve
this problem because they tend to be too simplistic or too resource consumptive
for all but the smallest circuit designs. This paper demonstrates the relative
efficiency of using various sizes of supersaturated experiment designs
(SSDâs) to gain an understanding of circuit performance. Methods of analyzing
SSD experiment data are compared to determine which method is more reliable
at determining factors that affect circuit performance. The SSD data can
also be used to identify which circuit performance measures are least robust
to variation in the experiment factors.
3. "Spatial Mixture Models and Detecting Clusters with Application to IC Fabrication", Vijay Nair and Lian Xu, University of Michigan / Bristol-Myers and Squibbs.
Abstract: It is well-known that wafer map data in IC fabrication exhibit considerable
degree of spatial clustering. The spatial patterns of defects can be usefuly
exploited for process improvement. In this talk, we consider methods for
modeling and detecting spatial patterns from data on lattices. The data
are viewed as being generated from spatial mixture models, and a Bayesian
hierarchical approach with priors given by suitable Markov random field
models is described. The ICM algorithm is used to recover the defect-clustering
patterns. Gibbs sampling is used to compute posterior distributions of
parameters of interest. The results are illustrated on some real data sets.
Session Organizer: Zachary Stoumbos, Rutgers University
Session Chair: Zachary Stoumbos, Rutgers University
1. ãA General Approach to the Monitoring of Process and Product Profiles,
William Woodall, Virginia Polytechnic Institute and State University, Blacksburg,
VA, and Douglas M. Montgomery, Arizona State University, Tempe, AZ Paper
Abstract: In many practical situations, the quality of a process or product is
characterized by a relationship between two variables instead of bythe
distribution of aunivariate quality characteristic or by the distribution
of a vector consisting of several quality characteristics. Thus, each observation
consists of a collection of data points that can be represented by a curve
(or profile). In some calibration applications, the profile is simply a
straight line while in other applications the relationship is much more
complicated. In this presentation, we discuss some of the general issues
involved in monitoring process and product profiles. We relate this application
to functional data analysis and review applications involving linear profiles,
nonlinear profiles, and the use of splines and wavelets.
2. "An Approach to Detection of Changes in Attribute Data", Emmanuel
Yashchin, IBM Research Paper
Abstract: We consider situations where the observed data is in the form of a contingency
table and the underlying table parameters are subject to abrupt changes
of unpredictable magnitude at some unknown points in time. We derive change-point
detection schemes based on the Regenerative Likelihood Ratio (RLR) approach
and develop procedures for their design and analysis. We also discuss problems
related to estimation of the contingency table characteristics in the presence
of abrupt changes and give a number of examples related to semiconductor
industry.
3. TBD
Session Organizer: Christopher Stanard, GE Global Research
Session Chair: Christopher Stanard, GE Global Research
1. "Designing Six Sigma Assemblies", Narendra Soman, GE Global
Research
Abstract: Prediction and control of assembly alignments & clearances are critical
to product performance and fit-up. We will introduce QPATS - Quantitative
Producibility Analysis Tool Set - developed in GE to store and retrieve
statistical models of manufacturing process capability. We present a case
study using QPATS in combination with Monte Carlo analysis to simulate
the assembly of an aircraft engine module. The assembly model contains
over 50 parts & subassemblies, with more than 300 critical assembly
features, and also models the real-life assembly sequence. 3-D Monte Carlo
assembly simulation is used to "build" thousands of "virtual"
aircraft engines, and predict the statistical variation of the clearances.
The 3-D analysis provides better accuracy than traditional 1-D tolerance
"stack-up" analysis, and allows designers to create multiple
what-if scenarios to study the impact of tolerances and datum schemes.
Linking the analysis to QPATS helps ensure that design tolerances are consistent
with manufacturing capability.
2. "Incremental performance improvement during development of diagnostic
algorithms", Kai Gobel, GE Global Research
Abstract: We examine the design of classifier fusion algorithm development where
both the classifiers and the fusion algorithm are unknown. To properly
guide algorithm development, it is desirable to evaluate algorithm performance
throughout its design. Because the design space may be of considerable
complexity, exhaustive simulation of the entire design space may not be
practical. Another issue is the unknown classifier output distribution.
Using the wrong distribution may lead to fusion tools that are either overly
optimistic or otherwise distort the outcome. Either case may lead to a
fuser that performs sub-optimal in practice. We study an algorithm development
that is guided by a two-stroke design of experiment setup and evaluate
different classifier distributions on the fused outcome. We show results
from an application of diagnostic classifier fusion algorithm development
for diagnosing gas path faults on aircraft engines.
3. "Development of Laser Welding of Electrodes for Ceramic Metal Halide
Arc Tubes", Marshall Jones, GE Global Research
Abstract: Ceramic Metal Halide (CMH) Lamps have attractive properties that proved
useful in a number of applications. The properties of the lamp, the quality
of the light produced, and reliability and durability of the product are
all impacted by the process of manufacture. The DFSS and Six Sigma approach
was used in developing a new laser welding process for CMH Arc Tubes which
has resulted in an overall improved product.
Session Organizer: Diane Lambert, Bell Labs
Session Chair: Diane Lambert, Bell Labs
1. "Statistical Technology Transfer through Software Components",
David James, Duncan Temple Lang and Scott Vander Wiel, Bell-Labs, Lucent
Technologies
Abstract: For more than 30 years, the S language (in its various implementations)
has been the main vehicle for statistical technology transfer from Bell
Labs to the various engineering and business communities in the corporation.
In the last few years, however, corporate end-user computing has become
almost exclusively Microsoft-based thus shifting users' preferences towards
Excel and Excel-like tools for data analysis. Transferring new statistical
methodology through S to such users has become increasingly difficult.
In this paper we describe an early implementation of software components
for distributed statistical computing implemented in the S language. These
software components provide a transparent ActiveX (COM) interface between
traditional S objects and methods and Microsoft's ActiveX clients, such
as Excel, Access, and corporate applications such as Business Objects (www.businessObjects.com),
Oracle, etc.
As an example, we describe how R (www.r-project.org) is being used through
an ActiveX interface in a manufacturing setting to orchestrate data extraction
from corporate databases, data visualization, modeling, and final reporting
into Excel workbooks.
2. "Setting Specifications and Adjustment Rules in Multistage Production",
Suresh Goyal and Scott Vander Wiel, Bell-Labs, Lucent Technologies
Abstract: Optical-electronic circuit packs are assembled and tested in multiple
stages. Test engineers set adjustment rules and test specifications for
the initial production stages with the goal of meeting specs set by designer
engineers at the stage of testing. These adjustment rules and initial-stage
specs affect yield at the final stage. I will present a case where variation
in optical power at the final inspection was too high and needed to be
reduced. We formulated a random effects model that accounts for several
sources of power variation and fit the model to six months of historical
production data. The analysis decomposes variation into components including
measurement error, adjustment error, test-set calibration and pack-to-pack
differences. Follow-up experiments verified an unexpectedly large estimate
of measurement error. The fitted model was used to formulate a better measurement
procedure at one production stage and better adjustment rules and specs
at another stage so as to improve yield at the final stage of testing.
3. "Reliability Estimation From Accelerated Degradation and Failure
Data", Diane Lambert, Chuanhai Liu, and Scott Vander Wiel, Bell-Labs,
Lucent Technologies
Abstract: Leading edge components with high reliability requirements may have to
be deployed after limited testing. Their reliability is usually estimated
by exposing components to high stress conditions and modeling failure rate
as a function of stress. But the standard failure rate estimates can be
misleading if few components fail, uncertainty in the parameters of the
stress model is ignored, or component variability across manufacturing
lots is high. This talk describes a model for estimating reliability that
accommodates these sources of uncertainty when degradation measurements
as well as failure counts can be obtained. An application to laser diodes
is given.
Session Organizer: Ramon Leon, University of Tennessee
Session Chair: Ramon Leon, University of Tennessee
1. "Bayesian Reliability Testing for New Generation Semiconductor
Processing Equipment", Michael Pore and Paul Tobias (Sematech, Ret.)
Paper
Abstract: New multi-million dollar semiconductor processing tools need to be developed
and deployed rapidly to match the aggressive industry roadmap for chip
technologies and manufacturing productivity. In particular, a completely
new set of tools was recently introduced for use in factories designed
to handle 300 mm wafers (the previous largest wafer size was 200 mm). International
SEMATECH (ISMT) was asked by its member companies to work with vendors
to qualify many of these new tools and provide assurance that they could
meet factory reliability (i.e. Mean Time Between Failure) goals. Since
time was short, and many of these tools existed only as single prototype
models, Bayesian reliability testing techniques were chosen to provide
the most information in the shortest time. This talk will review the theory
used to plan and analyze Bayesian reliability tests and test data. The
practical problems of making use of prior information of different types
and quality levels when specifying the Bayesian Gamma priors will also
be discussed, along with some of the solutions that were implemented by
ISMT statisticians and engineers .
2. "Physical Science Case Studies in Bayesian Analysis Using WinBUGS",William Guthrie, Statistical Engineering Division, NIST
Abstract: High-level software for Markov Chain Monte Carlo (MCMC) Simulation has
made Bayesian analysis increasingly accessible to applied statisticians
and other scientists over the last several years. BUGS, Bayesian inference
Using Gibbs Sampling, and its GUI-driven, Windows counterpart, WinBUGS,
are two popular computer packages for this type of analysis available on
the Web. Statisticians or other scientists who wish to explore the use
of BUGS or WinBUGS for their applications are encouraged by the software's
authors to follow the many examples included in the documentation to get
started. However, of the thirty-seven case studies included in the documentation
only three are physical science applications, making the examples less
convenient for statisticians working in the physical sciences than for
those in biostatistics. To help make these software packages as accessible
as possible, several physical science case studies in Bayesian analysis
using WinBUGS will be demonstrated in this talk.
3. "Bayesian Modeling of Accelerated Life Tests with Heterogeneous
Test Units", Avery J. Ashby, Ramon V. Leon, Jayanth Thyagarajan, University
of Tennessee, Knoxville Paper
Abstract: WinBUGS is a software program for Bayesian analysis of complex statistical
models using Markov chain Monte Carlo techniques (MCMC). We show how the
models supported by the program can be used to model data obtained from
accelerated life tests where there are both random and fixed effects. We
illustrate the approach by predicting life of Kevlar fiber based on an
accelerated life test where in addition to the stress there is a random
spool effect. The talk demonstrates that Bayesian modeling using MCMC can
be used to fit more realistic models for accelerated life tests than those
that have been traditionally considered.
Session Organizer: Susan Albin, Rutgers University
Session Chair: Susan Albin, Rutgers University
Speaker: Andrew Moore, Carnegie Mellon University
Abstract: How can we exploit massive data without resorting to statistically dubious
computational compromises? How can we make routine statistical analysis
sufficiently autonomous that it can be used safely by non-statisticians
who have to make very important decisions?
We believe that the importance of these questions for applied statistics
is increasing rapidly. In this talk we will fall well short of answering
them adequately, but we will show a number of examples where steps in the
right direction are due to new ways of geometrically preprocessing the
data and subsequently asking it questions.
The examples will include biological weapons surveillance, high throughput
drug screening, cosmological matter distribution, sky survey anomaly detection,
self-tuning engines and accomplice detection. We will then discuss the
unifying algorithms which allowed these systems to be deployed radidly
and with relatively large autonomy. These involve a blend of geometric
data structures such as Omohundro's Ball Trees, Callahan's Well-Separated
Pairwise Decomposition, and Greengard's multipole in conjunction with new
search algorithms based on "racing" samples of data, All-Dimensions
Search over aggregates of records and a kind of higher-order divide and
conquer over datasets and query sets.
BIO: Andrew Moore is the A. Nico Habermann associate professor of Robotics
and Computer Science at the School of Computer Science, Carnegie Mellon
University. His main research interests are reinforcement learning and
data mining, especially data structures and algorithms to allow them to
scale to large domains. The Auton Lab, co-directed by Andrew Moore and
Jeff Schneider, works with Astrophysicists, Biologists, Marketing Groups,
Bioinformaticists, Manufacturers and Chemical Engineers. He is funded partly
from industry, and also thanks to research grants from the National Science
Foundation, NASA, and more recently from the Defense Advanced Research
Projects Agency to work on data mining for Biosurveillance and for helping
intelligence analysts.
Andrew began his career writing video-games for an obscure British personal
computer. He rapidly became a thousandaire and retired to academia, where
he received a PhD from the University of Cambridge in 1991. He researched
robot learning as a Post-doc working with Chris Atkeson, and then moved
to a faculty position at Carnegie Mellon.
Session Organizer: Michael Baron, University of Texas, Dallas and IBM Research
Session Chair: Michael Baron
1. "On sequential determination of the number of computer simulations", Nitis Mukhopadhyay, University of Connecticut Paper
Abstract: The size of replication in many reported computer simulations is predominantly
fixed in advance and quite often fairly arbitrarily for that matter. But,
since computer simulations normally proceed in a sequential manner, it
makes good sense to expect that the rich repertoire from the area of sequential
analysis should play a central role in ãdecidingä the ãappropriateä number
of computer simulations that one must run in a given scenario. Obviously
there cannot be one unique way of determining the number of computer simulations
that will perhaps be satisfactory in every conceivable scenario. Any useful
ãoptimalä determination must instead take into account a particular simulation
problem on hand, how ãaccuracyä of a process is measured, how much accuracy
a consumer demands in the final output derived via simulations, how much
unit ãerrorä would ãcostä, as well as a myriad of other practical considerations
and concerns.
We focus on a specific statistical problem and partially walk through this maze. We explore how we may zero in on some ãappropriateä number of computer simulations in order to achieve pre-specified ãqualityä of the final output derived from associated simulations. We will highlight some plausible loss functions and explore few very reasonable sequential strategies that would lead to approximate ãoptimalä determination of the number of simulations. The goal of this presentation is to emphasize how the theory and methodologies of sequential analysis can potentially contribute and indeed enrich significantly any field of investigation that may use computer simulations.
2. "Balanced Randomization Designs and Classical Probability Distributions",
Andrew L. Rukhin, University of Maryland Paper
Abstract: The talk compares the properties of two most commonly used balanced randomization
schemes with several treatments. Such sequential schemes are common in
clinical trials, load balancing in computer files storage, etc. According
to the first scheme, the so-called truncated multinomial randomization
design, the allocation process starts with the uniform distribution, until
a treatment receives the prescribed number of subjects, after which this
uniform distribution switches to the remaining treatments, and so on. The
limiting joint distribution of the moments at which a treatment receives
the given number of subjects is found. The second scheme, the random allocation
rule, selects one of any equally likely assignments of the given number
of subjects per treatment. Its limiting behavior is shown to be quite different
from that of the truncated multinomial design. Formulas for the accidental
bias and for the selection bias of both procedures are derived, and the
large sample distribution of standard permutation tests is obtained. The
relationship to classical probability theory is discussed.
3. "Asymptotic Analysis of Bayesian Quickest Change Detection Procedures", Venugopal V. Veeravalli, University of Illinois at Urbana-Champaign, Alexander Tartakovsky, University of Southern California and Michael Baron, University of Texas, Dallas and IBM Research Paper
Abstract: The optimal detection procedure for detecting changes in independent
and identically distributed observations sequences (i.i.d.) in a Bayesian
setting was derived by Shiryaev in nineteen sixties. However, the analysis
of the performance of this procedure in terms of the average detection
delay and false alarm probability has been an open problem. We investigate
the performance of Shiryaev's procedure in an asymptotic setting where
the false alarm probability goes to zero. The asymptotic study is performed
not only in the i.i.d. case where the Shiryaev's procedure is optimal but
also in general, non-i.i.d. cases and for continuous-time observations.
In the latter cases, we show that Shiryaev's procedure is asymptotically
optimal under mild conditions. The results of this study are shown to be
especially important in studying the asymptotics of decentralized quickest
change detection procedures for distributed sensor sytems.
Session Organizer: Susan Albin, Rutgers University
Session Chair: Susan Albin, Rutgers University
1. "Optimal Adjustment Of A Process Subject To Unknown Setup Errors
Under Quadratic Off-Target And Fixed Adjustment Costs", Zilong Lian
and Enrique Del Castillo, The Pennsylvania State University
Abstract: Consider a machine that can start production off-target where the initial
offset is unknown and unobservable. The goal is to determine the optimal
series of machine adjustments {U_t} that minimize the expected value of
the sum of quadratic off-target costs and fixed adjustment costs. Apart
of the unknown initial offset, the process is supposed to be in a state
of statistical control, so the underlying production process is supposed
to be of a discrete-part nature. We show, using a dynamic programming formulation
based on the bayesian estimation of all unknown process parameters, how
the optimal policy is of a deadband form where the width of the deadband
is time-varying. Computational results and implementation are presented.
2. "Process-oriented Tolerancing for Quality Improvement in Multi-station
Assembly Systems", Dariusz Ceglarek, Wenzhen Huang, University of
Wisconsin - Madison
Abstract: In multi-station manufacturing systems, the quality of final products
is significantly affected by both product design as well as process variables.
However, historically tolerance research primarily focused on allocating
tolerances based on product design characteristics of each component. Currently,
there are no analytical approaches to optimally allocate tolerances to
integrate product and process variables in multi-station manufacturing
processes at minimum costs. The concept of process-oriented tolerancing
expands the current tolerancing practices, which bound errors related to
product variables, to explicitly include process variables. The resulting
methodology extends the concept of ãpart interchangeabilityä into ãprocess
interchangeability,ä which is critical in increasing requirements related
to the suppliers selection and benchmarking. The proposed methodology is
based on the development and integration of three models: tolerance-variation
relation, variation propagation, and process degradation. The tolerance-variation
model is based on a pin-hole fixture mechanism in multi-station assembly
processes. The variation propagation model utilizes a state space representation
but uses a station index instead of time index. Dynamic process effect
such as tool wear is also incorporated into the framework of process-oriented
tolerancing, which provides the capability to design tolerances for the
whole life-cycle of a production system. Tolerances of process variables
are optimally allocated through solving a nonlinear constrained optimization
problem. An industry case study is used to illustrate the proposed approach.
3. "Robust Optimization of Experimentally Derived Objective Functions",
Susan Albin, Rutgers University and Di Xu, American Express Paper
Abstract: In the design or improvement of systems and processes, the objective
function is often a performance response surface estimated from experiments.
A standard approach is to identify the levels of the design variables that
optimize the estimated model. However, if the estimated model varies from
the true model due to random error in the experiment, the resulting solution
may be quite far from optimal. We consider all points in the confidence
intervals associated with the estimated model and construct a minimax deviation
model to find a robust solution that is resistant to the error in the estimated
empirical model. We prove a reduction theorem to reduce the optimization
model to a tractable, finite, mathematical program. The proposed approach
is applied to solve for a robust order policy in an inventory problem and
is compared with the canonical approach using a Monte Carlo simulation
Session Organizer: Radu Neagu, GE Research
Session Chair: Radu Neagu, GE Research
1. "Application of Six Sigma to Corporate Finance", Roger Hoerl, GE Global Research Paper
Abstract: Numerous authors, such as Dusharme* (2003) have noted that the vast majority
of Six Sigma applications have occurred in manufacturing and engineering.
However, Six Sigma is a generic improvement methodology, and in theory
should be applicable to improve any activity. This talk will discuss the
application of Six Sigma to corporate finance. It is based on the author's
experience as Quality Leader of the Corporate Audit Staff of GE, a division
of corporate finance. A brief introduction to corporate finance will be
provided. Next, some of the unique aspects of corporate finance, relative
to manufacturing and engineering will be discussed. Real Six Sigma applications,
and general strategies for applying Six Sigma in corporate finance will
then be reviewed. The intent is to lay the groundwork for the remaining
talks in this session, which focus on detailed case studies in finance.
*Dusharme, D., "Six Sigma Survey", Quality Digest, February 2003, 24-32.
2. "Fair Valuation of Employee Stock Options", Antonio Possolo and Brock Osborn, GE Global Research Paper
Abstract: Current and proposed accounting standards, both national and international,
suggest that the value of employee stock options should be estimated and
recorded as expense in corporate financial statements. Since these options
are rather different from stock options that are traded on exchanges (for
example, they are subject to vesting and forfeiture rules, may remain alive
for as long as fifteen years, and cannot be traded), their valuation calls
for non-standard methodology. In addition, the patterns of forfeiture and
early exercise that are observed empirically also should play a role in
their fair valuation. In this presentation, we review the approach that
we have been developing at General Electric to value options awarded to
employees, with emphasis on its components that involve probabilistic modeling
and statistical data analysis.
3. "From Corporate Default Prediction to Market Efficiency: a case
study orporate default prediction and market efficiency", Radu Neagu
and Roger Hoerl, GE Global Research. Paper
Abstract: The progression of a corporation from a status of financial stability
into the status of financial distress usually happens over relatively large
periods of time, raising the opportunity of identifying these "falling"
corporations ahead of time. Consequently, our goal is to provide portfolio
managers with an early notice of deteriorating financial status for a given
corporation so that business decisions can be taken to mitigate loss. There
are different techniques for estimating the likelihood that a corporation
will go into financial default. We consider a model built using equity
inferred probability of default (PD) metrics (Merton* 1996) where we construct
a 2-dimensional risk space for estimating the likelihood that a company
will experience financial default in the near future. For those companies
whose financial risk seems to be improving over time in the 2-dimensional
space, we analyze the effect of past company PD behavior on future PD status.
In an efficient market, this past behavior would have no effect on future
status (beyond the information conveyed by the current PD level). Our findings
will, for this particular case, disprove the efficient market hypothesis.
We work on a dataset formed of North American non-financial publicly traded
companies and we use a publicly available definition of corporations in
financial default. We conduct our study using CART analysis, logistic regression
and Markov-Chain type transition probabilities analysis.
*R.C. Merton, Continuous-Time Finance; Blackwell Publishers, Revised Edition,
1996
Session Organizer: Chid Apte, IBM Research Division
Session Chair: Chid Apte, IBM Research Division
1. "The Challenges in Improving Business Processes with Data Mining",
Vasant Dhar (NYU Stern School of Business)
Abstract: Data Mining is a compelling proposition for business. There is enough
evidence that the core technology for finding patterns in data works. Dozens
of case studies across industries document the fact that sensible problem
formulations coupled with analyses from real data are valuable to decision
makers. However, the adoption of data mining within businesses is still
relatively low. In this talk, I describe the hurdles that must be overcome
in making data mining an inherent part of doing business. In particular,
I describe the obstacles in each part of the data knowledge value chain,
and the challenges in making data mining an inherent aspect of business
processes.
2. "Algorithms for Efficient Statistical Data Mining", Andrew
Moore, Carnegie Mellon University.
Abstract: How can we exploit massive data without resorting to statistically dubious
computational compromises? How can we make routine statistical analysis
sufficiently autonomous that it can be used safely by non-statisticians
who have to make very important decisions? We believe that the importance
of these questions for applied statistics is increasing rapidly. In this
talk we will fall well short of answering them adequately, but we will
show a number of examples where steps in the right direction are due to
new ways of geometrically preprocessing the data and subsequently asking
it questions.
The examples will include biological weapons surveillance, high throughput
drug screening, cosmological matter distribution, sky survey anomaly detection,
self-tuning engines and accomplice detection. We will then discuss the
unifying algorithms which allowed these systems to be deployed radidly
and with relatively large autonomy. These involve a blend of geometric
data structures such as Omohundro's Ball Trees, Callahan's Well-Separated
Pairwise Decomposition, and Greengard's multipole in conjunction with new
search algorithms based on "racing" samples of data, All-Dimensions
Search over aggregates of records and a kind of higher-order divide and
conquer over datasets and query sets.
3. "Case Studies of Machine Learning for Manufacturing and Help Desks", Sholom Weiss, IBM Research.
Abstract: We describe two applications covering the opposite extremes of structured
and unstructured data. One application is for a fabrication process used
in the manufacture of laptop displays. The induced decision rules can have
a significant impact on display yield and manufacturing costs. The task
can be described as a regression problem. With the appropriate transformation
of raw data, a solution is readily obtained. The second application is
for unstructured data: a document database of customers' descriptions of
their problems with products and the vendor's descriptions of their resolution.
These records are ill-formed, containing redundant and poorly organized
documents. The objective is to try to automate the procedures for reducing
a database to its essential components, much like FAQs (frequently asked
questions) that can be offered as self-help to customers. Any solution
will require more than the simple application of a known model to data
prepared in a spreadsheet format. We review this self-help application
to obtain insight into complex data mining tasks for unstructured data.
Session Organizer: Grace Lin, Senior Manager, Supply Chain and e-business Optimization,
IBM Research Division
Session Chair: Roger Tsai, IBM Research Division
1. "Sales and Operations Optimization for a Supply Chain", Roger
Tsai, Young M. Lee, Markus Ettl, Feng Cheng, Tom Ervolina (IBM Research),
John Konopka, Shannon Liu (IBM Integrated Supply Chain)
Abstract: Several IBM divisions make quarterly Sales and Operations Planning (SOP)
decisions each month. The supply quantity, or SOP decision, is based on
many factors, revenue target, sales forecast, profit margin, life cycle
of each computer, commonality of computer components and manufacturing
flexibility. The SOP decision had been typically made using rule of thumb
or heuristic methods. We have recently developed a Newsvendor model that
computes an optimal SOP decision. The model is designed to optimize the
expected financial performance, and the performance of the model was estimated
using historic data and simulation of stochastic nature of demand. It is
found that a variety of objective functions that are profits of a variety
of definitions can all be cast into a common solution framework. Even the
consideration of supply flexibility can be derived with such a framework
with slight modification. A new concept of process window was introduced.
It will be demonstrated why process window is beneficial to be considered
in such setting. Our study indicates that substantial financial benefit
can result from the optimization model.
2. "A Hybrid and Distributed Control Model for Supply Chain",
Mohsen A. Jafari and Tayfur Altiok, Dept. of Industrial & Systems Engineering,
Rutgers University
Abstract: One of the main challenges encountered by existing supply chain networks
is lack of visibility over the network. As a result, companies make critical
business decisions using estimates and projections and mainly based on
historical data, resulting in inefficiencies such as insufficient or overstocked
inventories, improperly filled orders, late deliveries, and overly optimistic
partner performance demands. At the same time, the information transfer
and collaboration between the enterprises within a supply chain is mostly
data driven and lacks direct and pro-active responsiveness. In a truly
collaborative supply chain environment with some degree of visibility,
one would expect that change in demand forecast in one enterprise leads
to changes in production, priorities, and even inventory thresholds in
its suppliers in some pro-active manner. The impact of a true collaborative
model will be immense in the presence of real time information on logistics,
distribution, and production, among others. Such a model, however, requires
a different view of supply chain networks, where information flow is associated
with appropriate events, and proper event propagation models with feedback
mechanism are defined and implemented. A similar concept has recently been
also promoted by IBM and SAP AG. For instance in IBMâs Sense and Response
model, a control/monitoring layer with feedback is embedded between supply
chain planning and execution. In SAP AG model, event based collaborations
are defined and predictive measures are established for event propagation
and pro-active control across a network of enterprises.
In this talk we treat a supply chain network as a distributed/hybrid system where interaction between different nodes (enterprises) are event based, where events are triggered by some threshold-based mechanism or governed by some continuous process(es) (e.g., inventory depletion or replenishment). We present our preliminary results, where underlying business processes (which rarely change) are separated from business rules (which could dynamically change). Control layers are defined within enterprise and between enterprises across the supply chain network. We also present a preliminary prototype that is under construction at Rutgers University.
3. "Cross-enterprise Data Analysis: Methodologies, Challenges, Opportunities", James Ding, McMaster University Paper
Abstract: Recent development in IT and globalization provides many research opportunities
and challenges for statistical quality control in industry. Some of interesting
topics involve integration of the traditional statistical quality control
methods and other mathematical tools for the practical problems under the
new environment. Some of those important aspects include applications in
data analysis under the ERP systems, SCM, e-business and hetrogenous social
and economic structure under the emerging economics. I plan to address
in detail the role of statistics in the after ERP solutions and the integration
of the statistics and system dynamics in enterprise decision making.
Session Organizer: Galit Shmueli, University of Maryland
Session Chair: Galit Shmueli, University of Maryland
1. "Wavelets for Change Point Problem and Non-Stationary Time Series",
Yazhin Wang, University of Connecticut
Abstract: Because of localization property and ability to decompose processes
into different frequency components, wavelets have been successively employed
to study sudden structure changes in functions or signals and model non-stationary
time series. This talk will review their recent developments and present
some current work on locally self-similar processes.
2. "Real-time Monitoring of Daily Sales Using Wavelets", Galit
Shmueli, University of Maryland
Abstract: Daily grocery and pharmacy sales tend to form dependent and non-stationary
series. We suggest a method for real-time monitoring of medication sales
using a combination of wavelets and autoregressive models. We illustrate
the method by applying it to real sales data from a large grocery chain.
3. "Multiscale Statistical Process Control Using Libraries of Basis
Functions", Bhavik Bakshi, Ohio State University
Abstract: This presentation will provide an overview of MSSPC and its different
variations including MSSPC based on, principal component analysis (MSPCA),
clustering by adaptive resonance theory (MSART), and libraries of basis
functions. MSSPC with orthonormal wavelets is good for detecting mean shifts,
while MSSPC with a library of wavelet packet basis functions can detect
a broad range of changes such as, mean, autocorrelation, oscillatory and
spectral changes. Average run length analysis and industrial case studies
demonstrate that MSSPC is an excellent general method for detecting abnormal
situations when the nature and magnitude of the change can vary over a
wide range and are not known a priori. However, MSSPC does not outperform
methods that are tailored to detect specific types and magnitudes of changes.
In industrial practice, the ability of MSSPC to provide better average
performance for different types of data and changes is usually an important
advantage, since prior knowledge about the types of changes is rarely available,
particularly in large multivariate systems. These properties will be illustrated
via data from petrochemical processes.
Session Organizer: Tamrparni Dasu, AT&T Labs - Research
Session Chair: Tamrparni Dasu, AT&T Labs - Research
1. "Data Cleaning: The Good, The Bad, and The Ugly", Ronald K.
Pearson, Daniel Baugh Institute / Thomas Jefferson University
Abstract: This talk gives a broad overview of the data cleaning problem, focusing
on three general areas: "good" techniques that perform well in
practice, "bad'' techniques that seem reasonable but frequently fail
in practice, and ``ugly'' phenomena that we see in real datasets but would
rather not. The Martin-Thomson data cleaning filter illustrates the good,
since it often does an excellent job in dynamic analysis applications like
spectral estimation and harmonic analysis, although at the price of some
complexity. The venerable "3 sigma-edit rule'' illustrates the bad,
since it often fails to find any outliers at all in contaminated datasets,
due to masking effects. As a useful alternative, I will discuss the Hampel
identifier which has the advantage that it also extends to a very simple
nonlinear data cleaning filter that sometimes performs as well or better
than the Martin-Thomson data cleaner in dynamic analysis applications.
Finally, as examples of the ugly, I will consider the problems of common-mode
outliers that appear simultaneously in several variables, the subtle outliers
that arise as the ironic consequence of idempotent form-based data entry
systems designed to enforce data quality, and the problem of outlier detection
in asymmetrically distributed data sequences. All of these ideas will be
illustrated with real datasets, drawn primarily from the areas of industrial
process monitoring and bioinformatics.
2. "Database Technology and Data Quality", Theodore Johnson,
Database Research Department, AT&T Labs - Research
Abstract: Modern (relational) databases have extensive facilities to ensure data
quality : integrity constraints (on field values, uniqueness, relationships
between tables, and so on), triggers, metadata support, data modeling tools,
data loading tools, and simple but powerful query languages. So why is
it that every database that I've encountered is filled with data quality
problems? In this talk, I will outline some of the most common causes of
data quality problems, and how these problems can be mitigated by the use
of existing and new database related techniques. Because database complexity
is a common factor in data quality problems, I will emphasize recent research
in database summarization and exploration.
3. "Case Study in Data Quality Implementation", Tamraparni Dasu, Statistics Research Department, AT&T Labs - Research
Abstract: I will present a case study to illustrate the implementation of a data
quality program to improve the accurate data flows in a provisioning process.
The case study is based on a real application. As a part of the case study,
I will discuss data quality metrics, outlining conventional ones as well
as proposing updated and dynamic metrics that capture the complexity of
data quality as well as the highly application specific nature of solutions
and metrics.
Session Organizer: Paul Tobias, Sematech (Ret)
Session Chair: Paul Tobias, Sematech (Ret)
"Data Gathering: Focusing on the Challenge", Gerald J. Hahn and
Necip Doganaksoy, Adjunct Faculty, RPI and GE Global Research Center Paper
Abstract: Getting the right data is a critical part of our job÷and one that is
given insufficient emphasis in training practitioners and statisticians.
Although this is hardly news to this audience, we feel it merits our attention.
We propose a formal process for data gathering, and provide some examples.
We also recommend a comprehensive course on data gathering. Ideally, this would be the second semester of a one-year introduction to applied statistics for both practitioners and aspiring statisticians. A condensed version could also be an important part of short courses in industry.
The proposed course illustrates the inadequacy of historical data, and describes the suggested process for getting the right data. It includes an overview of key concepts and practical considerations in the design of experiments and survey sampling, as well as in the development of data gathering systems, in general. Other course topics include sample size determination, planning analytic (as opposed to enumerative) studies, and the data cleaning process. The course de-emphasizes formal statistical analyses, such as the analysis of variance, focusing, instead on simple graphical evaluations.
We emphasize the criticality of the data gathering process in a forthcoming book, tentatively entitled ãStatistics in the Corporate World,ä planned for publication in 2004, and invite colleaguesâ comments.
Session Organizer: Emmanuel Yashchin, IBM Research
Session Chair: Emmanuel Yashchin, IBM Research
"Optimizing Sequential Design of Experiments", Michael Baron,
University of Texas at Dallas and IBM Research Division and Claudia Schmegner,
University of Texas at Dallas Paper
Abstract: It is often impractical or expensive to sample according to the classical
sequential scheme, that is, one observation at a time. Sequential planning
extends and generalizes the ``pure'' sequential procedures by allowing
to sample observations in groups. At any moment, all the collected data
are used to determine the size of the next group and to decide whether
or not sampling should be terminated. We discuss optimality of sequential
designs taking into account both the variable and the fixed cost of experiments.
Some general guidelines for optimal sequential planning are established.
It is shown that the total cost of standard sequential procedures can be
reduced significantly without increasing the loss. Specific types of sequential
plans are introduced and compared, some existing plans are modified and
improved.