A Comparative Analysis of Data Mining Technique using Decision Tree Classifier and SPC
1.
INTRODUCTION
- Acceptance sampling
- Statistical process control (SPC)
Statistical Process Control (SPC) is an effective method of monitoring a production process through the use of control charts. By collecting in-process data or random samples of the output at various stages of the production process, one can detect variations or trends in the quality of the materials or processes that may affect the quality of the end product. Because data are gathered during the production process, problems can be detected and prevented much earlier than methods that only look at the quality of the end product. Early detection of problems through SPC can reduce wasted time and resources and may detect defects that other methods would not. Additionally, production processes can be streamlined through the identification of bottlenecks, wait times, and other sources of delay by use of SPC.
Those data could be potentially useful when learning patterns and knowledge for the purpose of quality improvement in manufacturing processes. However, due to the large amount of data, it can be difficult to discover the knowledge hidden in the data without proper tools.
Data mining provides a set of techniques to study patterns in data “that can be sought automatically, identified, validated, and used for prediction. As long as knowledge is another name of power, organizations give much importance to knowledge. When reaching this knowledge, they make use of the data in their databases. Data is important since it provides them to learn from the past and to predict future trends and behaviors. Today most of the organizations use the data collected in their databases when taking strategic decisions. The process of using the data to reach this knowledge consists of two steps as collecting the data and analyzing the data. In the beginning, organizations face with difficulties when collecting the data. So, they have not enough data in order to make suitable analysis.
In the long run, with the rapid computerization, they are
able to store huge amount of data easily. But at this time they face with
another problem when analyzing and interpreting of such large data sets.
Traditional methods like statistical techniques or data management tools are
not sufficient anymore. Then, in order to manage with this problem the technique
called data mining (DM) has been discovered. DM is a new useful and powerful
technology that supports companies to derive strategic information in their
databases. It has been defined as: ‗The process of exploration and analysis, by
automatic or semiautomatic means of large quantities of data in order to
discover meaningful patterns and rules.
The expression meant by meaningful patterns and rules is:
easily understood by humans, valid on new data, potentially useful and novel.
Validating a hypothesis that the user wants to prove can also be accepted as a
meaningful patterns and rules. In sum, it is essential to derive patterns and
rules that help us to reach strategic and unimagined information in DM.
2. DATA MINING PROCESS TECHNIQUES AND
APPLICATION
2.1 DM PROCESS
According to Fayyad DM refers to a set of integrated analytical techniques divided into several phases with the aim of extrapolating previously unknown knowledge from massive sets of observed data that do not appear to have any obvious regularity or important relationships. Another definition is, DM is the process of selection, extrapolation and modeling of large quantities of data to discover regularities or relations that are at first unknown with the aim of obtaining clear and useful results of the database By using these definitions and harmonizing them, in this study DM is considered as a whole process consisting of different steps where in each step different DM techniques can be used. Then, several steps of the DM process and the possible DM techniques that might be used in each step are tried to classify as follows and as shown in figure below:
a. Feature determination
b. Database formation
2. Data preprocessing
a. Data cleaning
i. Missing data handling
ii. Outlier and inlier handling
iii. Inconsistent data handling
b. Data integration
c. Data transformation
i. Smoothing,
ii. Aggregation,
iii. Generalization,
iv. Normalization of data [For example: min-max normalization, z-score normalization, normalization by decimal scaling etc.]
v. Attribute construction
vi. Discretization and concept hierarchy generation [For example: S-based techniques (DA etc.), NN-based (SOM) etc.]
d. Data reduction
i. Data cube aggregation
ii. Dimension reduction (Feature selection)
1. Feature wrapper
2. Feature filter
iii. Data compression [For example: Wavelet transform, S-based techniques (PCA, FA etc.) etc.]
iv. Numerosity reduction [For example: S-based techniques(R, Histograms etc.), Clustering etc.]
v. Over sampling
3. Modeling
a. Predictive model
i. Classification [For example: S-based techniques(R, BC, LR etc.), DT-based (OC1, ID3, CHAID, ID5R, C4.5 AND C5, CART, QUEST, Scalable DT techniques, Statistical batch-based DT learning etc.), NN-based-based (Generating Rules from a DT, Generating Rules from a ANN{ Rectangular basis function network }, Generating Rules without a DT and ANN {PRISM, RST, FST etc.}), Combining techniques (Integration of FST and RST, FAN, EN, CC etc.), SVM etc.]
ii. Prediction [For example: S-based techniques (Parametric {MLR as RSM, GLM as ANOVA, MANOVA, TM, NRM as Generalized Additive Models, RR, BR, TSA as exponential smoothing etc.}, Nonparametric {ANOVA as Kruskal-Wallis, R, TSA as Moving average etc.}), DT-based (ID3, C4.5 and C5, CART, CHAID, Scalable DT techniques etc.), NN-based(w.r.t. learning algorithms: Feed forward Propagation, back propagation; w.r.t. architecture: RBF{RBF network as Gaussian RBF NN}, Perceptions, BNN), R-based (Generating Rules from a DT, Generating Rules from a ANN, Generating Rules without a DT and ANN{PRISM}), CBR, FEMS, SVM, Combining techniques (Modular ANN { FNN, Fuzzy ARTMAP NN, ANFIS etc.}) etc.]
b. Descriptive model
i. Clustering [For example: Hierarchical methods (Agglomerative, Divisive), Partitional methods (Minimum spanning tree, Squared error, K-means, Nearest neighbor, PAM, Bond energy, GA, NN based {w.r.t. learning rule: Competitive as SOM, LVQ etc.}, Non competitive (Hebian)), Rule-based (Generating Rules from a ANN) etc.]
ii. Summarization (Visualization and Statistics)
1. Visualization [For example: S-based (Histograms, scatter plots, box plots, pie charts, 3_D plots etc.) etc.]
2. Statistics [For example: Descriptive statistics (mean, median, frequency count etc.), Density estimation etc.]
3. Tables
c. Association
i. Basic methods [For example: Apriori, Sampling, Partitioning etc.]
ii. Advanced association rules method [For example: Generalized association rules, Multiple-level association rules, Quantitative association rules, Using multiple minimum supports, Correlation rules etc.]
d. Optimization [For example: S-based (TM, RSM), NN-based, GA, SA, SQP, Levenberg-Marquardt method etc.] S-based: Statistical based, DT-based: Decision tree based, NN-based: Neural network based, R-based: Rule based, R:Regression, PCA:Principle component analysis, RBF:Radial basis function, SVM:Support vector machines, CBR: Case based reasoning, GA:Genetic algorithms, SE:Subjective and empirical approach, BNN:Bayesian networks, FAN:Fuzzy adaptive network, EN:Entropy network, CC:Composite classifiers, TSA:Time series analysis, FEM:Finite element modeling, GSA:Grey superior analysis, SA: Simulated annealing, SQP: Sequantial quadratic programming method, DA: Discriminant analysis, CA:Correlation analysis, BC:Bayesian classification, GLM:General Linear Models, NRM:Nonlinear regression models, RR:Robust regression BR:Bayesian Regression, FA: Factor analysis, LR:Logistic regression, MLP:Multi linear perceptron, LVQ: Learning vector quantization
2.2 MAIN DATA MINING TASKS
There are different ways of distinguishing interesting patterns or trends in a huge amount of data set which are called DM operations or tasks. there exist different DM task categorizations.
For instance, one is the categorization involving Prediction, Classification, Clustering, Affinity grouping or Association Rules and 8 Visualization and Statistics ; and another involving Classification, Regression, Clustering, Summarization, Dependency Modeling, Change and Deviation Detection . There are also some other categorizations involving various classes as Outlier Analysis and Text Mining . We are mainly interested in the following DM tasks. Optimization in the categorization does not exist in the literature. So that, we newly defined this category. We defined it because although the papers commonly used DM tools for optimization purposes, there were not any DM tasks suitable to this case. The few rest that out of the scope of this study as Text mining, Web mining, Spatial mining and Temporal mining or in the scope of this study but not studied in the papers placed in the table as Affinity Grouping or Association Rules, Visualization and Statistics are listed in the others part. The analyst can apply one or several of them during the analysis on the dataset.
2.2.1 Data gathering and preprocessing
The first step of DM applications is data gathering. In this part, it is aimed to obtain the right data. For this purpose, all the available data sources are examined then the right data for the recent analysis is selected. It includes two steps: Feature determination and database formation. In feature determination it is determined the name of the variables whose data is collected. Whereas in database formation, collected data is returned to database format. The second step is data preprocessing. The goal of this step is to investigate the quality of the selected data then transform in order to make it suitable for further analysis. This part is important since real life data is incomplete, noisy and inconsistent. Data preprocessing consists of data cleaning, data integration, data transformation and data reduction.
Data cleaning deals with filling the missing values, detecting the outliers then smoothing the noisy data and correcting the inconsistencies in the data. Methods for missing values are listed in DM Process part. All of them have some advantages and 9 disadvantages respectively.
For example, if tupple does not contain many missing values ignoring the tupple is not an effective method. Similarly, filling in missing value manually is time consuming.
Although using a global constant to fill in the missing values is a simple method, it is not recommended. Filling in missing values with the most probable value is the most commonly used technique. Some methods like regression and decision tree induction are also used in this technique. Noisy data is another important problem if real life data is used. Noise is a random error or variance in a measured variable. Clustering technique, scatter plots, box plots are helpful for detecting the outliers. And, some smoothing techniques like binning and regression are used to get rid of noise. Lastly, there may be inconsistencies in data. It is due to error made at data entry or data integration. It may be corrected by performing a paper trace .
Data integration is combining necessary data from multiple data sources like multiple databases, data cubes, or flat files. Some problems may occur during the data integration. To illustrate, if an attribute can be derived from another table it indicates to redundancy problem. Another problem is detection and resolution of data value conflicts. Since different representation, scaling or encoding can be used, for the same entity, attribute values can be different in different data sources.
As a conclusion, we should be more careful in data integration in order not to face with such problems. Data transformation is changing the data into convenient form for DM analysis. It includes smoothing, aggregation, generalization, normalization of data and attribute construction. Aggregation is summarization of data and it is used when building a data cube.
An example of generalization is, changing the numeric attribute age into young, middle-aged, and senior. Normalization is changing the scaling of the value in order to be fall it within a desired range. Many methods are used for normalization. Some of them are min-max normalization, z-score normalization and normalization by decimal scaling.
Attribute construction is building new useful attributes by combining other attributes inside the data. For instance, ratio of weight 10 to height squared (obesity index) is constructed as a new variable so that it may be more logical and beneficial to use it in analysis.
Data reduction is changing the representation of data so that its volume becomes smaller while the information it includes is almost equal to the original data. It is important since the datasets are huge and doing analysis on this data is both time consuming and impractical. Methods for data reduction are listed in DM Process part. Data gathering and data preprocessing are parts of data preparation. It includes choosing the right data then convert it into suitable form for the analysis. Data preparation is the most time spent part of the DM applications. In fact, about half of the time is spent in this part in DM projects. Much importance should be given to this part if we do not want to come up with any problem during the process.
2.2.2 Classification
Classification is an operation that examines the feature of the objects then assigns them to the predefined classes by the analyst. For this reason it is called as supervised learning||. The aim of it is to develop a classification or predictive model that increases the explanation capability of the system. In order to achieve this, it searches patterns that discriminate one class from the others.
To illustrate, a simple example of this analysis is to predict the customers or non-customers who had either visited the website or not. The most commonly used techniques for classification are DT and ANN. And, it is frequently used in the evaluation of credit demands, fraud detection and insurance risk analysis.
2.2.3 Prediction
Prediction is a construction of a model to estimate a value of a feature. In DM, the term classification||is used for predicting the class labels and discrete or nominal values whereas the term prediction|| is mainly used for estimating continues values. In fact, some books use the name value prediction ‘instead of prediction’.
Two traditional techniques namely, linear and nonlinear regression (R/NLR) and ANN are commonly used in this operation. Moreover, RBF is a newly used technique for value prediction which is more robust than traditional regression techniques.
2.2.4
Clustering
Clustering is an operation that divides datasets into
similar small groups or segments according to some criteria or metric.
Different from classification, there are no predefined classes in this
operation. So it is called as unsupervised learning||. It is just an unbiased
look at the potential groupings within a dataset. It is used when there are
suspected groupings in dataset without any judgments about what that similarity
may involve. It might be the first step in DM analysis. Because, it is
difficult to derive any single pattern or develop any meaningful single model
by using the entire dataset. Clearly, constructing the clusters reduce the
complexity in dataset so that the performance of other DM techniques are more
likely to be successful. To illustrate, instead of doing a new sales company to
all customers, it is meaningful firstly creating customer segments than doing
the convenient sales companies to the suitable customer segments. Clustering
often uses the methods like K-means algorithm or a special form of NN called a
Kohonen feature map network (SOM).,
2.3
MAIN DATA MINING TECHNIQUES
In DM operations, well-known
mathematical and statistical techniques are used. Some of these techniques are
collected in some heads like S-based, DT-based, NN-based and Distance-based.
And the rest of them which are not covered with these four heads are listed in
the others part. Here we only mentioned the commonly used or known techniques
within these heads.
2.3.1
Statistical – based techniques
One of the commonly used S-based
techniques is R. "Regression analysis is a statistical technique for
investigating and modeling the relationship between variables".
The general form of a simple linear regression is in
this equation α is the intercept, β is the slope and is the error term, which
is the unpredictable part of the response variable yi. α and β
are the unknown parameters to be estimated.
The estimated values of α and β can
be derived by the method of ordinary least squares as follows:
Regression analysis must satisfy
some certain assumptions. There assumptions are predictors must be linearly
independent, error terms must be normally distributed and independent and
variance of the error terms must be constant. If the distribution of error term
is different than normal distribution then the GLM which is a useful
generalization of ordinary least squares regression is used. The form of the
right hand side can be determined from the data which is called nonparametric
regression. This form of regression analysis requires a large number of
observations since the data are used both to build the model structure and to
estimate the model parameters. Robust regression is a form of regression
analysis which circumvents some limitations of traditional parametric and
non-parametric methods. It is highly robust to outliers. If the response
variable is non-continuous then the logistic regression approach is used.
ANOVA which stands for analysis of variance
is another well-known S-based technique. It is a statistical procedure for
assessing the influence of a categorical variable (or variables) on the
variance of a dependent variable||. It compares the difference of each subgroup
mean from the overall mean with the difference of each observation from the
subgroup mean. If there is more variation between-groups differences, then the
categorical variable or factor is influential on the dependent 15 variable.
One-way ANOVA measure the effects of one factor only,
whereas two-way ANOVA measure both the effects of two factors and the
interactions between them simultaneously. The F-test is used to measure the
effects of the factors. It must satisfy some certain assumptions as
independence of cases, the distributions in each of the groups are normal and
the variance of data in groups should be the same. When the normality assumption
fails, the Kruskal-Wallis test which is a nonparametric alternative can be used.
2.3.2
Decision tree – based techniques
Decision trees are the tree shaped
structures that are the most commonly used DM techniques. Construction of these
trees is simple. The results can easily be understood by the users. In
addition, they can practically solve most of the classification problems. In a
DT model, there are internal nodes which devise a test on an attribute and
branches show the outcomes of the test. At the end of the tree, leaf nodes,
which represent classes, take place. During the construction of these trees,
the data is split into smaller subsets iteratively. At each iteration, choosing
the most suitable independent variable is an important issue. Here, the split
which creates the most homogenous subsets with respect to the dependent
variable should be chosen. While choosing the independent variable, some
attribute selection measures like information gain, gini index etc. are used.
Then, these splitting processes according to the measures continue until no
more useful splits are found. In brief, DT technique is useful for
classification problems and the most common types of decision tree algorithms
are CHAID, CART and C5.0.
2.3.3
Neutral network – based techniques
NN supports us to develop a model by
using historical data that are able to learn just as people. They are quite
talented for deriving meaning from the complicated dataset that are difficult
to be realized by humans or other techniques. To exemplify,
Figure 1.
Example of a Neural Network Architecture
It simply consists of combining the inputs (independent variables) with some weights to predict the outputs (dependent variables) based on prior experience. In Figure, A, B and C are input nodes and they constitute the input layer. In addition, F is the output node and constitutes the output layer. Moreover, in most of the NNs, there are one or more additional layer between the input and output layer which are called hidden layers||. In the Figure, D and E are the hidden nodes and constitute a hidden layer. The weights are also shown on the arrows between the nodes in the same figure. Additionally, if we look at the strengths and weaknesses of this technique firstly, it is more robust than DT in noisy environments. Then, it can improve its performance by learning. However, the model developed is difficult to understand. Moreover, learning phase may fail to converge must also be numeric. As a result, NNs are useful for most prediction and classification operations when just the result of the model is important rather than how the model finds it. Back propagation is the most commonly used learning technique. It is easily understood and applicable. It adjusts the weights in the NN by propagating weight changes backward from the sink to the source nodes|| .Perceptron is the simplest NN. In this architecture, there is a single neuron with multiple inputs and one output. A network of perceptrons is called a multilayer perceptron (MLP). MLP is the simple feed forward NN and it has multiple layers.
Radial basis function network is a NN which has three layers. In hidden layers Gaussian activation function is used whereas in output layer a linear activation function is used. Gaussian activation function is a RBF with a central point of zero. RBF is a class of functions whose value decreases (or increases) with the distance from a central point||.
2.3.4 Hierarchical and Partitional techniques
Cluster analysis identifies the distinguished characteristics of the dataset, and then divides it into partitions so that the records in the same group are similar and between the groups are different as much as possible. The basic operation is the same in all clustering algorithms. Each record is compared with the existing clusters then it is assigned to the cluster whose centroid is the earliest. Later, centroids of the new clusters are calculated and once again each record assigned to the new cluster with the closest centroid. At each iteration the class boundaries, which are the lines equidistant between each pair of centroids, are computed. This process continues until the cluster boundaries stop changing. As a distance measure, most of the clustering algorithms use the Euclidean distance formula. Certainly, nonnumeric variables must be transformed in order to be used by this formula.
Hierarchical clustering techniques can generate sets of clusters whereas partitional techniques can generate only one set of clusters. So in partitional techniques user has to specify the number of clusters. In an agglomerative algorithm, which is one of the hierarchical clustering techniques, each observation is accepted as one cluster. Then it continues to combine these clusters iteratively until obtaining one cluster. On the other hand, in K-means clustering, which is one of the partitional clustering techniques; observations are moved among sets of clusters until the desired set is obtained.
2.3.5 Others
Genetic algorithm is an optimization type algorithm. It can be used for classification, clustering and generating association rules. It has five steps:
1. Starting set of individuals, P.
2. Crossover technique.
3. Mutation algorithm.
4. Fitness function
5. Algorithm that applies the crossover and mutation techniques to P iteratively using the fitness function to determine the best individuals in P to keep. The algorithm replaces a predefined number of individuals from the population with each iteration and terminates when some threshold is met||.
This algorithm begins with a starting model which is assumed. Then using crossover algorithms, it combines the models to generate new models iteratively. And, a fitness function selects the best models from these. At the end, it finds the fittest|| models from a set of models to represent the data.
2.4 APPLICATION AREAS
Today main application areas of DM are as follows:
Marketing:
· Customer segmentation
o Find clusters of model ‘customers who share the same characteristics: interest, income level, spending habits, etc.
o Determining the correlations between the demographic properties of the customers
o Various marketing campaign
o Constructing marketing strategies for not losing present customers
o Market basket analysis
o Cross-market analysis
§ Associations/correlations between product sales
§ Prediction based on the association information
o Customer evaluation
o Different customer analysis
§ Determine customer purchasing patterns over time
§ What types of customers buy what products
§ Identifying the best products for different customers
o CRM
o
Banking:
· Finding the hidden correlations among the different financial indicators
· Fraud detection
· Customer segmentation
· Evaluation of the credit demands
· Risk analysis
· Risk management
Insurance:
· Estimating the customers who demand new insurance policy
· Fraud detection
· Determining the properties of risky customers
Retailing:
· Point of sale data analysis
· Buying and selling basket analysis
· Supply and store layout optimization
Bourse:
· Growth stock price prediction
· General market analysis
· Purchase and sale strategies optimization
Telecommunication:
· Quality and improving analysis
· Allotment fixing
· Line busyness prediction
Health and Medicine:
· Test results prediction
· Product development
· Medical diagnosis
· Cure process determination
Industry:
· Quality control and improvement
o Product design
· Concept design
· Parameter design (design optimization)
· Tolerance design
· Manufacturing process design
o Concept design
o Parameter design (design optimization)
o Tolerance design
· Manufacturing
o Quality monitoring
o Process control
o Inspection / Screening
o Quality analysis
· Customer usage
· Warranty and repair / replacement
Science and Engineering:
· Analysis of scientific and technical problems by constructing models using empirical data
3. LITERATURE SURVEY
James C. Benneyan,
Ph.D. Northeastern University,
This paper provides an overview of statistical process
control (SPC) charts, the different uses of these charts, the most common types
of charts, when to use each type, and guidelines for determining an appropriate
sample size. The intent is to provide an introduction to these methods and
further insight into their design and performance beyond what exists in current
literature. The utility of control charts to help improve clinical and
administrative processes has received growing interest in the healthcare
community
It is useful to distinguish between two types of data
mining exercises. The first is data modeling, in which the aim is produce some
overall summary of a given data set, characterizing its main features. Thus,
for example, we may produce a Bayesian belief network, a regression model, a Neural
network, a tree model, and so on. Clearly this aim is very similar to the aim
of standard statistical modeling. Having said that, the large sizes of the data
sets often analyzed in data mining can mean that there are differences. In
particular, standard algorithms may be too slow and standard statistical
model-building procedures may lead to over-complex models since even small
features will be highly significant. We return to these points below.
It is probably true to say that most statistical
work is concerned with inference in one form or another. That is, the aim is to
use the available data to make statements about the population from which it
was drawn, values of future observations, and so on. Much data mining work is
also of this kind. In such situations one conceptualizes the available data as
a sample from some population of values which could have been chosen. However,
in many data mining situations all possible data is available, and the
aim is not to make an inference beyond these data, but rather it is describe to them. In this
case, it is inappropriate to use inferential procedures such as hypothesis
tests to decide whether or not some feature of the describing model should be
retained. Other criteria must be used.
The second type of data mining exercise is pattern detection. Here the aim is
not to build an Overall global descriptive model, but is rather to detect
peculiarities, anomalies, or simply unusual or interesting patterns in the
data. Pattern detection has not been a central focus of activity for statisticians,
where the (inferential) aim has rather been assessing the ‘reality’ of a
pattern once detected. In data mining the aim is to locate the patterns in the
first place, typically leaving the establishment of its reality, interest, or
value to the database owner or a domain expert. Thus a data miner might locate
clusters of people suffering from a particular disease, while an epidemiologist
will assess whether the cluster would be expected to arise simply from random
variation. Of course, most problems occur in data spaces of more than two
variables (and with many points), which is why we must use formal analytic
approaches.
Ronald E. Walpole, Raymond H. Myers,
“Probability and Statistics for Engineers, Scientists”, 8th edition, Prentice
Hall, Statistics
is a discipline which is concerned with:
·
Summarizing
information to aid understanding,
·
Drawing
conclusions from data,
·
Estimating
the present or predicting the future, and
·
Designing
experiments and other data collection.
In making predictions, Statistics uses the companion
subject of Probability, which models chance mathematically and enables calculations
of chance in complicated cases. Today, statistics has become an important tool
in the work of many academic disciplines such as medicine, psychology,
education, sociology, engineering and physics, just to name a few.
Statistics is also important in many aspects of society
such as business, industry and government. Because of the increasing use of
statistics in so many areas of our lives, it has become very desirable to
understand and practice statistical thinking. This is important even if you do
not use statistical methods directly.
Ian H. Witten, & Eibe Frank,
“Data Mining – Practical Machine Learning Tools and Techniques”, second
edition, Elsevier, 2005.
Data mining is defined as the process of discovering
patterns in data. The process must be automatic or (more usually)
semiautomatic. The patterns discovered must be meaningful in that they lead to
some advantage, usually an economic advantage. The data is invariably present
in substantial quantities.
How are the patterns expressed? Useful patterns allow us
to make nontrivial predictions on new data. There are two extremes for the
expression of a pattern: as a black box whose innards are effectively
incomprehensible and as a transparent box whose construction reveals the
structure of the pattern. Both, we are assuming, make good predictions. The
difference is whether or not the patterns that are mined are represented in
terms of a structure that can be examined, reasoned about, and used to inform
future decisions. Such patterns we call structural because they capture
the decision structure in an explicit way. In other words, they help to explain
something about the data..
Shu-guang He1, Zhen He1, G. Alan
Wang2 and Li Li3 “Quality Improvement
using Data Mining in Manufacturing Processes”
Data mining provides a set of
techniques to study patterns in data “that can be sought automatically,
identified, validated, and used for prediction” (
It has become an emerging topic in
the field of quality engineering. Andrew Kusiak (2001) used a decision tree
algorithm to identify the cause of soldering defects on circuit board. The
rules derived from the decision tree greatly simplified the process of quality
diagnosis. Shao-Chuang Hsu (2007) and Chen-Fu Chien (2006 and 2007)
demonstrated the use of data mining on semiconductor yield improvement.
Data mining has also been applied to product development
process (Bakesh Menon, 2004) and assembly lines (Sébastien Gebus,2007). Some
researchers combined data mining and traditional statistical methods and
applied to quality improvement. Examples are the use of MSPC (multivariate
statistical control charts) and neural networks in detergent-making company
(Seyed Taghi Akhavan Niaki, 2005; Tai-Yue Wang, 2002), the combination of automated
decision system and six sigma in the General Electric financial Assurance businesses
(Angie Patterson, 2005), the combined used of decision tree and SPC with data from
Holmes and Mergen (Ruey-Shiang Guh, 2008), the use of SVR (support vector regression)
and control charts (Ben Khediri ISSam, 2008), the use of ANN (artificial neural
358 Data Mining and Knowledge Discovery in Real Life Applications network), SA
( simulated annealing) and Taguchi experiment design (Hsu-Hwa Chang, 2008).
Giovanni C Porzio(2003) has presented a methods for visually mining off-line
data with combination of ANN and T2 control chart and to identify the
assignable variation automatically.
4. PROBLEM STATEMENT
By using control charts that are
valuable for analyzing and improving industrial process outcomes. Data mining,
the extraction of hidden predictive information from large databases, is
a powerful technology with great potential to help companies focus on the most
important information in their data warehouses.
Data mining tools predict future
trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions.
The automated, prospective analyses offered by data mining move beyond the
analyses of past events provided by retrospective tools typical of decision
support systems. Data mining tools can answer business questions that
traditionally were too time consuming to resolve. They scour databases for hidden
patterns, finding predictive information that experts may miss because it lies
outside their expectations.
Most companies already collect and
refine massive quantities of data. The procedure used to perform data mining
modeling and analysis has undergone a long transformation from the domain of academic
research to a systematic industrial process performed by business and
quantitative analysts. Several methodologies have been proposed to cast the
steps of developing and deploying data mining models into a standardized
process.
Data mining techniques can be
implemented rapidly on existing software and hardware platforms to enhance the
value of existing information resources, and can be integrated with new
products and systems as they are brought on-line. SPC is both a data analysis
method and a process management philosophy, with important implications on the
use of data mining for continuous improvement and enhancement.
5. PROPOSED METHODOLOGIES
5.1 Knowledge-based continuous quality improvement in
manufacturing Processes
Continuous quality improvement is an important concept in
the field of manufacturing quality management. DMAIC
(Define-Measure-Analyze-Improve-Control) for six-sigma is the most commonly
used model of continuous quality improvement. Fig.1 illustrates the processes
of DMAIC.
Fig. 1.
The processes of DMAIC
We can observe from Fig. 1 that DMAIC is a problem driven
approach. The entire process begins with locating a problem. However, in a
complicated manufacturing process, such as semiconductor manufacturing and
steelmaking, it may not be easy to identify and define a proper problem.
Moreover, in high-speed manufacturing processes quality problems must be
quickly identified and eliminated. Otherwise it may lead to a large amount of
loss in both cost and productivity. Therefore, we propose a knowledge-based
quality improvement model (see Fig. 2). Different from DMAIC, this model is a
goal-driven process. The central idea of the knowledge-based quality
improvement mode is to mine the mass data collected from a manufacturing
process using automated data mining techniques. The goal is to improve the
quality performance of manufacturing processes by quickly identifying and eliminating
quality problems.
In the knowledge-based quality improvement model, the
first step is to define the goal. The goal here may be defect elimination,
efficiency improvement, or yield improvement.
Data mining is used to analyze the quality related data
for finding the knowledge between the goal and the factors such as machinery
parameters, operators, and material vendors. After Quality Improvement using
Data Mining in Manufacturing Processes the knowledge has been verified,
opportunities of quality improvement can be identified using the knowledge and
patterns learned by data mining techniques. The scope of the problem can be
broad across different phases of a manufacturing process. In the following sections,
we explained how to apply the model to parameter optimization, quality
diagnosis and service data analysis.
Fig.
2. The knowledge-based quality improvement mode
5.2 Quality diagnosis with data mining
During a manufacturing process, product quality can be
affected by two types of variation: random variations and assignable
variations. Random variations are caused by the intrinsic characteristics of a
manufacturing process and cannot be eliminated completely.
SPC (Statistical Process Control) and MSPC (Multivariate
SPC) are the most widely used tools in manufacturing for finding assignable
variations. Although they can effectively Quality Improvement using Data Mining
in Manufacturing Processes detect assignable variations in manufacturing
processes, they give no clue to identifying the root causes of the assignable
variations. Data mining techniques can again be employed in this case to
provide insights for quality diagnosis.
Fig. 3.
The combination of SPC and data mining
In the model shown in Fig. 3, the yield ratio of a product
is defined as the index of the quality performance of a manufacturing process.
The chart on the left is a control chart. When the chat shows alarming signals,
i.e., points located beyond the control limit, the data mining process will be
engaged. Data related to quality are stored in a data warehouse. Data mining
techniques such as a decision tree and association rule mining can be applied
to the data to identify the causes of the alarming signals.
5.3 Quality management and tools for
process improvement
In the early 1980s, U.S business
leaders began to realize the competitive importance of providing high-quality
products and services and a quality revolution was under way in the
The numbers attached to each point
do not include an order ofimportance; rather the 14 points collectively are
seen as necessary steps to becoming a world-class company. Deming’s 14
points are:
- Create a constancy
of purpose toward the improvement
of products and services in order to become competitive, stay in business,
and provide jobs.
- Adopt the new philosophy.
Management must learn that it is in a new economicage and awaken to the
challenge, learn their responsibilities, and take on leadership for
change.
- Stop depending on
inspection to achieve quality. Build in quality from the start.
- Stop awarding contracts
on the basis of low bids.
- Improve
continuously and forever the system of production and service to improve
quality and productivity, and thus constantly reduce costs.
- Institute training
on the job.
- Institute leadership.
The purpose of leadership should be to help people and technology work
better.
- Drive out fear so
that everyone may work efficiently.
- Break down barriers
between departments so that people can work as a team.
- Eliminate slogans,
exhortations, and targets for the workforce. They create adversarial
relationships.
- Eliminate quotas
and management by objectives. Substitute leadership.
- Remove barriers
that rob employees of their pride of workmanship.
- Institute a
vigorous program for education andself-improvement.
- Make the
transformation everyone’s job and put everyone to work on it.
There have been other individuals
who have played significant roles in quality movement. Among these are, Philip B
Crosby, who is probably best known for his book Quality Is Free. Kauro
Ishikawa developed and popularized the application of the Fishbone Diagram.
Finally, we must not overlook the contributions of many different managers at
companies such as Hewlett-Packard, General Electric, Motorola,
These leaders synthesized and
applied many different quality ideas and concepts to their organizations in
order to create world-class corporations. By sharing their success with other
firms, they have inspired and motivated others to continually seek
opportunities in which the tools of quality can be applied to improve business processes.
Statistical
Process Control Charts (SPC) can be used for quality and
process improvement. SPC charts are a special type of trend chart. In addition
to data, the charts display the process average and the upper and lower control
limits. These control limits define the range of random variation expected in the
output of a process. SPC chats are used to provide early warnings when a
process has gone out of control.
Statistical
inference is the process of making conclusions using data that is subject
to random variation, for example, observational errors or sampling variation.
More substantially, the terms statistical inference, statistical
induction and inferential statistics are used to describe systems of
procedures that can be used to draw conclusions from datasets arising from
systems affected by random variation. Statistical inference is concerned with
drawing conclusions from numerical data, about quantities that are not observed.
An important component of
statistical inference is hypothesis testing. Information contained in a sample
is subject to sampling error. The sample mean will almost certainly not equal
the population mean. Therefore, in situations in which we need to test a claim
about a population mean by using the sample mean. Therefore, in situations in
which we need to test a claim about a population mean by using the sample mean,
we can’t simply compare the sample mean to the claim and reject the claim if x
bar and the claim are different. Instead, we need a testing procedure that
incorporates the potential of sampling error.
Statistical hypothesis is testing
provides managers with a structured analytical method for making decisions of
this type. It lets them make decisions in such a way that the probability of
decision errors can be controlled, or at least measured. Even though statistical
hypothesis testing does not eliminate the uncertainty in the managerial
environment, the techniques involved often allow managers to identify and
control the level of uncertainty. SPC is actually an application of hypothesis
testing. The testing of statistical hypotheses is perhaps the most important
area of
decision
theory.
A statistical hypothesis is an
assumption or statement, which may or may not be true, concerning one or more
populations. The truth or falsity of a statistical hypothesis is never known
with certainty unless we examine the entire population. This, of course, would
be impractical in most situations. Instead, we take a random sample from the
population of interest and use the information contained in this sample to decide
whether the hypothesis is likely to be true or false. Evidence from the sample
that is inconsistent with the statistical hypothesis leads to the rejection of
the hypothesis, whereas evidence supporting the hypothesis leads to its
acceptance. We should make it clear at this point that the acceptance of a statistical
hypothesis is a result of insufficient evidence to reject it, and does not
necessary imply it is true. Although we shall use the terms accept and reject
frequently throughout this paper, it is important to understand that the rejection of a hypothesis
is to conclude that it is false, while the acceptance of a hypothesis merely
implies that we have no evidence to believe otherwise. Hypotheses that we
formulate with the hope of rejecting are called null hypotheses and are denoted
by .The rejection of H0 leads to the
acceptance of an alternative hypothesis denoted by H1. Based on the sample
data, we either reject H0or we do not reject H0.
Types of
statistical errors:
Because of the potential of extreme
sampling errors, two possible errors can occur when a hypothesis is tested:
Type I and Type II errors. These errors show the relationship between what
actually exists (a state of nature) and the decision based on the sample information.
There are three possible outcomes:
no error (correct decision), Type I error, and type II error. Only one of these
outcomes will occur for a hypothesis test. From figure 4, if the null
hypothesis is true and an error is made, it must be a Type I error. On the other
hand, if the null hypothesis is false and an error is made, it must be a Type
II error. We should never use the phrase “accept the null hypothesis. Instead
we should use “do not reject the null hypothesis”. Thus, the only two
hypothesis testing decisions would be reject H0 or “do not reject H0”. In most
business applications, the purpose of the hypothesis test is to direct the
decision maker to take one action or another, based on the test results.
Statistical
process control charts:
In the applications in which the
value of the random variable is determined by measuring rather than counting,
the random variable is said to be approximately continuous, and the probability
distribution associated with the random variable (as in this article) is called
a continuous probability distribution.
SPC can be used in business to
define the boundaries that represent the amount of variation that can be
considered normal.
Normal distribution is used
extensively in statistical decision making such as hypothesis testing, and
estimation. The mathematical equation for the probability distribution of the continuous
normal variable is the x bar charts are
used to monitor a process average. R charts are used to monitor the
variation of individual process values. They require that the variable of
interest be quantitative .These control charts can be developed using
the following steps:
- Collect the initial
sample data from which the control charts will be developed.
- Calculate subgroup
means x bar i and ranges Ri .
- Compute the average
of the subgroup means and the average range value.
5.4 Quality Control Charts Overview and
Interpretation
Statistical process control charts are chronological
graphs of process data that are used to help understand, control, and improve
processes - such as infection control or adverse event processes - and that,
although based in statistical theory, are easy for practitioners to use and interpret.
While there are several different types of control charts, the general format
and interpretation of the most common and simplest type, called a Shewhart
control chart, are shown in Figure 4. Some statistic of interest, such as the
number of cases of ventilator associated pneumonia per 100 device days, is
plotted on the chart and interpreted on a monthly or weekly basis.
Fig.4 Shewhart control
chart
Uses
of Control Charts
It is important to emphasize that control charts have
several important, somewhat sequential, roles in quality improvement work.
- Understanding current and past
process performance and its degree of consistency and predictability;
- Establishing a "state of
statistical control" by identifying and removing causes of unnatural
(or "special cause") variation so as to achieve a consistent and
predictable level of process quality over time;
- Improving a process by identifying and
removing causes of natural (or "common cause") variation and by testing whether
interventions result in an improvement; and
- Monitoring for process deterioration
and
"holding the gains" by identifying special causes of unnatural
variation when they arise in the future.
Establishing a State of
Statistical Control
Note that while the latter two uses of control charts -
testing and holding the gains - tend to be the most well-known in many popular
quality improvement models, the first two activities are very important but
unfortunately often overlooked or misunderstood. In many applications, considerable
value exists in "merely" achieving a state of statistical control. As
in other industries, many healthcare processes will not be stable and
consistent when first examined and will require significant effort to bring
them into a state of statistically consistent behavior (i.e., statistical
control). This activity is referred to as "trial control charting"
because of the focus on testing whether the process is consistent and on
attempting to bring it into a state of operation such that it produces
consistent and predictable results. This iterative process occurs over a period
of time and consists of:
- Constructing an initial trial
control chart to test for statistical control,
- Searching for and removing
assignable causes of unnatural variability,
- Removing all affected data and
recalculating the center line and control limits from the remaining data
(with the addition of new data if available or necessary),
- Searching a second time for causes
of unnatural variability,
- Removing these data and
reconstructing the control chart a second time as above, and
- Repeating this process as many times
as is necessary until a state of statistical control is reached.
Monitoring and Improving
Once a stable process exists (i.e., a state of
statistical control has been established), the control chart is used to monitor
the process for signals that a change has occurred ("special cause"
of "unnatural" variability in SPC terminology) - points outside the
control limits or violations of any of the within-limit rules. If these are
changes for the worse, such as an increase in the ventilator-associated
pneumonia rate, then an effort should be made to discover the causes so that
they can be removed and prevented in the future.
Causes of changes for the better also should be
investigated and understood so that they can be implemented on a regular basis.
While this monitoring activity tends to be the most familiar use associated
with control charts, it also is the most passive use from a process improvement
perspective as it is focused primarily on maintaining the status quo. It also
is important to note that being in a state of statistical control must not
necessarily imply that the process is performing at an acceptable level, or
that the outcome rate is good, and that either an increase or decrease (i.e.,
an improvement) in the outcome rate represents an out-of-control process.
Statistical control is defined as all data being produced by the same constant
process and probability model, which may or may not have an acceptable mean or
variance.
Types of Control Charts
The most familiar types of control charts, called
Shewhart control charts, originally were developed by Shewhart in 1924, one for
each of several types of data that are commonly encountered in practice. Each
of these types of data can be described by a statistical distribution that is
used to determine the expected value, theoretical standard deviation, and
natural variation of the data (i.e., the center line and control limits). Examples
of the most common types of data distributions - the normal, binomial, Poisson,
and geometric. While many other types of
data exist, these distributions will be familiar to many readers as very common
and appropriate in many applications.
One of the most common difficulties that practitioners
have in using SPC is determining which type of control chart they should
construct. As shown in Table 1, the chart type to use in any particular
situation is based on identifying which type of data is most appropriate
6. CONCLUSION
In this report DM application in
quality improvements is presented. There is very little information about DM
and its usage in a manufacturing environment. One contribution of this research
is exploring that in which industries DM is used and in which quality
improvement studies in those industries it is used up to now.
Control charts are valuable for analyzing and improving
clinical process outcomes. Different types of charts should be used in
different applications and sample size guidelines should be used to achieve the
desired sensitivity and specificity. SPC is both a data analysis method and a
process management philosophy, with important implications on the use of data
for improvement rather than for blame, the frequency of data collection, and the
type and format of data that should be collected. When dealing with low rates,
it also can be advantageous to collect data on the number of cases or the
amount of time between adverse events, rather than monthly rates.
In a competitive global market, manufacturing enterprises
must stay agile when making quality improvement decisions. The development of
IT and other related technologies makes the collection of quality related data
easy and cost-effective. However, it is still an open question on how to
leverage the large amount of quality data to improve manufacturing quality.
This chapter has approached the problem of quality improvement in manufacturing
processes using data mining techniques The utility of statistical process control (SPC) methods has received growing interest in quality assurance to help
and improve different processes. The objective of this paper is to provide an overview
of x as
one of SPC charts, statistical data analysis, and how to benefit from data
mining concepts and techniques to get a further insight into SPC design and performance
to look for patterns in data and improve manufacturing processes quality control.
References
[1] David F. Groebner, Patrick W. Shannon,
Phillip C. Fry, and Kent D. Smith, “Business Statistics – A Decision Making Approach”,
Seventh Edition, Prentice Hall, 2008.
[2] Hossein Arsham, “Topics in Statistical
Data Analysis: Inferring From Data”,
[3] Mamdouh Reffat, “Data preparation for
data mining using SAS”, Elsevier, 2007.
[4] Pang-Ning Tan, Michael Steinbach, and
Vipin Kumar, “Introduction to Data Mining”, Pearson Education, Inc, 2006.
[5] Andrew Gelman, John B. Carlin, Hal S.
Stern, and Donald B. Rubin., “Bayesian Data Analysis”, second edition, Chapman
& Hall/CRC, 2004.
[6] David J. Hand, “Why data mining is more
than statistics writ large”, Imperial College of Science, Technology, and
Medicine, Department of Mathematics, Huxley Building, 180 Queen’s Gate, London
SW7 2BZ, UK.
[7] James C. Benneyan, “Design, Use, and
Performance of Statistical Control Charts for Clinical Process Improvement”, Northeastern
University,
[8] Mehmed Kantardzic, “Data Mining Concepts,
Models, Methods, and Algorithms”, IEEE Press, 2003.
[9] Ian H. Witten, & Eibe Frank, “Data
Mining – Practical Machine Learning Tools and Techniques”, second edition, Elsevier, 2005.
[10] Ronald
[11] Agrawal R., Stolorz P., and
Piatetsky-Shapiro G. (eds.) (1998) Proceedings
of the Fourth “International Conference on Knowledge Discovery and Data Mining”.
[12] Elder J, IV, and Pregibon D. (1996) “A
statistical perspective on knowledge discovery in Databases”.
[13] In Fayyad U.M., Piatetsky-Shapiro G.,
Smyth P., and Uthurusamy R. (eds.) “Advances
in Knowledge Discovery and Data Mining”.
[14] Vapnik, V.N. “The nature of statistical
learning theory “
[15] Glymour C., Madigan D., Pregibon D., and
Smyth P. (1996) “Statistical inference and data mining. Communications” of the ACM, 39, 35-41.
[16] Burr IW. “The effect of non normality on
constants for X and R charts. Industrial Quality Control”. 1967; 23:563-569.
[17] Quinlan J. R., “Induction of Decision
Trees”, Machine Learning, 1, pp. 81-106, 1986
[18] Breiman et al, “Classification and Regression Trees”,
[19] Maureen Caudill and Charles Butler,
“Under-standing Neural Networks” Vol 1, Basic Networks.
[20] Lu, H., Setiono. R, and Liu, H (1996),
“Effective Data Mining Using Neural Networks”, IEEE Transactions on Knowledge
and Data Engineering, Vol. 8 No. 6 pp 957-961.
Comments
Post a Comment