A Comparative Analysis of Data Mining Technique using Decision Tree Classifier and SPC

1. INTRODUCTION

Manufacturing process control and quality assurance are elements of a quality management system, which is the set of policies, procedures, and processes used to ensure the quality of a product or a service. It is widely acknowledged that quality management systems improve the quality of the products and services produced, thereby improving market share, sales growth, sales margins, and competitive advantage, and helping to avoid litigation. Quality control methods in industrial production were first developed by statistician W. Edwards Deming; the adoption of these ideas in post–World War II Japan led to the production of more reliable goods, with fewer defects, than those of the United States and western Europe, spurring the subsequent global success of many Japanese firms.

The International Organization for Standardization has outlined quality principles and the procedures for implementing a quality management system in ISO 9000:2000, ISO 9001:2000 and other documents. These documents have become the gold standards of best practices for ensuring quality and, in many fields, serve as the basis for regulation. Barnes¹ notes that "ISO 9000 guidelines provide a comprehensive model for quality management systems that can make any company competitive."

An important element of quality assurance is the collection and analysis of data that measure the quality of the raw materials, components, products, and assembly processes. Exponent statisticians can help companies comply with ISO 9001 standards by developing good data collection and analysis techniques during the design, development, and production stages. Specifically, Exponent statisticians are experienced in:

Acceptance sampling
Statistical process control (SPC)

Acceptance sampling is conducted to decide whether a batch of product (e.g., supplier components or finished units) is of acceptable quality. Rather than testing 100% of the batch, a random sample of the batch is tested, and a decision about the entire batch is reached from the sample test results. Acceptance sampling was originally developed during World War II to test bullets; since then, numerous military and civilian standards have been developed to encompass various types of quality measurements, and testing and sampling methods. Exponent statisticians are familiar with these standards and can assist clients in evaluating available alternatives sampling by variables vs. attributes, use of single vs. double or multiple sampling, rectifying vs. non-rectifying with respect to nonconforming items to determine an appropriate sampling plan.

Statistical Process Control (SPC) is an effective method of monitoring a production process through the use of control charts. By collecting in-process data or random samples of the output at various stages of the production process, one can detect variations or trends in the quality of the materials or processes that may affect the quality of the end product. Because data are gathered during the production process, problems can be detected and prevented much earlier than methods that only look at the quality of the end product. Early detection of problems through SPC can reduce wasted time and resources and may detect defects that other methods would not. Additionally, production processes can be streamlined through the identification of bottlenecks, wait times, and other sources of delay by use of SPC.

Nowadays, manufacturing enterprises have to stay competitive in order to survive the competition in the global market. Quality, cost and cycle time are considered as decisive factors when a manufacturing enterprise competes against its peers. Among them, quality is viewed as the more critical for getting long-term competitive advantages. The development of information technology and sensor technology has enabled large-scale data collection when monitoring the manufacturing processes.

Those data could be potentially useful when learning patterns and knowledge for the purpose of quality improvement in manufacturing processes. However, due to the large amount of data, it can be difficult to discover the knowledge hidden in the data without proper tools.

Data mining provides a set of techniques to study patterns in data “that can be sought automatically, identified, validated, and used for prediction. As long as knowledge is another name of power, organizations give much importance to knowledge. When reaching this knowledge, they make use of the data in their databases. Data is important since it provides them to learn from the past and to predict future trends and behaviors. Today most of the organizations use the data collected in their databases when taking strategic decisions. The process of using the data to reach this knowledge consists of two steps as collecting the data and analyzing the data. In the beginning, organizations face with difficulties when collecting the data. So, they have not enough data in order to make suitable analysis.

In the long run, with the rapid computerization, they are able to store huge amount of data easily. But at this time they face with another problem when analyzing and interpreting of such large data sets. Traditional methods like statistical techniques or data management tools are not sufficient anymore. Then, in order to manage with this problem the technique called data mining (DM) has been discovered. DM is a new useful and powerful technology that supports companies to derive strategic information in their databases. It has been defined as: ‗The process of exploration and analysis, by automatic or semiautomatic means of large quantities of data in order to discover meaningful patterns and rules.

The expression meant by meaningful patterns and rules is: easily understood by humans, valid on new data, potentially useful and novel. Validating a hypothesis that the user wants to prove can also be accepted as a meaningful patterns and rules. In sum, it is essential to derive patterns and rules that help us to reach strategic and unimagined information in DM.

2. DATA MINING PROCESS TECHNIQUES AND

APPLICATION

2.1 DM PROCESS

According to Fayyad DM refers to a set of integrated analytical techniques divided into several phases with the aim of extrapolating previously unknown knowledge from massive sets of observed data that do not appear to have any obvious regularity or important relationships. Another definition is, DM is the process of selection, extrapolation and modeling of large quantities of data to discover regularities or relations that are at first unknown with the aim of obtaining clear and useful results of the database By using these definitions and harmonizing them, in this study DM is considered as a whole process consisting of different steps where in each step different DM techniques can be used. Then, several steps of the DM process and the possible DM techniques that might be used in each step are tried to classify as follows and as shown in figure below:

1. Data gathering

a. Feature determination

b. Database formation

2. Data preprocessing

a. Data cleaning

i. Missing data handling

ii. Outlier and inlier handling

iii. Inconsistent data handling

b. Data integration

c. Data transformation

i. Smoothing,

ii. Aggregation,

iii. Generalization,

iv. Normalization of data [For example: min-max normalization, z-score normalization, normalization by decimal scaling etc.]

v. Attribute construction

vi. Discretization and concept hierarchy generation [For example: S-based techniques (DA etc.), NN-based (SOM) etc.]

d. Data reduction

i. Data cube aggregation

ii. Dimension reduction (Feature selection)

1. Feature wrapper

2. Feature filter

iii. Data compression [For example: Wavelet transform, S-based techniques (PCA, FA etc.) etc.]

iv. Numerosity reduction [For example: S-based techniques(R, Histograms etc.), Clustering etc.]

v. Over sampling

3. Modeling

a. Predictive model

i. Classification [For example: S-based techniques(R, BC, LR etc.), DT-based (OC1, ID3, CHAID, ID5R, C4.5 AND C5, CART, QUEST, Scalable DT techniques, Statistical batch-based DT learning etc.), NN-based-based (Generating Rules from a DT, Generating Rules from a ANN{ Rectangular basis function network }, Generating Rules without a DT and ANN {PRISM, RST, FST etc.}), Combining techniques (Integration of FST and RST, FAN, EN, CC etc.), SVM etc.]

ii. Prediction [For example: S-based techniques (Parametric {MLR as RSM, GLM as ANOVA, MANOVA, TM, NRM as Generalized Additive Models, RR, BR, TSA as exponential smoothing etc.}, Nonparametric {ANOVA as Kruskal-Wallis, R, TSA as Moving average etc.}), DT-based (ID3, C4.5 and C5, CART, CHAID, Scalable DT techniques etc.), NN-based(w.r.t. learning algorithms: Feed forward Propagation, back propagation; w.r.t. architecture: RBF{RBF network as Gaussian RBF NN}, Perceptions, BNN), R-based (Generating Rules from a DT, Generating Rules from a ANN, Generating Rules without a DT and ANN{PRISM}), CBR, FEMS, SVM, Combining techniques (Modular ANN { FNN, Fuzzy ARTMAP NN, ANFIS etc.}) etc.]

b. Descriptive model

i. Clustering [For example: Hierarchical methods (Agglomerative, Divisive), Partitional methods (Minimum spanning tree, Squared error, K-means, Nearest neighbor, PAM, Bond energy, GA, NN based {w.r.t. learning rule: Competitive as SOM, LVQ etc.}, Non competitive (Hebian)), Rule-based (Generating Rules from a ANN) etc.]

ii. Summarization (Visualization and Statistics)

1. Visualization [For example: S-based (Histograms, scatter plots, box plots, pie charts, 3_D plots etc.) etc.]

2. Statistics [For example: Descriptive statistics (mean, median, frequency count etc.), Density estimation etc.]

3. Tables

c. Association

i. Basic methods [For example: Apriori, Sampling, Partitioning etc.]

ii. Advanced association rules method [For example: Generalized association rules, Multiple-level association rules, Quantitative association rules, Using multiple minimum supports, Correlation rules etc.]

d. Optimization [For example: S-based (TM, RSM), NN-based, GA, SA, SQP, Levenberg-Marquardt method etc.] S-based: Statistical based, DT-based: Decision tree based, NN-based: Neural network based, R-based: Rule based, R:Regression, PCA:Principle component analysis, RBF:Radial basis function, SVM:Support vector machines, CBR: Case based reasoning, GA:Genetic algorithms, SE:Subjective and empirical approach, BNN:Bayesian networks, FAN:Fuzzy adaptive network, EN:Entropy network, CC:Composite classifiers, TSA:Time series analysis, FEM:Finite element modeling, GSA:Grey superior analysis, SA: Simulated annealing, SQP: Sequantial quadratic programming method, DA: Discriminant analysis, CA:Correlation analysis, BC:Bayesian classification, GLM:General Linear Models, NRM:Nonlinear regression models, RR:Robust regression BR:Bayesian Regression, FA: Factor analysis, LR:Logistic regression, MLP:Multi linear perceptron, LVQ: Learning vector quantization

2.2 MAIN DATA MINING TASKS

There are different ways of distinguishing interesting patterns or trends in a huge amount of data set which are called DM operations or tasks. there exist different DM task categorizations.

For instance, one is the categorization involving Prediction, Classification, Clustering, Affinity grouping or Association Rules and 8 Visualization and Statistics ; and another involving Classification, Regression, Clustering, Summarization, Dependency Modeling, Change and Deviation Detection . There are also some other categorizations involving various classes as Outlier Analysis and Text Mining . We are mainly interested in the following DM tasks. Optimization in the categorization does not exist in the literature. So that, we newly defined this category. We defined it because although the papers commonly used DM tools for optimization purposes, there were not any DM tasks suitable to this case. The few rest that out of the scope of this study as Text mining, Web mining, Spatial mining and Temporal mining or in the scope of this study but not studied in the papers placed in the table as Affinity Grouping or Association Rules, Visualization and Statistics are listed in the others part. The analyst can apply one or several of them during the analysis on the dataset.

2.2.1 Data gathering and preprocessing

The first step of DM applications is data gathering. In this part, it is aimed to obtain the right data. For this purpose, all the available data sources are examined then the right data for the recent analysis is selected. It includes two steps: Feature determination and database formation. In feature determination it is determined the name of the variables whose data is collected. Whereas in database formation, collected data is returned to database format. The second step is data preprocessing. The goal of this step is to investigate the quality of the selected data then transform in order to make it suitable for further analysis. This part is important since real life data is incomplete, noisy and inconsistent. Data preprocessing consists of data cleaning, data integration, data transformation and data reduction.

Data cleaning deals with filling the missing values, detecting the outliers then smoothing the noisy data and correcting the inconsistencies in the data. Methods for missing values are listed in DM Process part. All of them have some advantages and 9 disadvantages respectively.

For example, if tupple does not contain many missing values ignoring the tupple is not an effective method. Similarly, filling in missing value manually is time consuming.

Although using a global constant to fill in the missing values is a simple method, it is not recommended. Filling in missing values with the most probable value is the most commonly used technique. Some methods like regression and decision tree induction are also used in this technique. Noisy data is another important problem if real life data is used. Noise is a random error or variance in a measured variable. Clustering technique, scatter plots, box plots are helpful for detecting the outliers. And, some smoothing techniques like binning and regression are used to get rid of noise. Lastly, there may be inconsistencies in data. It is due to error made at data entry or data integration. It may be corrected by performing a paper trace .

Data integration is combining necessary data from multiple data sources like multiple databases, data cubes, or flat files. Some problems may occur during the data integration. To illustrate, if an attribute can be derived from another table it indicates to redundancy problem. Another problem is detection and resolution of data value conflicts. Since different representation, scaling or encoding can be used, for the same entity, attribute values can be different in different data sources.

As a conclusion, we should be more careful in data integration in order not to face with such problems. Data transformation is changing the data into convenient form for DM analysis. It includes smoothing, aggregation, generalization, normalization of data and attribute construction. Aggregation is summarization of data and it is used when building a data cube.

An example of generalization is, changing the numeric attribute age into young, middle-aged, and senior. Normalization is changing the scaling of the value in order to be fall it within a desired range. Many methods are used for normalization. Some of them are min-max normalization, z-score normalization and normalization by decimal scaling.

Attribute construction is building new useful attributes by combining other attributes inside the data. For instance, ratio of weight 10 to height squared (obesity index) is constructed as a new variable so that it may be more logical and beneficial to use it in analysis.

Data reduction is changing the representation of data so that its volume becomes smaller while the information it includes is almost equal to the original data. It is important since the datasets are huge and doing analysis on this data is both time consuming and impractical. Methods for data reduction are listed in DM Process part. Data gathering and data preprocessing are parts of data preparation. It includes choosing the right data then convert it into suitable form for the analysis. Data preparation is the most time spent part of the DM applications. In fact, about half of the time is spent in this part in DM projects. Much importance should be given to this part if we do not want to come up with any problem during the process.

2.2.2 Classification

Classification is an operation that examines the feature of the objects then assigns them to the predefined classes by the analyst. For this reason it is called as supervised learning||. The aim of it is to develop a classification or predictive model that increases the explanation capability of the system. In order to achieve this, it searches patterns that discriminate one class from the others.

To illustrate, a simple example of this analysis is to predict the customers or non-customers who had either visited the website or not. The most commonly used techniques for classification are DT and ANN. And, it is frequently used in the evaluation of credit demands, fraud detection and insurance risk analysis.

2.2.3 Prediction

Prediction is a construction of a model to estimate a value of a feature. In DM, the term classification||is used for predicting the class labels and discrete or nominal values whereas the term prediction|| is mainly used for estimating continues values. In fact, some books use the name value prediction ‘instead of prediction’.

Two traditional techniques namely, linear and nonlinear regression (R/NLR) and ANN are commonly used in this operation. Moreover, RBF is a newly used technique for value prediction which is more robust than traditional regression techniques.

2.2.4 Clustering

Clustering is an operation that divides datasets into similar small groups or segments according to some criteria or metric. Different from classification, there are no predefined classes in this operation. So it is called as unsupervised learning||. It is just an unbiased look at the potential groupings within a dataset. It is used when there are suspected groupings in dataset without any judgments about what that similarity may involve. It might be the first step in DM analysis. Because, it is difficult to derive any single pattern or develop any meaningful single model by using the entire dataset. Clearly, constructing the clusters reduce the complexity in dataset so that the performance of other DM techniques are more likely to be successful. To illustrate, instead of doing a new sales company to all customers, it is meaningful firstly creating customer segments than doing the convenient sales companies to the suitable customer segments. Clustering often uses the methods like K-means algorithm or a special form of NN called a Kohonen feature map network (SOM).,

2.3 MAIN DATA MINING TECHNIQUES

In DM operations, well-known mathematical and statistical techniques are used. Some of these techniques are collected in some heads like S-based, DT-based, NN-based and Distance-based. And the rest of them which are not covered with these four heads are listed in the others part. Here we only mentioned the commonly used or known techniques within these heads.

2.3.1 Statistical – based techniques

One of the commonly used S-based techniques is R. "Regression analysis is a statistical technique for investigating and modeling the relationship between variables".

The general form of a simple linear regression is in this equation α is the intercept, β is the slope and is the error term, which is the unpredictable part of the response variable yi. α and β are the unknown parameters to be estimated.

The estimated values of α and β can be derived by the method of ordinary least squares as follows:

Regression analysis must satisfy some certain assumptions. There assumptions are predictors must be linearly independent, error terms must be normally distributed and independent and variance of the error terms must be constant. If the distribution of error term is different than normal distribution then the GLM which is a useful generalization of ordinary least squares regression is used. The form of the right hand side can be determined from the data which is called nonparametric regression. This form of regression analysis requires a large number of observations since the data are used both to build the model structure and to estimate the model parameters. Robust regression is a form of regression analysis which circumvents some limitations of traditional parametric and non-parametric methods. It is highly robust to outliers. If the response variable is non-continuous then the logistic regression approach is used.

ANOVA which stands for analysis of variance is another well-known S-based technique. It is a statistical procedure for assessing the influence of a categorical variable (or variables) on the variance of a dependent variable||. It compares the difference of each subgroup mean from the overall mean with the difference of each observation from the subgroup mean. If there is more variation between-groups differences, then the categorical variable or factor is influential on the dependent 15 variable.

One-way ANOVA measure the effects of one factor only, whereas two-way ANOVA measure both the effects of two factors and the interactions between them simultaneously. The F-test is used to measure the effects of the factors. It must satisfy some certain assumptions as independence of cases, the distributions in each of the groups are normal and the variance of data in groups should be the same. When the normality assumption fails, the Kruskal-Wallis test which is a nonparametric alternative can be used.

2.3.2 Decision tree – based techniques

Decision trees are the tree shaped structures that are the most commonly used DM techniques. Construction of these trees is simple. The results can easily be understood by the users. In addition, they can practically solve most of the classification problems. In a DT model, there are internal nodes which devise a test on an attribute and branches show the outcomes of the test. At the end of the tree, leaf nodes, which represent classes, take place. During the construction of these trees, the data is split into smaller subsets iteratively. At each iteration, choosing the most suitable independent variable is an important issue. Here, the split which creates the most homogenous subsets with respect to the dependent variable should be chosen. While choosing the independent variable, some attribute selection measures like information gain, gini index etc. are used. Then, these splitting processes according to the measures continue until no more useful splits are found. In brief, DT technique is useful for classification problems and the most common types of decision tree algorithms are CHAID, CART and C5.0.

2.3.3 Neutral network – based techniques

NN supports us to develop a model by using historical data that are able to learn just as people. They are quite talented for deriving meaning from the complicated dataset that are difficult to be realized by humans or other techniques. To exemplify,

Figure 1. Example of a Neural Network Architecture

It simply consists of combining the inputs (independent variables) with some weights to predict the outputs (dependent variables) based on prior experience. In Figure, A, B and C are input nodes and they constitute the input layer. In addition, F is the output node and constitutes the output layer. Moreover, in most of the NNs, there are one or more additional layer between the input and output layer which are called hidden layers||. In the Figure, D and E are the hidden nodes and constitute a hidden layer. The weights are also shown on the arrows between the nodes in the same figure. Additionally, if we look at the strengths and weaknesses of this technique firstly, it is more robust than DT in noisy environments. Then, it can improve its performance by learning. However, the model developed is difficult to understand. Moreover, learning phase may fail to converge must also be numeric. As a result, NNs are useful for most prediction and classification operations when just the result of the model is important rather than how the model finds it. Back propagation is the most commonly used learning technique. It is easily understood and applicable. It adjusts the weights in the NN by propagating weight changes backward from the sink to the source nodes|| .Perceptron is the simplest NN. In this architecture, there is a single neuron with multiple inputs and one output. A network of perceptrons is called a multilayer perceptron (MLP). MLP is the simple feed forward NN and it has multiple layers.

Radial basis function network is a NN which has three layers. In hidden layers Gaussian activation function is used whereas in output layer a linear activation function is used. Gaussian activation function is a RBF with a central point of zero. RBF is a class of functions whose value decreases (or increases) with the distance from a central point||.

2.3.4 Hierarchical and Partitional techniques

Cluster analysis identifies the distinguished characteristics of the dataset, and then divides it into partitions so that the records in the same group are similar and between the groups are different as much as possible. The basic operation is the same in all clustering algorithms. Each record is compared with the existing clusters then it is assigned to the cluster whose centroid is the earliest. Later, centroids of the new clusters are calculated and once again each record assigned to the new cluster with the closest centroid. At each iteration the class boundaries, which are the lines equidistant between each pair of centroids, are computed. This process continues until the cluster boundaries stop changing. As a distance measure, most of the clustering algorithms use the Euclidean distance formula. Certainly, nonnumeric variables must be transformed in order to be used by this formula.

Hierarchical clustering techniques can generate sets of clusters whereas partitional techniques can generate only one set of clusters. So in partitional techniques user has to specify the number of clusters. In an agglomerative algorithm, which is one of the hierarchical clustering techniques, each observation is accepted as one cluster. Then it continues to combine these clusters iteratively until obtaining one cluster. On the other hand, in K-means clustering, which is one of the partitional clustering techniques; observations are moved among sets of clusters until the desired set is obtained.

2.3.5 Others

Genetic algorithm is an optimization type algorithm. It can be used for classification, clustering and generating association rules. It has five steps:

1. Starting set of individuals, P.

2. Crossover technique.

3. Mutation algorithm.

4. Fitness function

5. Algorithm that applies the crossover and mutation techniques to P iteratively using the fitness function to determine the best individuals in P to keep. The algorithm replaces a predefined number of individuals from the population with each iteration and terminates when some threshold is met||.

This algorithm begins with a starting model which is assumed. Then using crossover algorithms, it combines the models to generate new models iteratively. And, a fitness function selects the best models from these. At the end, it finds the fittest|| models from a set of models to represent the data.

2.4 APPLICATION AREAS

Today main application areas of DM are as follows:

Marketing:

· Customer segmentation

o Find clusters of model ‘customers who share the same characteristics: interest, income level, spending habits, etc.

o Determining the correlations between the demographic properties of the customers

o Various marketing campaign

o Constructing marketing strategies for not losing present customers

o Market basket analysis

o Cross-market analysis

§ Associations/correlations between product sales

§ Prediction based on the association information

o Customer evaluation

o Different customer analysis

§ Determine customer purchasing patterns over time

§ What types of customers buy what products

§ Identifying the best products for different customers

o CRM

o Sale estimation

Banking:

· Finding the hidden correlations among the different financial indicators

· Fraud detection

· Customer segmentation

· Evaluation of the credit demands

· Risk analysis

· Risk management

Insurance:

· Estimating the customers who demand new insurance policy

· Fraud detection

· Determining the properties of risky customers

Retailing:

· Point of sale data analysis

· Buying and selling basket analysis

· Supply and store layout optimization

Bourse:

· Growth stock price prediction

· General market analysis

· Purchase and sale strategies optimization

Telecommunication:

· Quality and improving analysis

· Allotment fixing

· Line busyness prediction

Health and Medicine:

· Test results prediction

· Product development

· Medical diagnosis

· Cure process determination

Industry:

· Quality control and improvement

o Product design

· Concept design

· Parameter design (design optimization)

· Tolerance design

· Manufacturing process design

o Concept design

o Parameter design (design optimization)

o Tolerance design

· Manufacturing

o Quality monitoring

o Process control

o Inspection / Screening

o Quality analysis

· Customer usage

· Warranty and repair / replacement

Science and Engineering:

· Analysis of scientific and technical problems by constructing models using empirical data

3. LITERATURE SURVEY

James C. Benneyan, Ph.D. Northeastern University, Boston MA Last Revised: September 16, 2001 “Design, Use, and Performance of Statistical Control Charts for Clinical Process Improvement”

This paper provides an overview of statistical process control (SPC) charts, the different uses of these charts, the most common types of charts, when to use each type, and guidelines for determining an appropriate sample size. The intent is to provide an introduction to these methods and further insight into their design and performance beyond what exists in current literature. The utility of control charts to help improve clinical and administrative processes has received growing interest in the healthcare community

David J. Hand Imperial College of Science, Technology, and Medicine, Department of Mathematics Huxley Building. 180 Queen’s Gate, London SW7 2BZ, UK, d.j.hand@ic.ac.uk “Why data mining is more than statistics writ large”

It is useful to distinguish between two types of data mining exercises. The first is data modeling, in which the aim is produce some overall summary of a given data set, characterizing its main features. Thus, for example, we may produce a Bayesian belief network, a regression model, a Neural network, a tree model, and so on. Clearly this aim is very similar to the aim of standard statistical modeling. Having said that, the large sizes of the data sets often analyzed in data mining can mean that there are differences. In particular, standard algorithms may be too slow and standard statistical model-building procedures may lead to over-complex models since even small features will be highly significant. We return to these points below.

It is probably true to say that most statistical work is concerned with inference in one form or another. That is, the aim is to use the available data to make statements about the population from which it was drawn, values of future observations, and so on. Much data mining work is also of this kind. In such situations one conceptualizes the available data as a sample from some population of values which could have been chosen. However, in many data mining situations all possible data is available, and the aim is not to make an inference beyond these data, but rather it is describe to them. In this case, it is inappropriate to use inferential procedures such as hypothesis tests to decide whether or not some feature of the describing model should be retained. Other criteria must be used.

The second type of data mining exercise is pattern detection. Here the aim is not to build an Overall global descriptive model, but is rather to detect peculiarities, anomalies, or simply unusual or interesting patterns in the data. Pattern detection has not been a central focus of activity for statisticians, where the (inferential) aim has rather been assessing the ‘reality’ of a pattern once detected. In data mining the aim is to locate the patterns in the first place, typically leaving the establishment of its reality, interest, or value to the database owner or a domain expert. Thus a data miner might locate clusters of people suffering from a particular disease, while an epidemiologist will assess whether the cluster would be expected to arise simply from random variation. Of course, most problems occur in data spaces of more than two variables (and with many points), which is why we must use formal analytic approaches.

Ronald E. Walpole, Raymond H. Myers, “Probability and Statistics for Engineers, Scientists”, 8th edition, Prentice Hall, Statistics is a discipline which is concerned with:

· Summarizing information to aid understanding,

· Drawing conclusions from data,

· Estimating the present or predicting the future, and

· Designing experiments and other data collection.

In making predictions, Statistics uses the companion subject of Probability, which models chance mathematically and enables calculations of chance in complicated cases. Today, statistics has become an important tool in the work of many academic disciplines such as medicine, psychology, education, sociology, engineering and physics, just to name a few.

Statistics is also important in many aspects of society such as business, industry and government. Because of the increasing use of statistics in so many areas of our lives, it has become very desirable to understand and practice statistical thinking. This is important even if you do not use statistical methods directly.

Ian H. Witten, & Eibe Frank, “Data Mining – Practical Machine Learning Tools and Techniques”, second edition, Elsevier, 2005.

Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to some advantage, usually an economic advantage. The data is invariably present in substantial quantities.

How are the patterns expressed? Useful patterns allow us to make nontrivial predictions on new data. There are two extremes for the expression of a pattern: as a black box whose innards are effectively incomprehensible and as a transparent box whose construction reveals the structure of the pattern. Both, we are assuming, make good predictions. The difference is whether or not the patterns that are mined are represented in terms of a structure that can be examined, reasoned about, and used to inform future decisions. Such patterns we call structural because they capture the decision structure in an explicit way. In other words, they help to explain something about the data..

Shu-guang He1, Zhen He1, G. Alan Wang2 and Li Li3 “Quality Improvement using Data Mining in Manufacturing Processes”

Data mining provides a set of techniques to study patterns in data “that can be sought automatically, identified, validated, and used for prediction” (Witten and Frank 2005). Typical data mining techniques include clustering, association rule mining, classification, and regression. In recent years data mining began to be applied to quality diagnosis and quality improvement in complicated manufacturing processes, such as semiconductor manufacturing and steel making.

It has become an emerging topic in the field of quality engineering. Andrew Kusiak (2001) used a decision tree algorithm to identify the cause of soldering defects on circuit board. The rules derived from the decision tree greatly simplified the process of quality diagnosis. Shao-Chuang Hsu (2007) and Chen-Fu Chien (2006 and 2007) demonstrated the use of data mining on semiconductor yield improvement.

Data mining has also been applied to product development process (Bakesh Menon, 2004) and assembly lines (Sébastien Gebus,2007). Some researchers combined data mining and traditional statistical methods and applied to quality improvement. Examples are the use of MSPC (multivariate statistical control charts) and neural networks in detergent-making company (Seyed Taghi Akhavan Niaki, 2005; Tai-Yue Wang, 2002), the combination of automated decision system and six sigma in the General Electric financial Assurance businesses (Angie Patterson, 2005), the combined used of decision tree and SPC with data from Holmes and Mergen (Ruey-Shiang Guh, 2008), the use of SVR (support vector regression) and control charts (Ben Khediri ISSam, 2008), the use of ANN (artificial neural 358 Data Mining and Knowledge Discovery in Real Life Applications network), SA ( simulated annealing) and Taguchi experiment design (Hsu-Hwa Chang, 2008). Giovanni C Porzio(2003) has presented a methods for visually mining off-line data with combination of ANN and T2 control chart and to identify the assignable variation automatically.

4. PROBLEM STATEMENT

By using control charts that are valuable for analyzing and improving industrial process outcomes. Data mining, the extraction of hidden predictive information from large databases, is a powerful technology with great potential to help companies focus on the most important information in their data warehouses.

Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

Most companies already collect and refine massive quantities of data. The procedure used to perform data mining modeling and analysis has undergone a long transformation from the domain of academic research to a systematic industrial process performed by business and quantitative analysts. Several methodologies have been proposed to cast the steps of developing and deploying data mining models into a standardized process.

Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. SPC is both a data analysis method and a process management philosophy, with important implications on the use of data mining for continuous improvement and enhancement.

5. PROPOSED METHODOLOGIES

5.1 Knowledge-based continuous quality improvement in manufacturing Processes

Continuous quality improvement is an important concept in the field of manufacturing quality management. DMAIC (Define-Measure-Analyze-Improve-Control) for six-sigma is the most commonly used model of continuous quality improvement. Fig.1 illustrates the processes of DMAIC.

Fig. 1. The processes of DMAIC

We can observe from Fig. 1 that DMAIC is a problem driven approach. The entire process begins with locating a problem. However, in a complicated manufacturing process, such as semiconductor manufacturing and steelmaking, it may not be easy to identify and define a proper problem. Moreover, in high-speed manufacturing processes quality problems must be quickly identified and eliminated. Otherwise it may lead to a large amount of loss in both cost and productivity. Therefore, we propose a knowledge-based quality improvement model (see Fig. 2). Different from DMAIC, this model is a goal-driven process. The central idea of the knowledge-based quality improvement mode is to mine the mass data collected from a manufacturing process using automated data mining techniques. The goal is to improve the quality performance of manufacturing processes by quickly identifying and eliminating quality problems.

In the knowledge-based quality improvement model, the first step is to define the goal. The goal here may be defect elimination, efficiency improvement, or yield improvement.

Data mining is used to analyze the quality related data for finding the knowledge between the goal and the factors such as machinery parameters, operators, and material vendors. After Quality Improvement using Data Mining in Manufacturing Processes the knowledge has been verified, opportunities of quality improvement can be identified using the knowledge and patterns learned by data mining techniques. The scope of the problem can be broad across different phases of a manufacturing process. In the following sections, we explained how to apply the model to parameter optimization, quality diagnosis and service data analysis.

Fig. 2. The knowledge-based quality improvement mode

5.2 Quality diagnosis with data mining

During a manufacturing process, product quality can be affected by two types of variation: random variations and assignable variations. Random variations are caused by the intrinsic characteristics of a manufacturing process and cannot be eliminated completely.

SPC (Statistical Process Control) and MSPC (Multivariate SPC) are the most widely used tools in manufacturing for finding assignable variations. Although they can effectively Quality Improvement using Data Mining in Manufacturing Processes detect assignable variations in manufacturing processes, they give no clue to identifying the root causes of the assignable variations. Data mining techniques can again be employed in this case to provide insights for quality diagnosis.

Fig. 3. The combination of SPC and data mining

In the model shown in Fig. 3, the yield ratio of a product is defined as the index of the quality performance of a manufacturing process. The chart on the left is a control chart. When the chat shows alarming signals, i.e., points located beyond the control limit, the data mining process will be engaged. Data related to quality are stored in a data warehouse. Data mining techniques such as a decision tree and association rule mining can be applied to the data to identify the causes of the alarming signals.

5.3 Quality management and tools for process improvement

In the early 1980s, U.S business leaders began to realize the competitive importance of providing high-quality products and services and a quality revolution was under way in the United States. Deming’s 14 points reflected a new philosophy of management that emphasized the importance of leadership.

The numbers attached to each point do not include an order ofimportance; rather the 14 points collectively are seen as necessary steps to becoming a world-class company. Deming’s 14 points are:

Create a constancy of purpose toward the improvement of products and services in order to become competitive, stay in business, and provide jobs.
Adopt the new philosophy. Management must learn that it is in a new economicage and awaken to the challenge, learn their responsibilities, and take on leadership for change.
Stop depending on inspection to achieve quality. Build in quality from the start.
Stop awarding contracts on the basis of low bids.
Improve continuously and forever the system of production and service to improve quality and productivity, and thus constantly reduce costs.
Institute training on the job.
Institute leadership. The purpose of leadership should be to help people and technology work better.
Drive out fear so that everyone may work efficiently.
Break down barriers between departments so that people can work as a team.
Eliminate slogans, exhortations, and targets for the workforce. They create adversarial relationships.
Eliminate quotas and management by objectives. Substitute leadership.
Remove barriers that rob employees of their pride of workmanship.
Institute a vigorous program for education andself-improvement.
Make the transformation everyone’s job and put everyone to work on it.

There have been other individuals who have played significant roles in quality movement. Among these are, Philip B Crosby, who is probably best known for his book Quality Is Free. Kauro Ishikawa developed and popularized the application of the Fishbone Diagram. Finally, we must not overlook the contributions of many different managers at companies such as Hewlett-Packard, General Electric, Motorola, Toyota, and Federal Express.

These leaders synthesized and applied many different quality ideas and concepts to their organizations in order to create world-class corporations. By sharing their success with other firms, they have inspired and motivated others to continually seek opportunities in which the tools of quality can be applied to improve business processes.

Statistical Process Control Charts (SPC) can be used for quality and process improvement. SPC charts are a special type of trend chart. In addition to data, the charts display the process average and the upper and lower control limits. These control limits define the range of random variation expected in the output of a process. SPC chats are used to provide early warnings when a process has gone out of control.

Statistical inference is the process of making conclusions using data that is subject to random variation, for example, observational errors or sampling variation. More substantially, the terms statistical inference, statistical induction and inferential statistics are used to describe systems of procedures that can be used to draw conclusions from datasets arising from systems affected by random variation. Statistical inference is concerned with drawing conclusions from numerical data, about quantities that are not observed.

An important component of statistical inference is hypothesis testing. Information contained in a sample is subject to sampling error. The sample mean will almost certainly not equal the population mean. Therefore, in situations in which we need to test a claim about a population mean by using the sample mean. Therefore, in situations in which we need to test a claim about a population mean by using the sample mean, we can’t simply compare the sample mean to the claim and reject the claim if x bar and the claim are different. Instead, we need a testing procedure that incorporates the potential of sampling error.

Statistical hypothesis is testing provides managers with a structured analytical method for making decisions of this type. It lets them make decisions in such a way that the probability of decision errors can be controlled, or at least measured. Even though statistical hypothesis testing does not eliminate the uncertainty in the managerial environment, the techniques involved often allow managers to identify and control the level of uncertainty. SPC is actually an application of hypothesis testing. The testing of statistical hypotheses is perhaps the most important area of

decision theory.

A statistical hypothesis is an assumption or statement, which may or may not be true, concerning one or more populations. The truth or falsity of a statistical hypothesis is never known with certainty unless we examine the entire population. This, of course, would be impractical in most situations. Instead, we take a random sample from the population of interest and use the information contained in this sample to decide whether the hypothesis is likely to be true or false. Evidence from the sample that is inconsistent with the statistical hypothesis leads to the rejection of the hypothesis, whereas evidence supporting the hypothesis leads to its acceptance. We should make it clear at this point that the acceptance of a statistical hypothesis is a result of insufficient evidence to reject it, and does not necessary imply it is true. Although we shall use the terms accept and reject frequently throughout this paper, it is important to understand that the rejection of a hypothesis is to conclude that it is false, while the acceptance of a hypothesis merely implies that we have no evidence to believe otherwise. Hypotheses that we formulate with the hope of rejecting are called null hypotheses and are denoted by .The rejection of H0 leads to the acceptance of an alternative hypothesis denoted by H1. Based on the sample data, we either reject H0or we do not reject H0.

Types of statistical errors:

Because of the potential of extreme sampling errors, two possible errors can occur when a hypothesis is tested: Type I and Type II errors. These errors show the relationship between what actually exists (a state of nature) and the decision based on the sample information.

There are three possible outcomes: no error (correct decision), Type I error, and type II error. Only one of these outcomes will occur for a hypothesis test. From figure 4, if the null hypothesis is true and an error is made, it must be a Type I error. On the other hand, if the null hypothesis is false and an error is made, it must be a Type II error. We should never use the phrase “accept the null hypothesis. Instead we should use “do not reject the null hypothesis”. Thus, the only two hypothesis testing decisions would be reject H0 or “do not reject H0”. In most business applications, the purpose of the hypothesis test is to direct the decision maker to take one action or another, based on the test results.

Statistical process control charts:

In the applications in which the value of the random variable is determined by measuring rather than counting, the random variable is said to be approximately continuous, and the probability distribution associated with the random variable (as in this article) is called a continuous probability distribution.

SPC can be used in business to define the boundaries that represent the amount of variation that can be considered normal.

Normal distribution is used extensively in statistical decision making such as hypothesis testing, and estimation. The mathematical equation for the probability distribution of the continuous normal variable is the x bar charts are used to monitor a process average. R charts are used to monitor the variation of individual process values. They require that the variable of interest be quantitative .These control charts can be developed using the following steps:

Collect the initial sample data from which the control charts will be developed.
Calculate subgroup means x bar i and ranges Ri .
Compute the average of the subgroup means and the average range value.

5.4 Quality Control Charts Overview and Interpretation

Statistical process control charts are chronological graphs of process data that are used to help understand, control, and improve processes - such as infection control or adverse event processes - and that, although based in statistical theory, are easy for practitioners to use and interpret. While there are several different types of control charts, the general format and interpretation of the most common and simplest type, called a Shewhart control chart, are shown in Figure 4. Some statistic of interest, such as the number of cases of ventilator associated pneumonia per 100 device days, is plotted on the chart and interpreted on a monthly or weekly basis.

Fig.4 Shewhart control chart

Uses of Control Charts

It is important to emphasize that control charts have several important, somewhat sequential, roles in quality improvement work.

Understanding current and past process performance and its degree of consistency and predictability;
Establishing a "state of statistical control" by identifying and removing causes of unnatural (or "special cause") variation so as to achieve a consistent and predictable level of process quality over time;
Improving a process by identifying and removing causes of natural (or "common cause") variation and by testing whether interventions result in an improvement; and
Monitoring for process deterioration and "holding the gains" by identifying special causes of unnatural variation when they arise in the future.

Establishing a State of Statistical Control

Note that while the latter two uses of control charts - testing and holding the gains - tend to be the most well-known in many popular quality improvement models, the first two activities are very important but unfortunately often overlooked or misunderstood. In many applications, considerable value exists in "merely" achieving a state of statistical control. As in other industries, many healthcare processes will not be stable and consistent when first examined and will require significant effort to bring them into a state of statistically consistent behavior (i.e., statistical control). This activity is referred to as "trial control charting" because of the focus on testing whether the process is consistent and on attempting to bring it into a state of operation such that it produces consistent and predictable results. This iterative process occurs over a period of time and consists of:

Constructing an initial trial control chart to test for statistical control,
Searching for and removing assignable causes of unnatural variability,
Removing all affected data and recalculating the center line and control limits from the remaining data (with the addition of new data if available or necessary),
Searching a second time for causes of unnatural variability,
Removing these data and reconstructing the control chart a second time as above, and
Repeating this process as many times as is necessary until a state of statistical control is reached.

Monitoring and Improving

Once a stable process exists (i.e., a state of statistical control has been established), the control chart is used to monitor the process for signals that a change has occurred ("special cause" of "unnatural" variability in SPC terminology) - points outside the control limits or violations of any of the within-limit rules. If these are changes for the worse, such as an increase in the ventilator-associated pneumonia rate, then an effort should be made to discover the causes so that they can be removed and prevented in the future.

Causes of changes for the better also should be investigated and understood so that they can be implemented on a regular basis. While this monitoring activity tends to be the most familiar use associated with control charts, it also is the most passive use from a process improvement perspective as it is focused primarily on maintaining the status quo. It also is important to note that being in a state of statistical control must not necessarily imply that the process is performing at an acceptable level, or that the outcome rate is good, and that either an increase or decrease (i.e., an improvement) in the outcome rate represents an out-of-control process. Statistical control is defined as all data being produced by the same constant process and probability model, which may or may not have an acceptable mean or variance.

Types of Control Charts

The most familiar types of control charts, called Shewhart control charts, originally were developed by Shewhart in 1924, one for each of several types of data that are commonly encountered in practice. Each of these types of data can be described by a statistical distribution that is used to determine the expected value, theoretical standard deviation, and natural variation of the data (i.e., the center line and control limits). Examples of the most common types of data distributions - the normal, binomial, Poisson, and geometric. While many other types of data exist, these distributions will be familiar to many readers as very common and appropriate in many applications.

One of the most common difficulties that practitioners have in using SPC is determining which type of control chart they should construct. As shown in Table 1, the chart type to use in any particular situation is based on identifying which type of data is most appropriate

6. CONCLUSION

In this report DM application in quality improvements is presented. There is very little information about DM and its usage in a manufacturing environment. One contribution of this research is exploring that in which industries DM is used and in which quality improvement studies in those industries it is used up to now.

Control charts are valuable for analyzing and improving clinical process outcomes. Different types of charts should be used in different applications and sample size guidelines should be used to achieve the desired sensitivity and specificity. SPC is both a data analysis method and a process management philosophy, with important implications on the use of data for improvement rather than for blame, the frequency of data collection, and the type and format of data that should be collected. When dealing with low rates, it also can be advantageous to collect data on the number of cases or the amount of time between adverse events, rather than monthly rates.

In a competitive global market, manufacturing enterprises must stay agile when making quality improvement decisions. The development of IT and other related technologies makes the collection of quality related data easy and cost-effective. However, it is still an open question on how to leverage the large amount of quality data to improve manufacturing quality. This chapter has approached the problem of quality improvement in manufacturing processes using data mining techniques The utility of statistical process control (SPC) methods has received growing interest in quality assurance to help and improve different processes. The objective of this paper is to provide an overview of x as one of SPC charts, statistical data analysis, and how to benefit from data mining concepts and techniques to get a further insight into SPC design and performance to look for patterns in data and improve manufacturing processes quality control.

References

[1] David F. Groebner, Patrick W. Shannon, Phillip C. Fry, and Kent D. Smith, “Business Statistics – A Decision Making Approach”, Seventh Edition, Prentice Hall, 2008.

[2] Hossein Arsham, “Topics in Statistical Data Analysis: Inferring From Data”, New York: Wiley.

[3] Mamdouh Reffat, “Data preparation for data mining using SAS”, Elsevier, 2007.

[4] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, “Introduction to Data Mining”, Pearson Education, Inc, 2006.

[5] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin., “Bayesian Data Analysis”, second edition, Chapman & Hall/CRC, 2004.

[6] David J. Hand, “Why data mining is more than statistics writ large”, Imperial College of Science, Technology, and Medicine, Department of Mathematics, Huxley Building, 180 Queen’s Gate, London SW7 2BZ, UK.

[7] James C. Benneyan, “Design, Use, and Performance of Statistical Control Charts for Clinical Process Improvement”, Northeastern University, Boston MA, September 16, 2001.

[8] Mehmed Kantardzic, “Data Mining Concepts, Models, Methods, and Algorithms”, IEEE Press, 2003.

[9] Ian H. Witten, & Eibe Frank, “Data Mining – Practical Machine Learning Tools and Techniques”, second edition, Elsevier, 2005.

[10] Ronald E. Walpole, Raymond H. Myers, “Probability and Statistics for Engineers, Scientists”, 8th edition, Prentice Hall

[11] Agrawal R., Stolorz P., and Piatetsky-Shapiro G. (eds.) (1998) Proceedings of the Fourth “International Conference on Knowledge Discovery and Data Mining”. Menlo Park: AAAI Press.

[12] Elder J, IV, and Pregibon D. (1996) “A statistical perspective on knowledge discovery in Databases”.

[13] In Fayyad U.M., Piatetsky-Shapiro G., Smyth P., and Uthurusamy R. (eds.) “Advances in Knowledge Discovery and Data Mining”. Menlo Park, California: AAAI Press. 83-113

[14] Vapnik, V.N. “The nature of statistical learning theory “New York: Springer-Verlag, 2000

[15] Glymour C., Madigan D., Pregibon D., and Smyth P. (1996) “Statistical inference and data mining. Communications” of the ACM, 39, 35-41.

[16] Burr IW. “The effect of non normality on constants for X and R charts. Industrial Quality Control”. 1967; 23:563-569.

[17] Quinlan J. R., “Induction of Decision Trees”, Machine Learning, 1, pp. 81-106, 1986

[18] Breiman et al, “Classification and Regression Trees”, Belmont, CA: Wadsworth Int‟Group, 1984

[19] Maureen Caudill and Charles Butler, “Under-standing Neural Networks” Vol 1, Basic Networks.

[20] Lu, H., Setiono. R, and Liu, H (1996), “Effective Data Mining Using Neural Networks”, IEEE Transactions on Knowledge and Data Engineering, Vol. 8 No. 6 pp 957-961.

Search This Blog

LM World

A Comparative Analysis of Data Mining Technique using Decision Tree Classifier and SPC

Comments

Post a Comment

Popular posts from this blog

Chemical test for Tragacanth

BLUE MATCHING & WEDGE MATCHING OF CORE & CAVITY

Chemical test for Benzoin