Search This Blog

Sunday, February 12, 2012

EM algorithm starting with E-step for parameterized Gaussian mixture models


em {mclust}R Documentation

EM algorithm starting with E-step for parameterized Gaussian mixture models.

Description

Implements the EM algorithm for parameterized Gaussian mixture models, starting with the expectation step.

Usage

em(modelName, data, parameters, prior = NULL, control = emControl(),
   warn = NULL, ...)

Arguments

modelName A character string indicating the model. The help file for mclustModelNames describes the available models.
data A numeric vector, matrix, or data frame of observations. Categorical variables are not allowed. If a matrix or data frame, rows correspond to observations and columns correspond to variables.
parameters A names list giving the parameters of the model. The components are as follows:
pro
Mixing proportions for the components of the mixture. If the model includes a Poisson term for noise, there should be one more mixing proportion than the number of Gaussian components.
mean
The mean for each component. If there is more than one component, this is a matrix whose kth column is the mean of the kth component of the mixture model.
variance
A list of variance parameters for the model. The components of this list depend on the model specification. See the help file for mclustVariance for details.
Vinv
An estimate of the reciprocal hypervolume of the data region. If set to NULL or a negative value, the default is determined by applying function hypvol to the data. Used only when pro includes an additional mixing proportion for a noise component.
prior Specification of a conjugate prior on the means and variances. The default assumes no prior.
control A list of control parameters for EM. The defaults are set by the call emControl().
warn A logical value indicating whether or not a warning should be issued when computations fail. The default is warn=FALSE.
... Catches unused arguments in indirect or list calls via do.call.

Value

A list including the following components:
modelName A character string identifying the model (same as the input argument).
z A matrix whose [i,k]th entry is the conditional probability of the ith observation belonging to the kth component of the mixture.
parameters
pro
A vector whose kth component is the mixing proportion for the kth component of the mixture model. If the model includes a Poisson term for noise, there should be one more mixing proportion than the number of Gaussian components.
mean
The mean for each component. If there is more than one component, this is a matrix whose kth column is the mean of the kth component of the mixture model.
variance
A list of variance parameters for the model. The components of this list depend on the model specification. See the help file for mclustVariance for details.
Vinv
The estimate of the reciprocal hypervolume of the data region used in the computation when the input indicates the addition of a noise component to the model.
loglik The log likelihood for the data in the mixture model.
Attributes:
    "info"
    Information on the iteration.
    "WARNING"
    An appropriate warning if problems are encountered in the computations.

References

C. Fraley and A. E. Raftery (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97:611-631.
C. Fraley and A. E. Raftery (2005). Bayesian regularization for normal mixture estimation and model-based clustering. Technical Report, Department of Statistics, University of Washington.
C. Fraley and A. E. Raftery (2006). MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering, Technical Report no. 504, Department of Statistics, University of Washington.

See Also

emE, ..., emVVV, estep, me, mstep, mclustOptions, do.call

Examples

msEst <- mstep(modelName = "EEE", data = iris[,-5], 
               z = unmap(iris[,5]))
names(msEst)

em(modelName = msEst$modelName, data = iris[,-5],
   parameters = msEst$parameters)
## Not run: 
do.call("em", c(list(data = iris[,-5]), msEst))   ## alternative call
## End(Not run)

[Package mclust version 3.1-1 Index]

EM algorithm for incomplete categorical data


em.cat {cat}R Documentation

EM algorithm for incomplete categorical data

Description

Finds ML estimate or posterior mode of cell probabilities under the saturated multinomial model.

Usage

em.cat(s, start, prior=1, showits=T, maxits=1000,
eps=0.0001)

Arguments

s summary list of an incomplete categorical dataset produced by the function prelim.cat.
start optional starting value of the parameter. This is an array with dimensions s$d whose elements sum to one. The default starting value is a uniform array (equal probabilities in all cells). If structural zeros appear in the table, start should contain zeros in those positions and nonzero (e.g. uniform) values elsewhere.
prior optional vector of hyperparameters for a Dirichlet prior distribution. The default is a uniform prior distribution (all hyperparameters = 1) on the cell probabilities, which will result in maximum likelihood estimation. If structural zeros appear in the table, a prior should be supplied with NAs in those cells.
showits if TRUE, reports the iterations of EM so the user can monitor the progress of the algorithm.
maxits maximum number of iterations performed. The algorithm will stop if the parameter still has not converged after this many iterations.
eps convergence criterion. This is the largest proportional change in an expected cell count from one iteration to the next. Any expected cell count that drops below 1E-07 times the average cell probability (1/number of non-structural zero cells) is set to zero during the iterations.

Value

array of dimension s$d containing the ML estimate or posterior mode, assuming that EM has converged by maxits iterations.

Note

If zero cell counts occur in the observed-data table, the maximum likelihood estimate may not be unique, and the algorithm may converge to different stationary values depending on the starting value. Also, if zero cell counts occur in the observed-data table, the ML estimate may lie on the boundary of the parameter space. Supplying a prior with hyperparameters greater than one will give a unique posterior mode in the interior of the parameter space. Estimated probabilities for structural zero cells will always be zero.

References

Schafer (1996) Analysis of Incomplete Multivariate Data. Chapman & Hall, Section 7.3.

See Also

prelim.cat, ecm.cat, logpost.cat

Examples

data(crimes)
crimes
s <- prelim.cat(crimes[,1:2],crimes[,3])     # preliminary manipulations
thetahat <- em.cat(s)                        # mle under saturated model
logpost.cat(s,thetahat)                      # loglikelihood at thetahat

Data Mining Algorithms In R/Clustering/Expectation Maximization (EM)

Data Mining Algorithms In R/Clustering/Expectation Maximization (EM)

< Data Mining Algorithms In R | Clustering
This chapter intends to give an overview of the technique Expectation Maximization (EM), proposed by [1] (although the technique was informally proposed in literature, as suggested by the author) in the context of R-Project environment. The first section gives an introduction of representative clustering and mixture models. The algorithm details and a case study will be presented on the second section.
The R package that will be used is the MCLUST-v3.3.2 developed by Chris Fraley and Adrian Raftery, available in CRAN repository. The MCLUST tool is a software that includes the following features: normal mixture modeling (EM); EM initialization through an hierarquical clustering approach; estimate the number of clusters based on the Bayesian Information Criteria (BIC); and displays, including uncertainty plots and dimension projections.
The information sources of this document were divided in two groups: (i) manual and guides of R and MCLUST, which includes the technical report [2] that is the base reference of this project and gives an overview of MCLUST with several examples and (ii) theoretical papers, surveys and books found in literature.

Contents

 [hide

[edit] Introduction

Clustering consists in identifying groups of entities that have characteristics in common and are cohesive and separated from each other. Interest in clustering has increased due to several applications in distinct knowledge areas. Highlighting the search for grouping of customers and products in massive datasets, document analysis in Web usage data, gene expression from microarrays and image analysis where clustering is used for segmentation.
The clustering methods can be grouped in classes. One widely used involves hierarchical clustering, which consider, initially, that each points represent one group and at each iteration it merged two groups chosen to optimize some criterion. A popular criteria, proposed by [3], include the sum of within group sum of squares and is given by the shortest distance between the groups (single-link method).
Another typical class is based on iterative relocation, which data are moved from one group to another at each iteration. Also called as representative clustering due the use of a model, created to each cluster, that summarize the characteristics of the group elements. The most popular method in this class is the K-Means, proposed by [4], which is based on iterative relocation with the sum of squares criterion.
In statistic and optimization problems is usual maximize or minimize a function, and its variables in a specific space. As these optimization problems may assume several different types, each one with its own characteristics, many techniques have been developed to solve them. This techniques are very important in data mining and knowledge discovery area as it can be used as basis for most complex and powerful methods.
One of these techniques is the Maximum Likelihood and its main goal is to adjust a statistic model with a specific data set, estimating its unknown parameters so the function that can describe all the parameters in the dataset. In other words, the method will adjust some variables of a statistical model from a dataset or a known distribution, so the model can “describe” each data sample and estimate others.
It was realized that clustering can be based on probability models to cover the missing values. This provides insights into when the data should conform to the model and has led to the development of new clustering methods such as Expectation Maximization (EM) that is based on the principle of Maximum Likelihood of unobserved variables in finite mixture models.

[edit] Technique to be discussed

The EM algorithm is an unsupervised clustering method, that is, don't require a training phase, based on mixture models. It follows an iterative approach, sub-optimal, which tries to find the parameters of the probability distribution that has the maximum likelihood of its attributes.
In general lines, the algorithm's input are the data set (x), the total number of clusters (M), the accepted error to converge (e) and the maximum number of iterations. For each iteration, first is executed the E-Step (E-xpectation), that estimates the probability of each point belongs to each cluster, followed by the M-step (M-aximization), that re-estimate the parameter vector of the probability distribution of each class. The algorithm finishes when the distribution parameters converges or reach the maximum number of iterations.

[edit] Algorithm

Initialization
Each classe j, of M classes (or clusters), is constituted by a parameter vector (θ), composed by the mean (μj) and by the covariance matrix (Pj), which represents the features of the Gaussian probability distribution (Normal) used to characterize the observed and unobserved entities of the data set x.

\theta(t) = { \mu_j(t),\ P_j(t) },\ j\ =\ 1..M
On the initial instant (t = 0) the implementation can generate randomly the initial values of mean (μj) and of covariance matrix (Pj). The EM algorithm aims to approximate the parameter vector (θ) of the real distribution. Another alternative offered by MCLUST is to initialize EM with the clusters obtained by a hierarquical clustering technique.
E-Step
This step is responsible to estimate the probability of each element belong to each cluster (P(Cj | xk) ). Each element is composed by an attribute vector (xk). The relevance degree of the points of each cluster is given by the likelihood of each element attribute in comparison with the attributes of the other elements of cluster Cj.

P(C_j|x) = \frac{|\sum_j(t)|^{-\frac{1}{2}}\ exp^{n_j}\ P_j(t)}{\sum_{k=1}^M\ |\sum_j(t)|^{-\frac{1}{2}}\ exp^{n_j}\ P_k(t)}
M-Step
This step is responsible to estimate the parametrs of the probability distribution of each class for the next step. First is computed the mean (μj) of classe j obtained through the mean of all points in function of the relevance degree of each point.

\mu_j(t+1) = \frac{\sum_{k=1}^N\ P(C_j|x_k)\ x_k}{\sum_{k=1}^N\ P(C_j|x_k)}
To compute the covariance matrix for the next iteration is applied the Bayes Theorem, which implies that P(A | B) = P(B | A) * P(A)P(B), based on the conditional probabilities of the class occurrence.
 
\sum_j(t+1) = \frac{\sum_{k=1}^N\ P(C_j|x_k)\ (x_k - \mu_j(t) )\ (x_k - \mu_j(t) )}{\sum_{k=1}^N\ P(C_j|x_k)}
The probability of occurrence of each class is computed through the mean of probabilities (Cj) in function of the relevance degree of each point from the class.
 
P_j(t+1) = \frac{1}{N} \sum_{k=1}^N P(C_j|x_k)
The attributes represents the parameter vector θ that characterize the probability distribution of each class that will be used in the next algorithm iteration.
Convergence Test
After each iteration is performed a convergence test which verifies if the difference of the attributes vector of an iteration to the previous iteration is smaller than an acceptable erro tolerance, given by parameter. Some implementations uses the difference between the averages of class distribution as the convergence criterion.
if(||θ(t + 1) − θ(t)|| < ǫ)
   stop
else
   call E-Step
end

The algorithm has the property of, at each step, estimate a new attribute vector that has the maximum local likelihood, not necessarily the global, what reduces the its complexity. However, depending on the dispersion of the data and on its volume, the algorithm can stop due the maximum number of iterations defined.

[edit] Implementation

[edit] Packages

The expectation-maximization in algorithm in R [5], proposed in [6], will use the package mclust. This package contains crucial methods for the execution of the clustering algorithm, including functions for the E-step and M-step calculation. The package manual explains all of its functions, including simple examples. This manual can be found in [2][6].
The mclust package also provides various models for EM and also hierarchical clustering(HC), which is defined by the covariance structures. These models are presented in Table 1 and are explained in detail in [7].
Table 1: Covariance matrix structures.
identifier Model HC EM Distribution Volume Shape Orientation
E
* * (univariate) equal

V
* * (univariate) variable

EII λI * * Spherical equal equal NA
VII λkI * * Spherical variable equal NA
EEI λA
* Diagonal equal equal coordinate axes
VEI λkA
* Diagonal variable equal coordinate axes
EVI λAk
* Diagonal equal variable coordinate axes
VVI λkAk
* Diagonal variable variable coordinate axes
EEE λDADT * * Ellipsoidal equal equal equal
EEV \lambda D_k A D_k^T
* Ellipsoidal equal equal variable
VEV \lambda_k D_k A D_k^T
* Ellipsoidal variable equal variable
VVV \lambda_k D_k A_k D_k^T * * Ellipsoidal variable variable variable

[edit] Executing the Algorithm

The function “em” can be used for the expectation-maximization method, as it implements the method for parameterized Gaussian Mixture Models (GMM), starting in the E-step. This function uses the following parameters:
  • model-name: the name of the model used;
  • data: all the collected data, which must be all numerical. If the data format is represent by a matrix, the rows will represent the samples (observations) and the columns the variables;
  • parameters: model parameters, which can assume the following values: pro, mean, variance and Vinv, corresponding to the mixture proportion for the components of mixture, mean of each component, parameters variance and the estimate hypervolume of the data region, respectively.
  • other: less relevant parameters which wont be described here. More details can be found in the package manual.
After the execution, the function will return:
  • model-name: the name of the model;
  • z: a matrix whose the element in position [I,k] presents the conditional probability of the ith sample belongs to the kth mixture component;
  • parameters: same as the input;
  • others: other metrics which wont be discussed here. More details can be found in the package manual.

[edit] A simple example

In order to demonstrate how to use the R to execute the expectation-Maximization method, the following algorithm presents a simple example for a test dataset. This example can also be found in the package manual.
> modelName = ``EEE''
> data = iris[,-5]
> z = unmap(iris[,5])
> msEst <- mstep(modelName, data, z)
> names(msEst)
> modelName = msEst$modelName
> parameters = msEst$parameters
> em(modelName, data, parameters)
The first line executes the M-step so the parameters used in the em function can be generated. This function is called mstep and its inputs are model name, as “EEE”, the dataseta the iris dataset and finally, the z matrix, which contains the conditional probability of each class contains each data sample. This z matrix is generated by the unmap function.
After the M-step, the algorithm will show (line 2) the attributes of the object returned by this function. The third line will start the clustering process using some of the result of the M-step method as input.
The clustering method will return the parameters estimated in the process and the conditional probability of each sample falls in each class. These parameters include mean and variance, and this last one corresponds to the use of the mclustVariance method.

[edit] View

This section will present some examples of visualization available in MCLUST package. First will be showed a simple example of the overall process of clustering from the choice of the number of clusters, the initialization and the partitioning. Then it will be explained a didactical example using two random mixtures in comparison with two gaussian mixtures.
Basic Example
This is a simple example to show the features offered by MCLUST package. It is applied to the faithful dataset (included in R project). First the cluster analysis estimates the number of clusters that best represents this data set and also the covariance structure of the spread points. This is performed through the technique called Bayesian Information Criterion (BIC) that varies the number of cluster from 1 to 9. The BIC is the value of the maximized loglikelihood measured with a penalty for the number of parameters in the model. Then it's executed the hierarchical clustering technique (HC), which doesn't require a initialization phase. The output of the HC, that is, the cluster that each element belongs, is used to initialize the Expectation-Maximization technique (EM). After the execution of EM clustering the charts are showed below:
### basic_example.R ###
# usage: R --no-save < basic_example.R

library(mclust)           # load mclust library
x = faithful[,1]          # get the first column of the faithful data set
y = faithful[,2]          # get the first column of the faithful data set
plot(x,y)                 # plot the spread points before the clustering
model <- Mclust(faithful) # estimate the number of cluster (BIC), initialize (HC) and clusterize (EM)
data = faithful           # get the data set 
plot(model, faithful)     # plot the clustering results

File:Basic points.pdf Basic bic.pdf Basic cluster.pdf Basic uncertainty.pdf Basic density.pdf
Didactical Example
A didactical example is developed to show two distinct scenarios: (a) one that the model doesn't represents the data and (b) another that the data is conformed to the model. This example intends to show how EM behaves with noised and clean data sets.
a) Uniform random mixtures
A noised data set is generated through a uniform random function. The points spread are showed in a chart. Then the clustering analysis is executed with default parameters (i.e. varies cluster from 1 to 9). The clustering tool shows a warning with a message that the best model occurs at the min or max, in this case is the min that all points are grouped in a single cluster.
### random_example.R ###
# usage: R --no-save < random_example.R

library(mclust)                  # load mclust library
x1 = runif(20)                   # generate 20 random random numbers for x axis (1st class)
y1 = runif(20)                   # generate 20 random random numbers for y axis (1st class)
x2 = runif(20)                   # generate 20 random random numbers for x axis (2nd class)
y2 = runif(20)                   # generate 20 random random numbers for y axis (2nd class)
rx = range(x1,x2)                # get the axis x range
ry = range(y1,y2)                # get the axis y range
plot(x1, y1, xlim=rx, ylim=ry)   # plot the first class points
points(x2, y2 )                  # plot the second class points
mix = matrix(nrow=40, ncol=2)    # create a dataframe matrix 
mix[,1] = c(x1, x2)              # insert first class points into the matrix
mix[,2] = c(y1, y2)              # insert second class points into the matrix
mixclust = Mclust(mix)           # initialize EM with hierarchical clustering, execute BIC and EM

# Warning messages:
# 1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#    best model occurs at the min or max # of components considered
# 2: In Mclust(mix) : optimal number of clusters occurs at min choice


Points random.pdf
b) Two gaussian mixtures
This scenario is composed by two well separated data sets generated through a gaussian distribution function (Normal). The points are showed in the first chart. The EM clustering is applied and the results are also showed in the graphs below. As we can see, the EM clustering obtain two gaussian models that is in conformed to the data.
### gaussian_example.R ###
# usage: R --no-save < gaussian_example.R

library(mclust)                  # load mclust library
x1 = rnorm(n=20, mean=1, sd=1)   # get 20 normal distributed points for x axis with mean=1 and std=1 (1st class)
y1 = rnorm(n=20, mean=1, sd=1)   # get 20 normal distributed points for x axis with mean=1 and std=1 (2nd class)
x2 = rnorm(n=20, mean=5, sd=1)   # get 20 normal distributed points for x axis with mean=5 and std=1 (1st class)
y2 = rnorm(n=20, mean=5, sd=1)   # get 20 normal distributed points for x axis with mean=5 and std=1 (2nd class)
rx = range(x1,x2)                # get the axis x range
ry = range(y1,y2)                # get the axis y range
plot(x1, y1, xlim=rx, ylim=ry)   # plot the first class points
points(x2, y2)                   # plot the second class points
mix = matrix(nrow=40, ncol=2)    # create a dataframe matrix 
mix[,1] = c(x1, x2)              # insert first class points into the matrix
mix[,2] = c(y1, y2)              # insert second class points into the matrix
mixclust = Mclust(mix)           # initialize EM with hierarchical clustering, execute BIC and EM
plot(mixclust, data = mix)       # plot the two distinct clusters found 

Gaussian points.pdf Gaussian bic.pdf Gaussian cluster.pdf File:Gaussian density.pdf File:Gaussian uncertainty.pdf

[edit] Case Study

[edit] Scenario

The scenario to be analized is composed by a sample data set available in the MCLUST package named "wreath". A clustering analysis is performed with more details, applied to a scenario composed by 14 point groups, that exceeds the maximum number of clusters allowed by the default MCLUST parameters. The clustering technique is executed two times: (i) the first based on the default MCLUST, (ii) with customized parameters.

[edit] Input data

The input of the case study is the data set wreath provided by the MCLUST package. This data set consists in a 14 point group showed on the next figure, which can be modeled with Spherical or Ellipsoide that take into account the orientation of the data due its rotation.
### case_input.R ###
# usage: R --no-save < case_default.R

plot(wreath[,1],wreath[,2])


[edit] Execution

The clustering is executed two times. The first one is based on the default parameters given by the MCLUST tool, that varies the number of cluster from 1 to 9, which is smaller than the necessary to fit the case study data set. The estimation of the number of clusters is showed in a graphic that varies the number of clusters and compute the Bayesian Informatin Criterion (BIC) for each value. We can see that the BIC, using the default parameters, is divergent while is expected find a peak followed by a decrease that indicates the best number of clusters.
### case_default.R ###
# usage: R --no-save < case_default.R

library(mclust)

wreathBIC <- mclustBIC(wreath)
plot(wreathBIC, legendArgs = list(x = "topleft"))

Default bic.pdf
Then the number of cluster is customized, modified to varies from 1 to 20 as showed below. The BIC technique will give the best number of clusters, in this case 14 clusters and the coefficient structure that have the properties of this data set, that is the EEV, which means that the clusters has similar shape, similar volumes but variable orientation. After executes the BIC method again we can see that 14 clusters, indicated by the peak on the graphic, is the number of cluster that present the maximum likelihood for this data.
### case_customized.R ###
# usage: R --no-save < case_customized.R

library(mclust)
wreathDefault <- mclustBIC(wreath)
wreathCustomize <- mclustBIC(wreath, G = 1:20, x = wreathDefault)
plot(wreathBIC, G = 10:20, legendArgs = list(x = "bottomleft"))
summary(wreathBIC, wreath)

Customize bic.pdf

[edit] Output

The output of this clustering is analysed is obtained using the method mclust2Dplot depicted in the next figure. It was used the density visualization. The clusters found are characterized by 14 models wich have the distribution of an ellipsoides, with different orientation, beein in conformed with the data. The method summary can be executed later analyse other aspects of the clustering result.
### case_output.R ###
# usage: R --no-save < case_customized.R

library(mclust)
data(wreath)
wreathBIC <- mclustBIC(wreath)
wreathBIC <- mclustBIC(wreath, G = 1:20, x = wreathBIC)
wreathModel <- summary(wreathBIC, data = wreath)
mclust2Dplot(data = wreath, what = "density", identify = TRUE, parameters = wreathModel$parameters, z = wreathModel$z)

Cluster customize.pdf

[edit] Analysis

We can see that the mixture models created to represent the point are in conformation to the data set. On this case, the groups doesn't have an intersection between them, so all points were classified to the right group. The cluster orientation allows the method to find a better Ellipsoide to represent those points.
We conclude that the EM clustering technique, despite the dependence of the number of clusters and the initialization phase, is an efficient method that produces good results for several scenarios of data dispersion. The use of BIC to estimate the number of clusters and of the hierarchical clustering (HC) (which doesn't depend of the number of clusters) to initialize the clusters improves the quality of the results.

[edit] References

[8] [9] [1] [5] [7] [2] [6] [3] [4]
  1. a b Georey J. Mclachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions. Wiley-Interscience, 1 edition, November 1996.
  2. a b c C. Fraley and A. E. Raftery. MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, University of Washington, Department of Statistics, September 2006.
  3. a b Ward, J. H., "Hierarquical Grouping to Optimize an Objective Function." Journal of the American Statistical Association. 58. 234-244. 1963.
  4. a b MacQueen. J. Some Methods for Classification and Analysis of Multivariate Observations. in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Vol 1. eds. L. M. L. Cam and J. Neyman, Berkeley, CA: University of California Press, pp. 281-297. 1967.
  5. a b John M. Chambers. R language definition. http://cran.r-project.org/doc/manuals/R-lang.html.
  6. a b c Chris Fraley and Adrian E. Raftery. Bayesian regularization for normal mixture estimation and model-based clustering. J. Classif., 24(2):155-181, 2007.
  7. a b C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association, 97:611-631, 2002.
  8. Leslie Burkholder. Monty hall and bayes's theorem, 2000.
  9. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1{38, 1977.