Search This Blog

Thursday, May 30, 2013

Nonlinear Data-Fitting with a specific function in MATLAB

Nonlinear Data-Fitting

This example shows how to fit a nonlinear function to data using several Optimization Toolbox™ algorithms.
Problem Setup
Consider the following data:
Data = ...
  [0.0000    5.8955
   0.1000    3.5639
   0.2000    2.5173
   0.3000    1.9790
   0.4000    1.8990
   0.5000    1.3938
   0.6000    1.1359
   0.7000    1.0096
   0.8000    1.0343
   0.9000    0.8435
   1.0000    0.6856
   1.1000    0.6100
   1.2000    0.5392
   1.3000    0.3946
   1.4000    0.3903
   1.5000    0.5474
   1.6000    0.3459
   1.7000    0.1370
   1.8000    0.2211
   1.9000    0.1704
   2.0000    0.2636];
Let's plot these data points.
t = Data(:,1);
y = Data(:,2);
% axis([0 2 -0.5 6])
% hold on
plot(t,y,'ro')
title('Data points')
% hold off

We would like to fit the function
   y =  c(1)*exp(-lam(1)*t) + c(2)*exp(-lam(2)*t)
to the data.
Solution Approach Using lsqcurvefit
The lsqcurvefit function solves this type of problem easily.
To begin, define the parameters in terms of one variable x:
x(1) = c(1)
x(2) = lam(1)
x(3) = c(2)
x(4) = lam(2)
Then define the curve as a function of the parameters x and the data t:
F = @(x,xdata)x(1)*exp(-x(2)*xdata) + x(3)*exp(-x(4)*xdata);
We arbitrarily set our initial point x0 as follows: c(1) = 1, lam(1) = 1, c(2) = 1, lam(2) = 0:
x0 = [1 1 1 0];
We run the solver and plot the resulting fit.
[x,resnorm,~,exitflag,output] = lsqcurvefit(F,x0,t,y)

hold on
plot(t,F(x,t))
hold off
Local minimum possible.

lsqcurvefit stopped because the final change in the sum of squares relative to 
its initial value is less than the default value of the function tolerance.




x =

  Columns 1 through 3

   3.006845730161635  10.586854247456271   2.889093127582852

  Column 4

   1.400341518260952


resnorm =

   0.147722572048259


exitflag =

     3


output = 

    firstorderopt: 7.893433102770797e-06
       iterations: 6
        funcCount: 35
     cgiterations: 0
        algorithm: 'trust-region-reflective'
          message: [1x459 char]


Solution Approach Using fminunc
To solve the problem using fminunc, we set the objective function as the sum of squares of the residuals.
Fsumsquares = @(x)sum((F(x,t) - y).^2);
opts = optimoptions('fminunc','Algorithm','quasi-newton');
[xunc,ressquared,eflag,outputu] = ...
    fminunc(Fsumsquares,x0,opts)
Local minimum found.

Optimization completed because the size of the gradient is less than
the default value of the function tolerance.




xunc =

  Columns 1 through 3

   2.889015274844785   1.400313980001329   3.006919302957509

  Column 4

  10.586172160407736


ressquared =

   0.147722571779314


eflag =

     1


outputu = 

       iterations: 30
        funcCount: 185
         stepsize: 1
    firstorderopt: 2.947592962696121e-05
        algorithm: 'medium-scale: Quasi-Newton line search'
          message: [1x436 char]

Notice that fminunc found the same solution as lsqcurvefit, but took many more function evaluations to do so. The parameters for fminunc are in the opposite order as those for lsqcurvefit; the larger lam is lam(2), not lam(1). This is not surprising, the order of variables is arbitrary.
fprintf(['There were %d iterations using fminunc,' ...
    ' and %d using lsqcurvefit.\n'], ...
    outputu.iterations,output.iterations)
fprintf(['There were %d function evaluations using fminunc,' ...
    ' and %d using lsqcurvefit.'], ...
    outputu.funcCount,output.funcCount)
There were 30 iterations using fminunc, and 6 using lsqcurvefit.
There were 185 function evaluations using fminunc, and 35 using lsqcurvefit.
Splitting the Linear and Nonlinear Problems
Notice that the fitting problem is linear in the parameters c(1) and c(2). This means for any values of lam(1) and lam(2), we can use the backslash operator to find the values of c(1) and c(2) that solve the least-squares problem.
We now rework the problem as a two-dimensional problem, searching for the best values of lam(1) and lam(2). The values of c(1) and c(2) are calculated at each step using the backslash operator as described above.
type fitvector
function yEst = fitvector(lam,xdata,ydata)
%FITVECTOR  Used by DATDEMO to return value of fitting function.
%   yEst = FITVECTOR(lam,xdata) returns the value of the fitting function, y
%   (defined below), at the data points xdata with parameters set to lam.
%   yEst is returned as a N-by-1 column vector, where N is the number of
%   data points.
%
%   FITVECTOR assumes the fitting function, y, takes the form
%
%     y =  c(1)*exp(-lam(1)*t) + ... + c(n)*exp(-lam(n)*t)
%
%   with n linear parameters c, and n nonlinear parameters lam.
%
%   To solve for the linear parameters c, we build a matrix A
%   where the j-th column of A is exp(-lam(j)*xdata) (xdata is a vector).
%   Then we solve A*c = ydata for the linear least-squares solution c,
%   where ydata is the observed values of y.

A = zeros(length(xdata),length(lam));  % build A matrix
for j = 1:length(lam)
   A(:,j) = exp(-lam(j)*xdata);
end
c = A\ydata; % solve A*c = y for linear parameters c
yEst = A*c; % return the estimated response based on c
Solve the problem using lsqcurvefit, starting from a two-dimensional initial point lam(1), lam(2):
x02 = [1 0];
F2 = @(x,t) fitvector(x,t,y);

[x2,resnorm2,~,exitflag2,output2] = lsqcurvefit(F2,x02,t,y)
Local minimum possible.

lsqcurvefit stopped because the final change in the sum of squares relative to 
its initial value is less than the default value of the function tolerance.




x2 =

  10.586092011163505   1.400299514349251


resnorm2 =

   0.147722571771357


exitflag2 =

     3


output2 = 

    firstorderopt: 4.399495381331858e-06
       iterations: 10
        funcCount: 33
     cgiterations: 0
        algorithm: 'trust-region-reflective'
          message: [1x459 char]

The efficiency of the two-dimensional solution is similar to that of the four-dimensional solution:
fprintf(['There were %d function evaluations using the 2-d ' ...
    'formulation, and %d using the 4-d formulation.'], ...
    output2.funcCount,output.funcCount)
There were 33 function evaluations using the 2-d formulation, and 35 using the 4-d formulation.
Split Problem is More Robust to Initial Guess
Choosing a bad starting point for the original four-parameter problem leads to a local solution that is not global. Choosing a starting point with the same bad lam(1) and lam(2) values for the split two-parameter problem leads to the global solution. To show this we re-run the original problem with a start point that leads to a relatively bad local solution, and compare the resulting fit with the global solution.
x0bad = [5 1 1 0];
[xbad,resnormbad,~,exitflagbad,outputbad] = ...
    lsqcurvefit(F,x0bad,t,y)

hold on
plot(t,F(xbad,t),'g')
legend('Data','Global fit','Bad local fit','Location','NE')
hold off

fprintf(['The residual norm at the good ending point is %f,' ...
   ' and the residual norm at the bad ending point is %f.'], ...
   resnorm,resnormbad)
Local minimum possible.

lsqcurvefit stopped because the final change in the sum of squares relative to 
its initial value is less than the default value of the function tolerance.




xbad =

  Columns 1 through 3

 -22.903587482833103   2.479236586680044  28.027347454069787

  Column 4

   2.479103591273296


resnormbad =

   2.217300254426719


exitflagbad =

     3


outputbad = 

    firstorderopt: 0.005748278503129
       iterations: 32
        funcCount: 165
     cgiterations: 0
        algorithm: 'trust-region-reflective'
          message: [1x459 char]

The residual norm at the good ending point is 0.147723, and the residual norm at the bad ending point is 2.217300.

Curve Fitting and Distribution Fitting in MATLAB

Curve Fitting and Distribution Fitting

This example shows the difference between fitting a curve to a set of points, and fitting a probability distribution to a sample of data.
A common question is, "I have some data and I want to fit a Weibull distribution. What MATLAB functions can I use to do Weibull curve fitting?" Before answering that question, we need to figure out what kind of data analysis is really appropriate.
Curve Fitting
Consider an experiment where we measure the concentration of a compound in blood samples taken from several subjects at various times after taking an experimental medication.
time = [ 0.1   0.1   0.3   0.3   1.3   1.7   2.1   2.6   3.9   3.9 ...
         5.1   5.6   6.2   6.4   7.7   8.1   8.2   8.9   9.0   9.5 ...
         9.6  10.2  10.3  10.8  11.2  11.2  11.2  11.7  12.1  12.3 ...
        12.3  13.1  13.2  13.4  13.7  14.0  14.3  15.4  16.1  16.1 ...
        16.4  16.4  16.7  16.7  17.5  17.6  18.1  18.5  19.3  19.7];
conc = [0.01  0.08  0.13  0.16  0.55  0.90  1.11  1.62  1.79  1.59 ...
        1.83  1.68  2.09  2.17  2.66  2.08  2.26  1.65  1.70  2.39 ...
        2.08  2.02  1.65  1.96  1.91  1.30  1.62  1.57  1.32  1.56 ...
        1.36  1.05  1.29  1.32  1.20  1.10  0.88  0.63  0.69  0.69 ...
        0.49  0.53  0.42  0.48  0.41  0.27  0.36  0.33  0.17  0.20];
plot(time,conc,'o');
xlabel('Time'); ylabel('Blood Concentration');

Notice that we have one response variable, blood concentration, and one predictor variable, time after ingestion. The predictor data are assumed to be measured with little or no error, while the response data are assumed to be affected by experimental error. The main objective in analyzing data like these is often to define a model that predicts the response variable. That is, we are trying to describe the trend line, or the mean response of y (blood concentration), as a function of x (time). This is curve fitting with bivariate data.
Based on theoretical models of absorption into and breakdown in the bloodstream, we might, for example, decide that the concentrations ought to follow a Weibull curve as a function of time. The Weibull curve, which takes the form
$$y = c * (x/a)^{(b-1)} * exp(-(x/a)^b),$$
is defined with three parameters: the first scales the curve along the horizontal axis, the second defines the shape of the curve, and the third scales the curve along the vertical axis. Notice that while this curve has almost the same form as the Weibull probability density function, it is not a density because it includes the parameter c, which is necessary to allow the curve's height to adjust to data.
We can fit the Weibull model using nonlinear least squares.
modelFun =  @(p,x) p(3) .* (x ./ p(1)).^(p(2)-1) .* exp(-(x ./ p(1)).^p(2));
startingVals = [10 2 5];
coefEsts = nlinfit(time, conc, modelFun, startingVals);
xgrid = linspace(0,20,100);
line(xgrid, modelFun(coefEsts, xgrid), 'Color','r');

This scatterplot suggests that the measurement errors do not have equal variance, but rather, that their variance is proportional to the height of the mean curve. One way to address this problem would be to use weighted least squares. However, there is another potential problem with this fit.
The Weibull curve we're using, as with other similar models such as Gaussian, gamma, or exponential curves, is often used as a model when the response variable is nonnegative. Ordinary least squares curve fitting is appropriate when the experimental errors are additive and can be considered as independent draws from a symmetric distribution, with constant variance. However, if that were true, then in this example it would be possible to measure negative blood concentrations, which is clearly not reasonable.
It might be more realistic to assume multiplicative errors, symmetric on the log scale. We can fit a Weibull curve to the data under that assumption by logging both the concentrations and the original Weibull curve itself. That is, we can use nonlinear least squares to fit the curve
$$log(y) = log(c) + (b-1)*log(x/a) - (x/a)^b.$$
coefEsts2 = nlinfit(time, log(conc), @(p,x)log(modelFun(p,x)), startingVals);
line(xgrid, modelFun(coefEsts2, xgrid), 'Color',[0 .5 0]);
legend({'Raw Data' 'Additive Errors Model' 'Multiplicative Errors Model'});

This model fit should be accompanied by estimates of precision and followed by checks on the model's goodness of fit. For example, we should make residual plots on the log scale to check the assumption of constant variance for the multiplicative errors. For simplicity we'll leave that out here.
In this example, using the multiplicative errors model made little difference in the model's predictions, but that's not always the case. An example where it does matter is described in the Pitfalls in Fitting Nonlinear Models by Transforming to Linearity example.
Functions for Curve Fitting
MATLAB and several toolboxes contain functions that can used to perform curve fitting. The Statistics Toolbox™ includes the functions nlinfit, for nonlinear least squares curve fitting, and glmfit, for fitting Generalized Linear Models. The Curve Fitting Toolbox™ provides command line and graphical tools that simplify many of the tasks in curve fitting, including automatic choice of starting coefficient values for many models, as well as providing robust and nonparametric fitting methods. Many complicated types of curve fitting analyses, including models with constraints on the coefficients, can be done using functions in the Optimization Toolbox™. The MATLAB function polyfit fits polynomial models, and fminsearch can be used in many other kinds of curve fitting.
Distribution Fitting
Now consider an experiment where we've measured the time to failure for 50 identical electrical components.
life = [ 6.2 16.1 16.3 19.0 12.2  8.1  8.8  5.9  7.3  8.2 ...
        16.1 12.8  9.8 11.3  5.1 10.8  6.7  1.2  8.3  2.3 ...
         4.3  2.9 14.8  4.6  3.1 13.6 14.5  5.2  5.7  6.5 ...
         5.3  6.4  3.5 11.4  9.3 12.4 18.3 15.9  4.0 10.4 ...
         8.7  3.0 12.1  3.9  6.5  3.4  8.5  0.9  9.9  7.9];
Notice that only one variable has been measured -- the components' lifetimes. There is no notion of response and predictor variables; rather, each observation consists of just a single measurement. The objective of an analysis for data like these is not to predict the lifetime of a new component given a value of some other variable, but rather to describe the full distribution of possible lifetimes. This is distribution fitting with univariate data.
One simple way to visualize these data is to make a histogram.
binWidth = 2;
binCtrs = 1:binWidth:19;
hist(life,binCtrs);
xlabel('Time to Failure'); ylabel('Frequency'); ylim([0 10]);

It might be tempting to think of this histogram as a set of (x,y) values, where each x is a bin center and each y is a bin height.
h = get(gca,'child');
set(h,'FaceColor',[.98 .98 .98],'EdgeColor',[.94 .94 .94]);
counts = hist(life,binCtrs);
hold on
plot(binCtrs,counts,'o');
hold off

We might then try to fit a curve through those points to model the data. Since lifetime data often follow a Weibull distribution, and we might even think to use the Weibull curve from the curve fitting example above.
coefEsts = nlinfit(binCtrs,counts,modelFun,startingVals);
However, there are some potential pitfalls in fitting a curve to a histogram, and this simple fit is not appropriate. First, the bin counts are nonnegative, implying that measurement errors cannot be symmetric. Furthermore, the bin counts have different variability in the tails than in the center of the distribution. They also have a fixed sum, implying that they are not independent measurements. These all violate basic assumptions of least squares fitting.
It's also important to recognize that the histogram really represents a scaled version of an empirical probability density function (PDF). If we fit a Weibull curve to the bar heights, we would have to constrain the curve to be properly normalized.
These problems might be addressed by using a more appropriate least squares fit. But another concern is that for continuous data, fitting a model based on the histogram bin counts rather than the raw data throws away information. In addition, the bar heights in the histogram are very dependent on the choice of bin edges and bin widths. While it is possible to fit distributions in this way, it's usually not the best way.
For many parametric distributions, maximum likelihood is a much better way to estimate parameters. Maximum likelihood does not suffer from any of the problems mentioned above, and it is often the most efficient method in the sense that results are as precise as is possible for a given amount of data.
For example, the function wblfit uses maximum likelihood to fit a Weibull distribution to data. The Weibull PDF takes the form
$$y = (a/b) * (x/a)^{(b-1)} * exp(-(x/a)^b).$$
Notice that this is almost the same functional form as the Weibull curve used in the curve fitting example above. However, there is no separate parameter to independently scale the vertical height. Being a PDF, the function always integrates to 1.
We'll fit the data with a Weibull distribution, then plot a histogram of the data, scaled to integrate to 1, and superimpose the fitted PDF.
paramEsts = wblfit(life);
n = length(life);
prob = counts / (n * binWidth);
bar(binCtrs,prob,'hist');
h = get(gca,'child');
set(h,'FaceColor',[.9 .9 .9]);
xlabel('Time to Failure'); ylabel('Probability Density'); ylim([0 0.1]);
xgrid = linspace(0,20,100);
pdfEst = wblpdf(xgrid,paramEsts(1),paramEsts(2));
line(xgrid,pdfEst)

Maximum likelihood can, in a sense, be thought of as finding a Weibull PDF to best match the histogram. But it is not done by minimizing the sum of squared differences between the PDF and the bar heights.
As with the curve fitting example above, this model fit should be accompanied by estimates of precision and followed by checks on the model's goodness of fit; for simplicity we'll leave that out here.
Functions for Distribution Fitting
The Statistics Toolbox includes functions, such as wblfit, for fitting many different parametric distributions using maximum likelihood, and the function mle can be used to fit custom distributions for which dedicated fitting functions are not explicitly provided. The function ksdensity fits a nonparametric distribution model to data. The Statistics Toolbox also provides the GUI dfittool, which simplifies many of the tasks in distribution fitting, including generation of various visualizations and diagnostic plots. Many types of complicated distributions, including distributions with constraints on the parameters, can be fit using functions in the Optimization Toolbox. Finally, the MATLAB function fminsearch can be used in many kinds of maximum likelihood distribution fitting.
Although fitting a curve to a histogram is usually not optimal, there are sensible ways to apply special cases of curve fitting in certain distribution fitting contexts. One method, applied on the cumulative probability (CDF) scale instead of the PDF scale, is described in the Fitting a Univariate Distribution Using Cumulative Probabilities example.
Summary
It's not uncommon to do curve fitting with a model that is a scaled version of a common probability density function, such as the Weibull, Gaussian, gamma, or exponential. Curve fitting and distribution fitting can be easy to confuse in these cases, but the two are very different kinds of data analysis.
  • Curve fitting involves modelling the trend or mean of a response variable as a function of a second predictor variable. The model usually must include a parameter to scale the height of the curve, and may also include an intercept term. The appropriate plot for the data is an x-y scatterplot.
  • Distribution fitting involves modelling the probability distribution of a single variable. The model is a normalized probability density function. The appropriate plot for the data is a histogram.