A Data Scientist's blog: 05/01/2013

Friday, May 31, 2013

MATLAB, R, and Julia: Languages for data analysis

Inside core features of specialized data analysis languages.

Big data frameworks like Hadoop have received a lot of attention recently, and with good reason: when you have terabytes of data to work with — and these days, who doesn’t? — it’s amazing to have affordable, reliable and ubiquitous tools that allow you to spread a computation over tens or hundreds of CPUs on commodity hardware. The dirty truth is, though, that many analysts and scientists spend as much time or more working with mere megabytes or gigabytes of data: a small sample pulled from a larger set, or the aggregated results of a Hadoop job, or just a dataset that isn’t all that big (like, say, all of Wikipedia, which can be squeezed into a few gigs without too much trouble).
At this scale, you don’t need a fancy distributed framework. You can just load the data into memory and explore it interactively in your favorite scripting language. Or, maybe, a different scripting language: data analysis is one of the few domains where special-purpose languages are very commonly used. Although in many respects these are similar to other dynamic languages like Ruby or Javascript, these languages have syntax and built-in data structures that make common data analysis tasks both faster and more concise. This article will briefly cover some of these core features for two languages that have been popular for decades — MATLAB and R — and another, Julia, that was just announced this year.

MATLAB

MATLAB is one of the oldest programming languages designed specifically for data analysis, and it is still extremely popular today. MATLAB was conceived in the late ’70s as a simple scripting language wrapped around the FORTRAN libraries LINPACK and EISPACK, which at the time were the best way to efficiently work with large matrices of data — as they arguably still are, through their successor LAPACK. These libraries, and thus MATLAB, were solely concerned with one data type: the matrix, a two-dimensional array of numbers.
This may seem very limiting, but in fact, a very wide range of scientific and data-analysis problems can be represented as matrix problems, and often very efficiently. Image processing, for example, is an obvious fit for the 2D data structure; less obvious, perhaps, is that a directed graph (like Twitter’s follow graph, or the graph of all links on the web) can be expressed as an adjacency matrix, and that graph algorithms like Google’s PageRank can be easily implemented as a series of additions and multiplications of these matrices. Similarly, the winning entry to the Netflix Prize recommendation challenge relied, in part, on a matrix representation of everyone’s movie ratings (you can imagine every row representing a Netflix user, every column a movie, and every entry in the matrix a rating), and in particular on an operation called Singular Value Decomposition, one of those original LINPACK matrix routines that MATLAB was designed to make easy to use.

Its focus on matrices led to some important differences in MATLAB’s design compared to general-purpose programming languages. First, it has special syntax for matrix literals. For simple cases, this will look pretty familiar; here’s a 1×2 matrix (in other words, a row vector):

V = [1 2]

Here’s a 2×3 matrix (the semicolon is what breaks up the rows; the line break is ignored):

A = [4 5 6;
    7 8 9]

It gets more interesting when you take advantage of interpolation. As with strings in some other languages, in MATLAB you can mix matrix variables in with literal data, almost like a template. For example, given the above definitions, then this:

B = [V 3; A]

will produce this 3×3 matrix (I’ll use bold italic to show results throughout):

In two dimensions, it’s a little tricker to fit things together than with strings. If we hadn’t included the “3″ to pad out the first row, the interpreter would have complained that the dimensions don’t match, and the literal would have been invalid.
More important than the literal syntax is that all of the core library functions and operators in MATLAB were designed to accept and return matrices rather than individual values. This can take some getting used to. It may not seem that strange that we can double every element of our matrix above by adding it to itself:

BB = B + B

BB =

    2    4    6
    8   10   12
   14   16   18

Or even that a function like sqrt will take the square root of every element:

BR = sqrt(B)

BR =

   1.0000   1.4142   1.7321
   2.0000   2.2361   2.4495
   2.6458   2.8284   3.0000

It’s a little bit more strange to pass it to a function like isprime and get a matrix back as a result, with a “1″ in every matrix location that held a prime number:

BP = isprime(B)

BP =

   0   1   1
   0   1   0
   1   0   0

It’s also strangely powerful to be able to naturally extend algorithms into multiple dimensions. For example, in most programming languages, if we had a column of numbers and wanted to find the mean, it would be straightforward to sum up the column and divide by the number of rows:

C = [1; 2; 3; 4; 5; 6; 7]
mean = sum(C) / rows(C)

mean =  4

But let’s say that instead we had a matrix where each row represented an (x,y) point:

Plotted, it looks like this:

Because sum() works on columns, no matter how many there are, we can use the exact same code to find the center — technically, the centroid — of these points, which has the mean of all the x values for its x, and the mean of all the y values for its y:

center = sum(D) / rows(D)

center =

    3   2

What if we wanted to find the distance from each point to that center? Again, we can operate on the matrix as a whole. Distance is the square root of the sum of the squares of the differences for each dimension, or in MATLAB:

distances = sqrt(sum(power(D — center, 2), 2))

distances =

  2.0000  
  2.2361
  3.1623
  2.2361

(If you’re wondering why we’re passing a 2 to sum(), that’s because we’re asking to sum the rows — the second dimension — rather than columns, which is the default).
All of this would also work unchanged for points with three (or more!) dimensions, simply by adding more columns to our original matrix.
The other important feature to call out in MATLAB’s matrix syntax is the very flexible support for indexing into a matrix. You can, of course, pick out an individual element — for example, D(1,1) picks out row 1, column 1 of D — but you can also use ranges, or wildcards. For example, because a colon by itself acts as a wildcard, Dy = D(:,2) will pick out the second (y) column for each row in D:

You can also use “logical indexing” to index a matrix by another matrix: M(I) will return only those elements in M that correspond to non-zero elements of I. To return to our earlier example, primes = B(BP) would return only those elements of B that are prime, and thus have corresponding 1s in the BP matrix:

primes =

   7
   2
   5
   3

Because logical and comparison operators also work on matrices, you can use this almost like a little query language. For example, this will return all elements of B that are both prime and greater than 5:

result = B(isprime(B) & (B > 5))

result = 7

Just to reiterate what’s happening here: each of isprime(B) and B > 5 are returning matrices full of 0s and 1s; the & operator is combining them; the resulting matrix is being used to index into B and return the (single, in this case) result of 7. And yet, it reads almost like SQL.
It’s worth mentioning that MATLAB is commercial software; throughout these examples, I’ve in fact been using GNU Octave, an open source language designed to be compatible in most respects.

R

The other widely used open source language for data analysis is R, a modern version of the S language for statistical computing that originally came out of the Bell Labs around the same time MATLAB was being developed.
Although R has access to a similar set of matrix math routines as MATLAB — via the LAPACK library — it avoids any specialized syntax for numeric matrices. Where R shines, instead, is in datasets that are richer or messier than the pure numerics of MATLAB. R introduces the “data frame” as a core data structure, which is like a matrix with two key differences: first, its columns and rows can have names; second, each column can hold a different data type. Like MATLAB matrices, you can operate on them as a whole. For example, our centroid example from earlier would look like this in R:

D = data.frame(x = c(3,2,6,1), y=c(4,0,1,3))
center = colSums(D) / nrow(D)

center =
  x y
  3 2

Even in this simple example, and even without making use of multiple data types in the columns, it’s awfully convenient to see named columns rather than just numeric indices. R exploits the names for much more than just output, however. First, it has specialized syntax for referencing a column by name. Although we could index the second column with D[,2] (unlike MATLAB, instead of using a specific wildcard character, we just omit the row index that would go before the comma), we can also reference it more simply:

D$y

[1] 4 0 1 3

Even more simply, in the common case where we are working primarily with one dataset, we can use R’s attach() function to import the column names into our local variable scope. (Experienced R users would, however, warn you against using this in more complex cases where you had multiple datasets, or a large number of columns, since you can quickly and confusingly clutter up your scope). Once we’ve attached D, we can access its columns just as “x” and “y”. If we wanted to compute, say, a new column that was the product of D’s x and y columns, nothing could be easier:

attach(D)
x * y

[1] 12  0  6  3

Unlike matrices, R’s data frames can also be extended with new rows or columns, so that we could create a new column for this result if we liked:

D$xy = x * y

However, because this column didn’t exist when we performed the attach(), it won’t be available as a local variable unless we attach(D) again.
R can do the same logical indexing tricks that MATLAB can, but they work even better with attached named columns. Let’s say we had a data frame with columns for height, weight, and gender:

M = data.frame(height = c(62, 70, 67), weight = c(120, 178, 180),
gender = c("m", "f", "m"))
attach(M)

We can use logical operations on the column variables to produce a vector of booleans, showing us which rows represent men taller than 65″:

height > 65 & gender == "m"

[1] FALSE FALSE  TRUE

But as with MATLAB, we can also use that vector to index into the rows of the original data frame, returning the (in this case) single matching result:

M[height > 65 & gender == "m",]

  height weight gender
    67    180     m

Note the comma at the end, which ensures that this is interpreted as a row index and that we get all columns. (For this specific case, it would also be idiomatic to use the subset() function, which operates similarly but doesn’t require attach()).
Finally, attaching named columns also makes it easy to use another piece of special purpose R syntax: model formula notation. This allows you to express a relationship between two or more variables and is used pervasively throughout R. Functions for plotting, regression and ANOVA, machine learning, and so on can all make use of models described in this form. For example, if we believed there was a linear relationship between the height values and the weight values of D, we might ask R to try to fit this model like so, using the lm() linear model function:

model = lm(weight ~ height)

Similarly, we could bring up a scatter plot with just:

plot(weight ~ height)

If we believed that weight depended on both height and gender, we could express that as weight ~ height + gender. More complicated relationships are possible, too, within limits; for example, the * operator can be used to relate two variables that interact rather than being independent. It would be unusual for a general-purpose language to have built-in syntax for defining these kinds of models; because R was designed for statistics, it’s both available and widely used, and helps greatly in allowing complex, highly configurable features like plotting or generalized linear modeling to be hidden behind a single function call.
MATLAB and R both share this style of exposing very complex functionality through a single function, often with a huge number of optional parameters. For example, R’s plot function literally can take dozens of graphical parameters, and MATLAB’s isn’t far behind. Functions for optimization, clustering, regression, and so on are similar, if less extreme. This works very well for the common case of loading a research data set into an interactive session and exploring it using the standard library toolkit — especially when you have something like model formula notation to keep the default usage very simple. It can be daunting, however, to dive deeper and build larger programs that need to extend, tweak, or reuse parts of this toolkit because a function like plot() or lm() appears to be a black box; either you need exactly what it does, or you need to reimplement the whole thing.
This is certainly not universally true; most would probably agree that Hadley Wickham’s plyr and ggplot2 packages, for example, have elegantly composable APIs. But the underlying structures of these languages may not encourage this kind of careful design. It’s telling, for example, that MATLAB awkwardly requires a separate file for each and every public function or that John Myles White’s excellent R-focused blog describes object-oriented programming in R as “a hack that was put on top of the original S language,” and has trouble puzzling out what the right style is for defining setter methods; it’s also telling that while a tutorial for a general-purpose language like Python will cover defining functions and classes early on, many R and MATLAB tutorials never cover these topics at all.
Performance may also be a factor. Although they can do matrix math very fast, thanks to the underlying libraries, both MATLAB and R have notoriously slow language interpreters (and this goes double for the open-source Octave implementation). This discourages writing large libraries or complex abstractions in the languages themselves and tends to relegate the computational core of any new function to a C or FORTRAN extension, which makes the function even more of a daunting black box to the casual hacker.

Julia

Julia is a modern language for scientific computing, designed to address some of these concerns. Superficially, Julia strongly resembles MATLAB. For example, here’s how the MATLAB documentation says you should compute the density of a matrix — that is, what proportion of its values are non-zero:

x = [1 0 3 0; 4 3 0 1; 2 3 5 5]
density = nnz(x)/prod(size(x))

density = 0.75

To unpack this a little bit: the obscurely named nnz function returns the number of non-zero values in the matrix; size() returns a vector with the matrix dimensions (in this case, [3 4]); and prod() multiplies up all the values in that vector, giving us the total number of entries in the matrix. Simple division gives us the density.
Here’s the equivalent code in Julia:

x = [1 0 3 0; 4 3 0 1; 2 3 5 5]
density = nnz(x)/prod(size(x))

Well, not just equivalent, but identical! Matrix literals and obscure function names and all. Most code won’t port over quite this easily, but Julia is clearly designed to make MATLAB users feel at home. Under the covers, however, things look extremely different. It’s instructive to look at Julia’s implementation of the prod function and compare it to Octave’s (since we don’t have access to MATLAB’s). Here’s a snippet of Octave’s prod:

DEFUN (prod, args, ,
  "prod (X): products")
{
  octave_value_list retval;

  int nargin = args.length ();
  if (nargin == 1) {
      octave_value arg = args(0);
      if (arg.is_real_type ()) {
        Matrix tmp = arg.matrix_value ();
        if (! error_state)
                retval(0) = tmp.prod ();
      } else if (arg.is_complex_type ()) { 
        …
      } else {
          gripe_wrong_type_arg ("prod", arg);
          return retval;
        }
   } else {
        …
   }
}

A few things to notice: first of all, Octave implements this and many of the standard library functions in C; second, there are hard-coded checks for the number of arguments, and for two possible argument types — a standard real-valued matrix and a complex matrix — with calls to separate implementations for each, where the actual computation happens.
Here’s the equivalent code in Julia:

function prod{T}(A::StridedArray{T})
    if isempty(A)
        return one(T)
    end
    v = A[1]
    for i=2:numel(A)
        v *= A[i]
    end
    v
end

It’s considerably shorter and easier to read, mostly because — even though it’s a core function — it’s implemented in Julia itself. It’s also generic, which is to say, this one piece of code will work for integer matrices, or complex, or double-precision, and so on. StridedArray is Julia’s type for dense (as opposed to sparse) n-dimensional arrays; the {T} in the function signature represents a type parameter and can take on any value, including a user-supplied type. An especially interesting thing here is the behavior when the array is empty: even though it doesn’t have any example values to work with, it can pass the type parameter to the one() function to get a “1″ of the right type.
It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will take up much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can deal seamlessly and generically with these different immediate types as well as pointer types like String.
In MATLAB, if you define a new data type, it’s possible to provide alternative implementations of functions like prod that operate on that type. The same is true of Julia: the implementation shown above is only used for StridedArray, and Julia provides entirely separate implementations for other types — like DArray, Julia’s distributed array implementation.
Unlike in MATLAB, Julia also lets you provide alternative implementations for more subtle variations. For example, the specific case of a dense array of booleans is overridden to error:

prod(A::StridedArray{Bool}) = error("use all() instead of prod() for boolean arrays")

Julia has full multiple dispatch, meaning that the implementation is chosen based on the specific type (and number) of all of the arguments to the function — which is why the Octave code above has an explicit check for the number of arguments but Julia doesn’t. Here’s a variation that allows you to pass an extra argument specifying which dimension to multiply on (like the extra 2 passed to sum() in the MATLAB example computing distances):

prod{T}(A::StridedArray{T}, region::Dimspec) = areduce(*,A,region,one(T))

This is extremely concise because it’s implemented with higher-order functions, passing the * function as a value to a generic array reduction function — if you passed in + instead, you’d have sum(). Julia makes extensive use of this functional programming style, allowing its core library to be implemented at a high level of abstraction. Abstraction can be costly, but it’s made possible in Julia by a very high-performance language implementation.
How high performance? The Julia site lists a handful of benchmarks comparing R, MATLAB, and Julia (as well as some others). For tasks that mostly exercise the underlying matrix libraries, like random matrix multiplication, they all do similarly well, as does C++; for tasks that exercise basic language features, like a simple recursive fibonacci implementation, Julia is a few times slower than C++ but is around 100 times faster than R and nearly 1000 times faster than MATLAB or Octave. This is a stunning difference, and may sound too good to be true, but although microbenchmarks should certainly be taken with a grain of salt, and these come from an obviously biased source, there’s no reason to think that Julia can’t be that fast. As an existence proof, Google’s V8 Javascript engine gets very similar performance from a language that’s even more dynamic and difficult to optimize; matching it is impressive but certainly not impossible. (Google has Lars Bak, a master VM implementer with decades of experience starting with the influential SELF project, but the Julia team, like anyone else, has access to those same influential papers).
Julia’s weakness, however, is its libraries. R has CRAN, certainly the most impressive collection of statistical libraries available anywhere. MATLAB also has a wide range of toolboxes available, for a price. Julia also lacks a rich development environment, like RStudio, and has only rudimentary support for plotting, which is a pretty critical part of most exploratory data analysis. Julia does, however, have a very active community, and I hope and believe that the rest will come with time; for now, it’s hard to compete with the decades of accumulated contributions that the older languages have.

… and Python

Reading this, you might get the impression that data analysis is all about numerics and filtering, and maybe plotting. Of course, that’s not true in the real world: data is messy, and in many cases, the majority of the work in a data analysis project is retrieving the data, parsing it, munging it, and so on. In this area, it’s unfortunately hard to dispute that general-purpose scripting languages like Perl, Ruby, and Python, have much better language and library support in this area than any of the data-specific languages. For that reason, despite the obvious advantages of MATLAB, R, and Julia, it’s also always worth considering what a general-purpose language can bring to the table.
The leading contender here is almost certainly Python. The NumPy library provides a solid MATLAB-like matrix data structure, with efficient matrix and vector operations. That’s not unusual, however. For example, NArray provides a similar facility for Ruby; Java and the various JVM languages can use Colt, and so on. What makes Python stand out are two more libraries that have been built on top of NumPy:

SciPy includes a very large collection of numerical, statistical, and optimization algorithms.
Wes McKinney’s Pandas provides R-style Data Frame objects (using NumPy arrays underneath to ensure fast computation), along with a wealth of tools for manipulating them.

At the same time, Python has a huge number of well-known libraries for the messier parts of analysis. For example, Beautiful Soup is best-of-breed for quickly scraping and parsing real-world HTML. Together with Python’s strong community and innovative environments like iPython and Reinteract, these libraries make Python a compelling alternative: not as tuned to numerics as MATLAB, or to stats as R, or as fast or elegant as Julia, but a useful (and popular) tool for data analysis all the same.

tags: data analysis, data programming, data science, Julia, languages, MATLAB, R

R for MATLAB users

Help

R/S-Plus	MATLAB/Octave	Description
`help.start()`	`doc` `help -i % browse with Info`	Browse help interactively
`help()`	`help help` or `doc doc`	Help on using help
`help(plot)` or `?plot`	`help plot`	Help for a function
`help(package='splines')`	`help splines` or `doc splines`	Help for a toolbox/library package
`demo()`	`demo`	Demonstration examples
`example(plot)`		Example using a function

Searching available documentation

R/S-Plus	MATLAB/Octave	Description
`help.search('plot')`	`lookfor plot`	Search help files
`apropos('plot')`		Find objects by partial name
`library()`	`help`	List available packages
`find(plot)`	`which plot`	Locate functions
`methods(plot)`		List available methods for a function

Using interactively

R/S-Plus	MATLAB/Octave	Description
`Rgui`	`octave -q`	Start session
`source('foo.R')`	`foo(.m)`	Run code from file
`history()`	`history`	Command history
`savehistory(file=".Rhistory")`	`diary on [..] diary off`	Save command history
`q(save='no')`	`exit` or `quit`	End session

Operators

R/S-Plus	MATLAB/Octave	Description
`help(Syntax)`	`help -`	Help on operator syntax

Arithmetic operators

R/S-Plus	MATLAB/Octave	Description
`a<-1 b="" tt="">`	`a=1; b=2;`	Assignment; defining a number
`a + b`	`a + b`	Addition
`a - b`	`a - b`	Subtraction
`a * b`	`a * b`	Multiplication
`a / b`	`a / b`	Division
`a ^ b`	`a .^ b`	Power, $a^b$
`a %% b`	`rem(a,b)`	Remainder
`a %/% b`		Integer division
`factorial(a)`	`factorial(a)`	Factorial, $n!$

Relational operators

R/S-Plus	MATLAB/Octave	Description
`a == b`	`a == b`	Equal
`a < b`	`a < b`	Less than
`a > b`	`a > b`	Greater than
`a <= b`	`a <= b`	Less than or equal
`a >= b`	`a >= b`	Greater than or equal
`a != b`	`a ~= b`	Not Equal

Logical operators

R/S-Plus	MATLAB/Octave	Description
`a && b`	`a && b`	Short-circuit logical AND
`a \|\| b`	`a \|\| b`	Short-circuit logical OR
`a & b`	`a & b` or `and(a,b)`	Element-wise logical AND
`a \| b`	`a \| b` or `or(a,b)`	Element-wise logical OR
`xor(a, b)`	`xor(a, b)`	Logical EXCLUSIVE OR
`!a`	`~a` or `not(a)` `~a` or `!a`	Logical NOT
	`any(a)`	True if any element is nonzero
	`all(a)`	True if all elements are nonzero

root and logarithm

R/S-Plus	MATLAB/Octave	Description
`sqrt(a)`	`sqrt(a)`	Square root
`log(a)`	`log(a)`	Logarithm, base $e$ (natural)
`log10(a)`	`log10(a)`	Logarithm, base 10
`log2(a)`	`log2(a)`	Logarithm, base 2 (binary)
`exp(a)`	`exp(a)`	Exponential function

Round off

R/S-Plus	MATLAB/Octave	Description
`round(a)`	`round(a)`	Round
`ceil(a)`	`ceil(a)`	Round up
`floor(a)`	`floor(a)`	Round down
	`fix(a)`	Round towards zero

Mathematical constants

R/S-Plus	MATLAB/Octave	Description
`pi`	`pi`	$\pi=3.141592$
`exp(1)`	`exp(1)`	$e=2.718281$

Missing values; IEEE-754 floating point status flags

R/S-Plus	MATLAB/Octave	Description
	`NaN`	Not a Number
	`Inf`	Infinity, $\infty$

Complex numbers

R/S-Plus	MATLAB/Octave	Description
`1i`	`i`	Imaginary unit
`z <- 3="" i="" tt="">`	`z = 3+4i`	A complex number, $3+4i$
`abs(3+4i)` or `Mod(3+4i)`	`abs(z)`	Absolute value (modulus)
`Re(3+4i)`	`real(z)`	Real part
`Im(3+4i)`	`imag(z)`	Imaginary part
`Arg(3+4i)`	`arg(z)`	Argument
`Conj(3+4i)`	`conj(z)`	Complex conjugate

Trigonometry

R/S-Plus	MATLAB/Octave	Description
`atan2(b,a)`	`atan(a,b)`	Arctangent, $\arctan(b/a)$

Generate random numbers

R/S-Plus	MATLAB/Octave	Description
`runif(10)`	`rand(1,10)`	Uniform distribution
`runif(10, min=2, max=7)`	`2+5*rand(1,10)`	Uniform: Numbers between 2 and 7
`matrix(runif(36),6)`	`rand(6)`	Uniform: 6,6 array
`rnorm(10)`	`randn(1,10)`	Normal distribution

Vectors

R/S-Plus	MATLAB/Octave	Description
`a <- c="" tt="">`	`a=[2 3 4 5];`	Row vector, $1 \times n$-matrix
`adash <- c="" t="" tt="">`	`adash=[2 3 4 5]';`	Column vector, $m \times 1$-matrix

Sequences

R/S-Plus	MATLAB/Octave	Description
`seq(10)` or `1:10`	`1:10`	1,2,3, ... ,10
`seq(0,length=10)`	`0:9`	0.0,1.0,2.0, ... ,9.0
`seq(1,10,by=3)`	`1:3:10`	1,4,7,10
`seq(10,1)` or `10:1`	`10:-1:1`	10,9,8, ... ,1
`seq(from=10,to=1,by=-3)`	`10:-3:1`	10,7,4,1
`seq(1,10,length=7)`	`linspace(1,10,7)`	Linearly spaced vector of n=7 points
`rev(a)`	`reverse(a)`	Reverse
	`a(:) = 3`	Set all values to same scalar value

Concatenation (vectors)

R/S-Plus	MATLAB/Octave	Description
`c(a,a)`	`[a a]`	Concatenate two vectors
`c(1:4,a)`	`[1:4 a]`

Repeating

R/S-Plus	MATLAB/Octave	Description
`rep(a,times=2)`	`[a a]`	1 2 3, 1 2 3
`rep(a,each=3)`		1 1 1, 2 2 2, 3 3 3
`rep(a,a)`		1, 2 2, 3 3 3

Miss those elements out

R/S-Plus	MATLAB/Octave	Description
`a[-1]`	`a(2:end)`	miss the first element
`a[-10]`	`a([1:9])`	miss the tenth element
`a[-seq(1,50,3)]`		miss 1,4,7, ...
	`a(end)`	last element
	`a(end-1:end)`	last two elements

Maximum and minimum

R/S-Plus	MATLAB/Octave	Description
`pmax(a,b)`	`max(a,b)`	pairwise max
`max(a,b)`	`max([a b])`	max of all values in two vectors
`v <- a="" i="" max="" tt="" which.max="">`	`[v,i] = max(a)`

Vector multiplication

R/S-Plus	MATLAB/Octave	Description
`a*a`	`a.*a`	Multiply two vectors
	`dot(u,v)`	Vector dot product, $u \cdot v$

Matrices

R/S-Plus	MATLAB/Octave	Description
`rbind(c(2,3),c(4,5))` `array(c(2,3,4,5), dim=c(2,2))`	`a = [2 3;4 5]`	Define a matrix

Concatenation (matrices); rbind and cbind

R/S-Plus	MATLAB/Octave	Description
`rbind(a,b)`	`[a ; b]`	Bind rows
`cbind(a,b)`	`[a , b]`	Bind columns
	`[a(:), b(:)]`	Concatenate matrices into one vector
`rbind(1:4,1:4)`	`[1:4 ; 1:4]`	Bind rows (from vectors)
`cbind(1:4,1:4)`	`[1:4 ; 1:4]'`	Bind columns (from vectors)

Array creation

R/S-Plus	MATLAB/Octave	Description
`matrix(0,3,5)` or `array(0,c(3,5))`	`zeros(3,5)`	0 filled array
`matrix(1,3,5)` or `array(1,c(3,5))`	`ones(3,5)`	1 filled array
`matrix(9,3,5)` or `array(9,c(3,5))`	`ones(3,5)*9`	Any number filled array
`diag(1,3)`	`eye(3)`	Identity matrix
`diag(c(4,5,6))`	`diag([4 5 6])`	Diagonal
	`magic(3)`	Magic squares; Lo Shu

Reshape and flatten matrices

R/S-Plus	MATLAB/Octave	Description
`matrix(1:6,nrow=3,byrow=T)`	`reshape(1:6,3,2)';`	Reshaping (rows first)
`matrix(1:6,nrow=2)` `array(1:6,c(2,3))`	`reshape(1:6,2,3);`	Reshaping (columns first)
`as.vector(t(a))`	`a'(:)`	Flatten to vector (by rows, like comics)
`as.vector(a)`	`a(:)`	Flatten to vector (by columns)
`a[row(a) <= col(a)]`	`vech(a)`	Flatten upper triangle (by columns)

Shared data (slicing)

R/S-Plus	MATLAB/Octave	Description
`b = a`	`b = a`	Copy of a

Indexing and accessing elements (Python: slicing)

R/S-Plus	MATLAB/Octave	Description
`a <- 12="" 13="" 14="" c="" rbind="" tt=""> c(21, 22, 23, 24), c(31, 32, 33, 34))`	`a = [ 11 12 13 14 ...` `21 22 23 24 ...` `31 32 33 34 ]`	Input is a 3,4 array
`a[2,3]`	`a(2,3)`	Element 2,3 (row,col)
`a[1,]`	`a(1,:)`	First row
`a[,1]`	`a(:,1)`	First column
	`a([1 3],[1 4]);`	Array as indices
`a[-1,]`	`a(2:end,:)`	All, except first row
	`a(end-1:end,:)`	Last two rows
	`a(1:2:end,:)`	Strides: Every other row
`a[-2,-3]`		All, except row,column (2,3)
`a[,-2]`	`a(:,[1 3 4])`	Remove one column

Assignment

R/S-Plus	MATLAB/Octave	Description
`a[,1] <- 99="" tt="">`	`a(:,1) = 99`
`a[,1] <- c="" tt="">`	`a(:,1) = [99 98 97]'`
`a[a>90] <- 90="" tt="">`	`a(a>90) = 90;`	Clipping: Replace all elements over 90

Transpose and inverse

R/S-Plus	MATLAB/Octave	Description
`t(a)`	`a'`	Transpose
	`a.'` or `transpose(a)`	Non-conjugate transpose
`det(a)`	`det(a)`	Determinant
`solve(a)`	`inv(a)`	Inverse
`ginv(a)`	`pinv(a)`	Pseudo-inverse
	`norm(a)`	Norms
`eigen(a)$values`	`eig(a)`	Eigenvalues
`svd(a)$d`	`svd(a)`	Singular values
	`chol(a)`	Cholesky factorization
`eigen(a)$vectors`	`[v,l] = eig(a)`	Eigenvectors
`rank(a)`	`rank(a)`	Rank

Sum

R/S-Plus	MATLAB/Octave	Description
`apply(a,2,sum)`	`sum(a)`	Sum of each column
`apply(a,1,sum)`	`sum(a')`	Sum of each row
`sum(a)`	`sum(sum(a))`	Sum of all elements
`apply(a,2,cumsum)`	`cumsum(a)`	Cumulative sum (columns)

Sorting

R/S-Plus	MATLAB/Octave	Description
	`a = [ 4 3 2 ; 2 8 6 ; 1 4 7 ]`	Example data
`t(sort(a))`	`sort(a(:))`	Flat and sorted
`apply(a,2,sort)`	`sort(a)`	Sort each column
`t(apply(a,1,sort))`	`sort(a')'`	Sort each row
	`sortrows(a,1)`	Sort rows (by first row)
`order(a)`		Sort, return indices

Maximum and minimum

R/S-Plus	MATLAB/Octave	Description
`apply(a,2,max)`	`max(a)`	max in each column
`apply(a,1,max)`	`max(a')`	max in each row
`max(a)`	`max(max(a))`	max in array
`i <- a="" apply="" tt="" which.max="">`	`[v i] = max(a)`	return indices, i
`pmax(b,c)`	`max(b,c)`	pairwise max
`apply(a,2,cummax)`	`cummax(a)`

Matrix manipulation

R/S-Plus	MATLAB/Octave	Description
`a[,4:1]`	`fliplr(a)`	Flip left-right
`a[3:1,]`	`flipud(a)`	Flip up-down
	`rot90(a)`	Rotate 90 degrees
`kronecker(matrix(1,2,3),a)`	`repmat(a,2,3)` `kron(ones(2,3),a)`	Repeat matrix: [ a a a ; a a a ]
`a[lower.tri(a)] <- 0="" tt="">`	`triu(a)`	Triangular, upper
`a[upper.tri(a)] <- 0="" tt="">`	`tril(a)`	Triangular, lower

Equivalents to "size"

R/S-Plus	MATLAB/Octave	Description
`dim(a)`	`size(a)`	Matrix dimensions
`ncol(a)`	`size(a,2)` or `length(a)`	Number of columns
`prod(dim(a))`	`length(a(:))`	Number of elements
	`ndims(a)`	Number of dimensions
`object.size(a)`		Number of bytes used in memory

Matrix- and elementwise- multiplication

R/S-Plus	MATLAB/Octave	Description
`a * b`	`a .* b`	Elementwise operations
`a %*% b`	`a * b`	Matrix product (dot product)
`outer(a,b)` or `a %o% b`		Outer product
`crossprod(a,b)` or `t(a) %*% b`		Cross product
`kronecker(a,b)`	`kron(a,b)`	Kronecker product
	`a / b`	Matrix division, $b{\cdot}a^{-1}$
`solve(a,b)`	`a \ b`	Left matrix division, $b^{-1}{\cdot}a$ \newline (solve linear equations)

Find; conditional indexing

R/S-Plus	MATLAB/Octave	Description
`which(a != 0)`	`find(a)`	Non-zero elements, indices
`which(a != 0, arr.ind=T)`	`[i j] = find(a)`	Non-zero elements, array indices
`ij <- 0="" a="" arr.ind="T);" ij="" tt="" v="" which="">`	`[i j v] = find(a)`	Vector of non-zero values
`which(a>5.5)`	`find(a>5.5)`	Condition, indices
`ij <- a="" which="">5.5, arr.ind=T); v <- a="" ij="" tt="">`		Return values
	`a .* (a>5.5)`	Zero out elements above 5.5

Multi-way arrays

R/S-Plus	MATLAB/Octave	Description
	`a = cat(3, [1 2; 1 2],[3 4; 3 4]);`	Define a 3-way array
	`a(1,:,:)`

File input and output

R/S-Plus	MATLAB/Octave	Description
`f <- data.txt="" read.table="" tt="">`	`f = load('data.txt')`	Reading from a file (2d)
`f <- data.txt="" read.table="" tt="">`	`f = load('data.txt')`	Reading from a file (2d)
`f <- file="data.csv" read.table="" sep=";" tt="">`	`x = dlmread('data.csv', ';')`	Reading fram a CSV file (2d)
`write(f,file="data.txt")`	`save -ascii data.txt f`	Writing to a file (2d)

Plotting

Basic x-y plots

R/S-Plus	MATLAB/Octave	Description
`plot(a, type="l")`	`plot(a)`	1d line plot
`plot(x[,1],x[,2])`	`plot(x(:,1),x(:,2),'o')`	2d scatter plot
	`plot(x1,y1, x2,y2)`	Two graphs in one plot
`plot(x1,y1)` `matplot(x2,y2,add=T)`	`plot(x1,y1)` `hold on` `plot(x2,y2)`	Overplotting: Add new plots to current
	`subplot(211)`	subplots
`plot(x,y,type="b",col="red")`	`plot(x,y,'ro-')`	Plotting symbols and color

Axes and titles

R/S-Plus	MATLAB/Octave	Description
`grid()`	`grid on`	Turn on grid lines
`plot(c(1:10,10:1), asp=1)`	`axis equal` `axis('equal')` `replot`	1:1 aspect ratio
`plot(x,y, xlim=c(0,10), ylim=c(0,5))`	`axis([ 0 10 0 5 ])`	Set axes manually
`plot(1:10, main="title",` `xlab="x-axis", ylab="y-axis")`	`title('title')` `xlabel('x-axis')` `ylabel('y-axis')`	Axis labels and titles

Log plots

R/S-Plus	MATLAB/Octave	Description
`plot(x,y, log="y")`	`semilogy(a)`	logarithmic y-axis
`plot(x,y, log="x")`	`semilogx(a)`	logarithmic x-axis
`plot(x,y, log="xy")`	`loglog(a)`	logarithmic x and y axes

Filled plots and bar plots

R/S-Plus	MATLAB/Octave	Description
`plot(t,s, type="n", xlab="", ylab="")` `polygon(t,s, col="lightblue")` `polygon(t,c, col="lightgreen")`	`fill(t,s,'b', t,c,'g')` `% fill has a bug?`	Filled plot
`stem(x[,3])`		Stem-and-Leaf plot

Functions

R/S-Plus	MATLAB/Octave	Description
`f <- -="" cos="" function="" sin="" tt="" x="">`	`f = inline('sin(x/3) - cos(x/5)')`	Defining functions
`plot(f, xlim=c(0,40), type='p')`	`ezplot(f,[0,40])` `fplot('sin(x/3) - cos(x/5)',[0,40])` `% no ezplot`	Plot a function for given range

Polar plots

R/S-Plus	MATLAB/Octave	Description
	`theta = 0:.001:2pi;` `r = sin(2theta);`
	`polar(theta, rho)`

Histogram plots

R/S-Plus	MATLAB/Octave	Description
`hist(rnorm(1000))`	`hist(randn(1000,1))`
`hist(rnorm(1000), breaks= -4:4)`	`hist(randn(1000,1), -4:4)`
`hist(rnorm(1000), breaks=c(seq(-5,0,0.25), seq(0.5,5,0.5)), freq=F)`
`plot(apply(a,1,sort),type="l")`	`plot(sort(a))`

3d data

Contour and image plots

R/S-Plus	MATLAB/Octave	Description
`contour(z)`	`contour(z)`	Contour plot
`filled.contour(x,y,z,` `nlevels=7, color=gray.colors)`	`contourf(z); colormap(gray)`	Filled contour plot
`image(z, col=gray.colors(256))`	`image(z)` `colormap(gray)`	Plot image data
	`quiver()`	Direction field vectors

Perspective plots of surfaces over the x-y plane

R/S-Plus	MATLAB/Octave	Description
`f <- exp="" function="" tt="" x="" y=""> n <- length="40)</tt" seq=""> z <- f="" n="" outer="" tt="">`	`n=-2:.1:2;` `[x,y] = meshgrid(n,n);` `z=x.*exp(-x.^2-y.^2);`
`persp(x,y,z,` `theta=30, phi=30, expand=0.6,` `ticktype='detailed')`	`mesh(z)`	Mesh plot
`persp(x,y,z,` `theta=30, phi=30, expand=0.6,` `col='lightblue', shade=0.75, ltheta=120,` `ticktype='detailed')`	`surf(x,y,z)` or `surfl(x,y,z)` `% no surfl()`	Surface plot

Scatter (cloud) plots

R/S-Plus	MATLAB/Octave	Description
`cloud(z~x*y)`	`plot3(x,y,z,'k+')`	3d scatter plot

Save plot to a graphics file

R/S-Plus	MATLAB/Octave	Description
`postscript(file="foo.eps")` `plot(1:10)` `dev.off()`	`plot(1:10)` `print -depsc2 foo.eps` `gset output "foo.eps"` `gset terminal postscript eps` `plot(1:10)`	PostScript
`pdf(file='foo.pdf')`		PDF
`devSVG(file='foo.svg')`		SVG (vector graphics for www)
`png(filename = "Rplot%03d.png"`	`print -dpng foo.png`	PNG (raster graphics)

Data analysis

Set membership operators

R/S-Plus	MATLAB/Octave	Description
`a <- c="" tt=""> b <- c="" tt="">`	`a = [ 1 2 2 5 2 ];` `b = [ 2 3 4 ];`	Create sets
`unique(a)`	`unique(a)`	Set unique
`union(a,b)`	`union(a,b)`	Set union
`intersect(a,b)`	`intersect(a,b)`	Set intersection
`setdiff(a,b)`	`setdiff(a,b)`	Set difference
`setdiff(union(a,b),intersect(a,b))`	`setxor(a,b)`	Set exclusion
`is.element(2,a)` or `2 %in% a`	`ismember(2,a)`	True for set member

Statistics

R/S-Plus	MATLAB/Octave	Description
`apply(a,2,mean)`	`mean(a)`	Average
`apply(a,2,median)`	`median(a)`	Median
`apply(a,2,sd)`	`std(a)`	Standard deviation
`apply(a,2,var)`	`var(a)`	Variance
`cor(x,y)`	`corr(x,y)`	Correlation coefficient
`cov(x,y)`	`cov(x,y)`	Covariance

Interpolation and regression

R/S-Plus	MATLAB/Octave	Description
`z <- lm="" tt="" x="" y=""> plot(x,y) abline(z)`	`z = polyval(polyfit(x,y,1),x)` `plot(x,y,'o', x,z ,'-')`	Straight line fit
`solve(a,b)`	`a = x\y`	Linear least squares $y = ax + b$
	`polyfit(x,y,3)`	Polynomial fit

Non-linear methods

Polynomials, root finding

R/S-Plus	MATLAB/Octave	Description
`polyroot(c(1,-1,-1))`	`roots([1 -1 -1])`	Find zeros of polynomial
	`f = inline('1/x - (x-1)')` `fzero(f,1)`	Find a zero near $x = 1$
	`solve('1/x = x-1')`	Solve symbolic equations
	`polyval([1 2 1 2],1:10)`	Evaluate polynomial

Differential equations

R/S-Plus	MATLAB/Octave	Description
	`diff(a)`	Discrete difference function and approximate derivative
		Solve differential equations

Fourier analysis

R/S-Plus	MATLAB/Octave	Description
`fft(a)`	`fft(a)`	Fast fourier transform
`fft(a, inverse=TRUE)`	`ifft(a)`	Inverse fourier transform

Symbolic algebra; calculus

R/S-Plus	MATLAB/Octave	Description
	`factor()`	Factorization

Programming

R/S-Plus	MATLAB/Octave	Description
`.R`	`.m`	Script file extension
`#`	`%` `%` or `#`	Comment symbol (rest of line)
`library(RSvgDevice)`	`% must be in MATLABPATH` `% must be in LOADPATH`	Import library functions
`string <- 234="" a="" tt=""> eval(parse(text=string))`	`string='a=234';` `eval(string)`	Eval

Loops

R/S-Plus	MATLAB/Octave	Description
`for(i in 1:5) print(i)`	`for i=1:5; disp(i); end`	for-statement
`for(i in 1:5) {` `print(i)` `print(i*2)` `}`	`for i=1:5` `disp(i)` `disp(i*2)` `end`	Multiline for statements

Conditionals

R/S-Plus	MATLAB/Octave	Description
`if (1>0) a <- 100="" tt="">`	`if 1>0 a=100; end`	if-statement
	`if 1>0 a=100; else a=0; end`	if-else-statement
`ifelse(a>0,a,0)`		Ternary operator (if?true:false)

Debugging

R/S-Plus	MATLAB/Octave	Description
`.Last.value`	`ans`	Most recent evaluated expression
`objects()`	`whos` or `who`	List variables loaded into memory
`rm(x)`	`clear x` or `clear [all]`	Clear variable $x$ from memory
`print(a)`	`disp(a)`	Print

Working directory and OS

R/S-Plus	MATLAB/Octave	Description
`list.files()` or `dir()`	`dir` or `ls`	List files in directory
`list.files(pattern="\.r$")`	`what`	List script files in directory
`getwd()`	`pwd`	Displays the current working directory
`setwd('foo')`	`cd foo`	Change working directory
`system("notepad")`	`!notepad` `system("notepad")`	Invoke a System Command

Thursday, May 30, 2013

Nonlinear Data-Fitting with a specific function in MATLAB

Nonlinear Data-Fitting

This example shows how to fit a nonlinear function to data using several Optimization Toolbox™ algorithms.

Problem Setup

Consider the following data:

Data = ...
  [0.0000    5.8955
   0.1000    3.5639
   0.2000    2.5173
   0.3000    1.9790
   0.4000    1.8990
   0.5000    1.3938
   0.6000    1.1359
   0.7000    1.0096
   0.8000    1.0343
   0.9000    0.8435
   1.0000    0.6856
   1.1000    0.6100
   1.2000    0.5392
   1.3000    0.3946
   1.4000    0.3903
   1.5000    0.5474
   1.6000    0.3459
   1.7000    0.1370
   1.8000    0.2211
   1.9000    0.1704
   2.0000    0.2636];

Let's plot these data points.

t = Data(:,1);
y = Data(:,2);
% axis([0 2 -0.5 6])
% hold on
plot(t,y,'ro')
title('Data points')
% hold off

We would like to fit the function

   y =  c(1)*exp(-lam(1)*t) + c(2)*exp(-lam(2)*t)

to the data.

Solution Approach Using lsqcurvefit

The lsqcurvefit function solves this type of problem easily.
To begin, define the parameters in terms of one variable x:

x(1) = c(1)
x(2) = lam(1)
x(3) = c(2)
x(4) = lam(2)

Then define the curve as a function of the parameters x and the data t:

F = @(x,xdata)x(1)*exp(-x(2)*xdata) + x(3)*exp(-x(4)*xdata);

We arbitrarily set our initial point x0 as follows: c(1) = 1, lam(1) = 1, c(2) = 1, lam(2) = 0:

x0 = [1 1 1 0];

We run the solver and plot the resulting fit.

[x,resnorm,~,exitflag,output] = lsqcurvefit(F,x0,t,y)

hold on
plot(t,F(x,t))
hold off

Local minimum possible.

lsqcurvefit stopped because the final change in the sum of squares relative to 
its initial value is less than the default value of the function tolerance.




x =

  Columns 1 through 3

   3.006845730161635  10.586854247456271   2.889093127582852

  Column 4

   1.400341518260952


resnorm =

   0.147722572048259


exitflag =

     3


output = 

    firstorderopt: 7.893433102770797e-06
       iterations: 6
        funcCount: 35
     cgiterations: 0
        algorithm: 'trust-region-reflective'
          message: [1x459 char]

Solution Approach Using fminunc

To solve the problem using fminunc, we set the objective function as the sum of squares of the residuals.

Fsumsquares = @(x)sum((F(x,t) - y).^2);
opts = optimoptions('fminunc','Algorithm','quasi-newton');
[xunc,ressquared,eflag,outputu] = ...
    fminunc(Fsumsquares,x0,opts)

Local minimum found.

Optimization completed because the size of the gradient is less than
the default value of the function tolerance.




xunc =

  Columns 1 through 3

   2.889015274844785   1.400313980001329   3.006919302957509

  Column 4

  10.586172160407736


ressquared =

   0.147722571779314


eflag =

     1


outputu = 

       iterations: 30
        funcCount: 185
         stepsize: 1
    firstorderopt: 2.947592962696121e-05
        algorithm: 'medium-scale: Quasi-Newton line search'
          message: [1x436 char]

Notice that fminunc found the same solution as lsqcurvefit, but took many more function evaluations to do so. The parameters for fminunc are in the opposite order as those for lsqcurvefit; the larger lam is lam(2), not lam(1). This is not surprising, the order of variables is arbitrary.

fprintf(['There were %d iterations using fminunc,' ...
    ' and %d using lsqcurvefit.\n'], ...
    outputu.iterations,output.iterations)
fprintf(['There were %d function evaluations using fminunc,' ...
    ' and %d using lsqcurvefit.'], ...
    outputu.funcCount,output.funcCount)

There were 30 iterations using fminunc, and 6 using lsqcurvefit.
There were 185 function evaluations using fminunc, and 35 using lsqcurvefit.

Splitting the Linear and Nonlinear Problems

Notice that the fitting problem is linear in the parameters c(1) and c(2). This means for any values of lam(1) and lam(2), we can use the backslash operator to find the values of c(1) and c(2) that solve the least-squares problem.
We now rework the problem as a two-dimensional problem, searching for the best values of lam(1) and lam(2). The values of c(1) and c(2) are calculated at each step using the backslash operator as described above.

type fitvector

function yEst = fitvector(lam,xdata,ydata)
%FITVECTOR  Used by DATDEMO to return value of fitting function.
%   yEst = FITVECTOR(lam,xdata) returns the value of the fitting function, y
%   (defined below), at the data points xdata with parameters set to lam.
%   yEst is returned as a N-by-1 column vector, where N is the number of
%   data points.
%
%   FITVECTOR assumes the fitting function, y, takes the form
%
%     y =  c(1)*exp(-lam(1)*t) + ... + c(n)*exp(-lam(n)*t)
%
%   with n linear parameters c, and n nonlinear parameters lam.
%
%   To solve for the linear parameters c, we build a matrix A
%   where the j-th column of A is exp(-lam(j)*xdata) (xdata is a vector).
%   Then we solve A*c = ydata for the linear least-squares solution c,
%   where ydata is the observed values of y.

A = zeros(length(xdata),length(lam));  % build A matrix
for j = 1:length(lam)
   A(:,j) = exp(-lam(j)*xdata);
end
c = A\ydata; % solve A*c = y for linear parameters c
yEst = A*c; % return the estimated response based on c

Solve the problem using lsqcurvefit, starting from a two-dimensional initial point lam(1), lam(2):

x02 = [1 0];
F2 = @(x,t) fitvector(x,t,y);

[x2,resnorm2,~,exitflag2,output2] = lsqcurvefit(F2,x02,t,y)

Local minimum possible.

lsqcurvefit stopped because the final change in the sum of squares relative to 
its initial value is less than the default value of the function tolerance.




x2 =

  10.586092011163505   1.400299514349251


resnorm2 =

   0.147722571771357


exitflag2 =

     3


output2 = 

    firstorderopt: 4.399495381331858e-06
       iterations: 10
        funcCount: 33
     cgiterations: 0
        algorithm: 'trust-region-reflective'
          message: [1x459 char]

The efficiency of the two-dimensional solution is similar to that of the four-dimensional solution:

fprintf(['There were %d function evaluations using the 2-d ' ...
    'formulation, and %d using the 4-d formulation.'], ...
    output2.funcCount,output.funcCount)

There were 33 function evaluations using the 2-d formulation, and 35 using the 4-d formulation.

Split Problem is More Robust to Initial Guess

Choosing a bad starting point for the original four-parameter problem leads to a local solution that is not global. Choosing a starting point with the same bad lam(1) and lam(2) values for the split two-parameter problem leads to the global solution. To show this we re-run the original problem with a start point that leads to a relatively bad local solution, and compare the resulting fit with the global solution.

x0bad = [5 1 1 0];
[xbad,resnormbad,~,exitflagbad,outputbad] = ...
    lsqcurvefit(F,x0bad,t,y)

hold on
plot(t,F(xbad,t),'g')
legend('Data','Global fit','Bad local fit','Location','NE')
hold off

fprintf(['The residual norm at the good ending point is %f,' ...
   ' and the residual norm at the bad ending point is %f.'], ...
   resnorm,resnormbad)

Local minimum possible.

lsqcurvefit stopped because the final change in the sum of squares relative to 
its initial value is less than the default value of the function tolerance.




xbad =

  Columns 1 through 3

 -22.903587482833103   2.479236586680044  28.027347454069787

  Column 4

   2.479103591273296


resnormbad =

   2.217300254426719


exitflagbad =

     3


outputbad = 

    firstorderopt: 0.005748278503129
       iterations: 32
        funcCount: 165
     cgiterations: 0
        algorithm: 'trust-region-reflective'
          message: [1x459 char]

The residual norm at the good ending point is 0.147723, and the residual norm at the bad ending point is 2.217300.

Curve Fitting and Distribution Fitting in MATLAB

Curve Fitting and Distribution Fitting

This example shows the difference between fitting a curve to a set of points, and fitting a probability distribution to a sample of data.
A common question is, "I have some data and I want to fit a Weibull distribution. What MATLAB functions can I use to do Weibull curve fitting?" Before answering that question, we need to figure out what kind of data analysis is really appropriate.

Curve Fitting

Consider an experiment where we measure the concentration of a compound in blood samples taken from several subjects at various times after taking an experimental medication.

time = [ 0.1   0.1   0.3   0.3   1.3   1.7   2.1   2.6   3.9   3.9 ...
         5.1   5.6   6.2   6.4   7.7   8.1   8.2   8.9   9.0   9.5 ...
         9.6  10.2  10.3  10.8  11.2  11.2  11.2  11.7  12.1  12.3 ...
        12.3  13.1  13.2  13.4  13.7  14.0  14.3  15.4  16.1  16.1 ...
        16.4  16.4  16.7  16.7  17.5  17.6  18.1  18.5  19.3  19.7];
conc = [0.01  0.08  0.13  0.16  0.55  0.90  1.11  1.62  1.79  1.59 ...
        1.83  1.68  2.09  2.17  2.66  2.08  2.26  1.65  1.70  2.39 ...
        2.08  2.02  1.65  1.96  1.91  1.30  1.62  1.57  1.32  1.56 ...
        1.36  1.05  1.29  1.32  1.20  1.10  0.88  0.63  0.69  0.69 ...
        0.49  0.53  0.42  0.48  0.41  0.27  0.36  0.33  0.17  0.20];
plot(time,conc,'o');
xlabel('Time'); ylabel('Blood Concentration');

Notice that we have one response variable, blood concentration, and one predictor variable, time after ingestion. The predictor data are assumed to be measured with little or no error, while the response data are assumed to be affected by experimental error. The main objective in analyzing data like these is often to define a model that predicts the response variable. That is, we are trying to describe the trend line, or the mean response of y (blood concentration), as a function of x (time). This is curve fitting with bivariate data.
Based on theoretical models of absorption into and breakdown in the bloodstream, we might, for example, decide that the concentrations ought to follow a Weibull curve as a function of time. The Weibull curve, which takes the form

$y = c * (x/a)^{(b-1)} * exp(-(x/a)^b),$

is defined with three parameters: the first scales the curve along the horizontal axis, the second defines the shape of the curve, and the third scales the curve along the vertical axis. Notice that while this curve has almost the same form as the Weibull probability density function, it is not a density because it includes the parameter c, which is necessary to allow the curve's height to adjust to data.
We can fit the Weibull model using nonlinear least squares.

modelFun =  @(p,x) p(3) .* (x ./ p(1)).^(p(2)-1) .* exp(-(x ./ p(1)).^p(2));
startingVals = [10 2 5];
coefEsts = nlinfit(time, conc, modelFun, startingVals);
xgrid = linspace(0,20,100);
line(xgrid, modelFun(coefEsts, xgrid), 'Color','r');

This scatterplot suggests that the measurement errors do not have equal variance, but rather, that their variance is proportional to the height of the mean curve. One way to address this problem would be to use weighted least squares. However, there is another potential problem with this fit.
The Weibull curve we're using, as with other similar models such as Gaussian, gamma, or exponential curves, is often used as a model when the response variable is nonnegative. Ordinary least squares curve fitting is appropriate when the experimental errors are additive and can be considered as independent draws from a symmetric distribution, with constant variance. However, if that were true, then in this example it would be possible to measure negative blood concentrations, which is clearly not reasonable.
It might be more realistic to assume multiplicative errors, symmetric on the log scale. We can fit a Weibull curve to the data under that assumption by logging both the concentrations and the original Weibull curve itself. That is, we can use nonlinear least squares to fit the curve

coefEsts2 = nlinfit(time, log(conc), @(p,x)log(modelFun(p,x)), startingVals);
line(xgrid, modelFun(coefEsts2, xgrid), 'Color',[0 .5 0]);
legend({'Raw Data' 'Additive Errors Model' 'Multiplicative Errors Model'});

This model fit should be accompanied by estimates of precision and followed by checks on the model's goodness of fit. For example, we should make residual plots on the log scale to check the assumption of constant variance for the multiplicative errors. For simplicity we'll leave that out here.
In this example, using the multiplicative errors model made little difference in the model's predictions, but that's not always the case. An example where it does matter is described in the Pitfalls in Fitting Nonlinear Models by Transforming to Linearity example.

Functions for Curve Fitting

MATLAB and several toolboxes contain functions that can used to perform curve fitting. The Statistics Toolbox™ includes the functions nlinfit, for nonlinear least squares curve fitting, and glmfit, for fitting Generalized Linear Models. The Curve Fitting Toolbox™ provides command line and graphical tools that simplify many of the tasks in curve fitting, including automatic choice of starting coefficient values for many models, as well as providing robust and nonparametric fitting methods. Many complicated types of curve fitting analyses, including models with constraints on the coefficients, can be done using functions in the Optimization Toolbox™. The MATLAB function polyfit fits polynomial models, and fminsearch can be used in many other kinds of curve fitting.

Distribution Fitting

Now consider an experiment where we've measured the time to failure for 50 identical electrical components.

life = [ 6.2 16.1 16.3 19.0 12.2  8.1  8.8  5.9  7.3  8.2 ...
        16.1 12.8  9.8 11.3  5.1 10.8  6.7  1.2  8.3  2.3 ...
         4.3  2.9 14.8  4.6  3.1 13.6 14.5  5.2  5.7  6.5 ...
         5.3  6.4  3.5 11.4  9.3 12.4 18.3 15.9  4.0 10.4 ...
         8.7  3.0 12.1  3.9  6.5  3.4  8.5  0.9  9.9  7.9];

Notice that only one variable has been measured -- the components' lifetimes. There is no notion of response and predictor variables; rather, each observation consists of just a single measurement. The objective of an analysis for data like these is not to predict the lifetime of a new component given a value of some other variable, but rather to describe the full distribution of possible lifetimes. This is distribution fitting with univariate data.
One simple way to visualize these data is to make a histogram.

binWidth = 2;
binCtrs = 1:binWidth:19;
hist(life,binCtrs);
xlabel('Time to Failure'); ylabel('Frequency'); ylim([0 10]);

It might be tempting to think of this histogram as a set of (x,y) values, where each x is a bin center and each y is a bin height.

h = get(gca,'child');
set(h,'FaceColor',[.98 .98 .98],'EdgeColor',[.94 .94 .94]);
counts = hist(life,binCtrs);
hold on
plot(binCtrs,counts,'o');
hold off

We might then try to fit a curve through those points to model the data. Since lifetime data often follow a Weibull distribution, and we might even think to use the Weibull curve from the curve fitting example above.

coefEsts = nlinfit(binCtrs,counts,modelFun,startingVals);

However, there are some potential pitfalls in fitting a curve to a histogram, and this simple fit is not appropriate. First, the bin counts are nonnegative, implying that measurement errors cannot be symmetric. Furthermore, the bin counts have different variability in the tails than in the center of the distribution. They also have a fixed sum, implying that they are not independent measurements. These all violate basic assumptions of least squares fitting.
It's also important to recognize that the histogram really represents a scaled version of an empirical probability density function (PDF). If we fit a Weibull curve to the bar heights, we would have to constrain the curve to be properly normalized.
These problems might be addressed by using a more appropriate least squares fit. But another concern is that for continuous data, fitting a model based on the histogram bin counts rather than the raw data throws away information. In addition, the bar heights in the histogram are very dependent on the choice of bin edges and bin widths. While it is possible to fit distributions in this way, it's usually not the best way.
For many parametric distributions, maximum likelihood is a much better way to estimate parameters. Maximum likelihood does not suffer from any of the problems mentioned above, and it is often the most efficient method in the sense that results are as precise as is possible for a given amount of data.
For example, the function wblfit uses maximum likelihood to fit a Weibull distribution to data. The Weibull PDF takes the form

$y = (a/b) * (x/a)^{(b-1)} * exp(-(x/a)^b).$

Notice that this is almost the same functional form as the Weibull curve used in the curve fitting example above. However, there is no separate parameter to independently scale the vertical height. Being a PDF, the function always integrates to 1.
We'll fit the data with a Weibull distribution, then plot a histogram of the data, scaled to integrate to 1, and superimpose the fitted PDF.

paramEsts = wblfit(life);
n = length(life);
prob = counts / (n * binWidth);
bar(binCtrs,prob,'hist');
h = get(gca,'child');
set(h,'FaceColor',[.9 .9 .9]);
xlabel('Time to Failure'); ylabel('Probability Density'); ylim([0 0.1]);
xgrid = linspace(0,20,100);
pdfEst = wblpdf(xgrid,paramEsts(1),paramEsts(2));
line(xgrid,pdfEst)

Maximum likelihood can, in a sense, be thought of as finding a Weibull PDF to best match the histogram. But it is not done by minimizing the sum of squared differences between the PDF and the bar heights.
As with the curve fitting example above, this model fit should be accompanied by estimates of precision and followed by checks on the model's goodness of fit; for simplicity we'll leave that out here.

Functions for Distribution Fitting

The Statistics Toolbox includes functions, such as wblfit, for fitting many different parametric distributions using maximum likelihood, and the function mle can be used to fit custom distributions for which dedicated fitting functions are not explicitly provided. The function ksdensity fits a nonparametric distribution model to data. The Statistics Toolbox also provides the GUI dfittool, which simplifies many of the tasks in distribution fitting, including generation of various visualizations and diagnostic plots. Many types of complicated distributions, including distributions with constraints on the parameters, can be fit using functions in the Optimization Toolbox. Finally, the MATLAB function fminsearch can be used in many kinds of maximum likelihood distribution fitting.
Although fitting a curve to a histogram is usually not optimal, there are sensible ways to apply special cases of curve fitting in certain distribution fitting contexts. One method, applied on the cumulative probability (CDF) scale instead of the PDF scale, is described in the Fitting a Univariate Distribution Using Cumulative Probabilities example.

Summary

It's not uncommon to do curve fitting with a model that is a scaled version of a common probability density function, such as the Weibull, Gaussian, gamma, or exponential. Curve fitting and distribution fitting can be easy to confuse in these cases, but the two are very different kinds of data analysis.

Curve fitting involves modelling the trend or mean of a response variable as a function of a second predictor variable. The model usually must include a parameter to scale the height of the curve, and may also include an intercept term. The appropriate plot for the data is an x-y scatterplot.

Distribution fitting involves modelling the probability distribution of a single variable. The model is a normalized probability density function. The appropriate plot for the data is a histogram.

Search This Blog

Friday, May 31, 2013

MATLAB, R, and Julia: Languages for data analysis

Inside core features of specialized data analysis languages.

MATLAB

R

Julia

… and Python

R for MATLAB users

R for MATLAB users

Help

Searching available documentation

Using interactively

Operators

Arithmetic operators

Relational operators

Logical operators

root and logarithm

Round off

Mathematical constants

Missing values; IEEE-754 floating point status flags

Complex numbers

Trigonometry

Generate random numbers

Vectors

Sequences

Concatenation (vectors)

Repeating

Miss those elements out

Maximum and minimum

Vector multiplication

Matrices

Concatenation (matrices); rbind and cbind

Array creation

Reshape and flatten matrices

Shared data (slicing)

Indexing and accessing elements (Python: slicing)

Assignment

Transpose and inverse

Sum

Sorting

Maximum and minimum

Matrix manipulation

Equivalents to "size"

Matrix- and elementwise- multiplication

Find; conditional indexing

Multi-way arrays

File input and output

Plotting

Basic x-y plots

Axes and titles

Log plots

Filled plots and bar plots

Functions

Polar plots

Histogram plots

3d data

Contour and image plots

Perspective plots of surfaces over the x-y plane

Scatter (cloud) plots

Save plot to a graphics file

Data analysis

Set membership operators

Statistics

Interpolation and regression

Non-linear methods

Polynomials, root finding

Differential equations

Fourier analysis

Symbolic algebra; calculus

Programming

Loops

Conditionals

Debugging

Working directory and OS

Thursday, May 30, 2013

Nonlinear Data-Fitting with a specific function in MATLAB

Nonlinear Data-Fitting

Curve Fitting and Distribution Fitting in MATLAB

Curve Fitting and Distribution Fitting