Inside core features of specialized data analysis languages.
Big data frameworks like Hadoop have received a lot of attention recently, and with good reason: when you have terabytes of data to work with — and these days, who doesn’t? — it’s amazing to have affordable, reliable and ubiquitous tools that allow you to spread a computation over tens or hundreds of CPUs on commodity hardware. The dirty truth is, though, that many analysts and scientists spend as much time or more working with mere megabytes or gigabytes of data: a small sample pulled from a larger set, or the aggregated results of a Hadoop job, or just a dataset that isn’t all that big (like, say, all of Wikipedia, which can be squeezed into a few gigs without too much trouble).
At this scale, you don’t need a fancy distributed framework. You can just load the data into memory and explore it interactively in your favorite scripting language. Or, maybe, a different scripting language: data analysis is one of the few domains where special-purpose languages are very commonly used. Although in many respects these are similar to other dynamic languages like Ruby or Javascript, these languages have syntax and built-in data structures that make common data analysis tasks both faster and more concise. This article will briefly cover some of these core features for two languages that have been popular for decades — MATLAB and R — and another, Julia, that was just announced this year.
MATLAB
MATLAB is one of the oldest programming languages designed specifically for data analysis, and it is still extremely popular today. MATLAB was conceived in the late ’70s as a simple scripting language wrapped around the FORTRAN libraries LINPACK and EISPACK, which at the time were the best way to efficiently work with large matrices of data — as they arguably still are, through their successor LAPACK. These libraries, and thus MATLAB, were solely concerned with one data type: the matrix, a two-dimensional array of numbers.This may seem very limiting, but in fact, a very wide range of scientific and data-analysis problems can be represented as matrix problems, and often very efficiently. Image processing, for example, is an obvious fit for the 2D data structure; less obvious, perhaps, is that a directed graph (like Twitter’s follow graph, or the graph of all links on the web) can be expressed as an adjacency matrix, and that graph algorithms like Google’s PageRank can be easily implemented as a series of additions and multiplications of these matrices. Similarly, the winning entry to the Netflix Prize recommendation challenge relied, in part, on a matrix representation of everyone’s movie ratings (you can imagine every row representing a Netflix user, every column a movie, and every entry in the matrix a rating), and in particular on an operation called Singular Value Decomposition, one of those original LINPACK matrix routines that MATLAB was designed to make easy to use.
Its focus on matrices led to some important differences in MATLAB’s design compared to general-purpose programming languages. First, it has special syntax for matrix literals. For simple cases, this will look pretty familiar; here’s a 1×2 matrix (in other words, a row vector):
V = [1 2]
Here’s a 2×3 matrix (the semicolon is what breaks up the rows; the line break is ignored):A = [4 5 6;
7 8 9]
It gets more interesting when you take advantage of interpolation. As
with strings in some other languages, in MATLAB you can mix matrix
variables in with literal data, almost like a template. For example,
given the above definitions, then this:B = [V 3; A]
will produce this 3×3 matrix (I’ll use bold italic to show results throughout):B =
1 2 3
4 5 6
7 8 9
In two dimensions, it’s a little tricker to fit things together than
with strings. If we hadn’t included the “3″ to pad out the first row,
the interpreter would have complained that the dimensions don’t match,
and the literal would have been invalid.More important than the literal syntax is that all of the core library functions and operators in MATLAB were designed to accept and return matrices rather than individual values. This can take some getting used to. It may not seem that strange that we can double every element of our matrix above by adding it to itself:
BB = B + B
BB =
2 4 6
8 10 12
14 16 18
Or even that a function like sqrt
will take the square root of every element:BR = sqrt(B)
BR =
1.0000 1.4142 1.7321
2.0000 2.2361 2.4495
2.6458 2.8284 3.0000
It’s a little bit more strange to pass it to a function like isprime
and get a matrix back as a result, with a “1″ in every matrix location that held a prime number:BP = isprime(B)
BP =
0 1 1
0 1 0
1 0 0
It’s also strangely powerful to be able to naturally extend
algorithms into multiple dimensions. For example, in most programming
languages, if we had a column of numbers and wanted to find the mean, it
would be straightforward to sum up the column and divide by the number
of rows:C = [1; 2; 3; 4; 5; 6; 7]
mean = sum(C) / rows(C)
mean = 4
But let’s say that instead we had a matrix where each row represented an (x,y) point:D = [3 4;
2 0;
6 1;
1 3]
Plotted, it looks like this:Because
sum()
works on columns, no matter how many there are, we can use the exact same code to find the center — technically, the centroid — of these points, which has the mean of all the x values for its x, and the mean of all the y values for its y:center = sum(D) / rows(D)
center =
3 2
What if we wanted to find the distance from each point to that center? Again, we can operate on the matrix as a whole. Distance is the square root of the sum of the squares of the differences for each dimension, or in MATLAB:
distances = sqrt(sum(power(D — center, 2), 2))
distances =
2.0000
2.2361
3.1623
2.2361
(If you’re wondering why we’re passing a 2 to sum()
, that’s because we’re asking to sum the rows — the second dimension — rather than columns, which is the default).All of this would also work unchanged for points with three (or more!) dimensions, simply by adding more columns to our original matrix.
The other important feature to call out in MATLAB’s matrix syntax is the very flexible support for indexing into a matrix. You can, of course, pick out an individual element — for example,
D(1,1)
picks out row 1, column 1 of D — but you can also use ranges, or
wildcards. For example, because a colon by itself acts as a wildcard, Dy = D(:,2)
will pick out the second (y) column for each row in D:Dy =
4
0
1
3
You can also use “logical indexing” to index a matrix by another matrix: M(I)
will return only those elements in M that correspond to non-zero elements of I. To return to our earlier example, primes = B(BP)
would return only those elements of B that are prime, and thus have corresponding 1s in the BP matrix:primes =
7
2
5
3
Because logical and comparison operators also work on matrices, you
can use this almost like a little query language. For example, this will
return all elements of B that are both prime and greater than 5:result = B(isprime(B) & (B > 5))
result = 7
Just to reiterate what’s happening here: each of isprime(B)
and B > 5
are returning matrices full of 0s and 1s; the &
operator is combining them; the resulting matrix is being used to index
into B and return the (single, in this case) result of 7. And yet, it
reads almost like SQL.It’s worth mentioning that MATLAB is commercial software; throughout these examples, I’ve in fact been using GNU Octave, an open source language designed to be compatible in most respects.
R
The other widely used open source language for data analysis is R, a modern version of the S language for statistical computing that originally came out of the Bell Labs around the same time MATLAB was being developed.Although R has access to a similar set of matrix math routines as MATLAB — via the LAPACK library — it avoids any specialized syntax for numeric matrices. Where R shines, instead, is in datasets that are richer or messier than the pure numerics of MATLAB. R introduces the “data frame” as a core data structure, which is like a matrix with two key differences: first, its columns and rows can have names; second, each column can hold a different data type. Like MATLAB matrices, you can operate on them as a whole. For example, our centroid example from earlier would look like this in R:
D = data.frame(x = c(3,2,6,1), y=c(4,0,1,3))
center = colSums(D) / nrow(D)
center =
x y
3 2
Even in this simple example, and even without making use of multiple
data types in the columns, it’s awfully convenient to see named columns
rather than just numeric indices. R exploits the names for much more
than just output, however. First, it has specialized syntax for
referencing a column by name. Although we could index the second column
with D[,2] (unlike MATLAB, instead of using a specific wildcard
character, we just omit the row index that would go before the comma),
we can also reference it more simply:D$y
[1] 4 0 1 3
Even more simply, in the common case where we are working primarily with one dataset, we can use R’s attach()
function to import the column names into our local variable scope.
(Experienced R users would, however, warn you against using this in more
complex cases where you had multiple datasets, or a large number of
columns, since you can quickly and confusingly clutter up your scope).
Once we’ve attached D, we can access its columns just as “x” and “y”. If
we wanted to compute, say, a new column that was the product of D’s x
and y columns, nothing could be easier:attach(D)
x * y
[1] 12 0 6 3
Unlike matrices, R’s data frames can also be extended with new rows
or columns, so that we could create a new column for this result if we
liked:D$xy = x * y
However, because this column didn’t exist when we performed the attach()
, it won’t be available as a local variable unless we attach(D)
again.R can do the same logical indexing tricks that MATLAB can, but they work even better with attached named columns. Let’s say we had a data frame with columns for height, weight, and gender:
M = data.frame(height = c(62, 70, 67), weight = c(120, 178, 180),
gender = c("m", "f", "m"))
attach(M)
We can use logical operations on the column variables to produce a
vector of booleans, showing us which rows represent men taller than 65″:height > 65 & gender == "m"
[1] FALSE FALSE TRUE
But as with MATLAB, we can also use that vector to index into the
rows of the original data frame, returning the (in this case) single
matching result:M[height > 65 & gender == "m",]
height weight gender
67 180 m
Note the comma at the end, which ensures that this is interpreted as a
row index and that we get all columns. (For this specific case, it
would also be idiomatic to use the subset()
function, which operates similarly but doesn’t require attach())
.Finally, attaching named columns also makes it easy to use another piece of special purpose R syntax: model formula notation. This allows you to express a relationship between two or more variables and is used pervasively throughout R. Functions for plotting, regression and ANOVA, machine learning, and so on can all make use of models described in this form. For example, if we believed there was a linear relationship between the height values and the weight values of D, we might ask R to try to fit this model like so, using the
lm()
linear model function:model = lm(weight ~ height)
Similarly, we could bring up a scatter plot with just:plot(weight ~ height)
If we believed that weight depended on both height and gender, we could express that as weight ~ height + gender
. More complicated relationships are possible, too, within limits; for example, the *
operator can be used to relate two variables that interact rather than
being independent. It would be unusual for a general-purpose language to
have built-in syntax for defining these kinds of models; because R was
designed for statistics, it’s both available and widely used, and helps
greatly in allowing complex, highly configurable features like plotting
or generalized linear modeling to be hidden behind a single function
call.MATLAB and R both share this style of exposing very complex functionality through a single function, often with a huge number of optional parameters. For example, R’s plot function literally can take dozens of graphical parameters, and MATLAB’s isn’t far behind. Functions for optimization, clustering, regression, and so on are similar, if less extreme. This works very well for the common case of loading a research data set into an interactive session and exploring it using the standard library toolkit — especially when you have something like model formula notation to keep the default usage very simple. It can be daunting, however, to dive deeper and build larger programs that need to extend, tweak, or reuse parts of this toolkit because a function like
plot()
or lm()
appears to be a black box; either you need exactly what it does, or you need to reimplement the whole thing.This is certainly not universally true; most would probably agree that Hadley Wickham’s plyr and ggplot2 packages, for example, have elegantly composable APIs. But the underlying structures of these languages may not encourage this kind of careful design. It’s telling, for example, that MATLAB awkwardly requires a separate file for each and every public function or that John Myles White’s excellent R-focused blog describes object-oriented programming in R as “a hack that was put on top of the original S language,” and has trouble puzzling out what the right style is for defining setter methods; it’s also telling that while a tutorial for a general-purpose language like Python will cover defining functions and classes early on, many R and MATLAB tutorials never cover these topics at all.
Performance may also be a factor. Although they can do matrix math very fast, thanks to the underlying libraries, both MATLAB and R have notoriously slow language interpreters (and this goes double for the open-source Octave implementation). This discourages writing large libraries or complex abstractions in the languages themselves and tends to relegate the computational core of any new function to a C or FORTRAN extension, which makes the function even more of a daunting black box to the casual hacker.
Julia
Julia is a modern language for scientific computing, designed to address some of these concerns. Superficially, Julia strongly resembles MATLAB. For example, here’s how the MATLAB documentation says you should compute the density of a matrix — that is, what proportion of its values are non-zero:x = [1 0 3 0; 4 3 0 1; 2 3 5 5]
density = nnz(x)/prod(size(x))
density = 0.75
To unpack this a little bit: the obscurely named nnz
function returns the number of non-zero values in the matrix; size()
returns a vector with the matrix dimensions (in this case, [3 4]); and prod()
multiplies up all the values in that vector, giving us the total number
of entries in the matrix. Simple division gives us the density.Here’s the equivalent code in Julia:
x = [1 0 3 0; 4 3 0 1; 2 3 5 5]
density = nnz(x)/prod(size(x))
Well, not just equivalent, but identical! Matrix literals and obscure
function names and all. Most code won’t port over quite this easily,
but Julia is clearly designed to make MATLAB users feel at home. Under
the covers, however, things look extremely different. It’s instructive
to look at Julia’s implementation of the prod
function and compare it to Octave’s (since we don’t have access to MATLAB’s). Here’s a snippet of Octave’s prod
:DEFUN (prod, args, ,
"prod (X): products")
{
octave_value_list retval;
int nargin = args.length ();
if (nargin == 1) {
octave_value arg = args(0);
if (arg.is_real_type ()) {
Matrix tmp = arg.matrix_value ();
if (! error_state)
retval(0) = tmp.prod ();
} else if (arg.is_complex_type ()) {
…
} else {
gripe_wrong_type_arg ("prod", arg);
return retval;
}
} else {
…
}
}
A few things to notice: first of all, Octave implements this and many
of the standard library functions in C; second, there are hard-coded
checks for the number of arguments, and for two possible argument types —
a standard real-valued matrix and a complex matrix — with calls to
separate implementations for each, where the actual computation happens.Here’s the equivalent code in Julia:
function prod{T}(A::StridedArray{T})
if isempty(A)
return one(T)
end
v = A[1]
for i=2:numel(A)
v *= A[i]
end
v
end
It’s considerably shorter and easier to read, mostly because — even
though it’s a core function — it’s implemented in Julia itself. It’s
also generic, which is to say, this one piece of code will work for
integer matrices, or complex, or double-precision, and so on.
StridedArray is Julia’s type for dense (as opposed to sparse)
n-dimensional arrays; the {T}
in the function signature
represents a type parameter and can take on any value, including a
user-supplied type. An especially interesting thing here is the behavior
when the array is empty: even though it doesn’t have any example values
to work with, it can pass the type parameter to the one()
function to get a “1″ of the right type. It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will take up much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can deal seamlessly and generically with these different immediate types as well as pointer types like String.
In MATLAB, if you define a new data type, it’s possible to provide alternative implementations of functions like
prod
that operate on that type. The same is true of Julia: the
implementation shown above is only used for StridedArray, and Julia
provides entirely separate implementations for other types — like
DArray, Julia’s distributed array implementation.Unlike in MATLAB, Julia also lets you provide alternative implementations for more subtle variations. For example, the specific case of a dense array of booleans is overridden to error:
prod(A::StridedArray{Bool}) = error("use all() instead of prod() for boolean arrays")
Julia has full multiple dispatch, meaning that the implementation is
chosen based on the specific type (and number) of all of the arguments
to the function — which is why the Octave code above has an explicit
check for the number of arguments but Julia doesn’t. Here’s a variation
that allows you to pass an extra argument specifying which dimension to
multiply on (like the extra 2 passed to sum()
in the MATLAB example computing distances):prod{T}(A::StridedArray{T}, region::Dimspec) = areduce(*,A,region,one(T))
This is extremely concise because it’s implemented with higher-order functions, passing the *
function as a value to a generic array reduction function — if you passed in +
instead, you’d have sum()
.
Julia makes extensive use of this functional programming style,
allowing its core library to be implemented at a high level of
abstraction. Abstraction can be costly, but it’s made possible in Julia
by a very high-performance language implementation.How high performance? The Julia site lists a handful of benchmarks comparing R, MATLAB, and Julia (as well as some others). For tasks that mostly exercise the underlying matrix libraries, like random matrix multiplication, they all do similarly well, as does C++; for tasks that exercise basic language features, like a simple recursive fibonacci implementation, Julia is a few times slower than C++ but is around 100 times faster than R and nearly 1000 times faster than MATLAB or Octave. This is a stunning difference, and may sound too good to be true, but although microbenchmarks should certainly be taken with a grain of salt, and these come from an obviously biased source, there’s no reason to think that Julia can’t be that fast. As an existence proof, Google’s V8 Javascript engine gets very similar performance from a language that’s even more dynamic and difficult to optimize; matching it is impressive but certainly not impossible. (Google has Lars Bak, a master VM implementer with decades of experience starting with the influential SELF project, but the Julia team, like anyone else, has access to those same influential papers).
Julia’s weakness, however, is its libraries. R has CRAN, certainly the most impressive collection of statistical libraries available anywhere. MATLAB also has a wide range of toolboxes available, for a price. Julia also lacks a rich development environment, like RStudio, and has only rudimentary support for plotting, which is a pretty critical part of most exploratory data analysis. Julia does, however, have a very active community, and I hope and believe that the rest will come with time; for now, it’s hard to compete with the decades of accumulated contributions that the older languages have.
… and Python
Reading this, you might get the impression that data analysis is all about numerics and filtering, and maybe plotting. Of course, that’s not true in the real world: data is messy, and in many cases, the majority of the work in a data analysis project is retrieving the data, parsing it, munging it, and so on. In this area, it’s unfortunately hard to dispute that general-purpose scripting languages like Perl, Ruby, and Python, have much better language and library support in this area than any of the data-specific languages. For that reason, despite the obvious advantages of MATLAB, R, and Julia, it’s also always worth considering what a general-purpose language can bring to the table.The leading contender here is almost certainly Python. The NumPy library provides a solid MATLAB-like matrix data structure, with efficient matrix and vector operations. That’s not unusual, however. For example, NArray provides a similar facility for Ruby; Java and the various JVM languages can use Colt, and so on. What makes Python stand out are two more libraries that have been built on top of NumPy:
- SciPy includes a very large collection of numerical, statistical, and optimization algorithms.
- Wes McKinney’s Pandas provides R-style Data Frame objects (using NumPy arrays underneath to ensure fast computation), along with a wealth of tools for manipulating them.