Statistics with Julia: Least Squares Regression with Direct Methods

Linear regression, and ordinary least squares in particular, is one of the most popular tools for data analysis. Continuing on my series about using the Julia language for basic statistical analysis with a review of the most well known direction solutions to the least squares problem.
The Least Squares approach to linear regression was discovered by Gauss and first published by Legendre, although there has been some historical controvesy over this point (a translation of Legendre's original from 1804). It has grown to become the workhorse of most statistical analysis: most common estimators can be cast into this framework, it is very mathematically tractable, and has been used for a long period of time.
From a machine learning standpoint, this is an example of supervised learning since we have data for the dependent (predicted) variable and the independent variable(s) (the predictors). This means that we will be training the model on some known data, and trying to find the best set of parameters to minimize the difference between our model and the observations.
The notation will vary by field, but it is common in Statistics to denote the observations of the dependent variable by the vector y, observations of the independent variables by a matrix X, and either \hat y or more often f(X, \beta) as the model with a set of parameters \beta. As such, our least squares minimization problem can be defined as:
min_{\beta} || f(X, \beta) - y ||_2^2
We can simplify the notation further by adding a vector of ones to the matrix X to denote the intercept term, at which point we will want to minimize this term:
f(X, \beta) = \beta X = \hat y
Those coming from a more numerical background will find this same problem defined as Ax = b.
The Least Squares solution is found my minimizing the sum of squares of the residuals (i.e. the difference between the prediction f(X, \beta) and the observations y. Minimizing the Euclidean norm of the residuals, in the case of a classical linear model with i.i.d. residuals, is the best linear unbiased estimator (BLUE, from the Gauss-Markov Theorem). This is a common simplifying assumption in statistical models, and there are many manipulations to variables that can enable this to hold.
We will now start to focus on several direct methods for solving this problem, followed subsequently by iterative methods. R and Matlab have built-in methods for doing this, and they both take advantage of different characteristics of the input data (e.g. sparcity); as such, I would never advise using what I provide below in place of these functions, but only to get a better understanding of different methodologies.

A Simple Motivating Example

We start with a very simple example: 4 x and y data points:
X = matrix(1:4)
y = c(2, 1, 1, 1)
lm(y ~ X)

Similarly, the simple solution in Julia is available through the linreg() function which solves the problem efficiently with LAPACK:
X = [1:4]; y = [2, 1, 1, 1];
linreg(X, y)

Julia's linreg (which uses LAPACK) is much less rich than the infrastructure that supports lm in R. The only returned values are the coefficients. R's function accepts a formula, following on the design from the famous "White Book", "Statistical Models in S" from Chambers and Hastie. And it returns an lm.fit object, which provides a wealth of available data and features.
I want to look into several different methods for solving the least squares problem. R's lm functions use C and LINPACK under the hood, so this shouldn't be considered a fair comparison of their statistical functions, but more as a comparison of the languages themselves. I also don't go into mathematical derivations but simply provide the Julia code.

Normal Equations

[Note: By convention, I will be putting semi-colon's ; after expressions even though it isn't critical because it prevents Julia from printing out the return value.]
The normal equations result from defining the problem in terms of X and finding the first order condition for a minimum.
Note that I am adding a column of 1's to the input matrix, as this will represent the intercept in the regression equation.
X = [1 1; 1 2; 1 3; 1 4]; y = [2 1 1 1]';
(m, n) = size(X);
C = X' * X; c = X' * y; yy = y' * y;
Cbar = [C c; c' yy];
Gbar = chol(Cbar)';
G = Gbar[1:2, 1:2]; z = Gbar[3, 1:2]'; rho = Gbar[3, 3];
beta = G' \ z

This makes use of the Cholesky decomposition (you can also find an implementation of a parallel cholesky decomposition in Julia from Omar Mysore as part of a project for MIT 18.337 "Parallel Computing").
[Note: This code is almost exactly valid in Matlab/Octave, with the only exception being that the returned values on line 2 should be in square parentheses and the matrix slicing on line 6 should be in round parentheses.]

QR factorization

The QR factorization solution breaks the matrix A into a matrix Q and an upper right triangle R. For this it can be convenient to break the Q matrix into two parts: Q1 corresponding to the part that intersects with R, and Q2 which corresponds to the zero part of R.
X = [1 1; 1 2; 1 3; 1 4]; y = [2 1 1 1]';
(m, n) = size(X);
(Q, R) = qr(X);
R1 = R[1:n, 1:n]; Q1 = Q[:, 1:n];
beta = R1 \ (Q1' * y)

SVD factorization

The SVD approach can handle situations in which the columns of matrix A do not have full rank.
X = [1 1; 1 2; 1 3; 1 4]; y = [2 1 1 1]';
(m, n) = size(X);
(U, S, V) = svd(X);
c = U' * y; c1 = c[1:n]; c2 = c[n+1:m];
z1 = c1 ./ S;
beta = V' * z1

You can also find a nice discussion of this with Matlab in Jim Peterson's "Numerical Analysis: Adventures in Frustration".
SVD provides a more precise solution than QR, but it is considerably more expensive to compute. As a result, modern software uses QR most of the time, unless there are specific cases in which that methodology underperforms or doesn't hold.

End notes

We will look at iterative methods in the next post. Then spend some time looking at how we can characterize the "goodness of fit" and statistical significance in the traditional sense. And finally look at ways to avoid overfitting (e.g. regularization) and models other than OLS (including weighted-least squares and logistic regression).
Some useful references on this subject:
* Much of what I showed above was based on sections from "Numerical Methods and Optimization in Finance" by Gilli, Maringer, and Schumann. I highly recommend this text to anyone involved with optimization in finance as it provides a strong overview of most major areas, including computational considerations.
* Another nice free resource is Steven E. Pav's "Numerical Methods Course Notes", which includes examples of direct and iterative methods for least squares using Octave.