Mahout

Is a framework for machine learning and part of the Apache Foundation. A sub-framework of Mahout is Taste used specifically for collaborative filtering.
The Taste framework comes in two tastes (pun intended):

Online where recommendations are computed on demand, typically on smaller datasets. This version is easily integrated in existing Java applications either by using on of the existing Recommender algorithms. Online computations are done in memory (as long as they fit) these can be updated more frequently by, for example, pushing new .csv files, or using data from a SQL database.
Offline which utilise Apache Hadoop to achieve scalability. Mahout points out, however, that map-reduce tasks doesn’t logically fit all types of algorithms and are hence exploring alternative distribution methods too.

The recommender beginner’s wiki points out that datasets containing up to 100M user-item ratings should be computable online using a decent server.
Not all algorithms provided by Taste are available as Hadoop implementations. There is an iterative algorithm for matrix factorization using Alternating Least Squares. Iterative algorithms incur significant overhead when written as MapReduce jobs in Hadoop (a better way could be to model the computation using bulk synchronous processing, like Pregel).
Building a system which combines Mahout’s offline and online capabilities seems yet to be done. Basically since you want your online computations to be O(1) I’m not sure that Mahout is a good fit. It might be easier to do online updates on data on the side, and possibly use Mahout for the offline computations.
Decent introductions to Mahout can be found here and here.
This page has information about the recommender architecture and how to build your own recommenders. The architecture does not provide an intuitive explanation for how the collaborative framework connects to Hadoop. Based on a brief tour in the source code it looks like Mahout provides a “Hadoop Job Factory” which generates and submits map- and reduce tasks (aka jobs) to your Hadoop cluster.
Mahout has also been shown to run on AWS Elastic MapReduce which, given the readme, does not seem like a trivial task.
Foursquare provides a pretty interesting use-case on Mahout with extremely large datasets, and also emphasizes the fact that Mahout is geared towards industry.

Graphlab

On the other hand, the Graphlab project takes a quite different approach to parallel collaborative filtering (more broadly, machine learning), and is primarily used by academic institutions.
Graphlab jobs operate on a graph data structure much similar to Google’s system Pregel. Computation is defined through an update-function which operates on one vertex of the graph at the time. During an update call, new update requests can be scheduled with other vertices of the graph. A central scheduler delegates vertices for processing. For a good example, see the Graphlab implementation of Pagerank.
Contrary to Hadoop, Graphlab is built for multi-core parallelism, although there is on-going work in making it easier to user in a distributed setup. It also seem to lack mechanisms for fault-tolerance (for example, map or reduce tasks are restarted by the master if they fail to complete).
However, Graphlab boasts that “implementing efficient and provably correct parallel machine algorithms” is easier when compared to MapReduce. Especially since computation is not required to be transformed into an embarrassingly parallel form. It is different from Pregel in the sense that communication between vertices is implicit, and that computation is asynchronous. The latter implies that computation on a vertex will happen on the most recent available data. By ensuring that all computations are sequentially consistent, the end result data will eventually also be consistent, programs becomes easier to debug, and complexity of parallelism is reduced.

Summary

Mahout looks like a more polished product, especially as it relies on Hadoop for scalability and distribution. Its computational model may, however, be constrained just because of the same prerequisite. It is, hence, here Graphlab excells since it is built ground up for iterative algorithms such as those used in collaborative filtering. On the downside, Graphlab lacks a production-ready distribution framework.

What is this?
Many researchers say, start writing your thesis early. This is my way of approaching that. The goal is to write at least one post everyday. For common reasons I will not be able to share all details, however, I will at least try to collect interesting findings, ideas that come to mind, portions of the report, and of course occasional hickups.
If you have suggestions for improvements, feel free to contact me.
The Feed
Feeling for a post a day? Then the feed is for you. And no, I don't normally post on weekends ;)

Marcus who?

Not interested in my thesis work but want to find out more about me... (a little freakish), check my personal blog on http://ljungblad.nu
Contact me on Twitter or by e-mail.

What is Tuenti?

Tuenti is the largest private social network in the world with a primary focus on the Spanish market. It has around 13 million users and is still growing strong. For more information about Tuenti check their blog and their tech blog.
Btw, it is a really cool place to work at.

EMDC what?

EMDC stands for European Master in Distributed Systems and is a double-degree computer science programme jointly run by the KTH, UPC, and IST. Our class is a small group of dedicated students with a passion for technologies related to scalability, fault-tolerance, high-availability, networks, clouds, distribution and much more.

A Data Scientist's blog

Search This Blog

Thursday, May 17, 2012

Mahout

Thesis Log

Mahout

Graphlab

Summary

What is this?

The Feed

Marcus who?

What is Tuenti?

EMDC what?

No comments:

Post a Comment