Thursday, May 17, 2012

GraphLab: A parallel framework for machine learning (ML)

GraphLab: A parallel framework for machine learning (ML)

GraphLab is a powerful new system (A Tool) for designing and implementing parallel algorithms in machine learning. At the beginning it has been developed for scientific purposes.

The GraphLab is a flexible tool developed in Java 1.6 which provide graph facilities.These facilities have been developed and used to implement CAD (computer added design) design flows for embedded system design and more precisely hardware IP generation.

The GraphLab tool includes many design flows for High-Level Synthesis and many more functionality's. Its key features for hardware designers are:

High-level synthesis under latency constraint,
High-level synthesis under area constraint (tutorial),
High-level synthesis using non-uniform word-length (tutorial),
High-level synthesis for multimode design (mutually exlusive applications sharing resources in a single design),
High-level synthesis using redundancy techniques for high-reliabilty applications,
Full pipeline design generation for high throughtput applications,
FSM controller optimization for low area and low power design

The other input labguages supported are: MATLAB, C/C++ and other input languages used in the GraphLab tool.

There is not a released version of GraphLab, since GraphLab is a tool which increases everyday its possibilities with new functionnalities.

GraphLab Vs MapReduce

MapReduce abstraction, is defined in two parts:

A Map stage which performs computation on indepedent problems which can be solved in isolation, and
2. A Reduce stage which combines the results.

GraphLab provides a similar analog to the Map in the form of an Update Function. The Update Function however, is able to read and modify overlapping sets of data (program state) in a controlled fashion as defined by the user provided data graph. The user provided data graph represents the program state with arbitrary blocks of memory associated with each vertex and edges. In addition the update functions can be recursively triggered with one update function spawning the application of update functions to other vertices in the graph enabling dynamic iterative computation. GraphLab uses powerful scheduling primitives to control the order update functions are executed.
The GraphLab analog to Reduce is the Sync Operation. The Sync Operation also provides the ability to perform reductions in the background while other computation is running. Like the update function sync operations can look at multiple records simultaneously providing the ability to operate on larger dependent contexts.

GraphLab performs around x50 - x100 faster than hadoop based Mahout (Is a framework for machine learning and part of the Apache Foundation. A sub-framework of Mahout is Taste used specifically for collaborative filtering).

GraphLab: A framework for Parallel Machine Learning Documentation

1.0

Introduction

GraphLab is a powerful new system for designing and implementing parallel algorithms in machine learning. While the current targets multi-core shared memory parallel systems we are in the process of implementing a distributed version and plan to provide support for alternative parallel architectures include GPUs in the near future.
The easiest way to pick up GraphLab is to code! Here is a pagerank example which will provide you with some of the high level ideas of GraphLab And here is a more detailed example which provides more details as well as the the supporting APIs surrounding GraphLab.
The key pages of interest are:

The graphlab::graph data structure.
represents a directed graph container and is used extensively throughout GraphLab.
Scopes
Schedulers
Shared Data

GraphLab is heavily templatized and the following two structures help to simplify template usage.

The graphlab::core data structure.
This provides a convenient wrapper around most of Graphlab.
graphlab::types
This provides typedefs for all shared memory GraphLab types.

GraphLab is additionally supported by a serialization library, a fast parallel/thread-safe random number generator, as well as a flexible command line parsing system.

Serialization
Random Number Generators
graphlab::command_line_options
The detailed example provides a good example of how this is used

The GraphLab library also has a collection of parallel utility classes and functions which may be useful.

Distributed GraphLab

There is a basic usage documentation for the distributed GraphLab implementation here: Using Distributed GraphLab .
It relies on a nice RPC implementation documented here: GraphLab RPC .

Installing GraphLab on Ubuntu

Recently I am spending more and more time giving supports to people who are installing Graphlab on ubuntu. Installation should be rather simple. The directions below are for Ubuntu Natty (11.04)

0) Login into your ubuntu machine (64 bit).
On amazon EC2 - you can launch instance AMI-e9965180

1) Installing libboost
sudo apt-get update
sudo apt-get install zlib1g-dev libbz2-dev
sudo apt-get install build-essential
sudo apt-get install libboost1.42-all-dev

2) Installing itpp

sudo apt-get install libitpp-dev libitpp7-dbg

3) Install Mercurial

sudo apt-get install mercurial

4) Install cmake

sudo apt-get install cmake

5a) Install graphlab from mercurial

Go to graphlab download page, and follow the download link to the mercurial repository.

copy the command string: "hg clone..." and execute it in your ubuntu shell.

or 5b) Install graphlab from tgz file

Go to graphlab download page, and download the latest release.
Extract the tgz file using the command: "tar xvzf graphlabapi_v1_XXX.tar.gz"
where XXX is the version number you downloaded.

6a) configure and compile - for GraphLab version 1

cd graphlabapi

export BOOST_ROOT=/usr/

./configure --bootstrap --yes
cd release/
make -j4

6b) configure and compile - for GraphLab version 2

cd graphlabapi
hg pull
hg update v2

export BOOST_ROOT=/usr/

./configure
cd release/
make -j4

7) Test Graphlab
cd tests
./runtests.sh

8) Optional: install Octave.
If you don't have access to Matlab, Octave is an open source replacement. Octave is useful for preparing input formats to GraphLab's collaborative filtering library and reading the output.
You can install Octave using the command:

sudo apt-get install octave3.2

Let me know how it went!

Note: from Clive Cox, RummbleLabs.com : For Ubuntu Lucid add the following:

sudo add-apt-repository ppa:lucid-bleed/ppa
sudo apt-get update

Graph-parallel frameworks

By Mosharaf on November 18, 2011

Google, “Pregel: A System for Large-Scale Graph Processing,” SIGMOD, 2010. [PDF]

Carnegie Mellon, “GraphLab: A New Framework for Parallel Machine Learning,” arXiv:1006.4990, 2010. [PDF]

Summary

Data-parallel frameworks such as MapReduce and Dryad are good at performing embarrassingly parallel jobs. These frameworks are not ideal for iterative jobs and for jobs where data-dependencies across stages are sparse (e.g., in MapReduce, each reducer is likely to depend on each mapper). However, there are many problems, specially in machine learning, that can be intuitively expressed using graphs with sparse computational dependencies, require multiple iterations to converge, and have variable convergence rate for different parameters. Pregel and GraphLab are two frameworks optimized for this type of graph-based problems.
A typical graph-parallel problem is expressed using graphs with vertices and edges, where each vertex and edge have associated data with them. In every iteration, vertex and edge data are updated and a bunch messages are exchanged between neighboring entities. This update function is typically the same for every vertex, and it is written by the user. There may or may not be a synchronization step at the end of every iteration. In a distributed setting, the graph is cut and divided across multiple nodes and updates from a collection of vertices in one node is communicated to another using message passing.

Pregel vs GraphLab

The key difference between Pregel and GraphLab is that Pregel has a barrier at the end of every iteration, whereas GraphLab is completely asynchronous. Asynchrony in GraphLab allows it to prioritize more complex vertices over others, but it also calls for consistency models to maintain sanity of results. GraphLab proposes three consistency models: full, edge, and vertex consistency, to allow different levels of parallelism. Another difference is that Pregel allows dynamic modifications to the graph structure, whereas GraphLab does not.

Comments

Pregel and GraphLab sit at two ends of the “power of framework” vs “ease of use” tradeoff space. Allowing asynchrony makes GraphLab more general and powerful than Pregel, but it is more complex and requires users to understand which consistency model is suitable for them. Pregel is simpler (common for most frameworks in Google’s arsenal), but still capable of handling a wide variety of problems. Given its origin at Google, open-source clones like Giraph, Pregel’s model is more likely to succeed in near future.

Mahout

Is a framework for machine learning and part of the Apache Foundation. A sub-framework of Mahout is Taste used specifically for collaborative filtering.
The Taste framework comes in two tastes (pun intended):

Online where recommendations are computed on demand, typically on smaller datasets. This version is easily integrated in existing Java applications either by using on of the existing Recommender algorithms. Online computations are done in memory (as long as they fit) these can be updated more frequently by, for example, pushing new .csv files, or using data from a SQL database.
Offline which utilise Apache Hadoop to achieve scalability. Mahout points out, however, that map-reduce tasks doesn’t logically fit all types of algorithms and are hence exploring alternative distribution methods too.

The recommender beginner’s wiki points out that datasets containing up to 100M user-item ratings should be computable online using a decent server.
Not all algorithms provided by Taste are available as Hadoop implementations. There is an iterative algorithm for matrix factorization using Alternating Least Squares. Iterative algorithms incur significant overhead when written as MapReduce jobs in Hadoop (a better way could be to model the computation using bulk synchronous processing, like Pregel).
Building a system which combines Mahout’s offline and online capabilities seems yet to be done. Basically since you want your online computations to be O(1) I’m not sure that Mahout is a good fit. It might be easier to do online updates on data on the side, and possibly use Mahout for the offline computations.
Decent introductions to Mahout can be found here and here.
This page has information about the recommender architecture and how to build your own recommenders. The architecture does not provide an intuitive explanation for how the collaborative framework connects to Hadoop. Based on a brief tour in the source code it looks like Mahout provides a “Hadoop Job Factory” which generates and submits map- and reduce tasks (aka jobs) to your Hadoop cluster.
Mahout has also been shown to run on AWS Elastic MapReduce which, given the readme, does not seem like a trivial task.
Foursquare provides a pretty interesting use-case on Mahout with extremely large datasets, and also emphasizes the fact that Mahout is geared towards industry.

Graphlab

On the other hand, the Graphlab project takes a quite different approach to parallel collaborative filtering (more broadly, machine learning), and is primarily used by academic institutions.
Graphlab jobs operate on a graph data structure much similar to Google’s system Pregel. Computation is defined through an update-function which operates on one vertex of the graph at the time. During an update call, new update requests can be scheduled with other vertices of the graph. A central scheduler delegates vertices for processing. For a good example, see the Graphlab implementation of Pagerank.
Contrary to Hadoop, Graphlab is built for multi-core parallelism, although there is on-going work in making it easier to user in a distributed setup. It also seem to lack mechanisms for fault-tolerance (for example, map or reduce tasks are restarted by the master if they fail to complete).
However, Graphlab boasts that “implementing efficient and provably correct parallel machine algorithms” is easier when compared to MapReduce. Especially since computation is not required to be transformed into an embarrassingly parallel form. It is different from Pregel in the sense that communication between vertices is implicit, and that computation is asynchronous. The latter implies that computation on a vertex will happen on the most recent available data. By ensuring that all computations are sequentially consistent, the end result data will eventually also be consistent, programs becomes easier to debug, and complexity of parallelism is reduced.

Summary

Mahout looks like a more polished product, especially as it relies on Hadoop for scalability and distribution. Its computational model may, however, be constrained just because of the same prerequisite. It is, hence, here Graphlab excells since it is built ground up for iterative algorithms such as those used in collaborative filtering. On the downside, Graphlab lacks a production-ready distribution framework.

What is this?
Many researchers say, start writing your thesis early. This is my way of approaching that. The goal is to write at least one post everyday. For common reasons I will not be able to share all details, however, I will at least try to collect interesting findings, ideas that come to mind, portions of the report, and of course occasional hickups.
If you have suggestions for improvements, feel free to contact me.
The Feed
Feeling for a post a day? Then the feed is for you. And no, I don't normally post on weekends ;)

Marcus who?

Not interested in my thesis work but want to find out more about me... (a little freakish), check my personal blog on http://ljungblad.nu
Contact me on Twitter or by e-mail.

What is Tuenti?

Tuenti is the largest private social network in the world with a primary focus on the Spanish market. It has around 13 million users and is still growing strong. For more information about Tuenti check their blog and their tech blog.
Btw, it is a really cool place to work at.

EMDC what?

EMDC stands for European Master in Distributed Systems and is a double-degree computer science programme jointly run by the KTH, UPC, and IST. Our class is a small group of dedicated students with a passion for technologies related to scalability, fault-tolerance, high-availability, networks, clouds, distribution and much more.

GraphLab website

High-Level Synthesis and much more...

The GraphLab tool is dedicated to graph model applications. It has been developed for scientific purposes at the beginning. As my research field includes CAD tool and more precisely the High-Level Synthesis ones, I have developed HLS flows using the GraphLab framework.

At the moment the GraphLab tool implements multiple HLS flows handling various constraints and different optimization goals. A list of the HLS functionnalities is provided bellow.

What is GraphLab ?

The GraphLab is a flexible tool developed in Java 1.6 which provide graph facilities. These facilities have been developed and used to implement CAD design flows for embedded system design and more precisely hardware IP generation.

The GraphLab tool includes many design flows for High-Level Synthesis and many more functionality's. Its key features for hardware designers are:

High-level synthesis under latency constraint,
High-level synthesis under area constraint (tutorial),
High-level synthesis using non-uniform word-length (tutorial),
High-level synthesis for multimode design (mutually exlusive applications sharing resources in a single design),
High-level synthesis using redundancy techniques for high-reliabilty applications,
Full pipeline design generation for high throughtput applications,
FSM controller optimization for low area and low power design.

This list does not provide all the implemented functionnalities developed. For more information on their usage and examples, look at the tutorial section. The input langages supported are:

MatLab: using the built-in parser (not all the MatLab semantics are supported),
C/C++: using the C to XML parser developped with the CAIRN team,
Others input langages may be adder using the open XML graph format used in the GraphLab tool.

Is there a released version of GraphLab ?

For the moment we do not release GraphLab versions. GraphLab is a tool which increases everyday its possibilities with new functionnalities. In these conditions we prefer to develiver snapshot version of the tool, it helps to be more reactive to correct bugs or things like that...

In order to download the current snapshot of the GraphLab tool, go to the download section and get the lastest one...

A Data Scientist's blog

Search This Blog

Thursday, May 17, 2012

GraphLab: A parallel framework for machine learning (ML)

GraphLab: A framework for Parallel Machine Learning Documentation

GraphLab: A framework for Parallel Machine Learning Documentation

1.0

Introduction

Distributed GraphLab

Installing GraphLab on Ubuntu

Installing GraphLab on Ubuntu

Graph-parallel frameworks

Graph-parallel frameworks

Summary

Pregel vs GraphLab

Comments

Mahout

Thesis Log

Mahout

Graphlab

Summary

What is this?

The Feed

Marcus who?

What is Tuenti?

EMDC what?

GraphLab website