A Data Scientist's blog: What is Hadoop?

Hadoop is a fault-tolerant distributed system for data storage which is highly scalable. The scalability is the result of a Self-Healing High Bandwith Clustered Storage , known by the acronym of HDFS (Hadoop Distributed File System) and a specific fault-tolerant Distributed Processing, known as MapReduce.

(Hadoop Distributed File System) and a specific fault-tolerant Distributed Processing, known as MapReduce.

Why Hadoop as part of the IT?

It processes and analyzes variety of new and older data to extract meaningful business operations wisdom. Traditionally data moves to the computation node. In Hadoop, data is processed where the data resides . The type of questions one Hadoop helps answer are:

Event analytics — what series of steps lead a purchase or registration

Large scale web click stream analytics

Revenue assurance and price optimizations

Financial risk management and affinity engine

Many other...

The Hadoop cluster or cloud is disruptive in data center. Some grid software resource managers can be integrated with Hadoop. The main advantage is that Hadoop jobs can be submitted orderly from within the data center. See below the integration with Oracle Grid Engine.

What types of data we handle today?

Human-generated data that fits well into relational tables or arrays. Examples are conventional transactions – purchase/sale, inventory/manufacturing, employment status change, etc. This is the core data managed by OLTP relational DBMS everywhere. In the last decade, humans generated other kinds of data as well, like text, documents (text or otherwise), pictures, videos, slideware. Traditional relational databases are a poor home for this kind of data because:

It often deals with opinions or aesthetic judgments – there is little concept of perfect accuracy.
There is little concept of perfect completeness.
There’s also little concept of perfectly, unarguably accurate query results –
- Different people will have different opinions as to what comprises good results for a search.
No clear cut binary answers; documents can have differing degrees of relevancy

Another type of data is the machine generated data, machine that human created and that produce unstoppable streams of data

Computer logs
Satellite telemetry (espionage or science)
GPS outputs
Temperature and environmental sensors
Industrial sensors
Video from security cameras
Outputs from medical devises
Seismic and Geo-phisical sensors
…

According to Gartner , Enterprise Data will grow 650% by 2014. 85% of these data will be “unstructured data”, and this segment has a CAGR of 62% per year, far larger than transactional data.

Example of Hadoop usage

Netflix (NASDAQ: NFLX) is a service offering online flat rate DVD and Blu-ray disc rental-by-mail and video streaming in the United States. It has over 100,000 titles and 10 million subscribers. The company has 55 million discs and, on average, ships 1.9 million DVDs to customers each day. Netflix offers Internet video streaming, enabling the viewing of films directly on a PC or TV at home. Netflix’s movie recommendation algorithm uses Hive (underneath using Hadoop, HDFS & MapReduce) for query processing and Business Intelligence. Netflix collects all logs from website which are streaming logs collected using Hunu.
They parse 0.6TB of data running on Amazon S3 50 nodes. All data are processed for Business Intelligence using a software called MicroStrategy.

Hadoop challenges

Traditionally, Hadoop was opened for developers. But the wide adoption and success of Hadoop depends on business users, not developers. Commercial distributions will have to make it even easier for business analysts to use Hadoop.

Servers running Hadoop at Yahoo.com Templates for business scripts are a start, but getting away from scripting altogether should be the long term goal for the business user segment. This has not happened yet. Nevertheless Cloudera is trying to win the business user segment, and if they succeed they will create an enterprise Hadoop market.
To best illustrate, here it is a quote from Yahoo Hadoop development team:

“The way Yahoo! uses Hadoop is changing. Previously, most Hadoop users at Yahoo! were researchers. Researchers are usually hungry for scalability and features, but they are fairly tolerant of failures. Few scientists even know what "SLA" means, and they are not in the habit of counting the number of nines in your uptime. Today, more and more of Yahoo! production applications have moved to Hadoop. These mission-critical applications control every aspect of Yahoo!'s operation, from personalizing user experience to optimizing ad placement. They run 24/7, processing many terabytes of data per day. They must not fail. So we are looking for software engineers who want to help us make sure Hadoop works for Yahoo! and the numerous Hadoop users outside Yahoo!”

Hadoop Integration with resource management cloud software

One such example is Oracle Grid Engine 6.2 Update 5. Cycle Computing also announced an integration with Hadoop. It reduces the cost of running Apache Hadoop applications by enabling them to share resources with other data center applications, rather than having to maintain a dedicated cluster for running Hadoop applications. Here is a relevant customer quote “The Grid Engine software has dramatically lowered for us the cost of data intensive, Hadoop centered, computing. With its native understanding of HDFS data locality and direct support for Hadoop job submission, Grid Engine allows us to run Hadoop jobs within exactly the same scheduling and submission environment we use for traditional scalar and parallel loads. Before we were forced to either dedicate specialized clusters or to make use of convoluted, adhoc, integration schemes; solutions that were both expensive to maintain and inefficient to run. Now we have the best of both worlds: high flexibility within a single, consistent and robust, scheduling system"”

Getting Started with Hadoop

Hadoop is an open source implementation of the MapReduce algorithms and distributed file system. Hadoop is primarily developed in Java. Writing a Java application, obviously, will give you much more control and presumably improved performance. However, it can be used with other environments including scripting languages using “streaming”. Streaming applications simply reads data from stdin and write their output to stdout.

Installing Hadoop

To install Hadoop, you will need to download Hadoop Common (also referred as Hadoop Core) from http://hadoop.apache.org/common/. The binaries are available from Open Source under an Apache License. Once you have downloaded the Hadoop Common, follow the installation and configuration instructions.

Hadoop With Virtual Machine

If you have no experience playing with Hadoop, there is an easier way to install and experiment with Hadoop. Rather than installing a local copy of Hadoop, install a virtual machine from Yahoo! Virtual machine comes with Hadoop pre-installed and pre-configured and is almost ready to use. The virtual machine is available from their Hadoop tutorial. This tutorial includes well documented instructions for running the virtual machine and running Hadoop applications. The virtual machine, in addition to Hadoop, includes Eclipse IDE for writing Java based Hadoop applications.

Hadoop Cluster

By default, Hadoop distributions are configured to run on single machine and the Yahoo virtual machine is a good way to get going. However, the power of Hadoop comes from its inherent distributed nature and deploying distributed computing on a single machine misses its very point. For any serious processing with Hadoop, you’ll need many more machines. Amazon’s Elastic Compute Cloud (EC2) is perfect for this. An alternative option to running Hadoop on EC2 is to use the Cloudera distribution. And of course, you can set up your own cluster of Hadoop by following the Apache instructions. Resources
There is a large active developer community who created many scripted languages such as HBase, Hive, Pig and others). Cloudera, has a supported distribution.

A Data Scientist's blog

Search This Blog

Thursday, May 10, 2012

What is Hadoop?