By Jonathan Gershater | Article Rating: |
|
May 10, 2013 08:00 AM EDT | Reads: |
8,122 |

What is Hadoop?
Following my high-level write-up of Hadoop and Big Data, this article will present each of the components or projects that make up Hadoop with a technical description of each.
First, what is Hadoop?
Hadoop stores and processes large volumes of a wide variety of data that changes rapidly. It analyses and summarizes the data. For example: census of a city, web page analytics, threat analysis, risk models, network failures, etc.
Hadoop is redundant and reliable, powerful and focused on batch processing.
Hadoop divides a large data processing job into many smaller tasks that can be distributed across all the nodes
Hadoop comprises two main components:
- MapReduce: The task to analyse the data and summarize the results
- HDFS: The distributed file system, on commodity server hardware, that contains the data.
On each server there is a task tracker and a data node:
DataNode
The data node stores the data in HDFS and keeps track of access to the data.
TaskTracker
Task tracker launches a map reduce job on a node and manages the many tasks within one MapReduce job. So if my project was to conduct a census count, task tracker may count the members of a household on a data node. When finshed, task tracker reports its status to the job tracker. (Note: as of this writing, May 2013, TaskTracker is being obsoleted and replaced by "Yarn" in MapReduce v2.
JobTracker
Job tracker keeps track of all the jobs being executed and tries to schedule each map job as close to the actual data being processed. If a task has failed or disappeared perhaps due to hardware failure, job tracker will assign that task to another node.
So, now that I know what is a task and job how do I write tasks? How does a user create a map reduce job? There are various projects that make it easy. (As to how the projects were named, don't ask me!)
Apache Pig
To write a computer program, a software engineer might use a compiler, like "C", that compiles 'pseudo english instructions (IF, THEN, FOR, ELSE) and creates machine code that a computer an execute. Similarly, Apache Pig is a high level language that expresses data map reduce jobs and translates them to JAVA computer language. Pig's primary feature is that it can be run in parallel, meaning many map reduce jobs can run simultaneously to allow linear scaling and efficiency.
Apache Hive
Hive is a SQL like language, HiveQL, which allows you to define computation in SQL like language and then and translate it down into map reduce JAVA code. Hive also allows traditional MapRedce programmers to plug in their custom MapReducers when it is inefficient to express their logic in HiveQL.
hBase
hBase is a simple interface to distributed data that allows incremental processing. hBase stores its information in HDFS and metadata in zookeeper.
hCatalog
hCatalog is an abstraction layer for referencing data without using the underlying filenames or formats. It insulates users and scripts from how and where the data is physically stored.
Some of the smaller projects
Mahout
Mahout is a machine learning library to write MapReduce applications focused on machine learning
Ambari, Gagli and Nagios
These projects help you understand what goes on in your cluster
Scoop
Scoop is a tool that lets you run map reduce applications to or from sql databases
Oozie
Oozie is a workflow that triggers MapReduce jobs and executes them automatically or launches when new data becomes available.
Flume
Streams inputs into hadoop and gets that data loaded into hdfs
Here is a graphical view of the components
(courtesy of Hortonworks)
Published May 10, 2013 Reads 8,122
Copyright © 2013 Ulitzer, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
Related Stories
More Stories By Jonathan Gershater
Jonathan Gershater has lived and worked in Silicon Valley since 1996, primarily doing system and sales engineering specializing in: Web Applications, Identity and Security. At Red Hat, he provides Technical Marketing for Virtualization and Cloud. Prior to joining Red Hat, Jonathan worked at 3Com, Entrust (by acquisition) two startups, Sun Microsystems and Trend Micro.
(The views expressed in this blog are entirely mine and do not represent my employer - Jonathan).
- AJAX World RIA Conference & Expo Kicks Off in New York City
- Bad File Descriptor Error in Linux
- Twelve New Programming Languages: Is Cloud Responsible?
- My Personal 2010 Predictions
- Live CD Linux Distributions
- A Brief History of the Free Software/Open Source World
- So, Could Microsoft Ever "Own" XML?
- Using Ext JS, Servlets, JSON, MySQL and Tomcat on Fedora
- WebRTC Summit at Cloud Expo Agenda Announced
- Open Source Business Models Examined