Overview of HADOOP, new features in Version 1, Version 2 branch description 0

Introduction

The essay assumes the reader has no knowledge of Hadoop or Map Reduce; it will give an overview of Hadoop and the confusion around the project branch that form v1.0.0 and v2.0.0, the essay will also give discussion of the features introduced in v1.0.0 and some of the use-cases that they can be applied to.

Hadoop background

Hadoop (Apache Hadoop, 2013) is an open-source project coordinated by the Apache Software Foundation; Core Hadoop is “analogous to an operating system kernel” (Sammer E, 2012), the key modules are Common utilities, Distributed File System (HDFS), Job scheduler/tracking and Map Reduce.

The Hadoop ecosystem is wide and varied; there are several commercially available third party platform distributions that give a framework around the Apache distribution. Distributions are the key for Hadoop adoption within the enterprise because they offer a stable, production ready build often with additional features burned into the distribution. The vendors offer a definitive presence for support, training and consultancy something that can be problematic for open-source software.

Key vendors in this space are: Cloudera, Hortonworks, MapR, EMC Greenplum, IBM and Intel (Puccinelli S, 15th Jan 2013), (Molla R, Harris D, 5th March 2013), also, there are many other related projects that add to or replace components within the core Hadoop project.

HADOOP Project tree

The perception within enterprise is that versions prior to v1.0.0 are technology previews, certainly within the Microsoft world its long been common for companies to wait until the first service pack before considering a product stable and ready for production.

There are two distinct branches (Sammer E, 2012), (Zedlewski C, 8 Jan 2012) that incorporate different feature sets and stability, this adds to confusion within the enterprise because the 0.20.205 line became v1 and 0.23.0 became v2. It would be a mistake to think version 2.0 succeeds version 1.0 which is the expectation within the commercial world.

HadoopBranchHistory
Source: http://blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0/

Hadoop Use-Cases

Hadoop has been architected to run on many thousands of commodity machines thereby dramatically reducing the cost of the traditional scale-up database approach. Hadoop can be run in the cloud for instance Amazon Elastic MapReduce (Amazon EMR) and Windows Azure HDInsight.

A number of Hadoop related projects gives a broad range of use-case for example: Machine Learning (Mahout), Data Analytics (Hive, Pig) and Data Storage (HBase), there are others such as Sqoop that facilitate Hadoop as a Data Warehouse.

Cloudera’s Hadoop Platform Distribution is used in many industry sectors (Cloudera, 10 Mar 2013) for example fraud detection, consumer and market risk modelling within Finance, seismic data processing, smart meter analytics within Energy and Utilities, within Telecommunications it is used for churn analysis, network capacity trending and product research and development.

Hadoop primarily processes data in batch but there is increasing movement into the real-time analytics for instance the recent release of EMC Greenplum Pivotal HD.

Changes introduced with V1.0.0

The key features introduced with V1.0.0 are: Kerberos security, optimisations for the MapReduce framework, HBase support for flush and sync, a new implementation of file append and performance enhancements for local files, WebHDFS.

Security

A key concern for the enterprise; a reputation can be damaged and fines incurred by leaking of personal information.

Prior to the introduction of Kerberos there has been little in terms of solid security, through jobs users can execute arbitrary code against all the nodes using elevated permissions, there is also no privacy, no integrity and poor authentication (Becherer A, 2010).

The introduction of Kerberos goes part way to providing a secure platform, however, the implementation does not go far enough and there are several security weak points (Informatica, 6 Sep 2011). The Hadoop cluster when used to store non-public data will still need to be isolated from the main network behind a cluster specific firewall. The type of user access required will dictate whether sensitive data can be stored such as credit card, social security and other identifiable data.

HBASE feature enhancements

HBASE allows billions of rows and millions of columns to be stored in a column-orientated store (Apache HBase).

The new HDFS flush and sync feature gives a guarantee that data has been persisted to the HDFS cluster thus giving consistency and durability to HBASE. File append has also been added which allows HBASE to now use Write-Ahead Logging to HDFS – another major component in achieving Consistency and Durability.

Consistency and Durability are a major consideration for many types of application, some applications for example certain forms of clickstream storage might allow data loss but others for example order fulfilment may not. The introduction of the HBase features in this version opens Hadoop to another section of applications, applications that need persistent writes.

WebHDFS

Until the availability of WebHDFS in order to get data in or out of the Hadoop cluster both Java and Hadoop needed to be installed. WebHDFS defines a public HTTP REST API (Sze N, 2 Dec 2011) allowing interaction with the Hadoop cluster, the API support the complete FileSystem interface for HDFS.

It can be seen as an additional security measure because WebHDFS can be used as a proxy between an external client that may want to load data into the cluster and the cluster itself; users can authenticate using Kerberos (SPNEGO) as well as Hadoop delegation tokens.

Conclusion

Hadoop is maturing into a versatile and secure platform, it has a huge ecosystem of vendors, with a variety of add-on applications to customise its use; v1.0.0 and v2.0.0 are major milestones and will give comfort for those enterprises’ wishing to adopt the platform.

Coupling WebHDFS and Kerberos make for a more secure environment isolating the Hadoop cluster from the users; a custom API can be built that allows communication between a device for instance mobile, desktop pc, web-service and the Hadoop cluster thus opening up the full power of Hadoop’s MPP capabilities to any device that can support HTTP.

Platform distributions hide the complexities caused by the two Hadoop branches, enterprise will want training, support and consultancy from recognised players thus circumventing the problems with open-source software.

The vibrant ecosystem, the continued high investment in existing and start-up companies, coupled with an IDC prediction that the Hadoop software market will be worth $813million in 2016 all add to the positivity that the v1.0.0 release brings and will only serve to encourage its use within commerce.

References

An Update on Apache Hadoop 1.0, Zedleswski C, 8 Jan 2012 (http://blog.cloudera.com/blog/2012/01/an-update-on-apache-hadoop-1-0 accessed 10 Mar 2013)

Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions? Gartner 23 Jan 2012 (http://blogs.gartner.com/merv-adrian/2012/01/23/apache-hadoop-1-0-doesnt-clear-up-trunks-and-branches-questions-do-distributions accessed 8 Mar 2013)

APACHE HADOOP PowerBy Wiki (http://wiki.apache.org/hadoop/PoweredBy, accessed 5 Mar 2013)

APACHE HADOOP project (http://hadoop.apache.org, accessed 8 Mar 2013)

Append to files in HDFS, Jira: Hadoop-1700, Stack M, 8 Jul 2009, (https://issues.apache.org/jira/browse/HADOOP-1700 accessed 10 Mar 2013)

Blackhat 2010 – Hadoop Security Design? Becherer A, 2010 (http://www.securitytube.net/video/6966 accessed 10 Mar 2013)

Cloudera Use-Case: Industries, Cloudera (http://www.cloudera.com/content/cloudera/en/solutions/industries/financial-services.html accessed 10 Mar 2013)

HADOOP Operations, Sammer E, 2012

Hadoop Security: Part 6 of Hadoop Series, Informatica, 6 Sep 2011 (http://blogs.informatica.com/perspectives/2011/09/06/hadoop-security-part-6-of-hadoop-series-2  accessed 8 Mar 2013)

HBase Architecture 101 – Write Ahead Log, George L, 30 Jan 2010 (http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html accessed 10 Mar 2013)

HBase, HDFS and durable sync, Hofhansl L, 30 May 2012 (http://hadoop-hbase.blogspot.co.uk/2012/05/hbase-hdfs-and-durable-sync.html accessed 10 Mar 2013)

Securing Big Data: Security Recommendations for Hadoop and NoSQL Environments, Securosis, Oct 2012 (https://securosis.com/assets/library/reports/SecuringBigData_FINAL.pdf, accessed 8 Mar 2013)

The Hadoop ecosystem as of January 2013, Datameer (Puccinelli S), 15 Jan 2013 (http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html, accessed 23 Feb 2013)

The Hadoop ecosystem: the (welcome) elephant in the room (infographic), Molla R, Harris D, 5 Mar 2013 (http://gigaom.com/2013/03/05/the-hadoop-ecosystem-the-welcome-elephant-in-the-room-infographic accessed 8 Mar 2013)

WebHDFS – HTTP REST Access to HDFS, Sze N, 2 Dec 2011, (http://hortonworks.com/blog/webhdfs-%e2%80%93-http-rest-access-to-hdfs accessed 10 Mar 2013)

Welcome to Apache HBase, Apache (http://hbase.apache.org
accessed 10 Mar 2013)