Table of Contents

Big Data Distribution Package

Hadoop is the open-source software framework at the heart of much of the Big Data and analytics revolution. It provides solutions for enterprise data storage and analytics with almost unlimited scalability. Since its release in 2011, it has rapidly grown in popularity and a strong ecosystem of distributors, vendors, and consultants has emerged to support its use across the industry.

At its core, Hadoop is an Open Source system, which, among other considerations, means it is essentially free for anyone to use. However, the requirement for it to be aligned with the needs of individual organizations has resulted in the emergence of many commercial distributions. These generally come packaged with support or additional features designed to streamline its deployment or allow users to build additional analytics, security, or data handling into their framework.

Competition in this market is fierce and the landscape is constantly shifting. For example, all the top distributions now include the Apache Spark parallel processing framework, whereas a few years ago this was not the case. The growing prominence of Spark has resulted in many vendors increasing the resources dedicated to Spark deployment and support.

One important factor to consider in choosing a Hadoop distribution is whether you want an on-premises or cloud-based solution. If there is no room to compromise when it comes to maintaining complete control and ownership of your data, an onsite solution still theoretically offers the highest level of security. In recent years, though, cloud solutions have become less expensive, more flexible, and easier to scale.

Most of the vendor products here can be installed on a cloud or on-premises. However, some cannot be run on-site. These are generally products from web service providers, such as Amazon or Microsoft, running either Hadoop distributions from other, platform-focused vendors such as Hortonworks or MapR, or their own distributions.

Beyond that, all of the top distributions have subtle differences which could make them more or less suitable for your business. Here’s a non-exhaustive guide to some of the most popular on the market today.

Cloudera

Cloudera was the first vendor to offer Hadoop as a package and continues to be a leader in the industry. Its Cloudera CDH distribution, which contains all the open-source components, is the most popular Hadoop distribution. Cloudera is known for acting quickly to innovate with additions to the core framework – it was the first to offer SQL-for-Hadoop with its Impala query engine. Other additions include user interface, security, and interfaces for integration with third-party applications. It offers support for the whole of the distribution through its Cloudera Enterprise subscription service.

Hortonworks

Hortonworks’ platform is entirely open source. The company is known for making acquisitions of other companies with useful code and releasing it into the open-source community. What some have seen as a start of a trend towards consolidation in the market has prompted a growth in the popularity of Hortonworks’ product. Recently Pivotal stopped development of its own distribution and both Amazon and IBM are now offering Hortonworks as options on their own platforms, alongside their own Hadoop distributions. Hortonworks’ platform is also at the core of the Open Data Platform Initiative, a group looking to simplify and standardize specifications in the Big Data ecosphere. In the long run, this is likely to mean it will become even more widely supported.

MapR

Like Hortonworks and Cloudera, MapR is a platform-focused provider, rather than a managed service provider like Amazon or Microsoft. MapR integrates its own database system, MapR-DB, which it claims is between four and seven times faster than the stock Hadoop database – HBase running on competing distributions. Due to its power and speed, MapR is often seen as a good choice for the biggest of Big Data projects.

Amazon Elastic Map Reduce

Amazon offers a cloud-only Hadoop-as-a-service platform through its Amazon Web Services arm. A key advantage of the pay-as-you-go model offered by cloud-only service providers is the scalability offered, with storage and data processing able to be ramped up or wound down as demands change. Amazon has recently announced that customers can now use the Apache Flink stream processing framework for real-time data analytics on the platform, along with other popular tools such as Kafka and Presto. It also seamlessly connects with Amazon’s other cloud services infrastructure such as EC2 for cloud processing, Amazon S3, and DynamoDB for storage, and AWS IoT to collect data from Internet of Things-enabled devices.

Microsoft

Microsoft’s Azure HDInsight platform is a cloud-only service that offers managed installations of several open-source Hadoop distributions including Hortonworks, Cloudera, and MapR. It integrates them with its own Azure Data Lake platform to offer a complete solution for cloud-based storage and analytics. As well as the core Hadoop framework, HDInsights provides Spark, Hive, Kafka, and Storm cloud services, and its own cloud security framework.

Altiscale

Acquired recently by SAP for $125 million, Altiscale is another company offering cloud-based, managed Hadoop-as-a-service. It continues to offer its Altiscale Data Cloud product, which includes additional operational services like automation, security, scaling, and performance tuning alongside the core Hadoop framework. Data Cloud also provides managed Spark, Hive, and Pig services like most of the other products here but unlike the other as-a-service offerings, uses its own Hadoop distribution rather than that of one of the platform-focused vendors such as Hortonworks or MapR.

Explanation of Big Data Distribution Packages and Their Features

Big Data Distribution Package

Cloudera

Hortonworks

MapR

Amazon Elastic Map Reduce

Microsoft

Altiscale

Leave a Reply Cancel reply

Network Performance Measurement Guide with Key Metrics and Tools

Understanding Layers in the Internet Protocol Suite Stack

What is Coaxial Cable in Computer Engineering

Applications of Computer Networks in Modern Technology

About

NAVIGATE

Recent Post

Case Study in Software Engineering Insights and Best Practices Explained

Network Performance Measurement Guide with Key Metrics and Tools

Understanding Layers in the Internet Protocol Suite Stack

What is Coaxial Cable in Computer Engineering

Big Data Distribution Package

Cloudera

Hortonworks

MapR

Amazon Elastic Map Reduce

Microsoft

Altiscale

Related Post

Leave a Reply Cancel reply