Bossie Awards 2014: The best open source big data tools

feature
Sep 29, 201413 mins
Business IntelligenceHadoopOpen Source

InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem

The best open source big data tools

Although Hadoop is more popular than ever, MapReduce seems to be running out of friends. Everyone wants the answer faster, faster, now, and often in response to SQL queries. This year’s Bossies in big data track important new developments in the Hadoop stack, underscore a maturing NoSQL space, and highlight a number of useful tools for data wrangling, data analysis, and machine learning.

Hadoop

Hadoop

No technology in recent memory has made as big or as quick an impact as Hadoop. Hadoop encompasses many topics: HDFS, YARN, MapReduce, HBase, Hive, Pig, and a growing ecosystem of tools and engines. In the last year Hadoop has gone from being that thing everyone is talking about to being that thing even fairly conservative companies are deploying. Whether you’re trying to spark information sharing in your organization, mine data for new uses, replace expensive data warehousing technology, or offload ETL processing, Hadoop is the platform of technologies you should be looking at today.

— Andrew C. Oliver

Hive

Hive enables interactive queries over petabytes of data using familiar SQL semantics. Version 0.13 realized the three-part community vision that focused on speed, scale, and SQL compliance.

Hive preserves existing investments in SQL tools, skills, and processes and allows them to be applied to large-scale data sets. Tez integration, vector-based execution, and a cost-based optimizer significantly improve interactive query performance for users, while also lowering resource requirements. Hive 0.13 is the de-facto SQL interface to big data, with a large and growing community behind it, including Microsoft, which contributed several key pieces of SQL server technology.

— Steven Nunez

Mahout

Mahout

Mahout is a computation-engine-independent collection of algorithms for classification, clustering, and collaborative filtering, along with the underlying math required to implement them. It focuses on highly scalable machine learning.

Currently at release 0.9, with the 1.0 release around the corner, the project is moving from Hadoop’s MapReduce execution paradigm to a Domain Specific Language for linear algebra that is optimized for execution on Spark clusters. Map-reduce contributions are no longer accepted. Given the iterative nature of machine learning, this makes good sense and shows that the community isn’t afraid to do what’s right when technology changes.

— Steven Nunez

Cascading

Cascading

The learning curve for writing Hadoop applications can be steep. Cascading is an SDK that brings a functional programming paradigm to Hadoop data workflows. With the 3.0 release, Cascading provides an easier way for application developers to take advantage of next-generation Hadoop features like YARN and Tez.

The SDK provides a rich set of commonly used ETL patterns that abstract away much of the complexity of Hadoop, increasing the robustness of applications and making it simpler for Java developers to utilize their skills in a Hadoop environment. Connectors for common third-party applications are available, enabling Cascading applications to tap into databases, ERP, and other enterprise data sources.

— Steven Nunez