I recently attended the Hadoop Summit in San Jose. This is one of two major conferences organized around Hadoop, the other being Hadoop World. Nearly all the companies with Hadoop distributions were present along with several big users of Hadoop like Netflix, Twitter, and Linkedin.
Crossing The Chasm
If you’re not deeply involved with Hadoop, attending one of these conferences a year apart can be shocking. The advancements made in just the span of a year are amazing. The conference seemed notably larger this year, and I noticed more non-tech companies in the audience. I think it’s safe to say that Hadoop has crossed the chasm, at least for enterprise IT users.
Other than the type of attendees at the event, the other signal to me was the emergence of Hadoop 2.0. This second version of Hadoop focused on features that are important for users who want to run production-grade software for mission-critical systems. High-availability finally arrived for the name node (for the Open Source project, not the version Cloudera released for its distribution), a new version of Hive with more SQL-friendly features, and YARN which allows users to run just about anything on the Hadoop Distributed File System (HDFS). These types of stability and availability features tend to show up when there is a critical mass of users who want to use software for production.
Quite A YARN
One of the hot topics at the Summit was YARN. YARN (MapReduce 2.0) is a resource manager that will be introduced in Hadoop 2.0. In this release, MapReduce has been separated from HDFS increasing the flexibility of the platform so that now just about anything can run on HDFS. Some examples of applications that can be now run natively in Hadoop with YARN are Streaming applications (like Storm), Interactive Apps (like Tez) and, of course Batch (which is MapReduce). Although there was often hyperbole in the panels (“Run your microwave on YARN!”), this feature definitely introduces significant capabilities to Hadoop. One company that has extensive experience using YARN at scale is Yahoo. Yahoo has one of the largest Hadoop deployments in the world: 40,000 nodes with 500,000 MapReduce jobs running daily. Yahoo has seen performance improvements with YARN – Bobby Evans noted in his presentation that the efficiency gains were like buying 1,000 new servers. He also liked the analysis he could do with log aggregation and the flexibility of running non-MapReduce applications.
Microsoft Tames The Elephant
Microsoft has been making a big push into Big Data and its strategy has been to partner with Hortonworks. This strategy makes sense for both companies: HDP is now the only distribution that runs on Windows and Microsoft now has deeper integration with Hadoop. In addition to helping Hortonworks develop HDP on Windows and a connector to SQLServer, Microsoft now has multiple ways of interacting with Hadoop:
- HD Insights – This is Microsoft’s “Hadoop as a service” that runs on Azure. It is used by the Halo 4 team to improve the user gaming experience.
- Power Query – This Excel add-in (formerly known as Data Explorer) allows users to extract data directly from HDFS.
- Power Pivot – The row limitation in Excel won’t matter now with Power Pivot, which lets you analyze your entire Hadoop result set in Excel.
I think it was clever to enhance Excel by integrating Hadoop. Although Excel isn’t the most exciting product, it is used by millions of workers and is the de facto analytic tool for hundreds of companies. I think Microsoft’s strategy is a big commitment and a leading indicator that Hadoop is ready to enter the mainstream.
The Future is Now
Sanjay Radia’s presentation was key for me in thinking that Hadoop is ready for the mainstream. Many of the issues that were raised with the first release of Hadoop (and over time solved through various forks) have now found a home in the 2.0 version. HDFS will now finally support multiple independent name nodes, the ability to run an object store on block storage, read path performance improvements, and on-the-wire-encryption. The reliability and performance improvements are welcome additions to HDFS and will definitely be important in all production deployments. I’m also excited about planned improvements, such as the ability to run heterogeneous storage and support for volumes.
Another interesting development is the new version of Hive (called the Stinger Initiative) again leveraging YARN. Most users who work with databases are far more comfortable with SQL than Java. Although you still need to know Java to truly unlock the power of Hadoop, most analysts are happy to use Hive as the SQL translator to extract data from HDFS. However, Hive had limitations and still relied on MapReduce. This new version leverages Tez, which means MapReduce is no longer required when using Hive because it just executes a DAG. It also helped to remove that 15-second startup penalty, which improved the performance of Hive (among other improvements that boosted its overall performance). Stinger also added an enhanced optimizer (and a more usable explain plan), support for additional data types, and more SQL extensions like sub-queries in the where clause and secure GRANT and REVOKE commands. Many of these are common features in SQL clients, and this is definitely a sign of Hadoop’s movement to the mainstream.
Hadoop is no longer the experimental project that developers used to play with in their spare time. It’s now a serious analytic platform used by large companies in mission-critical applications. The focus on stability, security, and performance is definitely an indication of product users asking for features that are important to them. The growth of Hadoop has been phenomenal over the past few years, and I believe the launch of Hadoop 2.0 will bring even more mainstream users into the fold. If you aren’t seriously looking into Hadoop now, you could be ceding an advantage to your competitors who are taking advantage of its superior analytic abilities.