With the recent announcement by Google of Cloud DataFlow (intended as the successor to MapReduce) and with Cloudera now focusing on Spark for many of its projects, it looks like the days of MapReduce may be numbered. Although the change may seem sudden, it’s been a long time coming. Google wrote the MapReduce white paper 10 years ago, and developers have been using at least one distribution of Hadoop for about 8 years. Users have had ample time to determine the strengths and weaknesses of MapReduce. However, the release of Hadoop 2.0 and YARN clearly indicated that users wanted to live in a more diverse Big Data world.
Earlier versions of Hadoop could be described as MapReduce + HDFS (Hadoop Distributed File System) because that was the paradigm that everything Hadoop revolved around. Because users clamored for interactive access to Hadoop data, the Hive and Pig projects were started. And even though you could write SQL queries with Hive and script in Pig Latin with Pig, under the covers Hadoop was still running MapReduce jobs. That all changed in Hadoop 2.0 with the introduction of YARN. YARN became the resource manager for a Hadoop cluster that broke the dependence between MapReduce and HDFS. Although HDFS still remained as the file system, MapReduce became just another application that can interface with Hadoop through YARN. This change made it possible for other applications to now run on Hadoop through YARN.
Google is not known as a backer in the mold of Hortonworks or Cloudera with the open source Hadoop ecosystem. After all, Google was running its own versions of MapReduce and HDFS (the Google File System) on which these open-source projects are based. Because they are integral parts of Google’s internal applications, Google has the most experience with using these technologies. And although Cloud DataFlow is specifically for use on the Google cloud and appears more like a competitor to Amazon’s Kinesis product, Google is very influential in Big Data circles, so I can see other developers following Google’s lead and leveraging a similar technology in favor of MapReduce.
Although Google’s Cloud DataFlow may have a thought leadership-type impact, Cloudera’s decision to leverage Spark as the standard processing engine for its projects (in particular, Hive) will have a greater impact on open-source Big Data developers. Cloudera has one of the most popular Hadoop distributions on the market and has partnered with Databricks, Intel, MapR, and IBM to work on their Spark integration with Hive. This trend is surprising given Cloudera’s investment in Impala (its SQL query engine), but the company clearly feels that Spark is the future. As little as a year ago, Spark was mostly seen as fast in-memory computing for machine learning algorithms. However with its promotion to an Apache Top-Level Project in February 2014 and its backing company Databricks receiving $33 million in Series B funding, Spark clearly has greater ambitions. The advent of YARN made it much easier to tie Spark to the growing Hadoop ecosystem. Cloudera’s decision to leverage Spark in Hive and other projects makes it even more important to users of the CDH distribution.
With MapR and Cloudera working on Hive + Spark, however, what does that mean for Hortonworks’ Stinger? Stinger is billed as the future of Hive, with integration into YARN using Tez. Does this development mean users will have to choose between Tez and Spark? I guess we’ll have to see because Hive + Tez has already been delivered in Hive 0.13 while Hive + Spark is still in progress.
Spark is not tied specifically to Hadoop. Although it does work with YARN, it can also work well with Apache Mesos and can also read data from Cassandra. So although Spark may become the real-time engine for Hadoop, it can also live independent of it, with users leveraging its related projects such as Spark SQL, Spark Streaming, and MLlib (Machine Learning). I think this capability means that Spark will soon become more important with Big Data developers and MapReduce will in turn become the solution for batch processing as opposed to the core paradigm for Hadoop. Specifically for batch use cases, MapReduce for now will be stronger than Spark, especially for very large datasets.
Although it’s certainly early days for Spark, this hasn’t stopped Big Data vendors from going full throttle on integrating it as a core part of their distributions. It also hasn’t stopped companies like Alibaba and Baidu from adopting it as part of their stack. Traditional MapReduce will continue to have its place for certain use cases, but you should keep an eye out for Cloud DataFlow and Spark, which are certain to become important components for Big Data applications in the near future.