Tag spark

Spark Restarts the Data Processing Race

It’s still early days for Apache Spark, but you’d be forgiven for thinking that based on the corporate sponsorship at Spark Summit. For the second conference for a very early technology, the list of notable sponsors is impressive: IBM, SAP, Amazon Web Services, SanDisk and RedHat. SAP also announced Spark integration with HANA, its flagship DBMS appliance. Other companies, like MapR and DataStax, also announced (or reinforced) partnerships with Databricks, the Spark commercializer.

Given the relative immaturity of this open source project, why are these companies – particularly the large vendors – rushing to support Spark? I think there are a few things happening here.

First, after building out integration with MapReduce, integrating with Spark was easy. SAP’s integration with Spark uses Smart Data Access, the same method used for MapReduce integration. I imagine only it’s a matter of time before similar integration occurs with Teradata’s QueryGrid or IBM’s BigSQL, among others. After all, this looks a lot like external tables, something the DBMS vendors have been doing for at least a decade.

The ease of integration only explains part of the sudden interest in Spark. More important is the need to not be left out of the next iteration in data processing. While Hadoop is an important component of any data management discussion today, it had a long road to credibility. Many vendors simply took a “wait and see” approach to Hadoop and they waited too long. Don’t think the same mistake will happen with Spark. Customers are less resistant to open source options, and large vendors need to get behind every project with momentum to compete with startups.

It’s too early to pick winners and losers. The incumbent vendors are upping their game, while much of the messaging coming from the Hadoop distribution vendors is confusing. However this shakes out, it should make a great show for the rest of 2014.

Spark and Tez Highlight MapReduce Problems

On February 3rd, Cloudera announced support for Apache Spark as part of Cloudera Enterprise. I’ve blogged about Spark before so I won’t go into substantial detail here, but the short version is Spark improves upon MapReduce by removing the need to write data to disk between steps. Spark also takes advantage of in-memory processing and data sharing for further optimizations.

The other successor to MapReduce (of course there is more than one) is Apache Tez. Tez improves upon MapReduce by removing the need to write data to disk between steps (Sound familiar?). It also has in-memory capabilities similar to Spark.  Thus far Hortonworks has thrown its weight behind Tez development as part of the Stinger project.

Both Tez and Spark are described as supplementing MapReduce workloads. However, I don’t think this will be case much longer. The world has changed since Google published the original MapReduce paper in 2004. Memory prices have plummeted while data volumes and sources have increased, making legacy MapReduce less appealing.

Vendors will likely begin distancing themselves from MapReduce for more performant options once there are some high profile customer references. It remains to be seen what this means for early adopters with legacy MapReduce applications.

Thanks to Josh Wills at Cloudera for helping clarify the advantage provided by Spark & Tez.

Finding a Spark at Yahoo!

Recently I had an opportunity to learn a little more about Apache Spark, a new in-memory cluster computing system originally developed at the UC Berekeley AMPlab. By moving data into memory, Spark improves performance for tasks like interactive data analysis and iterative machine learning. These improvements are especially pronounced when comparing them to a batch oriented, disk-bound system like Apache Hadoop. While Spark has seen rapid adoption at a number of companies, I learned how Yahoo! has started integrating Spark into its analytics.

Read More…

Copyright © Nick Heudecker

Built on Notes Blog Core
Powered by WordPress