Tag hadoop

Cloudera-Teradata Partnership Highlights Hadoop Reality

Last week, Cloudera and Teradata announced an expanded partnership that extends Teradata’s Unified Data Architecture to Cloudera’s products. The announcement also includes tighter software integration between the two companies, with Teradata’s Loom (acquired from Revelytix) and QueryGrid both supporting CDH.

Given that Cloudera likes to take shots at Teradata, this may seem like an unlikely relationship. However, it highlights an important reality in the information management market. Despite the hype, Hadoop isn’t replacing data warehouses. This hype has only confused the market and resulted in delayed adoption for both technologies. Hadoop remains largely application-centric, with very few enterprise-wide deployments.

The emerging understanding for early adopters is that the strengths of Hadoop and the data warehouse are complementary. Your information management infrastructure discussion won’t be about if you should use Hadoop or the EDW, but how you should use both. This expanded partnership helps to tell that story.

Spark Restarts the Data Processing Race

It’s still early days for Apache Spark, but you’d be forgiven for thinking that based on the corporate sponsorship at Spark Summit. For the second conference for a very early technology, the list of notable sponsors is impressive: IBM, SAP, Amazon Web Services, SanDisk and RedHat. SAP also announced Spark integration with HANA, its flagship DBMS appliance. Other companies, like MapR and DataStax, also announced (or reinforced) partnerships with Databricks, the Spark commercializer.

Given the relative immaturity of this open source project, why are these companies – particularly the large vendors – rushing to support Spark? I think there are a few things happening here.

First, after building out integration with MapReduce, integrating with Spark was easy. SAP’s integration with Spark uses Smart Data Access, the same method used for MapReduce integration. I imagine only it’s a matter of time before similar integration occurs with Teradata’s QueryGrid or IBM’s BigSQL, among others. After all, this looks a lot like external tables, something the DBMS vendors have been doing for at least a decade.

The ease of integration only explains part of the sudden interest in Spark. More important is the need to not be left out of the next iteration in data processing. While Hadoop is an important component of any data management discussion today, it had a long road to credibility. Many vendors simply took a “wait and see” approach to Hadoop and they waited too long. Don’t think the same mistake will happen with Spark. Customers are less resistant to open source options, and large vendors need to get behind every project with momentum to compete with startups.

It’s too early to pick winners and losers. The incumbent vendors are upping their game, while much of the messaging coming from the Hadoop distribution vendors is confusing. However this shakes out, it should make a great show for the rest of 2014.

Don’t Forget the Hadoop Developers

Over the last two years, several companies have rushed to get SQL-on-Hadoop products or projects to market. Having a familiar SQL interface makes the data stored in Hadoop more accessible, and therefore more useful to larger parts of the organization. Search, another capability broadly available from several Hadoop vendors, enables more use cases for a different set of audiences.

This rush for SQL-on-Hadoop has left the developer market effectively underserved. But here’s the reality: if you can’t accomplish your task with SQL or even Pig, it’s time to break out the editor or IDE and start writing code. That means writing MapReduce (or tomorrow, Spark?), which has its own challenges:

  • Development tool support is fairly limited.
  • Application deployment and management is lacking.
  • Testing and debugging is difficult, if not impossible (the same can be said for just about any distributed system).
  • Integrating with non-HDFS data sources requires a lot of custom code.

None of these are new or unknown challenges, and developers have simply dealt with them with mixed levels of success. But Hadoop is growing up. The workloads it handles are increasing in priority and complexity. Developers on Hadoop need the same empowerment as BI/analytics users.

This push for developer empowerment on the broader Hadoop stack went largely unnoticed at June’s Hadoop Summit, but a number of companies are filling this gap, such as Concurrent, Continuuity and BMC with its Control-M product. And the ubiquitous Spring Framework has several stories to tell, with Spring-Hadoop and Spring-Batch.

What’s interesting, at least to me, is the traditional Hadoop vendors are largely absent from empowering developers (except for Pivotal). Has the developer base been abandoned in favor of the enterprise, or is this a natural evolution of a data management application?

Update: Apparently Cloudera is leading the development of Kite SDK. Kite looks like a good start at addressing some of the pain points developers frequently encounter, such as building ETL pipelines and working with Maven.

Another Update: Milind Bhandarkar reminded me about Spring-XD.

What’s Beyond MapReduce? It Depends.

Hadoop, or the fundamental concept behind it, has now existed for ten years. In 2004, Google released the original MapReduce paper. This paper resulted in the development of Hadoop, which helped spur much of the Big Data hype and discussion. Processing massive amounts of data with MapReduce has resulted in innovations and cost savings. But MapReduce is a batch solution. The world has changed since 2004 and so has Hadoop.

Recently I moderated a panel of Hadoop luminaries. Every prominent Hadoop vendor, and a promising startup, was represented. The topic, ‘Beyond MapReduce,’ explored the variety of options emerging in the Hadoop ecosystem. Interestingly, I got several questions after the panel asking, “So what’s beyond MapReduce?” The panel discussion was clear: everything is beyond MapReduce. But applying new data processing options depends on your use case.

Hadoop is in the Mind of the Beholder

This post was jointly authored by Merv Adrian (@merv) and Nick Heudecker (@nheudecker) and appears on both blogs.

In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.

This expanding footprint included a sizable group of “related projects”, mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012, the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.

In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.

During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree, all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.

But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?

Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.

Apache Tajo Enters the SQL-on-Hadoop Space

The number of SQL options for Hadoop expanded substantially over the last 18 months. Most get a large amount of attention when announced, but a few slip under the radar. One of these low-flying options is Apache Tajo. I learned about Tajo in November of 2013 at a Hadoop User Group meeting.

Billed as a big data data warehousing system for Hadoop, Tajo development started in 2010 and moved to the Apache Software Foundation in March of 2013. Tajo is currently incubating. Its primary development sponsor is Gruter, a big data infrastructure startup in South Korea. Despite the lack of public awareness, Tajo has a fairly robust feature set:

  • SQL compliance
  • Fully distributed query processing against HDFS and other data sources
  • ETL feature set
  • User-defined functions
  • Compatibility with HiveQL and Hive MetaStore
  • Fault tolerance through a restart mechanism for failed tasks
  • Cost-based query optimization and an extensible query rewrite engine

Things get interesting when comparing performance against Apache Hive and Cloudera Impala. SK Telecom, the largest telecommunications provider in South Korea, tested Tajo, Hive and Impala using five sample queries. Hive 0.10 and Impala 1.1.1 on CDH 4.3.0 were used for the test. Test data size was 1.7TB and query results were 8GB or less in size. (The following images were taken from the presentation in the previous link.)

Query 1: Heavy scan with 20 text matching filters


Query 2: 7 unions with joins


Query 3: Simple joins


Query 4: Group by and order by


Query 5: 30 pattern matching filters with OR conditions using group by, having and sorting


What do these results indicate? Clearly, different SQL-on-Hadoop implementations have different performance characteristics. Until these options mature to be truly multi-purpose, selecting a single option may not result in the best overall performance. Also, these benchmarks are for a specific set of use cases – not your use cases. The tested queries may have no relevance to your data and how you’re using it.

The other important takeaway is the absolute performance of these options. The sample data set and results are small in modern terms, yet none of the results are astounding relative to a modern data warehouse or RDBMS. There’s a difference between “fast” and “fast for Hadoop.” Cloudera appears to be making some headway, but a lot of ground must be covered before any Hadoop distribution is compatible with the systems vendors claim to be replacing.

How Square Secured Your Data in Hadoop

If any company must face issues around data security, a credit card payment processor is a likely candidate. Square’s card readers and applications likely process millions of payments per day. One output of this processing is a lot of data. According to a presentation delivered at Facebook, Square stores a substantial amount of this data in Hadoop, conducting data analysis and offering data products to buyers and merchants.

Square does this without storing sensitive information in Hadoop. Redacted data meets about 80% of their use cases. Meeting the remaining 20% of use cases required some fairly radical engineering.

If you’re not familiar with Hadoop, it’s a distributed filesystem and data processing mechanism. You can put anything you want in those files. Square stores its data as Protocol Buffers, or “protobufs”. Protobufs is a way to serialize data. (Trying to stay as non-technical as possible in this post and already failing miserably. It’s a data format.)

Hadoop’s security model applies access control at the file level. This isn’t unreasonable since all it knows about are files. File contents are a mystery. However, file-level security is a blunt instrument in a modern data architecture. Addressing this security impedance mismatch led Square to look at a few options:

  • Create a separate, secure Hadoop cluster for sensitive data. This was rejected because it introduces more operational overhead. It also wasn’t clear how to separate sensitive data. And everyone will want access to the secure cluster. Maybe the answers is to…
  • Open up access to everyone and track access with audit logs. Yep, this is a bad idea. Nobody looks at logs.

So Square thought about what they needed to support the business. That resulted in the following requirements:

  • Control access to individual fields within each record.
  • Extend the security framework to the ingest method (Apache Kafka), not just Hadoop.
  • Developer transparency.
  • Support algorithmic agility and key management.
  • Don’t store cryptographic keys with the data.

None of these requirements are unreasonable. All of them should be considered table stakes for an information management application. However, Square had to implement their own encryption model to meet these modest requirements. It did this by modifying the protobufs code to use an existing online crypto service. Key management overhead (always difficult) was challenging when facing growing data volumes and required a few iterations to get right.

Square didn’t disclose the number of hours spent on developing these capabilities. I think it’s fair to assume the investment was substantial.

Maybe the level of effort required is why securing your data in Hadoop isn’t a primary inhibitor to adoption. The lack of data-level security is keeping relevant projects off of Hadoop.

Spark and Tez Highlight MapReduce Problems

On February 3rd, Cloudera announced support for Apache Spark as part of Cloudera Enterprise. I’ve blogged about Spark before so I won’t go into substantial detail here, but the short version is Spark improves upon MapReduce by removing the need to write data to disk between steps. Spark also takes advantage of in-memory processing and data sharing for further optimizations.

The other successor to MapReduce (of course there is more than one) is Apache Tez. Tez improves upon MapReduce by removing the need to write data to disk between steps (Sound familiar?). It also has in-memory capabilities similar to Spark.  Thus far Hortonworks has thrown its weight behind Tez development as part of the Stinger project.

Both Tez and Spark are described as supplementing MapReduce workloads. However, I don’t think this will be case much longer. The world has changed since Google published the original MapReduce paper in 2004. Memory prices have plummeted while data volumes and sources have increased, making legacy MapReduce less appealing.

Vendors will likely begin distancing themselves from MapReduce for more performant options once there are some high profile customer references. It remains to be seen what this means for early adopters with legacy MapReduce applications.

Thanks to Josh Wills at Cloudera for helping clarify the advantage provided by Spark & Tez.

BI Hadoop Specialists Trail, Broader Tools Lead

There is a romantic notion of leaving the past behind and embracing the future unencumbered. Previous mistakes forgotten, we can venture forward to accomplish great things to the amazement of friends, colleagues and casual onlookers.

This is the promise made by BI and analytics vendors in the Hadoop-only ecosystem. After all, if your data moves to Hadoop, why concern yourself with data stored in legacy data warehouses? Based on the audience response from a polling question conducted during a webinar on Hadoop 2.0, you can’t escape your past. You can only embrace it.

Read more…

Finding a Spark at Yahoo!

Recently I had an opportunity to learn a little more about Apache Spark, a new in-memory cluster computing system originally developed at the UC Berekeley AMPlab. By moving data into memory, Spark improves performance for tasks like interactive data analysis and iterative machine learning. These improvements are especially pronounced when comparing them to a batch oriented, disk-bound system like Apache Hadoop. While Spark has seen rapid adoption at a number of companies, I learned how Yahoo! has started integrating Spark into its analytics.

Read More…

Copyright © Nick Heudecker

Built on Notes Blog Core
Powered by WordPress