Don’t Forget the Hadoop Developers

Over the last two years, several companies have rushed to get SQL-on-Hadoop products or projects to market. Having a familiar SQL interface makes the data stored in Hadoop more accessible, and therefore more useful to larger parts of the organization. Search, another capability broadly available from several Hadoop vendors, enables more use cases for a different set of audiences.

This rush for SQL-on-Hadoop has left the developer market effectively underserved. But here’s the reality: if you can’t accomplish your task with SQL or even Pig, it’s time to break out the editor or IDE and start writing code. That means writing MapReduce (or tomorrow, Spark?), which has its own challenges:

  • Development tool support is fairly limited.
  • Application deployment and management is lacking.
  • Testing and debugging is difficult, if not impossible (the same can be said for just about any distributed system).
  • Integrating with non-HDFS data sources requires a lot of custom code.

None of these are new or unknown challenges, and developers have simply dealt with them with mixed levels of success. But Hadoop is growing up. The workloads it handles are increasing in priority and complexity. Developers on Hadoop need the same empowerment as BI/analytics users.

This push for developer empowerment on the broader Hadoop stack went largely unnoticed at June’s Hadoop Summit, but a number of companies are filling this gap, such as Concurrent, Continuuity and BMC with its Control-M product. And the ubiquitous Spring Framework has several stories to tell, with Spring-Hadoop and Spring-Batch.

What’s interesting, at least to me, is the traditional Hadoop vendors are largely absent from empowering developers (except for Pivotal). Has the developer base been abandoned in favor of the enterprise, or is this a natural evolution of a data management application?

Update: Apparently Cloudera is leading the development of Kite SDK. Kite looks like a good start at addressing some of the pain points developers frequently encounter, such as building ETL pipelines and working with Maven.

Another Update: Milind Bhandarkar reminded me about Spring-XD.

Benefits and Risks in Curated Open Source

Today, Aerospike announced its in-memory NoSQL DBMS is available under the AGPL license, the same license used by a few of its competitors. According to Aerospike, there were a number of reasons to pursue an open source path, such as getting their DBMS into the hands of developers – who are the people leading the NoSQL charge. Of course, the long-term objective is some of those OSS users will eventually become paying customers.

The unexpected result is enterprises with open source mandates will be able to use Aerospike more broadly. As closed source software, Aerospike was a point solution. But the licensing change means Aerospike’s addressable use cases expand overnight.

This is a fundamental shift in enterprise attitudes toward open source and vendor lock-in.

During my career, I’ve seen open source software transition from a heretical notion to an essential factor in how enterprises evaluate and purchase software. This is especially true in the Information Management space. Information Management has a long history of understanding and adopting open source, essentially starting with Ingres and spawning a variety of data management options available today.

However, it takes more than simply having an Apache project or something on Github. Enterprises aren’t turning to StackOverflow, IRC or mailing lists for support. Open source software needs to be curated by commercializers for enterprises to be willing to use it.

It’s an interesting shift. Companies are directing – or outright owning – the development of open source projects to make them palatable to enterprises. In some cases, only one company is developing or shipping the open source project. That leads to an interesting question about the actual value of open source in that scenario: If only one company supports an open source product, is that product viable at avoiding vendor lock-in?

Let me know what you think in the comments.

What’s Beyond MapReduce? It Depends.

Hadoop, or the fundamental concept behind it, has now existed for ten years. In 2004, Google released the original MapReduce paper. This paper resulted in the development of Hadoop, which helped spur much of the Big Data hype and discussion. Processing massive amounts of data with MapReduce has resulted in innovations and cost savings. But MapReduce is a batch solution. The world has changed since 2004 and so has Hadoop.

Recently I moderated a panel of Hadoop luminaries. Every prominent Hadoop vendor, and a promising startup, was represented. The topic, ‘Beyond MapReduce,’ explored the variety of options emerging in the Hadoop ecosystem. Interestingly, I got several questions after the panel asking, “So what’s beyond MapReduce?” The panel discussion was clear: everything is beyond MapReduce. But applying new data processing options depends on your use case.

Copyright © Nick Heudecker

Built on Notes Blog Core
Powered by WordPress