Big Data Challenges Move from Tech to the Organization

This year I had the opportunity to lead our predictions for big data. Unlike most predictions this time of year, we don’t just look ahead for the coming 12 months. The effects of innovation, changes in the market and impact on IT budgets are hard to recognize over such a short timeframe. That’s why our predictions often extend to 36 months. (We also do lookbacks to see if we were right or not, but that’s a topic for another blog post.)

What became clear during the process of selecting and refining predictions is the focus has changed. Technology is no longer the interesting part of big data. What’s interesting is how organizations deal with it. The hype is receding and big data is no longer viewed as a simple technology problem. Organizations have to focus on the building blocks of enterprise information management (EIM):

eim_building_blocks

So far, only the most rudimentary elements of enabling infrastructure have been considered. This is not sustainable. One prediction from my colleague Roxane Edjlali is that 60% of big data projects will fail to make it into production either due to an inability to demonstrate value or because they cannot evolve into existing EIM processes.

This is only part of the story. Cultural or business model changes will be necessary to benefit from big data. And ethics must be a primary consideration as privacy concerns rise in importance.

Gartner clients can read the full report here: Predicts 2015: Big Data Challenges Move From Technology to the Organization. If you want to ensure your organization is on the right side of the analytical divide, join me and my Gartner colleagues at the Gartner Business Intelligence & Analytics Summit.

gartner_bi_vegas

Cloudera-Teradata Partnership Highlights Hadoop Reality

Last week, Cloudera and Teradata announced an expanded partnership that extends Teradata’s Unified Data Architecture to Cloudera’s products. The announcement also includes tighter software integration between the two companies, with Teradata’s Loom (acquired from Revelytix) and QueryGrid both supporting CDH.

Given that Cloudera likes to take shots at Teradata, this may seem like an unlikely relationship. However, it highlights an important reality in the information management market. Despite the hype, Hadoop isn’t replacing data warehouses. This hype has only confused the market and resulted in delayed adoption for both technologies. Hadoop remains largely application-centric, with very few enterprise-wide deployments.

The emerging understanding for early adopters is that the strengths of Hadoop and the data warehouse are complementary. Your information management infrastructure discussion won’t be about if you should use Hadoop or the EDW, but how you should use both. This expanded partnership helps to tell that story.

Is it Time For a DBMS Mass Extinction?

On Sept 19, 2014, InfiniDB (formerly Calpont) announced it was closing its doors after failing to secure financing to continue operations. Establishing differentiation in a crowded market and competing over a finite supply of large enterprises eroded InfiniDB’s position. It’s easy to think this is an isolated story of a single company that failed to gain traction. However, I think this is just the first sign that the asteroid is coming.

The Permian Extinction—When Life Nearly Came to an End
Source: NatGeo – The Permian Extinction—When Life Nearly Came to an End

Over the last eight years or so, dozens of new vendors have emerged offering specialized types of DBMSs. The website nosql-databases.org tracks about 150 different DBMSs. As I see it, the current level of diversity in the DBMS space is simply unsustainable.

This won’t be a popular view. After all, several reasons have been given for the explosion of DBMSs. Most of the reasons are thinly veiled market positioning from vendors desperate for market share. Market positioning reasons almost always talk about how the “old” rules of data management no longer apply. And when people say the old rules no longer apply, you’re in a bubble.

Open source is another element driving DBMS diversity. Open source licensing allows independent developers or companies to derive new products from OSS staples like MySQL and PostgreSQL, or combine several projects into entirely new offerings. To some extent, this is facilitated by sites like github. These sites provide a virtual water cooler to develop ideas and features without needing to get together in meatspace. Redis is a great example of the power of virtual collaboration and development.

Another reason given for the diversity of DBMS models is the growth of data volumes and varieties. This argument looks much like the first: legacy DBMS vendors simply can’t cope with new data demands. To a certain extent this reason has some merit, but not nearly enough to justify the continued existence of dozens of DBMS vendors. And don’t forget the resources these vendors have. When the asteroid strikes, they’ll be hiding under piles of money.

If the old rules still apply and the data expansion argument has questionable merit, what has supported the number of DBMS vendors entering the market? Other than loads of VC funding, the answer appears to be a simple one: hardware.

Computing hardware is always increasing in capability and decreasing in cost. But it’s uncommon for processing, storage and networking to all experience massive capability increases and cost reductions in concert. The last time it happened was in 2009. Each time this convergence happens, application developers have free reign over implementation decisions. This developer-centric approach typically lasts for a few years. After all, any code will run, and run quite well, with better hardware.

The slack capacity provided by better hardware might make you think you can do things you wouldn’t previously consider. The old rules may no longer apply. There might even be a free lunch!

Applications are developed and some value is created, but the result is the proliferation of data silos at the cost of abandoned information governance.

Information management realities always assert themselves. Data in silos is fine for systems of innovation because you’re only focused on data use. But those systems eventually become systems of record, where data reuse is paramount. Systems of record must have capabilities of description, organization, governance, integration, among others.

When the asteroid strikes, it won’t be a fiery rock falling from the sky. It will be IT Ops reasserting the need for adult supervision. DBMSs providing the required capabilities will likely thrive, while those that don’t, won’t.

And not even Bruce Willis can save vendors from IT Ops.
MPW-31982

Thanks to Merv Adrian for reviewing and contributing to this post.

The Big Data A-Ha Moment is Only the Beginning

If Big Data is going to remake an industry, insurance is certainly a great candidate. With massive amounts of historical data as well as emerging IoT-based data sources, insurance is well positioned to take advantage of new analytical methods and techniques. Recently, I had a chance to speak to a very large insurance company about their Big Data opportunities and challenges. What made the day unique was my session was followed by a predictive analytics demonstration created by the company’s IT department.

Once the gathered executives saw what was possible with predictive analytics, even in a small scale, they wanted to do more: new cuts of data, additional facets, different questions. But the requests from the invigorated audience reinforced fundamental challenges with Big Data.

First, the data must be correct. This isn’t limited to data quality, but also metadata. Almost any organization is going to have multiple data sources, often for the same data. In the case of this insurer, it has several claims systems, each with different attributes. For example, one claim system has five different categories for marriage status, while another had seven. Inconsistent dates of birth also complicated analysis.

Second, IT and the business must be partners. For the purposes of the demonstration, IT simply picked what they thought was an interesting problem and they picked correctly. After that, the executives started asking for more things from the IT team – without any consideration for the work that must happen from the business side. The executives believed they could simply request predictive insights from IT in the same way they ask for new descriptive analytics reports.

Without meaningful collaboration and investment from the business side, in the form of people, process and data, Big Data initiatives will fail. And they will fail quite spectacularly.

Spark Restarts the Data Processing Race

It’s still early days for Apache Spark, but you’d be forgiven for thinking that based on the corporate sponsorship at Spark Summit. For the second conference for a very early technology, the list of notable sponsors is impressive: IBM, SAP, Amazon Web Services, SanDisk and RedHat. SAP also announced Spark integration with HANA, its flagship DBMS appliance. Other companies, like MapR and DataStax, also announced (or reinforced) partnerships with Databricks, the Spark commercializer.

Given the relative immaturity of this open source project, why are these companies – particularly the large vendors – rushing to support Spark? I think there are a few things happening here.

First, after building out integration with MapReduce, integrating with Spark was easy. SAP’s integration with Spark uses Smart Data Access, the same method used for MapReduce integration. I imagine only it’s a matter of time before similar integration occurs with Teradata’s QueryGrid or IBM’s BigSQL, among others. After all, this looks a lot like external tables, something the DBMS vendors have been doing for at least a decade.

The ease of integration only explains part of the sudden interest in Spark. More important is the need to not be left out of the next iteration in data processing. While Hadoop is an important component of any data management discussion today, it had a long road to credibility. Many vendors simply took a “wait and see” approach to Hadoop and they waited too long. Don’t think the same mistake will happen with Spark. Customers are less resistant to open source options, and large vendors need to get behind every project with momentum to compete with startups.

It’s too early to pick winners and losers. The incumbent vendors are upping their game, while much of the messaging coming from the Hadoop distribution vendors is confusing. However this shakes out, it should make a great show for the rest of 2014.

Don’t Forget the Hadoop Developers

Over the last two years, several companies have rushed to get SQL-on-Hadoop products or projects to market. Having a familiar SQL interface makes the data stored in Hadoop more accessible, and therefore more useful to larger parts of the organization. Search, another capability broadly available from several Hadoop vendors, enables more use cases for a different set of audiences.

This rush for SQL-on-Hadoop has left the developer market effectively underserved. But here’s the reality: if you can’t accomplish your task with SQL or even Pig, it’s time to break out the editor or IDE and start writing code. That means writing MapReduce (or tomorrow, Spark?), which has its own challenges:

  • Development tool support is fairly limited.
  • Application deployment and management is lacking.
  • Testing and debugging is difficult, if not impossible (the same can be said for just about any distributed system).
  • Integrating with non-HDFS data sources requires a lot of custom code.

None of these are new or unknown challenges, and developers have simply dealt with them with mixed levels of success. But Hadoop is growing up. The workloads it handles are increasing in priority and complexity. Developers on Hadoop need the same empowerment as BI/analytics users.

This push for developer empowerment on the broader Hadoop stack went largely unnoticed at June’s Hadoop Summit, but a number of companies are filling this gap, such as Concurrent, Continuuity and BMC with its Control-M product. And the ubiquitous Spring Framework has several stories to tell, with Spring-Hadoop and Spring-Batch.

What’s interesting, at least to me, is the traditional Hadoop vendors are largely absent from empowering developers (except for Pivotal). Has the developer base been abandoned in favor of the enterprise, or is this a natural evolution of a data management application?

Update: Apparently Cloudera is leading the development of Kite SDK. Kite looks like a good start at addressing some of the pain points developers frequently encounter, such as building ETL pipelines and working with Maven.

Another Update: Milind Bhandarkar reminded me about Spring-XD.

Benefits and Risks in Curated Open Source

Today, Aerospike announced its in-memory NoSQL DBMS is available under the AGPL license, the same license used by a few of its competitors. According to Aerospike, there were a number of reasons to pursue an open source path, such as getting their DBMS into the hands of developers – who are the people leading the NoSQL charge. Of course, the long-term objective is some of those OSS users will eventually become paying customers.

The unexpected result is enterprises with open source mandates will be able to use Aerospike more broadly. As closed source software, Aerospike was a point solution. But the licensing change means Aerospike’s addressable use cases expand overnight.

This is a fundamental shift in enterprise attitudes toward open source and vendor lock-in.

During my career, I’ve seen open source software transition from a heretical notion to an essential factor in how enterprises evaluate and purchase software. This is especially true in the Information Management space. Information Management has a long history of understanding and adopting open source, essentially starting with Ingres and spawning a variety of data management options available today.

However, it takes more than simply having an Apache project or something on Github. Enterprises aren’t turning to StackOverflow, IRC or mailing lists for support. Open source software needs to be curated by commercializers for enterprises to be willing to use it.

It’s an interesting shift. Companies are directing – or outright owning – the development of open source projects to make them palatable to enterprises. In some cases, only one company is developing or shipping the open source project. That leads to an interesting question about the actual value of open source in that scenario: If only one company supports an open source product, is that product viable at avoiding vendor lock-in?

Let me know what you think in the comments.

What’s Beyond MapReduce? It Depends.

Hadoop, or the fundamental concept behind it, has now existed for ten years. In 2004, Google released the original MapReduce paper. This paper resulted in the development of Hadoop, which helped spur much of the Big Data hype and discussion. Processing massive amounts of data with MapReduce has resulted in innovations and cost savings. But MapReduce is a batch solution. The world has changed since 2004 and so has Hadoop.

Recently I moderated a panel of Hadoop luminaries. Every prominent Hadoop vendor, and a promising startup, was represented. The topic, ‘Beyond MapReduce,’ explored the variety of options emerging in the Hadoop ecosystem. Interestingly, I got several questions after the panel asking, “So what’s beyond MapReduce?” The panel discussion was clear: everything is beyond MapReduce. But applying new data processing options depends on your use case.

BI Summit Roundup: Big Data Confusion Reigns

The Gartner BI Summits are an ideal venue to connect with vendors and end users not just in BI, but also the general Big Data space. Last week in Las Vegas I led two great roundtable discussions on the the Big Data Ecosystem. Interest was high: I was supposed to have a total of 28 attendees, but had 43. End users were happy to sit on the floor just to be part of the discussion.

The roundtables were also a great opportunity to collect some data. Clients are always interested in what other people are doing. The most revealing question was the current status of Big Data deployments:

Slide2

For all of the massive hype in the marketplace, only 12% (that’s 5 people) are in production. Even defining what “production” meant was challenging. If your Big Data project is impacting business processes or results, congratulations – you’re in production.

The core takeaway from the discussions was confusion. End users, particularly those on the business side, have difficulty differentiating between vendors, especially in the Hadoop space. There is also growing exhaustion around the data warehouse replacement marketing message. While some companies may be interested in exploring alternatives to their current data warehouse, most see a Hadoop-based solution as additive to what they’re running today.

Hadoop distributions have already started their descent into the Trough of Disillusionment according to the 2013 Hype Cycle for Big Data. This increasingly negative sentiment will likely push Hadoop distributions along the Hype Cycle curve in 2014.

Hadoop is in the Mind of the Beholder

This post was jointly authored by Merv Adrian (@merv) and Nick Heudecker (@nheudecker) and appears on both blogs.

In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.

This expanding footprint included a sizable group of “related projects”, mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012, the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.

In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.

During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree, all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.

But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?

Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.

Copyright © Nick Heudecker

Built on Notes Blog Core
Powered by WordPress