Category dev

Don’t Forget the Hadoop Developers

Over the last two years, several companies have rushed to get SQL-on-Hadoop products or projects to market. Having a familiar SQL interface makes the data stored in Hadoop more accessible, and therefore more useful to larger parts of the organization. Search, another capability broadly available from several Hadoop vendors, enables more use cases for a different set of audiences.

This rush for SQL-on-Hadoop has left the developer market effectively underserved. But here’s the reality: if you can’t accomplish your task with SQL or even Pig, it’s time to break out the editor or IDE and start writing code. That means writing MapReduce (or tomorrow, Spark?), which has its own challenges:

  • Development tool support is fairly limited.
  • Application deployment and management is lacking.
  • Testing and debugging is difficult, if not impossible (the same can be said for just about any distributed system).
  • Integrating with non-HDFS data sources requires a lot of custom code.

None of these are new or unknown challenges, and developers have simply dealt with them with mixed levels of success. But Hadoop is growing up. The workloads it handles are increasing in priority and complexity. Developers on Hadoop need the same empowerment as BI/analytics users.

This push for developer empowerment on the broader Hadoop stack went largely unnoticed at June’s Hadoop Summit, but a number of companies are filling this gap, such as Concurrent, Continuuity and BMC with its Control-M product. And the ubiquitous Spring Framework has several stories to tell, with Spring-Hadoop and Spring-Batch.

What’s interesting, at least to me, is the traditional Hadoop vendors are largely absent from empowering developers (except for Pivotal). Has the developer base been abandoned in favor of the enterprise, or is this a natural evolution of a data management application?

Update: Apparently Cloudera is leading the development of Kite SDK. Kite looks like a good start at addressing some of the pain points developers frequently encounter, such as building ETL pipelines and working with Maven.

Another Update: Milind Bhandarkar reminded me about Spring-XD.

Benefits and Risks in Curated Open Source

Today, Aerospike announced its in-memory NoSQL DBMS is available under the AGPL license, the same license used by a few of its competitors. According to Aerospike, there were a number of reasons to pursue an open source path, such as getting their DBMS into the hands of developers – who are the people leading the NoSQL charge. Of course, the long-term objective is some of those OSS users will eventually become paying customers.

The unexpected result is enterprises with open source mandates will be able to use Aerospike more broadly. As closed source software, Aerospike was a point solution. But the licensing change means Aerospike’s addressable use cases expand overnight.

This is a fundamental shift in enterprise attitudes toward open source and vendor lock-in.

During my career, I’ve seen open source software transition from a heretical notion to an essential factor in how enterprises evaluate and purchase software. This is especially true in the Information Management space. Information Management has a long history of understanding and adopting open source, essentially starting with Ingres and spawning a variety of data management options available today.

However, it takes more than simply having an Apache project or something on Github. Enterprises aren’t turning to StackOverflow, IRC or mailing lists for support. Open source software needs to be curated by commercializers for enterprises to be willing to use it.

It’s an interesting shift. Companies are directing – or outright owning – the development of open source projects to make them palatable to enterprises. In some cases, only one company is developing or shipping the open source project. That leads to an interesting question about the actual value of open source in that scenario: If only one company supports an open source product, is that product viable at avoiding vendor lock-in?

Let me know what you think in the comments.

Which Database are You Taking to Prom?

Project downloads, job postings, Github commits and mailing list activity are common methods to gauge traction for commercial and open source software projects. I frequently use those same metrics as input when looking at emerging projects, and for good reason.

Increasing traffic on a user mailing list indicates growing use and adoption. However, if that traffic continues growing, especially after the milestone release, it can indicate poor documentation or quality control. Do the same questions keep coming up on user forums? Your documentation is poor or nonexistent. Did the bug count go up after the 1.0 release? Looks like you’re focusing on features over quality.

Metrics like these are useful but they don’t exist in a vacuum. They need context, the type provided by the overall market, vendor strategies and customer feedback. A simple measure of popularity lacks this context, and therefore lacks insight. Additionally, popularity doesn’t measure sentiment. You don’t know if someone on Stackoverflow is saying “Database X is awesome,” or “Database Y is awful.”

Also concerning is the self-fulfilling notion of popularity. Once a database is measured as relatively more popular than another, it will experience a further uptick in popularity metrics and the cycle continues. Ultimately this results in more marketplace hype and confusion.

Which Database are You Taking to Prom?

Project downloads, job postings, Github commits and mailing list activity are common methods to gauge traction for commercial and open source software projects. I frequently use those same metrics as input when looking at emerging projects, and for good reason.

Read more…

Finding a Spark at Yahoo!

Recently I had an opportunity to learn a little more about Apache Spark, a new in-memory cluster computing system originally developed at the UC Berekeley AMPlab. By moving data into memory, Spark improves performance for tasks like interactive data analysis and iterative machine learning. These improvements are especially pronounced when comparing them to a batch oriented, disk-bound system like Apache Hadoop. While Spark has seen rapid adoption at a number of companies, I learned how Yahoo! has started integrating Spark into its analytics.

Read More…

Service Lifetimes of Popes

The recent news of Pope Benedict announcing his retirement got me curious about how long popes actually serve. It was also an excuse to do some work in R and play with ggplot2. Like most data analysis projects, I spent most of my time cleaning up the data I scraped from Wikipedia. It took a few hours to with some Python to get the data into a useable state.

What I saw surprised me. I expected to see an average service lifetime of 15 years. Instead, the average is 6.9 years with a median of 5 years. Apparently being the Bishop of Rome is a tough job: of the 265 popes, 41 served less than a year.


I was also curious to see if certain time periods had higher turnover. The 9th, 10th and 11th centuries were especially difficult times to be the Pope.


I’d like to do some additional research to determine how each Pope died and see if there’s an interesting way to visualize it.

Launching Summly on MongoDB

Summly is an innovative iOS application providing meaningful summaries of news articles. Immediately prior to its UK launch in October, I was enlisted to help configure MongoDB for its production environment. I only had a few days to setup the environment and help the dev team work through any issues encountered in a replicated environment. This is how I did it and what I learned along the way.

The Environment

The Summly team didn’t know what kind of load their launch would generate on the data store, so we rolled out a 5-node replica set on m1.large instances, each with a 1TB EBS volume.

In retrospect this was excessive but we wanted to err on the side of having too much capacity instead of too little. Once the environment was configured and operating normally, testing with the existing application code began.

Integration with the Replica Set

The application was configured to prefer reading from secondaries and that’s where the application team encountered its first problem. The application’s pattern of an immediate write (to primary) and a read (from secondary), the replication lag meant the written data likely wasn’t available on the secondary, resulting in errors. There was only one access pattern like this in the code and was quickly resolved by using the consistent request available in the MongoDB Java driver.

The consistent request pattern ensures your reads and writes occur using the same socket, avoiding the asynchronous replication problem inherent is a replica set. It’s usage is straightforward:

Neither of the popular MongoDB Java abstractions (Spring Data-MongoDB or Morphia) provide direct access to the consistent request pattern. This isn’t a problem since both make the underlying MongoDB driver objects accessible, but you lose most of the convenience these frameworks provide.

Deployment and Monitoring

Once the Summly application was released to Apple’s UK app store, I monitored MongoDB’s logs to make sure everything was operating normally. After creating a few missing indexes and optimizing another, the iOS application became significantly more responsive and load decreased considerably.

Lessons Learned

  • As always, create indexes for your most commonly used queries. Make sure you understand the indexing trade offs.
  • Use some form of non-datastore cache to improve application performance.
  • Don’t launch without some way of monitoring the health of every component in your infrastructure. MonogoDB Monitoring Service is an excellent option.
  • Build your application with replica sets in mind. The asynchronous nature of the replication may impact your data access patterns.
  • Hire a professional, like me. 🙂


MongoDB Indexes

When you’re developing an application, it’s common to only work with a small quantity of data. The amount of data your production system encounters may not be available or may simply be unmanageable on development systems. Developing with a limited amount of data may mask eventual performance problems. Your queries respond quickly with a few hundred megabytes of sample data, but query latency may skyrocket when users query over a few hundred gigabytes. This article looks at using indexes in your MongoDB collections to address performance problems, as well as how to optimize your indexes.

What are Indexes?

You can think of an index in MongoDB, or its relational counterparts, like a book index. If you’re looking for a specific topic in a textbook without an index, your only option is to flip through every page to see if that page matches your interests. Depending on the size of the book, this can get tedious and time-consuming. MongoDB has to do the same thing when you query a collection: each document in the collection has to be examined and matches are added to the result set.

The search changes drastically with an index. The index is a tree-like structure containing the values you specify. If we keep going with our textbook example, a topic index might look like:

As you can see, an index makes finding what you’re looking for much faster and simpler. Let’s look at how indexes impact an application with some real-world data.

Applying Indexes

To get a suitable amount of data to work with, I pulled in some content from Twitter and added it to a collection in MongoDB. (You can find the script [here]( Overall, I ended up putting about 50,000 tweets into the collection. Each tweet has the following structure:

If I run a simple query for tweets from OpenShift’s (in)famous Jimmy Guerrero, the query has to scan all of the documents in the collection. We can see when executing the query with an explain() suffix:

> db.tweets.find({'from_user':'paasdude'}).explain();

The results of explain() describe the details of how MongoDB executes the query. Some of relevant fields are:

  • cursor: A result of BasicCursor indicates a non-indexed query. If we had used an indexed query, the cursor would have a type of BtreeCursor.
  • nscanned and nscannedObjects: The difference between these two similar fields is distinct but important. The total number of documents scanned by the query is represented by nscannedObjects. The number of documents *and indexes* is represented by nscanned. Depending on the query, it’s possible for nscanned to be greater than nscannedObjects.
  • n: The number of matching objects.
  • millis: Query execution duration.

The point here should be clear – a simple query results in the database having to scan every document in the collection. The query only took 40ms, but we’re also only dealing with about 50,000 documents. This duration will increase as the collection size increases.

What happens when an index is added to the from_user field? The difference is dramatic. We can add an index with the ensureIndex(…) command:

> db.tweets.ensureIndex({'from_user' : 1});

With the field indexed, our query execution is dramatically different:

> db.tweets.find({'from_user':'paasdude'}).explain();

The query still matches 35 results, but only 35 objects were scanned and the query took 3ms. This is quiet an improvement over 40ms to scan 50,000 documents!

Indexes can clearly improve performance in your MongoDB application, but there are some guidelines to keep in mind when indexing your collections. These guidelines are covered below.

Index Guidelines

Like everything else in the NoSQL world, there is no silver bullet to performance issues. Indexes are no exception. This section outlines some considerations to keep in mind, starting with what to index.

Indexes in MongoDB are similar to its relational counterparts. You’ll get the most value out of indexing the most distinct and frequently queried fields on your collections. For example, some ORM frameworks will automatically create indexes for the type of object stored in the collection. If you’re only storing a single type of object per collection (which I recommend), these indexes provide no value to the query engine since all values are the same. Only index distinct and frequently queried fields. This is also relevant when creating and updating indexes.

When it comes to indexes, it’s likely you are primarily concerned with query performance. However, indexes have to be maintained. Each time a document is insert or updated, associated index entries must also be updated. Similarly, index entries have to be found and deleted when documents are removed. This index maintenance can impact write performance. You’re best served creating indexes for your most frequently used queries.

Finally, you’ll want periodically review your indexes. As application usage changes, you may find your previously valuable indexes are just slowing things down. The problem is MongoDB doesn’t provide great visibility into index statistics. To address this, Jason Wilder created an excellent set of scripts to make sense of the information MongoDB provides. You can find his mongodb-tools here.


Index management is a central part of your MongoDB application. Knowing what to index based on your application’s usage and the potential performance impact should drive your decisions. MongoDB’s explain() function, as well as Jason Wilder’s tools, can provide valuable insight and guidance.



Apollo Group Evaluates MongoDB

The Apollo Group is the company behind University of Phoenix. It recently conducted a thorough evaluation of MongoDB and published the results. The paper is included below, but Apollo Group outlined the following keys for successfully deploying to MongoDB:

  • The key to MongoDB performance is thoughtful data model design.
  • Design the deployment to match data usage.
  • MongoDB is a great match for cloud-based, on-premise, and hybrid deployments.
  • MongoDB has excellent availability.
  • Latency between different Amazon EC2 Availability Zones within the same EC2 Region is very low.
  • At 85% CPU utilization, the behavior of MongoDB changes, and performance levels off.

Overall, this is a great approach to follow when you’re evaluating new technology, not just MongoDB.

Joining a Distributed Team

With software development talent harder to come by in hotspots like Silicon Valley, companies are looking for talent wherever they can find it. Distributed development teams are becoming more common. As a developer, joining an established distributed team can be a challenge. Aside from the technical challenges, such as learning an existing code base and processes, you also have to socialize with your new team.

When I joined Nodeable’s distributed development team a few months ago, the team was already in place and processes were defined. I had to quickly adapt to developers that had worked together for over a year and start providing value immediately. These are some of the things I did to acclimate to my new team. You may find these helpful if you’re in a similar position.

Start Early

A few weeks before starting, I learned about the development and deployment processes. This allowed me to ramp up on any tools I may not have been familiar with. There were very few in this case, but the opportunity was there to close any gaps.

Additionally, I started reviewing the bug database. This provided two advantages. First, I could figure out how the team used Jira. Is it just used for bugs and immediate work items, or for broader road map items as well? The second advantage was less obvious. From looking at the bug database, the history of the team as apparent. What areas of the application underwent significant change or were abandoned? Which ideas looked promising but were later dropped due to changing priorities or pivot? This doesn’t seem important but it can give you insight into how your new team works. And when you’re not seeing them in person everyday, every piece of data is helpful.

Avoid Cost Center Work

This point isn’t isolated to distributed teams, but I’ve found the impact of working on cost center stuff to be greater when working remotely. When you join a new team, distributed or not, you want to avoid working on tasks not directly related to the product road map and/or a profit center. Cost center tasks are things like chasing obscure bugs, anything related to devops, or writing unit tests. (Sorry, but unit testing is a massive cost center.) While you can’t avoid cost center work all the time, and you shouldn’t, it’s better to avoid it when starting in a new position. It’s demotivating and helps to erode that crucial honeymoon phase at a new company.


Copyright © Nick Heudecker

Built on Notes Blog Core
Powered by WordPress