NoSQL for Competitive Hiring Advantage

Regardless of your location or industry, hiring technical talent is challenging. This is particularly true in areas like Silicon Valley, where 59% of employers plan to add workers. Companies typically resort to  blunt instruments, like larger salaries or signing bonuses, to bring talent in the door. (The merits of this strategy are debatable.) Some companies are thinking beyond these basic incentives to attract workers. These companies are exploring how adopting new technologies can attract scarce development talent.

This is counter-intuitive. After all, you should only adopt a technology if it solves a business problem. Does it save money, increase revenue or capture some competitive advantage? If so, adopt. Otherwise, avoid. The flaw here is hiring is a business problem. Companies have to attract and retain talent to enable competitive advantage.

Doesn’t this sound familiar? “We have a problem. Let’s throw technology at it.” Now that type of thinking is expanding to hiring. It’s the technology trap all over again. If your company isn’t an appealing place to work, technology won’t fix the problem. It only delays meaningful change to your organization.

Looking to work at one of the best companies in the world? The Information Management team is hiring in North America and Europe. Leave a comment (I won’t publish it) for more information.

Apache Tajo Enters the SQL-on-Hadoop Space

The number of SQL options for Hadoop expanded substantially over the last 18 months. Most get a large amount of attention when announced, but a few slip under the radar. One of these low-flying options is Apache Tajo. I learned about Tajo in November of 2013 at a Hadoop User Group meeting.

Billed as a big data data warehousing system for Hadoop, Tajo development started in 2010 and moved to the Apache Software Foundation in March of 2013. Tajo is currently incubating. Its primary development sponsor is Gruter, a big data infrastructure startup in South Korea. Despite the lack of public awareness, Tajo has a fairly robust feature set:

  • SQL compliance
  • Fully distributed query processing against HDFS and other data sources
  • ETL feature set
  • User-defined functions
  • Compatibility with HiveQL and Hive MetaStore
  • Fault tolerance through a restart mechanism for failed tasks
  • Cost-based query optimization and an extensible query rewrite engine

Things get interesting when comparing performance against Apache Hive and Cloudera Impala. SK Telecom, the largest telecommunications provider in South Korea, tested Tajo, Hive and Impala using five sample queries. Hive 0.10 and Impala 1.1.1 on CDH 4.3.0 were used for the test. Test data size was 1.7TB and query results were 8GB or less in size. (The following images were taken from the presentation in the previous link.)

Query 1: Heavy scan with 20 text matching filters

tajo_q1

Query 2: 7 unions with joins

tajo_q2

Query 3: Simple joins

tajo_q3

Query 4: Group by and order by

tajo_q4

Query 5: 30 pattern matching filters with OR conditions using group by, having and sorting

tajo_q5

What do these results indicate? Clearly, different SQL-on-Hadoop implementations have different performance characteristics. Until these options mature to be truly multi-purpose, selecting a single option may not result in the best overall performance. Also, these benchmarks are for a specific set of use cases – not your use cases. The tested queries may have no relevance to your data and how you’re using it.

The other important takeaway is the absolute performance of these options. The sample data set and results are small in modern terms, yet none of the results are astounding relative to a modern data warehouse or RDBMS. There’s a difference between “fast” and “fast for Hadoop.” Cloudera appears to be making some headway, but a lot of ground must be covered before any Hadoop distribution is compatible with the systems vendors claim to be replacing.

How Square Secured Your Data in Hadoop

If any company must face issues around data security, a credit card payment processor is a likely candidate. Square’s card readers and applications likely process millions of payments per day. One output of this processing is a lot of data. According to a presentation delivered at Facebook, Square stores a substantial amount of this data in Hadoop, conducting data analysis and offering data products to buyers and merchants.

Square does this without storing sensitive information in Hadoop. Redacted data meets about 80% of their use cases. Meeting the remaining 20% of use cases required some fairly radical engineering.

If you’re not familiar with Hadoop, it’s a distributed filesystem and data processing mechanism. You can put anything you want in those files. Square stores its data as Protocol Buffers, or “protobufs”. Protobufs is a way to serialize data. (Trying to stay as non-technical as possible in this post and already failing miserably. It’s a data format.)

Hadoop’s security model applies access control at the file level. This isn’t unreasonable since all it knows about are files. File contents are a mystery. However, file-level security is a blunt instrument in a modern data architecture. Addressing this security impedance mismatch led Square to look at a few options:

  • Create a separate, secure Hadoop cluster for sensitive data. This was rejected because it introduces more operational overhead. It also wasn’t clear how to separate sensitive data. And everyone will want access to the secure cluster. Maybe the answers is to…
  • Open up access to everyone and track access with audit logs. Yep, this is a bad idea. Nobody looks at logs.

So Square thought about what they needed to support the business. That resulted in the following requirements:

  • Control access to individual fields within each record.
  • Extend the security framework to the ingest method (Apache Kafka), not just Hadoop.
  • Developer transparency.
  • Support algorithmic agility and key management.
  • Don’t store cryptographic keys with the data.

None of these requirements are unreasonable. All of them should be considered table stakes for an information management application. However, Square had to implement their own encryption model to meet these modest requirements. It did this by modifying the protobufs code to use an existing online crypto service. Key management overhead (always difficult) was challenging when facing growing data volumes and required a few iterations to get right.

Square didn’t disclose the number of hours spent on developing these capabilities. I think it’s fair to assume the investment was substantial.

Maybe the level of effort required is why securing your data in Hadoop isn’t a primary inhibitor to adoption. The lack of data-level security is keeping relevant projects off of Hadoop.

NoSQL Shouldn’t Mean NoDBA

Last September I conducted an informal survey of NoSQL adopters to improve our understanding of who is using NoSQL and why. The results were largely what I expected, except for the respondent profile. Database administrators (DBAs) appear to be significantly underrepresented in the NoSQL space, representing only 5.5% of respondents:

primary_job_role

The possibility of selection bias occurred to me, but these numbers look accurate based on conversations with clients, vendors and developers. DBAs simply aren’t a part of the NoSQL conversation. This means DBAs, intentionally or not, are being eliminated from a rapidly growing area of information management. Application developers may be getting what they want from NoSQL now, but cutting out the primary data stewards will result in long-term data quality and information governance challenges for the larger enterprise (see “Does Your NoSQL DBMS Result in Information Governance Debt? (G00257910)).”

If you’ve adopted a NoSQL DBMS, what are your integration and governance plans? Do you even have any?

Spark and Tez Highlight MapReduce Problems

On February 3rd, Cloudera announced support for Apache Spark as part of Cloudera Enterprise. I’ve blogged about Spark before so I won’t go into substantial detail here, but the short version is Spark improves upon MapReduce by removing the need to write data to disk between steps. Spark also takes advantage of in-memory processing and data sharing for further optimizations.

The other successor to MapReduce (of course there is more than one) is Apache Tez. Tez improves upon MapReduce by removing the need to write data to disk between steps (Sound familiar?). It also has in-memory capabilities similar to Spark.  Thus far Hortonworks has thrown its weight behind Tez development as part of the Stinger project.

Both Tez and Spark are described as supplementing MapReduce workloads. However, I don’t think this will be case much longer. The world has changed since Google published the original MapReduce paper in 2004. Memory prices have plummeted while data volumes and sources have increased, making legacy MapReduce less appealing.

Vendors will likely begin distancing themselves from MapReduce for more performant options once there are some high profile customer references. It remains to be seen what this means for early adopters with legacy MapReduce applications.

Thanks to Josh Wills at Cloudera for helping clarify the advantage provided by Spark & Tez.

Kiva Lending History

Over the past seven years I’ve participated in 46 loans through Kiva.org, a microcredit lending facilitator. I try to target my loans at female entrepreneurs in Sub-Saharan Africa whenever possible, but I never had a feel for how consistent I was being in my lending. Luckily, Kiva saved me the trouble of having to process my portfolio data.

Gender

I’ve been successful at targeting loans towards female entrepreneurs. Over 46 loans, 86.96% have gone to females or groups comprised of females.

kiva_gender

Countries

This chart contains the percentages for the 10 countries receiving most of the loans. I’ve lent to a total of 23 different countries. The largest percentage of my loans go to Sub-Saharan Africa.

kiva_countries

Sectors

I didn’t expect this, but I suppose it makes sense. Agriculture is a primary industry in lesser-developed countries and helps to employ a number of people.

kiva_sectors

Most impressively, of the 46 loans I’ve made, only one has defaulted. Most pay back on time and I reuse those funds for additional loans.

If you haven’t made a loan through Kiva.org, I highly recommend it.

Which Database are You Taking to Prom?

Project downloads, job postings, Github commits and mailing list activity are common methods to gauge traction for commercial and open source software projects. I frequently use those same metrics as input when looking at emerging projects, and for good reason.

Increasing traffic on a user mailing list indicates growing use and adoption. However, if that traffic continues growing, especially after the milestone release, it can indicate poor documentation or quality control. Do the same questions keep coming up on user forums? Your documentation is poor or nonexistent. Did the bug count go up after the 1.0 release? Looks like you’re focusing on features over quality.

Metrics like these are useful but they don’t exist in a vacuum. They need context, the type provided by the overall market, vendor strategies and customer feedback. A simple measure of popularity lacks this context, and therefore lacks insight. Additionally, popularity doesn’t measure sentiment. You don’t know if someone on Stackoverflow is saying “Database X is awesome,” or “Database Y is awful.”

Also concerning is the self-fulfilling notion of popularity. Once a database is measured as relatively more popular than another, it will experience a further uptick in popularity metrics and the cycle continues. Ultimately this results in more marketplace hype and confusion.

Which Database are You Taking to Prom?

Project downloads, job postings, Github commits and mailing list activity are common methods to gauge traction for commercial and open source software projects. I frequently use those same metrics as input when looking at emerging projects, and for good reason.

Read more…

BI Hadoop Specialists Trail, Broader Tools Lead

There is a romantic notion of leaving the past behind and embracing the future unencumbered. Previous mistakes forgotten, we can venture forward to accomplish great things to the amazement of friends, colleagues and casual onlookers.

This is the promise made by BI and analytics vendors in the Hadoop-only ecosystem. After all, if your data moves to Hadoop, why concern yourself with data stored in legacy data warehouses? Based on the audience response from a polling question conducted during a webinar on Hadoop 2.0, you can’t escape your past. You can only embrace it.

On January 16th, Merv Adrian and I presented webinars discussing the impact of Hadoop 2.0. As part of the webinar, we asked our audience three polling questions. One of the questions asked participants which SQL-on-Hadoop methods they were most likely to use to access data stored in HDFS and HBase. We decided to ask this question based on the arms race occurring in the Hadoop ecosystem based on SQL. Here are the responses: hadoop_webinar_q2

These results indicate that analytics tools lead, but Hadoop-specific tools trail significantly. Most responded they were most likely to use interfaces provided by existing analytics tool providers. Hive came in second, indicating increasing familiarity and comfort with the most established SQL-on-Hadoop interface. However, the Hadoop-specific BI specialists were tied for last. There are a few possible reasons for this apparent reluctance, like maturity, availability of skills and existing investment in analytics tools. However, the understanding that data exists in more places than just Hadoop may be a core factor in attendee responses. If your data lives in warehouses, marts, as well as Hadoop, your analytics tools must cope with that reality.

BI Hadoop Specialists Trail, Broader Tools Lead

There is a romantic notion of leaving the past behind and embracing the future unencumbered. Previous mistakes forgotten, we can venture forward to accomplish great things to the amazement of friends, colleagues and casual onlookers.

This is the promise made by BI and analytics vendors in the Hadoop-only ecosystem. After all, if your data moves to Hadoop, why concern yourself with data stored in legacy data warehouses? Based on the audience response from a polling question conducted during a webinar on Hadoop 2.0, you can’t escape your past. You can only embrace it.

Read more…

Copyright © Nick Heudecker

Built on Notes Blog Core
Powered by WordPress