Apache Tajo Enters the SQL-on-Hadoop Space

The number of SQL options for Hadoop expanded substantially over the last 18 months. Most get a large amount of attention when announced, but a few slip under the radar. One of these low-flying options is Apache Tajo. I learned about Tajo in November of 2013 at a Hadoop User Group meeting.

Billed as a big data data warehousing system for Hadoop, Tajo development started in 2010 and moved to the Apache Software Foundation in March of 2013. Tajo is currently incubating. Its primary development sponsor is Gruter, a big data infrastructure startup in South Korea. Despite the lack of public awareness, Tajo has a fairly robust feature set:

  • SQL compliance
  • Fully distributed query processing against HDFS and other data sources
  • ETL feature set
  • User-defined functions
  • Compatibility with HiveQL and Hive MetaStore
  • Fault tolerance through a restart mechanism for failed tasks
  • Cost-based query optimization and an extensible query rewrite engine

Things get interesting when comparing performance against Apache Hive and Cloudera Impala. SK Telecom, the largest telecommunications provider in South Korea, tested Tajo, Hive and Impala using five sample queries. Hive 0.10 and Impala 1.1.1 on CDH 4.3.0 were used for the test. Test data size was 1.7TB and query results were 8GB or less in size. (The following images were taken from the presentation in the previous link.)

Query 1: Heavy scan with 20 text matching filters

tajo_q1

Query 2: 7 unions with joins

tajo_q2

Query 3: Simple joins

tajo_q3

Query 4: Group by and order by

tajo_q4

Query 5: 30 pattern matching filters with OR conditions using group by, having and sorting

tajo_q5

What do these results indicate? Clearly, different SQL-on-Hadoop implementations have different performance characteristics. Until these options mature to be truly multi-purpose, selecting a single option may not result in the best overall performance. Also, these benchmarks are for a specific set of use cases – not your use cases. The tested queries may have no relevance to your data and how you’re using it.

The other important takeaway is the absolute performance of these options. The sample data set and results are small in modern terms, yet none of the results are astounding relative to a modern data warehouse or RDBMS. There’s a difference between “fast” and “fast for Hadoop.” Cloudera appears to be making some headway, but a lot of ground must be covered before any Hadoop distribution is compatible with the systems vendors claim to be replacing.

How Square Secured Your Data in Hadoop

If any company must face issues around data security, a credit card payment processor is a likely candidate. Square’s card readers and applications likely process millions of payments per day. One output of this processing is a lot of data. According to a presentation delivered at Facebook, Square stores a substantial amount of this data in Hadoop, conducting data analysis and offering data products to buyers and merchants.

Square does this without storing sensitive information in Hadoop. Redacted data meets about 80% of their use cases. Meeting the remaining 20% of use cases required some fairly radical engineering.

If you’re not familiar with Hadoop, it’s a distributed filesystem and data processing mechanism. You can put anything you want in those files. Square stores its data as Protocol Buffers, or “protobufs”. Protobufs is a way to serialize data. (Trying to stay as non-technical as possible in this post and already failing miserably. It’s a data format.)

Hadoop’s security model applies access control at the file level. This isn’t unreasonable since all it knows about are files. File contents are a mystery. However, file-level security is a blunt instrument in a modern data architecture. Addressing this security impedance mismatch led Square to look at a few options:

  • Create a separate, secure Hadoop cluster for sensitive data. This was rejected because it introduces more operational overhead. It also wasn’t clear how to separate sensitive data. And everyone will want access to the secure cluster. Maybe the answers is to…
  • Open up access to everyone and track access with audit logs. Yep, this is a bad idea. Nobody looks at logs.

So Square thought about what they needed to support the business. That resulted in the following requirements:

  • Control access to individual fields within each record.
  • Extend the security framework to the ingest method (Apache Kafka), not just Hadoop.
  • Developer transparency.
  • Support algorithmic agility and key management.
  • Don’t store cryptographic keys with the data.

None of these requirements are unreasonable. All of them should be considered table stakes for an information management application. However, Square had to implement their own encryption model to meet these modest requirements. It did this by modifying the protobufs code to use an existing online crypto service. Key management overhead (always difficult) was challenging when facing growing data volumes and required a few iterations to get right.

Square didn’t disclose the number of hours spent on developing these capabilities. I think it’s fair to assume the investment was substantial.

Maybe the level of effort required is why securing your data in Hadoop isn’t a primary inhibitor to adoption. The lack of data-level security is keeping relevant projects off of Hadoop.

NoSQL Shouldn’t Mean NoDBA

Last September I conducted an informal survey of NoSQL adopters to improve our understanding of who is using NoSQL and why. The results were largely what I expected, except for the respondent profile. Database administrators (DBAs) appear to be significantly underrepresented in the NoSQL space, representing only 5.5% of respondents:

primary_job_role

The possibility of selection bias occurred to me, but these numbers look accurate based on conversations with clients, vendors and developers. DBAs simply aren’t a part of the NoSQL conversation. This means DBAs, intentionally or not, are being eliminated from a rapidly growing area of information management. Application developers may be getting what they want from NoSQL now, but cutting out the primary data stewards will result in long-term data quality and information governance challenges for the larger enterprise (see “Does Your NoSQL DBMS Result in Information Governance Debt? (G00257910)).”

If you’ve adopted a NoSQL DBMS, what are your integration and governance plans? Do you even have any?

Spark and Tez Highlight MapReduce Problems

On February 3rd, Cloudera announced support for Apache Spark as part of Cloudera Enterprise. I’ve blogged about Spark before so I won’t go into substantial detail here, but the short version is Spark improves upon MapReduce by removing the need to write data to disk between steps. Spark also takes advantage of in-memory processing and data sharing for further optimizations.

The other successor to MapReduce (of course there is more than one) is Apache Tez. Tez improves upon MapReduce by removing the need to write data to disk between steps (Sound familiar?). It also has in-memory capabilities similar to Spark.  Thus far Hortonworks has thrown its weight behind Tez development as part of the Stinger project.

Both Tez and Spark are described as supplementing MapReduce workloads. However, I don’t think this will be case much longer. The world has changed since Google published the original MapReduce paper in 2004. Memory prices have plummeted while data volumes and sources have increased, making legacy MapReduce less appealing.

Vendors will likely begin distancing themselves from MapReduce for more performant options once there are some high profile customer references. It remains to be seen what this means for early adopters with legacy MapReduce applications.

Thanks to Josh Wills at Cloudera for helping clarify the advantage provided by Spark & Tez.

Copyright © Nick Heudecker

Built on Notes Blog Core
Powered by WordPress