Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation

Monte Zweben (Splice Machine Inc.)
4:00pm–4:40pm Friday, 02/20/2015
Hadoop Platform
Location: 210 C/G
Slides:   1-PPTX 

SQL-on-Hadoop solutions have been a growing force in the Big Data marketplace because of their ability to include a SQL layer, or even a full SQL database, on top of Hadoop. SQL-on-Hadoop delivers many benefits for Hadoop, such as integration with existing SQL and BI tools, as well as the bypassing the need to program in Java and MapReduce. SQL also eliminates the need to retrain armies of data and business analysts at many companies

With the increased traction of SQL-on-Hadoop technology, we are seeing more and more enterprises add it to their database environment. Increased adoption has led many companies to uncover similar issues as they transition workloads from familiar systems like Oracle, IBM DB2 and MySQL to the unfamiliar realm of Hadoop.

From data ingestion and data types to SQL coverage and joins, there are many issues that can slow – or prevent – users from seeing benefits of the new systems.
In this talk, we’ll review ten of the most common pitfalls that SQL-on-Hadoop users encounter, and then discuss how to avoid them. Topics that we’ll cover include:

  • How do I avoid constantly restarting ETL pipelines?
  • Why do my joins break and why I do I need so much memory?
  • Why does my storage footprint triple in size?
  • How can I express my query with limited dialects of SQL?
  • How can I clean my data without causing full data reloads?
  • How do I complete my ETL without causing reporting downtime?
  • How do I query across both structured and unstructured data?
  • When should I create schema or structure? On ingest? On read?

The goal of this talk is to provide current and future adopters of SQL-on-Hadoop solutions with valuable knowledge that will help them experience the value of SQL-on-Hadoop with greater ease of implementation and smoother ongoing operations.

Photo of Monte Zweben

Monte Zweben

Splice Machine Inc.

Monte Zweben is the CEO and co-founder of Splice Machine, provider of the only Hadoop RDBMS. A SQL-on-Hadoop solution, Splice Machine has helped many companies scale real-time applications using commodity hardware without application rewrites.

A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit.

In 1998, Monte was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings.

Zweben currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.