Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA
 
LL20 A
11:00am How LinkedIn built a text analytics platform at scale Chi-Yi Kuan (LinkedIn), Weidong Zhang (LinkedIn), Tiger Zhang (LinkedIn)
11:50am Python scalability: A convenient truth Travis Oliphant (Continuum Analytics)
1:50pm Data modeling for data science: Simplify your workload with complex types Marcel Kornacker (Cloudera), Alexander Behm (Cloudera)
2:40pm Atom smashing using machine learning at CERN Siddha Ganju (NVIDIA)
4:20pm The polyglot Beaker notebook Scott Draves (Two Sigma Open Source)
LL20 C
11:00am The state of visualization: Application and practice Noah Illinsky (Amazon Web Services)
11:50am Data scientists, you can help save lives Jeremy Howard ( fast.ai | USF | doc.ai and platform.ai)
1:50pm Visualization is distortion: How to lie less Aneesh Karve (Quilt)
2:40pm Building responsive data visualization for the Web Bill Hinderman (Vaystays)
4:20pm Vowpal Wabbit: The essence of speed in machine learning Jeroen Janssens (Data Science Workshops)
LL20 D
1:50pm Netflix's big leap from Oracle to Cassandra Roopa Tangirala (Netflix)
2:40pm BI on Hadoop: What are your options? Jacques Nadeau (Dremio)
LL21 B
2:40pm Strategies for agile instrumentation, ingestion, and analytics across many platforms and products Yann Landrin-Schweitzer (Autodesk), Charlie Crocker (Autodesk)
LL21 C/D
11:00am Adopting analytics: The Autodesk journey Adam Sugano (Autodesk)
11:50am Inside Cigna's big data journey Jeffrey Shmain (Cloudera), Mohammad Quraishi (Cigna)
1:50pm How big data is helping to save babies around the world Linus Liang (Embrace), Brad Allen (Silicon Valley Data Science)
2:40pm Publicly broadcasting data exhaust at a public broadcaster Christopher Berry (Canadian Broadcasting Corporation)
4:20pm Transforming Telefónica John Belchamber (Telefónica), Arturo Canales (Telefónica)
LL21 E/F
11:00am Netflix: Making big data small Daniel Weeks (Netflix)
11:50am Data applications and infrastructure at Coursera Roshan Sumbaly (Facebook), Pierre Barthelemy (Coursera)
1:50pm Architecting immediacy: The design of a high-performance, portable wrangling engine Joe Hellerstein (UC Berkeley), Seshadri Mahalingam (Trifacta)
4:20pm Secrets of natural language UIs: Translating English into computer actions Joseph Turian (Workday), Alex Nisnevich (Bayes Impact)
210 A/E
2:40pm Scalable schema management for Hadoop and Spark applications Kelvin Chu (Uber), Evan Richards (Uber)
4:20pm Cancer genomics analysis in the cloud with Spark and ADAM Timothy Danford (Tamr, Inc.)
210 C/G
11:00am Fast data made easy with Apache Kafka and Apache Kudu (incubating) Ted Malaska (Capital One), Jeff Holoman (Cloudera)
210 D/H
11:50am Embeddable data transformation for real-time streams Joey Echeverria (Rocana)
1:50pm Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop Ilya Ganelin (Capital One Data Innovation Lab)
2:40pm Apache Flink: Streaming done right Kostas Tzoumas (data Artisans)
211 A-C
11:00am Ask me anything: Apache Hadoop operations for production systems Kathleen Ting (Cloudera), Vikram Srivastava (Cloudera), Darren Lo (Cloudera), Jordan Hambleton (Cloudera)
11:50am Ask me anything: Hadoop application architectures Mark Grover (Lyft), Jonathan Seidman (Cloudera), Ted Malaska (Capital One), Gwen Shapira (Confluent)
1:50pm Ask me anything: Apache Spark Reynold Xin (Databricks), Tathagata Das (Databricks), michael dddd (Databricks)
2:40pm Ask me anything: Apache Kafka Joseph Adler (Facebook), Ewen Cheslack-Postava (Confluent), Jun Rao (Confluent), Jesse Anderson (Big Data Institute), Neha Narkhede (Confluent)
4:20pm Ask me anything: Developing a modern enterprise data strategy John Akred (Silicon Valley Data Science), Scott Kurth (Silicon Valley Data Science), Colette Glaeser (Silicon Valley Data Science)
230 A
11:50am Big data for telcos: A trio of use cases Amy O'Connor (Cloudera)
1:50pm Format wars: From VHS and Beta to Avro and Parquet Silvia Oliveros (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
2:40pm Did you accidentally build a database? Spencer Kimball (Cockroach Labs)
4:20pm Analyzing drivers of Net Promoter Score and their impact on customer engagement in the OTA industry Krishnan Venkata (LatentView Analytics), Jose Abelenda (Hotwire)
230 C
11:00am Deploying Hadoop on user namespace containers Abin Shahab (Altiscale)
1:50pm Twitter Heron at scale Karthik Ramasamy (Twitter)
2:40pm How the oil and gas industry is igniting a spark with information fusion and metadata analytics Brian Clark (Objectivity), Marco Ippolito (CGG GeoSoftware)
LL20 B
11:00am Demonstrating the art of the possible with Spark David Taieb (IBM), Mythili Venkatakrishnan (IBM)
LL21 A
11:50am Building a modern data architecture Ben Sharma (Zaloni)
1:50pm Solr as a SparkSQL datasource TJ Potter (Lucidworks )
2:40pm Batch is back: Critical for agile application adoption Joe Goldberg (BMC Software)
230 B
11:50am Filling the data lake Chuck Yarbrough (Pentaho), Mark Burnette (Pentaho, a Hitachi Group Company)
2:40pm Turn big data into big results Jeff Pohlmann (Oracle)
210 B/F
11:00am High-frequency decisioning Steve Wooledge (MapR Technologies)
11:50am Automated model selection and tuning at scale with Spark Peter Prettenhofer (DataRobot), Owen Zhang (DataRobot)
1:50pm Delivering "DARPA hard" Matthew Van Adelsberg (CACI )
2:40pm Virtualizing big data: Effective approaches from real-world deployments Martin Yip (VMware), Justin Murray (VMware)
Grand Ballroom 220
8:45am Thursday keynote welcome Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
8:50am Apache Hadoop meets cybersecurity Tom Reilly (Cloudera), Alan Ross (Intel)
9:00am Thinking like a Bayesian Julia Galef (Center for Applied Rationality)
9:15am Connected brains Joseph Sirosh (Compass), kai miller (Stanford University)
9:25am Building practical AI systems Adam Cheyer (Samsung)
9:45am What's next for BDAS (the Berkeley Data Analytics Stack)? Michael Franklin (AMPLab/UC Berkeley)
9:55am Open by design, open for data Adam Kocoloski (IBM)
10:00am Nonsense science Paula Poundstone (Star of NPR's #1 radio show, "Wait Wait...Don't Tell Me")
10:30am Morning Break sponsored by IBM | Room: Expo Hall
12:30pm Lunch sponsored by MapR Thursday BoF Tables | Room: Expo Hall
3:20pm Afternoon Break sponsored by Intel | Room: Expo Hall
5:00pm Event Ice Cream Social | Room: The Hub
8:00am Coffee Break | Room: Grand Ballroom Foyer
11:00am-11:40am (40m) Data Science & Advanced Analytics Artificial intelligence, Machine learning
How LinkedIn built a text analytics platform at scale
Chi-Yi Kuan (LinkedIn), Weidong Zhang (LinkedIn), Tiger Zhang (LinkedIn)
Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang explain how LinkedIn has built a "voice of member" platform to analyze hundreds of millions of text documents. Chi-Yi, Weidong, and Yongzheng illustrate the critical components of this platform and showcase how LinkedIn leverages it to derive insights such as customer value propositions from an enormous amount of unstructured data.
11:50am-12:30pm (40m) Data Science & Advanced Analytics
Python scalability: A convenient truth
Travis Oliphant (Continuum Analytics)
Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.
1:50pm-2:30pm (40m) Data Science & Advanced Analytics
Data modeling for data science: Simplify your workload with complex types
Marcel Kornacker (Cloudera), Alexander Behm (Cloudera)
Marcel Kornacker explains how to use nested data structures to increase analytic productivity. Marcel uses the well-known TPC-H schema to demonstrate how to simplify analytic workloads with nested schemas.
2:40pm-3:20pm (40m) Data Science & Advanced Analytics Artificial intelligence, Machine learning
Atom smashing using machine learning at CERN
Siddha Ganju (NVIDIA)
Siddha Ganju explains how CERN uses machine-learning models to predict which datasets will become popular over time. This helps to replicate the datasets that are most heavily accessed, which improves the efficiency of physics analysis in CMS. Analyzing this data leads to useful information about the physical processes.
4:20pm-5:00pm (40m) Data Science & Advanced Analytics Smart agents and human/machine augmentation
The polyglot Beaker notebook
Scott Draves (Two Sigma Open Source)
Scott Draves gives an overview of the Beaker notebook, a new open source tool for data scientists. Beaker was designed to be polyglot: a single notebook may contain cells from multiple languages that communicate with one another through a unique feature called autotranslation. Scott discusses motivations for the design, reviews the architecture, and gives a demo of Beaker in action.
11:00am-11:40am (40m) Visualization & User Experience Virtual and Augmented Reality
The state of visualization: Application and practice
Noah Illinsky (Amazon Web Services)
Noah Iliinsky surveys the state of visualization, outlines the major trends in the field, and explores the directions that visualization is headed. Noah also dives into the assorted tool domains—from enterprise to desktop to code-based—and discusses the pros and cons and use cases of each.
11:50am-12:30pm (40m) Data-driven Business
Data scientists, you can help save lives
Jeremy Howard ( fast.ai | USF | doc.ai and platform.ai)
In his 20+ years of applying machine learning and data analysis to a wide range of industries, Jeremy Howard never felt that his work really changed anyone's life in a deep and positive way, so he spent a year researching ways he might effect real change. Jeremy outlines the impact that deep learning is going to make on the world and explains how you too can make a difference.
1:50pm-2:30pm (40m) Visualization & User Experience
Visualization is distortion: How to lie less
Aneesh Karve (Quilt)
Seemingly harmless choices in visualization, design, and content selection can distort your data and lead to false conclusions. Aneesh Karve presents a framework for identifying and overcoming these distortions by drawing upon research in human perception, focus and context, and mobile design.
2:40pm-3:20pm (40m) Visualization & User Experience
Building responsive data visualization for the Web
Bill Hinderman (Vaystays)
With more than 1.4 billion smartphones and at least half that many tablets in use, there is a tremendous need for responsive web design in the data-visualization sphere. Bill Hinderman explains the principles of responsive data visualization, which allows you to respond to screen conditions as well as data conditions.
4:20pm-5:00pm (40m) Data Science & Advanced Analytics Machine learning
Vowpal Wabbit: The essence of speed in machine learning
Jeroen Janssens (Data Science Workshops)
Vowpal Wabbit (VW) is a fast out-of-core learning system that pushes the frontier of machine learning. Jeroen Janssens offers a practical introduction to VW from both RStudio and the Unix command line and demonstrates how it can be used to perform tasks such as classification, regression, matrix factorization, and topic modeling.
11:00am-11:40am (40m) Hadoop Use Cases
In search of database nirvana: The challenges of delivering HTAP
Rohit Jain (Esgyn)
Companies are looking for a single database engine that can address all their varied needs—from transactional to analytical workloads, against structured, semistructured, and unstructured data, leveraging graph databases, document stores, text search engines, column stores, key value stores, and wide column stores. Rohit Jain discusses the challenges one faces on the path to this nirvana.
11:50am-12:30pm (40m) Enterprise Adoption
Best practices for enterprise adoption of big data in the cloud
Prat Moghe (Cazena)
Many big data projects work in the lab yet never make it to full-scale production. Lengthy deployments and expertise shortages hinder enterprise adoption. Cloud deployments help but create challenges with security, integration, and costs. Prat Moghe outlines best practices for leveraging the modern big data stack and public cloud infrastructure while maintaining enterprise-grade standards.
1:50pm-2:30pm (40m) Enterprise Adoption
Netflix's big leap from Oracle to Cassandra
Roopa Tangirala (Netflix)
Roopa Tangirala details Netflix's migration from Oracle to Cassandra, covering the problems encountered, what worked and what didn't, and lessons learned along the way.
2:40pm-3:20pm (40m) Enterprise Adoption
BI on Hadoop: What are your options?
Jacques Nadeau (Dremio)
There are (too?) many options for BI on Hadoop. Some are great at exploration, some are great at OLAP, some are fast, and some are flexible. Understanding the options and how they work with Hadoop systems is a key challenge for many organizations. Jacques Nadeau provides a survey of the main options, both traditional (Tableau, Qlik, etc.) and new (Platfora, Datameer, etc.).
4:20pm-5:00pm (40m) Enterprise Adoption
Building a scalable, secure data platform: If I knew then what I know now
Bill Loconzolo (Intuit)
Data initiatives are often approached with a feast-or-famine mentality: go big and do it all or go home. Bill Loconzolo explains how established enterprises can build scalable, secure data pipelines that create connections between central data and product teams and enable business results that matter. Learn the framework Bill developed to realize your big data vision.
11:00am-11:40am (40m) Enterprise Adoption
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner (Miner & Kasch)
Figuring out Hadoop is daunting. However, understanding a set of basic yet important principles is all you need to cut through the hype and make intelligent enterprise decisions. Donald Miner breaks down modern Hadoop into 10 important principles you need to know to understand what Hadoop is and how it is different from the old way of doing things.
11:50am-12:30pm (40m) Enterprise Adoption Artificial intelligence, Machine learning
Old industries, sexy data: How machine learning is reshaping the world's backbone industries
David Beyer (Amplify Partners)
Over the past decade, machine learning has become intertwined with newer, Internet-born businesses. This despite the fact that the vast majority of global GDP turns on larger, less visible industries like energy and construction. David Beyer explores the ways these backbone industries are adopting machine-intelligent applications and the trends underlying this shift.
1:50pm-2:30pm (40m) Enterprise Adoption
Self-service, interactive analytics at multipetabyte scale in capital markets regulation on the cloud
Scott Donaldson (FINRA), Matt Cardillo (FINRA)
Scott Donaldson and Matt Cardillo detail the security measures and system architecture needed to bring alive a multipetabyte data warehouse via interactive analytics and directed graphs from several trillions of market events, using HBase, EMR, Hive, Redshift, and S3 technologies in a cost-efficient manner.
2:40pm-3:20pm (40m) Enterprise Adoption
Strategies for agile instrumentation, ingestion, and analytics across many platforms and products
Yann Landrin-Schweitzer (Autodesk), Charlie Crocker (Autodesk)
Autodesk's next-gen analytics pipeline, based on SDKs, Kafka, Spark, and containers, will solve the problems of platform and product fragmentation, instrumentation quality, and ease of access to analytics. Yann Landrin and Charlie Crocker explore the features that will enable teams to build reliable, high-quality usage analytics for Autodesk's products, autonomously and in mere minutes.
4:20pm-5:00pm (40m) Data Science & Advanced Analytics Artificial intelligence, Machine learning
Large-scale product classification via text and image-based signals using a fusion of discriminative and deep learning-based classifiers
Sreeni Iyer (quadanalytix), Anurag Bhardwaj (Quad Analytix)
Typically, 8–10% of product URLs in ecommerce sites are misclassified. Sreeni Iyer and Anurag Bhardwaj discuss a machine-learning-based solution that relies on an innovative fusion of classifiers that are both text- and image-based, along with human touch to handle edge cases, to automatically classify product URLs according to a canonical taxonomic organization with a high F-score.
11:00am-11:40am (40m) Data-driven Business
Adopting analytics: The Autodesk journey
Adam Sugano (Autodesk)
Autodesk's transition to a subscription business model has caused the company to rethink how it interacts with and engages its customers. Adam Sugano details how, in a short period of time, Autodesk has executed numerous data science projects that have enhanced its capabilities to acquire, retain, and provide more value to its customers.
11:50am-12:30pm (40m) Data-driven Business Machine learning
Inside Cigna's big data journey
Jeffrey Shmain (Cloudera), Mohammad Quraishi (Cigna)
How do you implement Apache Hadoop in a large healthcare company with a mature data-analysis infrastructure? Jeffrey Shmain and Mohammad Quraishi describe Cigna's journey toward big data and Hadoop, including an overview of new Hadoop capabilities like heterogeneous data integration and large-scale machine learning.
1:50pm-2:30pm (40m) Data-driven Business
How big data is helping to save babies around the world
Linus Liang (Embrace), Brad Allen (Silicon Valley Data Science)
Linus Liang and Brad Allen explain how big data is helping Embrace save millions of babies around the world. Embrace invented the world's most affordable infant incubator, but the data it collects—from the hardest to reach and most rural parts of the world—will actually save more lives than the device will.
2:40pm-3:20pm (40m) Data-driven Business Machine learning
Publicly broadcasting data exhaust at a public broadcaster
Christopher Berry (Canadian Broadcasting Corporation)
The Canadian Broadcasting Corporation broadcasts a lot of digital content. And Canadians create a huge amount of data about that content. So how does a public broadcaster, of all entities, broadcast its data exhaust? Christopher Berry details the CBC's early experiments with importing a variant of the lean startup into a 79-year-old institution.
4:20pm-5:00pm (40m) Data-driven Business
Transforming Telefónica
John Belchamber (Telefónica), Arturo Canales (Telefónica)
Increasing competition and technological change is impelling the telco industry toward a new model of analytics. Telefónica has been at the front of this change, driving business transformation to a digital telco. John Belchamber and Arturo Canales tell the story of that transformation and detail the pitfalls and challenges faced by teams looking to follow a similar journey.
11:00am-11:40am (40m) Data Innovations
Netflix: Making big data small
Daniel Weeks (Netflix)
Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Daniel Weeks explains how Netflix has enhanced its 25+ petabyte warehouse by combining Parquet's features with Presto and Spark to boost both ETL and interactive queries. Daniel explores how these approaches offer new ways to look at the relationship between storage and compute.
11:50am-12:30pm (40m) Data Innovations
Data applications and infrastructure at Coursera
Roshan Sumbaly (Facebook), Pierre Barthelemy (Coursera)
Coursera's platform allows 15 million learners to take courses from the best universities. Roshan Sumbaly and Thomas Barthelemy outline the pieces of Coursera's data infrastructure (streaming, data warehouse) that support its growing semi- and unstructured data requirements and explain how this ecosystem allows Coursera to build various instructor- and learner-side data products.
1:50pm-2:30pm (40m) Data Innovations
Architecting immediacy: The design of a high-performance, portable wrangling engine
Joe Hellerstein (UC Berkeley), Seshadri Mahalingam (Trifacta)
Seshadri Mahalingam and Joe Hellerstein discuss Photon, a high-performance data-transformation engine that provides immediacy to the data-wrangling experience, and demonstrate how to make the most of modern processors from both the browser and the desktop, with a focus on issues specific to the variety of big raw data, including heavy string manipulation and statistical data profiling.
2:40pm-3:20pm (40m) Data Innovations
Architecting distributed systems for failure: How Druid guarantees data availability
Fangjin Yang (Imply)
Running distributed systems in production can be tremendously challenging. Fangjin Yang covers common problems and failures with distributed systems and discusses design patterns that can be used to maintain data integrity and availability when everything goes wrong. Fangjin uses Druid as a real-world case study of how these patterns are implemented in an open source technology.
4:20pm-5:00pm (40m) Data Innovations Machine learning, Smart agents and human/machine augmentation
Secrets of natural language UIs: Translating English into computer actions
Joseph Turian (Workday), Alex Nisnevich (Bayes Impact)
Next-gen UIs will allow people to use plain English to interact with software. However, current published research focuses on abstract understanding, not on translating English into concrete software actions. Joseph Turian and Alex Nisnevich outline UPSHOT's English-to-SQL semantic parser and demonstrate how to build your own English-to-“your software application” parser.
11:00am-11:40am (40m) Spark & Beyond
Apache Spark and real-time analytics: From interactive queries to streaming
michael dddd (Databricks)
Michael Armbrust explores real-time analytics with Spark from interactive queries to streaming.
11:50am-12:30pm (40m) Spark & Beyond
Taking Spark Streaming to the next level with DataFrames
Tathagata Das (Databricks)
Tathagata Das introduces Streaming DataFrames, the next evolution of Spark Streaming. Streaming DataFrames unifies an additional dimension: interactive analysis. In addition, it provides enhanced support for out-of-order (delayed) data, zero-latency decision making and integration with existing enterprise data warehouses.
1:50pm-2:30pm (40m) Spark & Beyond
Breaking Spark: Top 5 mistakes to avoid when using Apache Spark in production
Neelesh Salian (Stitch Fix)
Spark has been growing in deployments for the past year. Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark and offers guidelines to help setup a real-world environment when planning an Apache Spark deployment in a cluster. Attendees can use these observations to improve the usability and supportability of Apache Spark in their projects.
2:40pm-3:20pm (40m) Hadoop Use Cases
Scalable schema management for Hadoop and Spark applications
Kelvin Chu (Uber), Evan Richards (Uber)
Schema plays a key role in the Hadoop architecture at Uber. Kelvin Chu and Evan Richards explain why schema is important and how it can make your Hadoop and Spark application more reliable and efficient.
4:20pm-5:00pm (40m) Spark & Beyond
Cancer genomics analysis in the cloud with Spark and ADAM
Timothy Danford (Tamr, Inc.)
To keep up with the DNA-sequencing-technology revolution, bioinformaticians need more-scalable tools for genomics analysis. Timothy Danford outlines one possible solution in a case study of a cancer genomics analysis pipeline implemented as part of the open source genomics software project, ADAM, which uses Apache Spark-generated abstractions executed on commodity computing infrastructure.
11:00am-11:40am (40m) IoT and Real-time Machine learning
Fast data made easy with Apache Kafka and Apache Kudu (incubating)
Ted Malaska (Capital One), Jeff Holoman (Cloudera)
Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable-profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu.
11:50am-12:30pm (40m) Data Innovations
When one data center is not enough: Building large-scale stream infrastructure across multiple data centers with Apache Kafka
Guozhang Wang (Confluent)
You may have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center. But what if one data center is not enough? Guozhang Wang offers an overview of best practices for multi-data-center deployments, architecture guidelines for data replication, and disaster scenarios.
1:50pm-2:30pm (40m) IoT and Real-time Smart agents and human/machine augmentation
Transforming industrial enterprises with data science: From deterministic machines to probabilistic systems
Sean Murphy (PingThings)
Sean Murphy demonstrates how and why the power grid and other legacy industrials built on traditional engineering will be transformed from deterministic machines described by mathematical equations to probabilistic systems requiring streaming data and analytics. Sean demonstrates how to take an agile approach to the scientific method with big data and fuse the two approaches.
2:40pm-3:20pm (40m) Data Innovations
Building DistributedLog, a high-performance replicated log service
Sijie Guo (StreamNative)
DistributedLog is a high-performance replicated log service built on top of Apache BookKeeper that is the foundation of publish-subscribe at Twitter, serving traffic from transactional databases to real-time data analytic pipelines. Sijie Guo offers an overview of DistributedLog, detailing the technical decisions and challenges behind its creation and how it is used at Twitter.
4:20pm-5:00pm (40m) IoT and Real-time
Pulsar: Real-time analytics at scale leveraging Kafka, Kylin, and Druid
Tony Ng (WeWork)
Enterprises are increasingly demanding real-time analytics and insights. Tony Ng offers an overview of Pulsar, an open source real-time streaming system used at eBay, which can scale to millions of events per second with 4GL SQL-like language support. Tony explains how Pulsar integrates Kafka, Kylin, and Druid to provide flexibility and scalability in event and metrics consumption.
11:00am-11:40am (40m) Data Innovations Machine learning
Elasticsearch and Apache Lucene for Apache Spark and MLlib
Costin Leau (Elastic)
Costin Leau offers an overview of Elastic’s current efforts to enhance Elasticsearch's existing integration with Spark, going beyond Spark core and Spark SQL by focusing on text processing and machine learning to allow data processing and tokenizing to be combined with Spark's MLlib algorithms.
11:50am-12:30pm (40m) IoT and Real-time
Embeddable data transformation for real-time streams
Joey Echeverria (Rocana)
Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for data transformation that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.
1:50pm-2:30pm (40m) Data Innovations Machine learning
Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop
Ilya Ganelin (Capital One Data Innovation Lab)
What if we have reached the point where open source can handle massively difficult streaming problems with enterprise-grade durability? Ilya Ganelin presents Capital One’s novel solution for real-time decisioning on Apache Apex. Ilya shows how Apex provides unique capabilities that ensure less than 2 ms latency in an enterprise-grade solution on Hadoop.
2:40pm-3:20pm (40m) IoT and Real-time
Apache Flink: Streaming done right
Kostas Tzoumas (data Artisans)
Apache Flink is a full-featured streaming framework with high throughput, millisecond latency, strong consistency, support for out-of-order streams, and support for batch as a special case of streaming. Kostas Tzoumas gives an overview of Flink and its streaming-first philosophy, as well as the project roadmap and vision: fully unifying the worlds of “batch” and “streaming” analytics.
4:20pm-5:00pm (40m) IoT and Real-time
Scaling your business with a messaging platform on the Zeta Architecture
Jim Scott (NVIDIA)
The Zeta Architecture is an enterprise architecture to move beyond the data lake. The most logical way to scale applications across tiers is to put a messaging platform in between the tiers, which allows a far simpler ability to scale the communications of applications. Jim Scott covers the benefits of this model and offers an example of data-center monitoring.
11:00am-11:40am (40m) Ask Me Anything
Ask me anything: Apache Hadoop operations for production systems
Kathleen Ting (Cloudera), Vikram Srivastava (Cloudera), Darren Lo (Cloudera), Jordan Hambleton (Cloudera)
Kathleen Ting, Vikram Srivastava, Darren Lo, and Jordan Hambleton, the instructors of the full-day tutorial Apache Hadoop Operations for Production Systems, field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
11:50am-12:30pm (40m) Ask Me Anything
Ask me anything: Hadoop application architectures
Mark Grover (Lyft), Jonathan Seidman (Cloudera), Ted Malaska (Capital One), Gwen Shapira (Confluent)
Mark Grover, Jonathan Seidman, Ted Malaska, and Gwen Shapira, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.
1:50pm-2:30pm (40m) Ask Me Anything
Ask me anything: Apache Spark
Reynold Xin (Databricks), Tathagata Das (Databricks), michael dddd (Databricks)
Join the Spark team for an informal Q&A session. Apache Spark architects Reynold Xin, Tathagata Das, and Michael Armbrust will be on hand to field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
2:40pm-3:20pm (40m) Ask Me Anything
Ask me anything: Apache Kafka
Joseph Adler (Facebook), Ewen Cheslack-Postava (Confluent), Jun Rao (Confluent), Jesse Anderson (Big Data Institute), Neha Narkhede (Confluent)
Joseph Adler, Ewen Cheslack-Postava, Jun Rao, Jesse Anderson, and Neha Narkhede, the instructors of the Apache Kafka tutorials, field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
4:20pm-5:00pm (40m) Ask Me Anything
Ask me anything: Developing a modern enterprise data strategy
John Akred (Silicon Valley Data Science), Scott Kurth (Silicon Valley Data Science), Colette Glaeser (Silicon Valley Data Science)
John Akred, Scott Kurth, and Colette Glaeser field a wide range of detailed questions about developing a modern enterprise data strategy. Even if you don’t have a specific question, join in to hear what others are asking.
11:00am-11:40am (40m) Security
Governance for custom Hadoop applications via the enterprise (meta)data hub
Chang She (Cloudera)
Many third-party apps are built on top of the Hadoop platform for data ingest, ETL, analytics, and predictive modeling. These services/apps need a data-governance layer for security and compliance, but it is often burdensome for each individual app to build its own. Chang She describes the challenges in building an extensible metadata layer that serves common governance needs for Hadoop.
11:50am-12:30pm (40m) Hadoop Use Cases
Big data for telcos: A trio of use cases
Amy O'Connor (Cloudera)
Telcos are graduating from exploring Hadoop’s technical capabilities to implementing full-blown, multiworkload data hubs at the heart of their operations. The world’s leading telcos are delivering compelling results for strategic use cases that leverage big data solutions. Amy O'Connor explores three key case studies that showcase these successes.
1:50pm-2:30pm (40m) Hadoop Internals & Development
Format wars: From VHS and Beta to Avro and Parquet
Silvia Oliveros (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
You have your Hadoop cluster, and you are ready to fill it up with data. But wait! Which format should you use to store your data? Should you store it in plain text, SequenceFile, Avro, or Parquet? (And should you compress it?) Silvia Oliveros and Stephen O'Sullivan cover the hows, whys, and whens of choosing one format over another and take a closer look at some of the tradeoffs each offers.
2:40pm-3:20pm (40m) Data Innovations
Did you accidentally build a database?
Spencer Kimball (Cockroach Labs)
Often without realizing it, companies spend significant resources engineering new databases. The need to combine traditional relational datasets with new operational and historical data leads to sharded RDBMS or hybridized RDBMS and NoSQL systems, typically leaving few of the constituent database guarantees intact. Spencer Kimball introduces CockroachDB, an open source, scale-out SQL database.
4:20pm-5:00pm (40m) Enterprise Adoption
Analyzing drivers of Net Promoter Score and their impact on customer engagement in the OTA industry
Krishnan Venkata (LatentView Analytics), Jose Abelenda (Hotwire)
While organizations understand the importance of customer satisfaction, quantifying its impact on future engagement is a surprisingly hard analytical problem (most rely on Net Promoter Scores). Krishnan Venkata and Jose Abelenda explain how Hotwire used big data to put a dollar figure on promoter/detractor behavior to help the organization objectively prioritize customer-engagement initiatives.
11:00am-11:40am (40m) Data Innovations
Deploying Hadoop on user namespace containers
Abin Shahab (Altiscale)
Abin Shahab walks attendees through Altiscale's Docker deployment strategy, describes the design decisions behind it, and discusses the issues encountered and fixed along the way.
11:50am-12:30pm (40m) Data Innovations Smart agents and human/machine augmentation
Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo
Sumeet Singh (Yahoo), Mridul Jain (Yahoo)
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
1:50pm-2:30pm (40m) IoT and Real-time
Twitter Heron at scale
Karthik Ramasamy (Twitter)
Heron, Twitter's streaming system, has been in production nearly two years and is widely used by several teams for diverse use cases. Karthik Ramasamy discusses Twitter's operating experiences and shares the challenges of running Heron at scale as well as the approaches that Twitter took to solve them.
2:40pm-3:20pm (40m) Hadoop Use Cases
How the oil and gas industry is igniting a spark with information fusion and metadata analytics
Brian Clark (Objectivity), Marco Ippolito (CGG GeoSoftware)
Oil and gas organizations are at the forefront of big data, adopting technologies such as Hadoop and Spark to develop next-generation fusion systems. Brian Clark and Marco Ippolito introduce a case study from CGG, a builder of common data models to drive analytics of sensor data and associated metadata from fast-changing big data streams, to show how to derive richer value from big data assets.
4:20pm-5:00pm (40m) Hadoop Use Cases
High-performance clickstream analytics with Apache Phoenix and HBase
Arun Thangamani (CDK)
Traditional data-warehousing techniques are sometimes limited by the scalability of the implementation tools themselves. Arun Thangamani explains how the advanced architectural approaches by tools like Apache Phoenix and HBase allow new, highly scalable live-analytics solutions using the same traditional techniques and showcases a successful implementation at CDK.
11:00am-11:40am (40m) Sponsored
Demonstrating the art of the possible with Spark
David Taieb (IBM), Mythili Venkatakrishnan (IBM)
Did you know Apache Spark is helping transform industries, companies, and your everyday life? David Taieb and Mythili Venkatakrishnan demonstrate two use cases of how Apache Spark is being used to harness valuable insights from complex data across cloud and hybrid environments.
11:50am-12:30pm (40m)
Overcoming the top 5 hurdles to real-time analytics
Pat McGarry (Ryft)
High-velocity, high-volume, and high-variety data streams challenge analytics organizations because the ability to get critical insights often decays rapidly. Pat McGarry explains how organizations that embrace heterogeneous computing techniques can overcome hurdles to real-time insights, thereby gaining significant competitive advantages.
1:50pm-2:30pm (40m) Sponsored
TensorFlow: Large-scale analytics and distributed machine learning with TensorFlow, BigQuery, and Dataflow (Apache Beam)
Kaz Sato (Google), Amy Unruh (Google)
Kazunori Sato and Amy Unruh explore how you can use TensorFlow to drive large-scale distributed machine learning against your analytic data sitting in Google BigQuery, with data preprocessing driven by Dataflow (now Apache Beam). Kazunori and Amy dive into practical examples of how these technologies can work together to enable a powerful workflow for distributed machine learning.
2:40pm-3:20pm (40m) Sponsored
Transforming core business operations with SAP HANA Vora on Hadoop and Apache Spark
Amit Satoor (SAP), Balalji Krishna (SAP)
Join the SAP team for a demonstration of how OLAP on Hadoop and real-time query federation help unify enterprise and big data, using SAP's new big data solution, SAP HANA Vora. Amit Satoor and Balalji Krishna explore real-world use cases where instant insights from a combination of operational and Hadoop data impact core business operations
11:00am-11:40am (40m) Sponsored
How we Hadoop: Inmar’s transformation from a business-services outsourcing company to a data-driven enterprise
Kevin Goode (Inmar)
Inmar handles 3.7 billion transactions annually. Kevin Goode explains Inmar's transformation, starting in 2012, from a business-services company to a data-driven enterprise using Hadoop.
11:50am-12:30pm (40m) Sponsored
Building a modern data architecture
Ben Sharma (Zaloni)
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started.
1:50pm-2:30pm (40m) Sponsored
Solr as a SparkSQL datasource
TJ Potter (Lucidworks )
Solr has been adopted by all major Hadoop platform vendors as the de facto standard for big data search. Timothy Potter introduces an open source project that exposes Solr as a SparkSQL datasource. Timothy offers common use cases, access to open source code, and performance metrics to help you develop your own large-scale search and discovery solution.
2:40pm-3:20pm (40m) Sponsored
Batch is back: Critical for agile application adoption
Joe Goldberg (BMC Software)
Joseph Goldberg discusses the attributes required of a batch management platform that can accelerate development by enabling programmers to generate workflows as code, support continuous deployment with rich APIs and lightweight workflow-scheduling infrastructure, and optimize production with comprehensive enterprise operational capabilities like SLA management and full log and output management.
11:00am-11:40am (40m) Sponsored
Master the Internet of Things with integrated analytics
Bob Rogers (Intel)
Join Bob Rogers, Intel’s chief data scientist for big data solutions, and special guests to see how Intel’s open source Trusted Analytics Platform has accelerated and simplified the development of powerful analytics that are changing the game.
11:50am-12:30pm (40m) Sponsored
Filling the data lake
Chuck Yarbrough (Pentaho), Mark Burnette (Pentaho, a Hitachi Group Company)
A major challenge in today’s world of big data is getting data into the data lake in a simple, automated way. Coding scripts for disparate sources is time consuming and difficult to manage. Developers need a process that supports disparate sources by detecting and passing metadata automatically. Chuck Yarbrough and Mark Burnette explain how to simplify and automate your data ingestion process.
1:50pm-2:30pm (40m) Sponsored
Big data-fueled feedback loops leveraging streaming data in SDN/NFV
Matt Olson (CenturyLink)
Software-defined networking (SDN) and network functions virtualization (NFV) hold tremendous potential to enable efficiency and flexibility in service delivery, but SDN/NFV environments are also highly complex and multilayered. Matt Olson explains why effective support for SDN/NFV services requires leveraging the tremendous amount of service and data streaming from the platform.
2:40pm-3:20pm (40m) Sponsored
Turn big data into big results
Jeff Pohlmann (Oracle)
Jeff Pohlmann explores the skills, challenges, and solutions necessary to turn big data into big results. Learn more effective ways to increase productivity and decrease costs, aid in the allocation of key personnel and resources, better determine the true sentiment of customers, determine the impact of changing processes on production, and help solve a host of other needs.
11:00am-11:40am (40m) Sponsored
High-frequency decisioning
Steve Wooledge (MapR Technologies)
In order to remain competitive, you need to be able to respond to changing conditions in the moment. New stream-based technologies allow you to build applications that incorporate low-latency processing so you can stream data immediately or whenever you’re ready. Steve Wooledge explores how new streaming technologies make this approach work and how they can be applied in many industries.
11:50am-12:30pm (40m) Sponsored
Automated model selection and tuning at scale with Spark
Peter Prettenhofer (DataRobot), Owen Zhang (DataRobot)
Effective and efficient model selection and tuning is crucial for building machine-learning systems, but large-scale machine-learning problems require us to rethink the model-selection and tuning process. Peter Prettenhofer and Owen Zhang outline the tradeoffs we need to make and demonstrate how to efficiently search and tune complex machine-learning pipelines in MLlib.
1:50pm-2:30pm (40m) Sponsored
Delivering "DARPA hard"
Matthew Van Adelsberg (CACI )
The Defense Advanced Research Projects Agency (DARPA) is synonymous with transformational change, developing the seeming impossible into the practical. Matthew van Adelsberg demonstrates how collaborative teams of SMEs, data scientists, and engineers have been organized to achieve “DARPA hard” results for nearly a decade and offers insights into how companies can do the same.
2:40pm-3:20pm (40m) Sponsored
Virtualizing big data: Effective approaches from real-world deployments
Martin Yip (VMware), Justin Murray (VMware)
Martin Yip and Justin Murray explore the benefits of virtualization of Hadoop on vSphere and delve into three different examples of real-world deployments—at small, medium, and large scales—to demonstrate how enterprises are currently deploying Hadoop differently on virtual machines.
8:45am-8:50am (5m)
Thursday keynote welcome
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
8:50am-9:00am (10m)
Apache Hadoop meets cybersecurity
Tom Reilly (Cloudera), Alan Ross (Intel)
The cybersecurity landscape is quickly changing, and Apache Hadoop is becoming the analytics and data management platform of choice for cybersecurity practitioners. Tom Reilly and Alan Ross explain why organizations are turning toward the open source ecosystem to break down traditional cybersecurity analytics and data constraints in order to detect a new breed of sophisticated attacks.
9:00am-9:15am (15m)
Thinking like a Bayesian
Julia Galef (Center for Applied Rationality)
Julia Galef explores why Bayesian thinking is so different from what we do by default and outlines the most important principles of thinking like a Bayesian.
9:15am-9:25am (10m) Sponsored
Connected brains
Joseph Sirosh (Compass), kai miller (Stanford University)
Joseph Sirosh offers a fascinating look into how brains connected with sensors to the cloud and machine learning could revolutionize a field of medicine.
9:25am-9:40am (15m)
Building practical AI systems
Adam Cheyer (Samsung)
As a technical founder at Siri, Sentient, and Viv Labs, Adam Cheyer has helped design and develop a number of intelligent systems solving real-world problems for hundreds of millions of users. Drawing on specific examples, Adam reveals some of the techniques he uses to maximize the impact of the AI technologies he employs.
9:40am-9:45am (5m) Sponsored
Advanced analytics and the mystery of the missing jeans
Bob Rogers (Intel)
Bob Rogers, Intel's chief data scientist for big data solutions, demonstrates the power of the question in analytics. Learn how different types of data, from cubes of structured data to live video streams from mobile systems, combine with analytical technology to inform the questions that can be answered.
9:45am-9:55am (10m)
What's next for BDAS (the Berkeley Data Analytics Stack)?
Michael Franklin (AMPLab/UC Berkeley)
Michael Franklin offers an overview of the Berkeley Data Analytics Stack, outlines the current directions it's taking, and settles once and for all how BDAS should be pronounced.
9:55am-10:00am (5m) Sponsored
Open by design, open for data
Adam Kocoloski (IBM)
As the volume and variety of data continue to grow, organizations have the opportunity to transform their industries and professions, but companies are grappling with how to deliver innovation. Adam Kocoloski shares his experience around this market shift and challenges attendees to join his mission of contributing to the community and investing in the power of open source and the cloud.
10:00am-10:25am (25m)
Nonsense science
Paula Poundstone (Star of NPR's #1 radio show, "Wait Wait...Don't Tell Me")
Paula Poundstone isn’t just any comedian. After years of justly criticizing and questioning the purpose of the many studies used for questions on NPR’s #1 show, Wait Wait. . .Don’t Tell Me, on which she's a popular panelist, she’s here to explore what we can learn about asking the right questions from a unique critique of published behavioral research.
10:30am-11:00am (30m)
Break: Morning Break sponsored by IBM
12:30pm-1:50pm (1h 20m) Event
Thursday BoF Tables
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics.
3:20pm-4:20pm (1h)
Break: Afternoon Break sponsored by Intel
5:00pm-6:00pm (1h) Event
Ice Cream Social
Join attendees, speakers, and exhibitors as we end the conference on a sweet note with some ice cream.
8:00am-8:45am (45m)
Break: Coffee Break