Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY
 
1 E 07/1 E 08
11:20am Parquet performance tuning: The missing guide Ryan Blue (Netflix)
1:15pm The evolution of massive-scale data processing Tyler Akidau (Google)
2:55pm Smart data for smarter firefighters Bart van Leeuwen (Netage)
4:35pm An introduction to Druid Fangjin Yang (Imply)
1 E 10/1 E11
1:15pm Data risk intelligence in a regulated world Uma Raghavan (Integris Software)
2:05pm Model visualization Amit Kapoor (narrativeVIZ)
2:55pm Corporate strategy: Artificial intelligence or bust Stephen Pratt (Noodle.ai)
4:35pm Five-senses data: Using your senses to improve data signal and value Cameron Turner (The Data Guild), Brad Sarsfield (Microsoft HoloLens), Hanna Kang-Brown (R/GA), Evan Macmillan (Gridspace)
1 E 12/1 E 13
2:05pm How GE analyzes billions of mission-critical events in real time using Apache Apex, Spark, and Kudu Venkatesh Sivasubramanian (GE Digital), Luis Ramos (GE Digital)
1 E 15/1 E 16
11:20am Helping computers help us see Susan Etlinger (Altimeter Group)
2:05pm Breeding data scientists: A four-year study Danielle Dean (iRobot), Amy O'Connor (Cloudera)
3D 12
11:20am Tackling machine-learning complexity for data curation Ihab Ilyas (University of Waterloo)
2:05pm Lessons learned running Hadoop and Spark in Docker Thomas Phelan (HPE BlueData)
River Pavilion
1:15pm Streaming cybersecurity into Graph: Accelerating data into Datastax Graph and Blazegraph Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA), Keith Kraus (NVIDIA)
2:05pm Securing Apache Kafka Jun Rao (Confluent)
4:35pm Apache Kudu: 1.0 and beyond Todd Lipcon (Cloudera)
3D 10
1:15pm Recent developments in SparkR for advanced analytics Xiangrui Meng (Databricks)
2:05pm Fast deep learning at your fingertips Amitai Armon (Intel), Nir Lotan (Intel)
4:35pm Data modeling for microservices with Cassandra and Spark Jeffrey Carpenter (DataStax)
3D 08
11:20am Data science and the Internet of Things: It's just the beginning Mike Stringer (Datascope Analytics)
1:15pm Shifting cities: A case study in data visualization Brian Kahn (Climate Central), Edward Wisniewski (Radish Lab)
2:55pm Machine intelligence at Google scale Kaz Sato (Google)
4:35pm Amazon Kinesis: Real-time streaming data in the AWS cloud Roy Ben-Alta (Amazon Web Services)
Hall 1C
2:05pm A data-driven approach to the US presidential election Amir Hajian (Thomson Reuters), Khaled Ammar (Thomson Reuters), Alex Constandache (Thomson Reuters)
2:55pm Evaluating models for a needle in a haystack: Applications in predictive maintenance Danielle Dean (iRobot), Shaheen Gauher (Microsoft)
4:35pm Predicting patent litigation Josh Lemaitre (Thomson Reuters)
Hall 1B
11:20am A deep dive into Structured Streaming in Spark Ram Sriharsha (Databricks)
2:05pm Spark Structured Streaming for machine learning Holden Karau (Independent), Seth Hendrickson (Cloudera)
2:55pm Choice Hotels's journey to better understand its customers through self-service analytics Narasimhan Sampath (Choice Hotels International), Avinash Ramineni (Clairvoyant)
4:35pm Spark and Java: Yes, they work together Jesse Anderson (Big Data Institute)
1 C03
11:20am Ask me anything: Deep learning with TensorFlow Martin Wicke (Google), Joshua Gordon (Google)
1:15pm Ask me anything: Hadoop application architectures Mark Grover (Lyft), Jonathan Seidman (Cloudera), Ted Malaska (Capital One)
2:55pm Ask me anything: The state of Spark Ram Sriharsha (Databricks), Xiangrui Meng (Databricks)
4:35pm Ask me anything: Developing a modern enterprise data strategy John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers), Julie Steele (Manifold)
1 C04 / 1 C05
2:05pm Big data journeys from the real world John Morrell (Datameer)
4:35pm Rethinking operational data stores on Hadoop Vinayak Borkar (X15 Software)
1 E 09
11:20am 5 cloud AI innovations Rimma Nehme (Microsoft)
1:15pm BigQuery for data warehousing Chad W. Jennings (Google)
4:35pm Ask me anything: Apache Kafka Jun Rao (Confluent), Ewen Cheslack-Postava (Confluent)
1 E 14
2:05pm Sensitive data sharing for analytics Steven Touw (Immuta)
2:55pm Ask me anything: White House Office of Science and Technology Policy DJ Patil (White House Office of Science and Technology Policy), Lynn Overmann (Office of the Chief Technology Officer)
1B 01/02
11:20am ODPi: The foundation for cross-distribution interoperability Berni SCHIEFER (IBM), Susan Maliaka
1:15pm Accelerate EDW modernization with the Hadoop ecosystem Joe Goldberg (BMC Software)
1B 03/04
11:20am How an open analytics ecosystem became a lifesaver Douglas Liming (SAS Institute Inc.)
2:05pm Path-to-purchase analytics using a data lake and Spark Joe Caserta (Caserta Concepts)
2:55pm Neptune: A machine-learning platform for experiment management Mariusz Gadarowski (deepsense.io)
Javits North
8:50am Thursday keynotes Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
8:55am Hadoop in the cloud: A Nielsen use case Tom Reilly (Cloudera), James Powell (Nielsen)
9:05am Inbox is the Trojan horse of AI Alistair Croll (Solve For Interesting)
9:10am Connected eyes Joseph Sirosh (Compass)
9:20am The tech behind the biggest journalism leak in history Mar Cabra (International Consortium of Investigative Journalists)
9:30am Business insights driven by speed Raghunath Nambiar (Cisco)
9:35am Google BigQuery for enterprise Chad W. Jennings (Google)
9:40am Data science: A view from the White House DJ Patil (White House Office of Science and Technology Policy), Lynn Overmann (Office of the Chief Technology Officer)
10:05am From big data to human-level artificial intelligence Gary Marcus (Geometric Intelligence)
10:20am SAS: More open than you might think Paul Kent (SAS)
10:25am Statistics, machine learning, and the crazy 2016 election Sam Wang (Princeton University)
8:00am Coffee Break | Room: Break
10:50am Morning Break sponsored by SAS | Room: Hall 3 A/B
12:00pm Lunch sponsored by Microsoft Thursday BoF Tables | Room: Hall 3 A/B
3:35pm Afternoon Break sponsored by IBM | Room: Hall 3 A/B
11:20am-12:00pm (40m) Data innovations
Parquet performance tuning: The missing guide
Ryan Blue (Netflix)
Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet's features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he's learned, creating the missing guide you need.
1:15pm-1:55pm (40m) Data innovations
The evolution of massive-scale data processing
Tyler Akidau (Google)
Tyler Akidau offers a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, comparing and contrasting systems at Google with popular open source systems in use today.
2:05pm-2:45pm (40m) Data innovations
Streaming analytics at 300 billion events per day with Kafka, Samza, and Druid
Xavier Léauté (Confluent)
Ever wondered what it takes to scale Kafka, Samza, and Druid to handle complex, heterogeneous analytics workloads at petabyte size? Xavier Léauté discusses his experience scaling Metamarkets's real-time processing to over 3 million events per second and shares the challenges encountered and lessons learned along the way.
2:55pm-3:35pm (40m) Data innovations
Smart data for smarter firefighters
Bart van Leeuwen (Netage)
Smart data allows fire services to better protect the people they serve and keep their firefighters safe. The combination of open and nonpublic data used in a smart way generates new insights both in preparation and operations. Bart van Leeuwen discusses how the fire service is benefiting from open standards and best practices.
4:35pm-5:15pm (40m) Data innovations
An introduction to Druid
Fangjin Yang (Imply)
Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks suboptimal choices to power interactive applications. Fangjin Yang discusses using Druid for analytics and explains why the architecture is well suited to power analytic dashboards.
11:20am-12:00pm (40m) Visualization & user experience
Caravel: An open source data exploration and visualization platform
Maxime Beauchemin (Lyft)
Airbnb developed Caravel to provide all employees with interactive access to data while minimizing friction. Caravel's main goal is to make it easy to slice, dice, and visualize data. Maxime Beauchemin explains how Caravel empowers each and every employee to perform analytics at the speed of thought.
1:15pm-1:55pm (40m) IoT & real-time, Visualization & user experience
Data risk intelligence in a regulated world
Uma Raghavan (Integris Software)
Uma Raghavan explains why you're about to see companies whose business models depend on using their customers' data, like Facebook, Google, and many others, scramble to keep up with the flood of new and evolving laws on data privacy.
2:05pm-2:45pm (40m) Data science & advanced analytics
Model visualization
Amit Kapoor (narrativeVIZ)
Though visualization is used in data science to understand the shape of the data, it's not widely used for statistical models, which are evaluated based on numerical summaries. Amit Kapoor explores model visualization, which aids in understanding the shape of the model, the impact of parameters and input data on the model, the fit of the model, and where it can be improved.
2:55pm-3:35pm (40m) Data-driven business
Corporate strategy: Artificial intelligence or bust
Stephen Pratt (Noodle.ai)
Stephen Pratt, the CEO of Noodle.ai and former head of Watson for IBM GBS, presents a shareholder value perspective on why enterprise artificial intelligence (eAI) will be the single largest competitive differentiator in business over the next five years—and what you can do to end up on top.
4:35pm-5:15pm (40m) Visualization & user experience
Five-senses data: Using your senses to improve data signal and value
Cameron Turner (The Data Guild), Brad Sarsfield (Microsoft HoloLens), Hanna Kang-Brown (R/GA), Evan Macmillan (Gridspace)
Data should be something you can see, feel, hear, taste, and touch. Drawing on real-world examples, Cameron Turner, Brad Sarsfield, Hanna Kang-Brown, and Evan Macmillan cover the emerging field of sensory data visualization, including data sonification, and explain where it's headed in the future.
11:20am-12:00pm (40m) IoT & real-time
Implementing extreme scaling and streaming in finance
Jim Scott (NVIDIA)
Jim Scott outlines the core tenets of a message-driven architecture and explains its importance in real-time big data-enabled distributed systems within the realm of finance.
1:15pm-1:55pm (40m) IoT & real-time
When one data center is not enough: Building large-scale stream infrastructures across multiple data centers with Apache Kafka
Ewen Cheslack-Postava (Confluent)
You may have successfully made the transition from single machines and one-off solutions to large, distributed stream infrastructures in your data center. But what if one data center is not enough? Ewen Cheslack-Postava explores resilient multi-data-center architecture with Apache Kafka, sharing best practices for data replication and mirroring as well as disaster scenarios and failure handling.
2:05pm-2:45pm (40m) IoT & real-time
How GE analyzes billions of mission-critical events in real time using Apache Apex, Spark, and Kudu
Venkatesh Sivasubramanian (GE Digital), Luis Ramos (GE Digital)
Opportunities in the industrial world are expected to outpace consumer business cases. Time series data is growing exponentially as new machines get connected. Venkatesh Sivasubramanian and Luis Ramos explain how GE makes it faster and easier for systems to access (using a common layer) and perform analytics on a massive volume of time series data by combining Apache Apex, Spark, and Kudu.
2:55pm-3:35pm (40m) IoT & real-time
How to achieve zero-latency IoT and FSI data processing with Spark
yaron haviv (iguaz.io)
Yaron Haviv explains how to design real-time IoT and FSI applications, leveraging Spark with advanced data frame acceleration. Yaron then presents a detailed, practical use case, diving deep into the architectural paradigm shift that makes the powerful processing of millions of events both efficient and simple to program.
4:35pm-5:15pm (40m) IoT & real-time
Stream analytics in the enterprise: A look at Intel’s internal IoT implementation
Moty Fania (Intel)
Moty Fania shares Intel’s IT experience implementing an on-premises IoT platform for internal use cases. The platform was designed as a multitenant platform with built-in analytical capabilities and based on open source big data technologies and containers. Moty highlights the lessons learned from this journey with a thorough review of the platform’s architecture.
11:20am-12:00pm (40m)
Helping computers help us see
Susan Etlinger (Altimeter Group)
The history of the digital age is being written in photographs. To innovate in the visual age, we have to crack the visual code. Susan Etlinger explores why the ability to understand why one photo resonates and one doesn’t can make or break reputations, spark new products or lines of business, and make or save millions of dollars.
1:15pm-1:55pm (40m) Data-driven business
AI-fueled customer experience: How online retailers are moving toward real-time perception, reasoning, and learning
Rupert Steffner (Otto GmbH & Co. KG)
Today’s online storefronts are good at procuring transactions but poor in managing customers. Rupert Steffner explains why online retailers must build a complementary intelligence to perceive and reason on customer signals to better manage opportunities and risks along the customer journey. Individually managed customer experience is retailers' next challenge, and fueling AI is the right answer.
2:05pm-2:45pm (40m) Data-driven business
Breeding data scientists: A four-year study
Danielle Dean (iRobot), Amy O'Connor (Cloudera)
At Strata + Hadoop World 2012, Amy O'Connor and her daughter Danielle Dean shared how they learned and built data science skills at Nokia. This year, Amy and Danielle explore how the landscape in the world of data science has changed in the past four years and explain how to be successful deriving value from data today.
2:55pm-3:35pm (40m) Data-driven business
CANCELED: How to hire and test for data skills: A one-size-fits-all interview kit
Tanya Cashorali (TCB Analytics)
Given the recent demand for data analytics and data science skills, adequately testing and qualifying candidates can be a daunting task. Interviewing hundreds of individuals of varying experience and skill levels requires a standardized approach. Tanya Cashorali explores strategies, best practices, and deceptively simple interviewing techniques for data analytics and data science candidates.
4:35pm-5:15pm (40m) Data-driven business
Using the explosion of data in the utility industry to prevent explosions in utility infrastructure
Kim Montgomery (GridCure)
With the advent of smart grid technology, the quantity of data collected by electrical utilities has increased by 3–5 orders of magnitude. To make full use of this data, utilities must expand their analytical capabilities and develop new analytical techniques. Kim Montgomery discusses some ways that big data tools are advancing the practice of preventative maintenance in the utility industry.
11:20am-12:00pm (40m) Data science & advanced analytics
Tackling machine-learning complexity for data curation
Ihab Ilyas (University of Waterloo)
Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.
1:15pm-1:55pm (40m) IoT & real-time
Twitter's real-time stack: Processing billions of events with Heron and DistributedLog
Karthik Ramasamy (Twitter)
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Karthik Ramasamy offers an overview of the end-to-end real-time stack Twitter designed in order to meet this challenge, consisting of DistributedLog (the distributed and replicated messaging system) and Heron (the streaming system for real-time computation).
2:05pm-2:45pm (40m) Data innovations
Lessons learned running Hadoop and Spark in Docker
Thomas Phelan (HPE BlueData)
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale environments poses new challenges, especially for big data applications like Hadoop. Thomas Phelan shares lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
2:55pm-3:35pm (40m) Enterprise adoption
Life of a click: How Hearst manages clickstream analytics in the cloud
Rick McFarland (Hearst Corp)
Rick McFarland explains how the Hearst Corporation utilizes big data and analytics tools like Spark and Kinesis to stream click data in real-time from its 300+ websites worldwide. This streaming process feeds an editorial tool called Buzzing@Hearst, which provides instant feedback to authors on what is trending across the Hearst network.
4:35pm-5:15pm (40m) Enterprise adoption
Machine intelligence in the wild: How AI will reshape global industries
David Beyer (Amplify Partners)
Society is standing at the gates of what promises to be a profound transformation in the nature of work, the role of data, and the future of the world's major industries. Intelligent machines will play a variety of roles in every sector of the economy. David Beyer explores a number of key industries and their idiosyncratic journeys on the way to adopting AI.
11:20am-12:00pm (40m) Security
Account takeovers are taking over: How big data can stop them
Fang Yu (DataVisor)
The value of online user accounts has led to a significant increase in account takeover (ATO) attacks. Cyber criminals create armies of compromised accounts to perform attacks including fraudulent transactions, bank withdrawals, reward program theft, and more. Fang Yu explains how the latest in big data technology is helping turn the tide on ATO campaigns.
1:15pm-1:55pm (40m) Security
Streaming cybersecurity into Graph: Accelerating data into Datastax Graph and Blazegraph
Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA), Keith Kraus (NVIDIA)
Cybersecurity has become a data problem and thus needs the best-in-breed big data tools. Joshua Patterson, Michael Wendt, and Keith Kraus explain how Accenture Labs's Cybersecurity team is using Apache Kafka, Spark, and Flink to stream data into Blazegraph and Datastax Graph to accelerate cyber defense.
2:05pm-2:45pm (40m) Security
Securing Apache Kafka
Jun Rao (Confluent)
With Apache Kakfa 0.9, the community has introduced a number of features to make data streams secure. Jun Rao explains the motivation for making these changes, discusses the design of Kafka security, and demonstrates how to secure a Kafka cluster. Jun also covers common pitfalls in securing Kafka and talks about ongoing security work.
2:55pm-3:35pm (40m) Security
Authorization in the cloud: Enforcing access control across compute engines
Li Li (Google), Hao Hao (Cloudera)
Li Li and Hao Hao elaborate the architecture of Apache Sentry + RecordService for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access control, to encourage attendees to adopt Apache Sentry and RecordService to protect sensitive data on the multitenant cloud across the Hadoop ecosystem.
4:35pm-5:15pm (40m) Hadoop internals & development
Apache Kudu: 1.0 and beyond
Todd Lipcon (Cloudera)
Apache Kudu was first announced as a public beta release at Strata NYC 2015 and recently reached 1.0. This conference marks its one year anniversary as a public open source project. Todd Lipcon offers a very brief refresher on the goals and feature set of the Kudu storage engine, covering the development that has taken place over the last year.
11:20am-12:00pm (40m) Enterprise adoption
Big data is a household word: How Procter & Gamble uses on-cluster Hadoop BI to give visual insight to hundreds of business users for everyday use
terry mcfadden (P&G), Priyank Patel (Arcadia Data)
Terry Mcfadden and Priyank Patel discuss Procter and Gamble's three-year journey to enable production applications with on-cluster BI technology, exploring in detail the architecture challenges and choices made by the team along this journey.
1:15pm-1:55pm (40m) Data science & advanced analytics
Recent developments in SparkR for advanced analytics
Xiangrui Meng (Databricks)
Xiangrui Meng explores recent community efforts to extend SparkR for scalable advanced analytics—including summary statistics, single-pass approximate algorithms, and machine-learning algorithms ported from Spark MLlib—and shows how to integrate existing R packages with SparkR to accelerate existing R workflows.
2:05pm-2:45pm (40m) Data science & advanced analytics
Fast deep learning at your fingertips
Amitai Armon (Intel), Nir Lotan (Intel)
Amitai Armon and Nir Lotan outline a new, free software tool that enables the creation of deep learning models quickly and easily. The tool is based on existing deep learning frameworks and incorporates extensive optimizations that provide high performance on standard CPUs.
2:55pm-3:35pm (40m) Data science & advanced analytics
Machine-learning techniques for class imbalances and adversaries
Brendan Herger (Capital One)
Many areas of applied machine learning require models optimized for rare occurrences, such as class imbalances, and users actively attempting to subvert the system (adversaries). Brendan Herger offers an overview of multiple published techniques that specifically attempt to address these issues and discusses lessons learned by the Data Innovation Lab at Capital One.
4:35pm-5:15pm (40m) Data innovations
Data modeling for microservices with Cassandra and Spark
Jeffrey Carpenter (DataStax)
Jeff Carpenter describes how data modeling can be a key enabler of microservice architectures for transactional and analytics systems, including service identification, schema design, and event streaming.
11:20am-12:00pm (40m) Data science & advanced analytics
Data science and the Internet of Things: It's just the beginning
Mike Stringer (Datascope Analytics)
We're likely just at the beginning of data science. The people and things that are starting to be equipped with sensors will enable entirely new classes of problems that will have to be approached more scientifically. Mike Stringer outlines some of the issues that may arise for business, for data scientists, and for society.
1:15pm-1:55pm (40m) IoT & real-time
Shifting cities: A case study in data visualization
Brian Kahn (Climate Central), Edward Wisniewski (Radish Lab)
Radish Lab teamed up with science news nonprofit Climate Central to transform temperature data from 1,001 US cities into a compelling, simple interactive that received more than 1 million views within three days of launch. Alana Range and Brian Kahn offer an overview of the process of creating a viral, interactive data visualization with teams that regularly produce powerful data stories.
2:05pm-2:45pm (40m) IoT & real-time
Implementing streaming architecture with Apache Flink: Present and future
Kostas Tzoumas (data Artisans)
Apache Flink has seen incredible growth during the last year, both in development and usage, driven by the fundamental shift from batch to stream processing. Kostas Tzoumas demonstrates how Apache Flink enables real-time decisions, makes infrastructure less complex, and enables extremely efficient, accurate, and fault-tolerant streaming applications.
2:55pm-3:35pm (40m) Data science & advanced analytics
Machine intelligence at Google scale
Kaz Sato (Google)
The largest challenge for deep learning is scalability. Google has built a large-scale neural network in the cloud and is now sharing that power. Kazunori Sato introduces pretrained ML services, such as the Cloud Vision API and the Speech API, and explores how TensorFlow and Cloud Machine Learning can accelerate custom model training 10x–40x with Google's distributed training infrastructure.
4:35pm-5:15pm (40m) IoT & real-time
Amazon Kinesis: Real-time streaming data in the AWS cloud
Roy Ben-Alta (Amazon Web Services)
Roy Ben-Alta explores the Amazon Kinesis platform in detail and discusses best practices for scaling your core streaming data ingestion pipeline as well as real-world customer use cases and design pattern integration with Amazon Elasticsearch, AWS Lambda, and Apache Spark.
11:20am-12:00pm (40m) Data science & advanced analytics
Recent advances in applications of deep learning for text and speech
Yishay Carmiel (IntelligentWire)
Deep learning has taken us a few steps further toward achieving AI for a man-machine interface. However, deep learning technologies like speech recognition and natural language processing remain a mystery to many. Yishay Carmiel reviews the history of deep learning, the impact it's made, recent breakthroughs, interesting solved and open problems, and what's in store for the future.
1:15pm-1:55pm (40m) Data science & advanced analytics
Semantic natural language understanding with Spark Streaming, UIMA, and machine-learned ontologies
David Talby (Pacific AI), Claudiu Branzan (Accenture)
David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.
2:05pm-2:45pm (40m) Data science & advanced analytics
A data-driven approach to the US presidential election
Amir Hajian (Thomson Reuters), Khaled Ammar (Thomson Reuters), Alex Constandache (Thomson Reuters)
Amir Hajian, Khaled Ammar, and Alex Constandache offer an approach to mining a large dataset to predict the electability of hypothetical candidates in the US presidential election race, using machine learning, natural language processing, and deep learning on an infrastructure that includes Spark and Elasticsearch, which serves as the backbone of the mobile game White House Run.
2:55pm-3:35pm (40m) Data science & advanced analytics
Evaluating models for a needle in a haystack: Applications in predictive maintenance
Danielle Dean (iRobot), Shaheen Gauher (Microsoft)
In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data.
4:35pm-5:15pm (40m) Data science & advanced analytics
Predicting patent litigation
Josh Lemaitre (Thomson Reuters)
How can the value of a patent be quantified? Josh Lemaitre explores how Thomson Reuters Labs approached this problem by applying machine learning to the patent corpus in an effort to predict those most likely to be enforced via litigation. Josh covers infrastructure, methods, challenges, and opportunities for future research.
11:20am-12:00pm (40m) Spark & beyond
A deep dive into Structured Streaming in Spark
Ram Sriharsha (Databricks)
Structured Streaming is a new effort in Apache Spark to make stream processing simple without the need to learn a new programming paradigm or system. Ram Sriharsha offers an overview of Structured Streaming, discussing its support for event-time, out-of-order/delayed data, sessionization, and integration with the batch data stack to show how it simplifies building powerful continuous applications.
1:15pm-1:55pm (40m) Spark & beyond
Apache Spark in fintech: Building fraud detection applications with distributed machine learning at Intel
Yuhao Yang (Intel)
Through collaboration with some of the top payments companies around the world, Intel has developed an end-to-end solution for building fraud detection applications. Yuhao Yang explains how Intel used and extended Spark DataFrames and ML Pipelines to build the tool chain for financial fraud detection and shares the lessons learned during development.
2:05pm-2:45pm (40m) Spark & beyond
Spark Structured Streaming for machine learning
Holden Karau (Independent), Seth Hendrickson (Cloudera)
Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark's new Structured Streaming and walk you through creating your own streaming model.
2:55pm-3:35pm (40m) Spark & beyond
Choice Hotels's journey to better understand its customers through self-service analytics
Narasimhan Sampath (Choice Hotels International), Avinash Ramineni (Clairvoyant)
Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement.
4:35pm-5:15pm (40m) Spark & beyond
Spark and Java: Yes, they work together
Jesse Anderson (Big Data Institute)
Although Spark gets a lot of attention, we only think about two languages being supported—Python and Scala. Jesse Anderson proves that Java works just as well. With lambdas, we even get syntax comparable to Scala, so Java developers get the best of both worlds without having to learn Scala.
11:20am-12:00pm (40m) AMA
Ask me anything: Deep learning with TensorFlow
Martin Wicke (Google), Joshua Gordon (Google)
Martin Wicke and Josh Gordon field questions related to their tutorial, Deep Learning with TensorFlow.
1:15pm-1:55pm (40m) AMA
Ask me anything: Hadoop application architectures
Mark Grover (Lyft), Jonathan Seidman (Cloudera), Ted Malaska (Capital One)
Mark Grover, Jonathan Seidman, and Ted Malaska, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.
2:05pm-2:45pm (40m) AMA
Ask me anything: Stream processing with Apache Beam and Google Cloud Dataflow engineers
Tyler Akidau (Google)
Join Apache Beam and Google Cloud Dataflow engineers to ask all of your questions about stream processing. They'll answer everything from general streaming questions about concepts, semantics, capabilities, limitations, etc. to questions specifically related to Apache Beam, Google Cloud Dataflow, and other common streaming systems (Flink, Spark, Storm, etc.).
2:55pm-3:35pm (40m) AMA
Ask me anything: The state of Spark
Ram Sriharsha (Databricks), Xiangrui Meng (Databricks)
Join Xiangrui Meng and Ram Sriharsha to discuss the state of Spark.
4:35pm-5:15pm (40m) AMA
Ask me anything: Developing a modern enterprise data strategy
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers), Julie Steele (Manifold)
John Akred, Stephen O'Sullivan, and Julie Steele will field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for CDO and its evolving role. Even if you don’t have a specific question, join in to hear what others are asking.
11:20am-12:00pm (40m) Sponsored
VoltDB and the Jepsen test: What we learned about data accuracy and consistency
John Hugg (VoltDB)
VoltDB promises full ACID with strong serializability in a fault-tolerant, distributed SQL platform, as well as higher throughput than other systems that promise much less. But why should users believe this? John Hugg discusses VoltDB's internal testing and support processes, its work with Kyle Kingsbury on the VoltDB Jepsen testing project, and where VoltDB will continue to improve.
1:15pm-1:55pm (40m) Sponsored
Yellow Pages (Canada): Our journey to speed of thought interactive analytics on top of Hadoop
Richard Langlois (IT Architecture & Strategy)
The self-service YP Analytics application allows advertisers to understand their digital presence and ROI. Richard Langlois explains how Yellow Pages used this expertise for an internal use case that delivers real-time analytics with Tableau, using OLAP on Hadoop and enabled by its stack, which includes HDFS, Parquet, Hive, Impala, and AtScale, for fast, real-time analytics and data exploration.
2:05pm-2:45pm (40m) Sponsored
Big data journeys from the real world
John Morrell (Datameer)
A panel of practitioners from from Dell, National Instruments, and Citi—companies that are gaining real value from big data analytics—explore their companies' big data journeys, explaining how analytics can answer groundbreaking new questions about business and create a path to becoming a data-driven organization.
2:55pm-3:35pm (40m) Data innovations
Alluxio (formerly Tachyon): The journey thus far and the road ahead
Haoyuan Li (Alluxio)
Haoyuan Li offers an overview of Alluxio (formerly Tachyon), a memory-speed virtual distributed storage system. In the past year, the Alluxio project experienced a tremendous improvement in performance and scalability and was extended with key new features. This year, the goal is to make Alluxio accessible to an even wider set of users through a focus on security, new language bindings, and APIs.
4:35pm-5:15pm (40m) Hadoop internals & development
Rethinking operational data stores on Hadoop
Vinayak Borkar (X15 Software)
Starting from first principles, Vinayak Borkar defines the requirements for a modern operational data store and explores some possible architectures to support those requirements.
11:20am-12:00pm (40m) Sponsored
5 cloud AI innovations
Rimma Nehme (Microsoft)
The amount of cutting-edge technology that Azure puts at your fingertips is incredible. Artificial intelligence is no exception. Azure enables sophisticated capabilities in artificial intelligence, machine learning, deep learning, cognitive services, and advanced analytics. Rimma Nehme explains why Azure is the next AI supercomputer and how this vision is being implemented in reality.
1:15pm-1:55pm (40m) Sponsored
BigQuery for data warehousing
Chad W. Jennings (Google)
BigQuery provides petabyte-scale data warehousing with consistently high performance for all users. However, users coming from traditional enterprise data warehousing platforms often have questions about how best to adapt their workloads for BigQuery. Chad Jennings explores best practices and integration with BigQuery with special emphasis on loading and transforming data for BigQuery.
2:05pm-2:45pm (40m) Sponsored
Running Presto and Spark on AWS: From zero to insight in less than five minutes
Jonathan Fritz (Amazon Web Services)
Running Hadoop, Spark, and Presto can be as fast and inexpensive as ordering a latte at your favorite coffee shop. Jonathan Fritz explains how organizations are deploying these and other big data frameworks with Amazon Web Services (AWS) and how you too can quickly and securely run Spark and Presto on AWS. Jonathan shows you how to get started and shares best practices and common use cases.
2:55pm-3:35pm (40m) AMA
Ask me anything: Getting into (and out of) data science consulting
Max Shron (Warby Parker)
Join Max Shron, former consultant on data science and current head of Warby Parker's data science team, for a Q&A all about data science consulting. Bring your questions about getting into the data science consulting business (or your questions about how to transition from consulting to something new). Even if you don't have questions, join in to hear what others are asking.
4:35pm-5:15pm (40m) AMA
Ask me anything: Apache Kafka
Jun Rao (Confluent), Ewen Cheslack-Postava (Confluent)
Join Apache Kafka cocreator and PMC chair Jun Rao and Apache Kafka committer and architect of Kafka Connect Ewen Cheslack-Postava for a Q&A session about Apache Kafka. Bring your questions about Kafka internals or key considerations for developing your data pipeline and architecture, designing your applications, and running in production with Kafka.
11:20am-12:00pm (40m) Sponsored
Big data and analytics with Cisco UCS: Lessons learned and platform considerations
Rajesh Shroff (Cisco Systems Inc)
Rajesh Shroff reviews the big data and analytics landscape, lessons learned in enterprise over the last few years, and some of the key considerations while designing a big data system.
1:15pm-1:55pm (40m) Sponsored
Big data governance: Making big data an enterprise-class citizen
Michael Eacrett (SAP)
Big data is a critical part of the enterprise data fabric and must meet the critical enterprise criteria of correctness, quality, consistency, compliance, and traceability. Michael Eacrett explains how companies are using big data infrastructures, asynchronously and in real time, to actively solve information governance and data-quality challenges.
2:05pm-2:45pm (40m) Sponsored
Sensitive data sharing for analytics
Steven Touw (Immuta)
Sharing your valuable data internally or with third-party consumers can be risky due to data privacy regulations and IP considerations, but sharing can also generate revenue or help nonprofits succeed at world-changing missions. Steve Touw explores real-world examples of how a proper data architecture enables philanthropic missions and offers ideas for how to better share your data.
2:55pm-3:35pm (40m) AMA
Ask me anything: White House Office of Science and Technology Policy
DJ Patil (White House Office of Science and Technology Policy), Lynn Overmann (Office of the Chief Technology Officer)
Join DJ Patil and Lynn Overmann to ask your questions about data science at the White House.
11:20am-12:00pm (40m) Sponsored
ODPi: The foundation for cross-distribution interoperability
Berni SCHIEFER (IBM), Susan Maliaka
With so much variance across Hadoop distributions, ODPi was established to create standards for both Hadoop components and testing applications on those components. Join John Mertic and Berni Schiefer to learn how application developers and companies considering Hadoop can benefit from ODPi.
1:15pm-1:55pm (40m) Sponsored
Accelerate EDW modernization with the Hadoop ecosystem
Joe Goldberg (BMC Software)
Joe Goldberg explores how companies like GoPro, Produban, Navistar, and others have taken a platform approach to managing their workflows; how they are using workflows to power data ingest, ETL, and data integration processing; how an end-to-end view of workflows has reduced issue resolution time; and how these companies are achieving success in their data warehouse modernization projects.
2:05pm-2:45pm (40m) Sponsored
Powering the future of data with connected data platforms
Scott Gnau (Hortonworks)
Scott Gnau provides unique insights into the tipping point for data, how enterprises are now rethinking everything from their IT architecture and software strategies to data governance and security, and the cultural shifts CIOs must grapple with when supporting a business using real-time data to scale and grow.
11:20am-12:00pm (40m) Sponsored
How an open analytics ecosystem became a lifesaver
Douglas Liming (SAS Institute Inc.)
Ready to take a deeper look at how Hadoop and its ecosystem has a widespread impact on analytics? Douglas Liming explains where SAS fits into the open ecosystem, why you no longer have to choose between analytics languages like Python, R, or SAS, and how a single, unified open analytics architecture empowers you to literally have it all.
1:15pm-1:55pm (40m) Sponsored
Governance and metadata management of Cigna's enterprise data lake
Sherri Adame (Cigna)
Launched in late 2015, Cigna's enterprise data lake project is taking the company on a data governance journey. Sherri Adame offers an overview of the project, providing insights into some of the business pain points and key drivers, how it has led to organizational change, and the best practices associated with Cigna’s new data governance process.
2:05pm-2:45pm (40m) Sponsored
Path-to-purchase analytics using a data lake and Spark
Joe Caserta (Caserta Concepts)
Joe Caserta explores how a leading membership interest group is utilizing a data lake to track its members’ path-to-purchase touch points across multiple channels by matching and mastering individuals using Spark GraphFrames and stitching together website, marketing, email, and transaction data to discover the most effective way to attract new members and retain existing high-value members.
2:55pm-3:35pm (40m) Sponsored
Neptune: A machine-learning platform for experiment management
Mariusz Gadarowski (deepsense.io)
Mariusz Gądarowski offers an overview of Neptune, deepsense.io’s new IT platform-based machine-learning experiment management solution for data scientists. Neptune enhances the management of machine-learning tasks such as dependent computational processes, code versioning, comparing achieved results, monitoring tasks and progress, sharing infrastructure among teammates, and many others.
8:50am-8:55am (5m)
Thursday keynotes
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
8:55am-9:05am (10m)
Hadoop in the cloud: A Nielsen use case
Tom Reilly (Cloudera), James Powell (Nielsen)
Cloudera CEO Tom Reilly and James Powell, global CTO of Nielsen, discuss the dynamics of Hadoop in the cloud, what to consider at the start of the journey, and how to implement a solution that delivers flexibility while meeting key enterprise requirements.
9:05am-9:10am (5m)
Inbox is the Trojan horse of AI
Alistair Croll (Solve For Interesting)
When Hollywood portrays artificial intelligence, it's either a demon or a savior. But the reality is that AI is far more likely to be an extension of ourselves. Strata program chair Alistair Croll looks at the sometimes surprising ways that machine learning is insinuating itself into our everyday lives.
9:10am-9:20am (10m) Sponsored keynote
Connected eyes
Joseph Sirosh (Compass)
Will machine learning give us better eyesight? Join Joseph Sirosh for a surprising story about how machine learning, population data, and the cloud are coming together to fundamentally reimagine eye care in one of the world’s most populous countries, India.
9:20am-9:30am (10m)
The tech behind the biggest journalism leak in history
Mar Cabra (International Consortium of Investigative Journalists)
The Panama Papers investigation revealed the offshore holdings and connections of dozens of politicians and prominent public figures around the world and led to high-profile resignations, police raids, and official investigations. Almost 500 journalists had to sift through 2.6 terabytes of data—the biggest leak in the history of journalism. Mar Cabra explains how technology made it all possible.
9:30am-9:35am (5m) Sponsored keynote
Business insights driven by speed
Raghunath Nambiar (Cisco)
The need to quickly acquire, process, prepare, store, and analyze data has never been greater. The need for performance crosses the big data ecosystem too—from the edge to the server to the analytics software, speed matters. Raghunath Nambiar shares a few use cases that have had significant organizational impact where performance was key.
9:35am-9:40am (5m) Sponsored keynote
Google BigQuery for enterprise
Chad W. Jennings (Google)
Chad W. Jennings demonstrates the power of BigQuery through an exciting demo and announces several new features that will make BigQuery a better home for your enterprise big data workloads.
9:40am-10:00am (20m)
Data science: A view from the White House
DJ Patil (White House Office of Science and Technology Policy), Lynn Overmann (Office of the Chief Technology Officer)
Keynote by DJ Patil and Lynn Overmann
10:00am-10:05am (5m) Sponsored keynote
Bring data to life with Immersive Visualization
Robert Thomas (IBM)
Data has long stopped being structured and flat, but the results of our analysis are still rendered as flat bar charts and scatter plots. We live in a 3D world, and we need to be able to enable data interaction from all perspectives. Robert Thomas offers an overview of Immersive Visualization—integrated with notebooks and powered by Spark—which helps bring insights to life.
10:05am-10:20am (15m)
From big data to human-level artificial intelligence
Gary Marcus (Geometric Intelligence)
Gary Marcus explores the gap between what machines do well and what people do well and what needs to happen before machines can match the flexibility and power of human cognition.
10:20am-10:25am (5m) Sponsored keynote
SAS: More open than you might think
Paul Kent (SAS)
Hadoop and its ecosystem have changed analytics profoundly. Paul Kent offers an overview of SAS's participation in open platforms and introduces SAS Viya, a new unified and open analytics architecture that lets you scale analytics in the cloud and code as you choose.
10:25am-10:45am (20m)
Statistics, machine learning, and the crazy 2016 election
Sam Wang (Princeton University)
Although 2016 is a highly unusual political year, elections and public opinion follow predictable statistical properties. Sam Wang explains how the presidential, Senate, and House races can be tracked and forecast from freely available polling data using tools from statistics and machine learning.
8:00am-8:45am (45m)
Break: Coffee Break
10:50am-11:20am (30m)
Break: Morning Break sponsored by SAS
12:00pm-1:15pm (1h 15m) Event
Thursday BoF Tables
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics.
3:35pm-4:35pm (1h)
Break: Afternoon Break sponsored by IBM