Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Schedule: AI and Data technologies in the cloud sessions

9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Jorge Lopez (Amazon Web Services), Roy Hasson (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Gautam Srinivasan (Amazon Web Services), Anthony Nguyen (Amazon Web Services)
Average rating: ****.
(4.50, 4 ratings)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.
1:30pm5:00pm Tuesday, March 26, 2019
Jason Wang (Cloudera), Brandon Freeman (Cloudera), Michael Kohs (Cloudera), Akihiro Ishikawa (Cloudera), Toby Ferguson (Cloudera)
Average rating: ***..
(3.20, 5 ratings)
There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.
1:30pm5:00pm Tuesday, March 26, 2019
Holden Karau (Google), Francesca Lazzeri (Microsoft), Trevor Grant (IBM)
Average rating: ***..
(3.00, 2 ratings)
Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud. Read more.
1:30pm5:00pm Tuesday, March 26, 2019
Arun Kejariwal (Facebook), Karthik Ramasamy (Streamlio)
Average rating: **...
(2.67, 12 ratings)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.
11:00am11:40am Wednesday, March 27, 2019
Jeremy Howard ( fast.ai | USF | doc.ai and platform.ai)
Average rating: ****.
(4.80, 5 ratings)
Jeremy Howard describes how to leverage the latest research from the deep learning and HCI communities to train neural networks from scratch—without code or preexisting labels. He then shares case studies in fashion, retail and ecommerce, travel, and agriculture where these approaches have been used. Read more.
11:00am11:40am Wednesday, March 27, 2019
Shubham Tagra (Qubole)
Average rating: ***..
(3.50, 8 ratings)
Did you know you can run Presto in AWS at a tenth of the cost with AWS Spot nodes, with just a few architectural enhancements to Presto. Shubham Tagra explores the gaps in Presto architecture, explains how to use Spot nodes, covers enhancements, and showcases the improvements in terms of reliability and TCO achieved through them. Read more.
11:00am11:40am Wednesday, March 27, 2019
Diego Oppenheimer (Algorithmia)
Average rating: ****.
(4.00, 11 ratings)
You've invested heavily in cleaning your data, feature engineering, training, and tuning your model—but now you have to deploy your model into production, and you discover it's a huge challenge. Diego Oppenheimer shares common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility. Read more.
11:00am11:40am Wednesday, March 27, 2019
Jaipaul Agonus (FINRA), Daniel Monteiro (FINRA)
Average rating: ***..
(3.40, 5 ratings)
Jaipaul Agonus and Daniel Monteiro do Carmo Rosa detail big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that the Market Regulation Department conducts in the AWS cloud. Read more.
11:00am11:40am Wednesday, March 27, 2019
Tristan Zajonc (Cloudera), Tim Chen (Cloudera)
Average rating: ****.
(4.40, 5 ratings)
Data platforms are being asked to support an ever increasing range of workloads and compute environments, including machine learning and elastic cloud platforms. Tristan Zajonc and Tim Chen discuss emerging capabilities, including running machine learning and Spark workloads on autoscaling container platforms, and share their vision for the road ahead for ML and AI in the cloud. Read more.
11:00am11:40am Wednesday, March 27, 2019
Robert Horton (Microsoft), Mario Inchiosa (Microsoft), Ali Zaidi (Microsoft)
Average rating: ****.
(4.70, 10 ratings)
Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game. Read more.
11:00am11:40am Wednesday, March 27, 2019
Alon Kaufman (Duality), Vinod Vaikuntanathan (MIT and Duality Technologies)
Average rating: ***..
(3.75, 4 ratings)
Alon Kaufman and Vinod Vaikuntanathan discuss the challenges and opportunities of machine learning on encrypted data and describe the state of the art in this space. Read more.
11:00am11:40am Wednesday, March 27, 2019
Mike Olson (Cloudera)
Average rating: ***..
(3.80, 5 ratings)
It's easier than ever to collect data, but managing it securely in compliance with regulations and legal constraints is harder. Mike Olson discusses the risks and the issues that matter most and explains how an enterprise data cloud that embraces your data center and the public cloud in combination can address them, delivering real business results for your organization. Read more.
11:50am12:30pm Wednesday, March 27, 2019
Sarah Aerni (Salesforce)
Average rating: ****.
(4.25, 4 ratings)
How does Salesforce make data science an Agile partner to over 100,000 customers? Sarah Aerni shares the nuts and bolts of the platform and details the Agile process behind it. From open source autoML library TransmogrifAI and experimentation to deployment and monitoring, Sarah covers the tools that make it possible for data scientists to rapidly iterate and adopt a truly Agile methodology. Read more.
11:50am12:30pm Wednesday, March 27, 2019
Tobias Knaup (Mesosphere), Joerg Schad (Suki)
Average rating: ****.
(4.50, 2 ratings)
There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring. Read more.
11:50am12:30pm Wednesday, March 27, 2019
Ron Bodkin (Google)
Average rating: ****.
(4.33, 6 ratings)
Google uses deep learning extensively in new and existing products. Join Ron Bodkin to learn how Google has used deep learning for recommendations at YouTube, in the Play store, and for customers in Google Cloud. You'll explore the role of embeddings, recurrent networks, contextual variables, and wide and deep learning and discover how to do candidate generation and ranking with deep learning. Read more.
2:40pm3:20pm Wednesday, March 27, 2019
Yaron Haviv (iguazio)
Average rating: ****.
(4.00, 2 ratings)
Faced with the need to handle increasing volumes of data, alternative datasets ("alt data"), and AI, many enterprises are working to design or redesign their big data architectures, but traditional batch platforms fail to generate sufficient ROI. Yaron Haviv shares a continuous analytics approach that yields faster answers for the business while remaining simpler and less expensive for IT. Read more.
4:20pm5:00pm Wednesday, March 27, 2019
Jowanza Joseph (Pluralsight), Karthik Ramasamy (Streamlio)
Average rating: ****.
(4.00, 1 rating)
After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Skyler Thomas (MapR), Terry He (MapR Technologies)
Average rating: ****.
(4.75, 4 ratings)
KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Rustem Feyzkhanov (Instrumental)
Average rating: ***..
(3.50, 8 ratings)
Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Gwen Shapira (Confluent)
Average rating: ****.
(4.64, 11 ratings)
As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines. Read more.
5:10pm5:50pm Wednesday, March 27, 2019
Kevin Moore (Salesforce)
Average rating: ****.
(4.50, 2 ratings)
Kevin Moore walks you through how TransmogrifAI—Salesforce's open source AutoML library built on Spark—automatically generates models that are automatically customized to a company's dataset and use case and provides insights into why the model is making the predictions it does. Read more.
11:00am11:40am Thursday, March 28, 2019
Thomas Phelan (BlueData)
Average rating: ****.
(4.50, 2 ratings)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). But TDE is difficult to configure and manage—particularly when run in Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them. Read more.
11:00am11:40am Thursday, March 28, 2019
Eric Jonas (UC Berkeley)
Average rating: ****.
(4.50, 2 ratings)
Eric Jonas offers a quick history of cloud computing, including an accounting of the predictions of the 2009 "Berkeley View of Cloud Computing" paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Read more.
11:50am12:30pm Thursday, March 28, 2019
Ken Johnston (Microsoft), Ankit Srivastava (Microsoft)
Average rating: ****.
(4.50, 2 ratings)
Today, normal growth isn't enough—you need hockey-stick levels of growth. Sales and marketing orgs are looking to AI to "growth hack" their way to new markets and segments. Ken Johnston and Ankit Srivastava explain how to use mutual information at scale across massive data sources to help filter out noise and share critical insights with new cohort of users, businesses, and networks. Read more.
11:50am12:30pm Thursday, March 28, 2019
Jason Wang (Cloudera), Sushant Rao (Cloudera)
Average rating: ****.
(4.00, 2 ratings)
Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms. Read more.
11:50am12:30pm Thursday, March 28, 2019
Avner Braverman (Binaris)
Average rating: ****.
(4.00, 3 ratings)
What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.
1:50pm2:30pm Thursday, March 28, 2019
Visualization and UX
Location: 2024
Stefaan Vervaet (Western Digital Corporation), Alain Dufaux (École Polytechnique Fédérale de Lausanne (EPFL))
Average rating: *****
(5.00, 1 rating)
The École Polytechnique Fédérale de Lausanne (EPFL) spearheaded the official digital archival of 15,000+ hours of A/V content captured from the Montreux Jazz Festival since 1967. Stefaan Vervaet and Alain Dufaux explain how EPFL created an immersive 3D VR experience. From capture and store to delivery and experience, they detail the evolution of the workflow that made it all possible. Read more.
2:40pm3:20pm Thursday, March 28, 2019
Till Bergmann (Salesforce)
Average rating: ***..
(3.67, 6 ratings)
A problem in predictive modeling data is label leakage. At enterprise companies such as Salesforce, this problem takes on monstrous proportions as the data is populated by diverse business processes, making it hard to distinguish cause from effect. Till Bergmann explains how Salesforce—which needs to churn out thousands of customer-specific models for any given use case—tackled this problem. Read more.
2:40pm3:20pm Thursday, March 28, 2019
Paul Curtis (MapR Technologies)
Average rating: ****.
(4.50, 2 ratings)
What do you do when your technology doesn’t easily fit on a single laptop and consists of many components? Paul Curtis explains how MapR Technologies rolled out a containerized, scalable, globally available, and easily updatable environment using a combination of Kubernetes to orchestrate, shared data fabric to store and persist, and AppLariat to provide the user interface. Read more.
2:40pm3:20pm Thursday, March 28, 2019
Eva Andreasson (Cloudera), Mark Brine (Cloudera), Michael Kohs (Cloudera)
Average rating: **...
(2.00, 3 ratings)
Michael Kohs, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's. Read more.
3:50pm4:30pm Thursday, March 28, 2019
Patrick Miller (Civis Analytics)
Average rating: ***..
(3.40, 5 ratings)
Brands that test the content of ads before they are shown to an audience can avoid spending resources on the 11% of ads that cause backlash. Using a survey experiment to choose the best ad typically improves effectiveness of marketing campaigns by 13% on average, and up to 37% for particular demographics. Patrick Miller explores data collection and statistical methods for analysis and reporting. Read more.
3:50pm4:30pm Thursday, March 28, 2019
Igor Canadi (Rockset), Dhruba Borthakur (Rockset)
Average rating: ****.
(4.00, 1 rating)
Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.
4:40pm5:20pm Thursday, March 28, 2019
Jinchul Kim (SK Telecom)
Average rating: **...
(2.17, 6 ratings)
Druid supports autoscaling for data ingestion, but it's only available on AWS EC2. You can't rely on the feature on your private cloud. Jinchul Kim demonstrates autoscale-out/in on Kubernetes, details the benefit on this approach, and discusses the development of Druid Helm charts, rolling updates, and custom metric usage for horizontal autoscaling. Read more.