Schedule: AI and Data technologies in the cloud sessions: Big data conference & machine learning training

9:00am - 5:00pm Monday, March 25 & Tuesday, March 26

Building a serverless big data application on AWS

Data Engineering & Architecture
Location: 2018

Jorge Lopez (Amazon Web Services), Roy Hasson (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Gautam Srinivasan (Amazon Web Services), Anthony Nguyen (Amazon Web Services)

Average rating:

(4.50, 4 ratings)

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Running multidisciplinary big data workloads in the cloud

Data Engineering & Architecture
Location: 2008

Jason Wang (Cloudera), Brandon Freeman (Cloudera), Michael Kohs (Cloudera), Akihiro Ishikawa (Cloudera), Toby Ferguson (Cloudera)

Average rating:

(3.20, 5 ratings)

There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Cross-cloud model training and serving with Kubeflow

Data Engineering & Architecture
Location: 2007

Holden Karau (Independent), Francesca Lazzeri (Microsoft), Trevor Grant (IBM)

Average rating:

(3.00, 2 ratings)

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Architecture and algorithms for end-to-end streaming data processing

Data Engineering & Architecture
Location: 2005

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(2.67, 12 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Deep learning applications for non-engineers

Data Science, Machine Learning & AI
Location: 2016

Jeremy Howard ( fast.ai | USF | doc.ai and platform.ai)

Average rating:

(4.80, 5 ratings)

Jeremy Howard describes how to leverage the latest research from the deep learning and HCI communities to train neural networks from scratch—without code or preexisting labels. He then shares case studies in fashion, retail and ecommerce, travel, and agriculture where these approaches have been used. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Cost-effective Presto on AWS with Spot nodes

Data Engineering & Architecture
Location: 2004

Shubham Tagra (Qubole)

Average rating:

(3.50, 8 ratings)

Did you know you can run Presto in AWS at a tenth of the cost with AWS Spot nodes, with just a few architectural enhancements to Presto. Shubham Tagra explores the gaps in Presto architecture, explains how to use Spot nodes, covers enhancements, and showcases the improvements in terms of reliability and TCO achieved through them. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Automating DevOps for machine learning

Data Engineering & Architecture
Location: 2008

Diego Oppenheimer (Algorithmia)

Average rating:

(4.00, 11 ratings)

You've invested heavily in cleaning your data, feature engineering, training, and tuning your model—but now you have to deploy your model into production, and you discover it's a huge challenge. Diego Oppenheimer shares common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Scaling visualization for big data and analytics in the cloud

Strata Business Summit, Visualization and UX
Location: 2018

Jaipaul Agonus (FINRA), Daniel Monteiro (FINRA)

Average rating:

(3.40, 5 ratings)

Jaipaul Agonus and Daniel Monteiro do Carmo Rosa detail big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that the Market Regulation Department conducts in the AWS cloud. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Cloud native machine learning: Emerging trends and the road ahead

Data Science, Machine Learning & AI
Location: 2011

Tristan Zajonc (Cloudera), Tim Chen (Cloudera)

Average rating:

(4.40, 5 ratings)

Data platforms are being asked to support an ever increasing range of workloads and compute environments, including machine learning and elastic cloud platforms. Tristan Zajonc and Tim Chen discuss emerging capabilities, including running machine learning and Spark workloads on autoscaling container platforms, and share their vision for the road ahead for ML and AI in the cloud. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Building high-performance text classifiers on a limited labeling budget

Data Science, Machine Learning & AI
Location: 2010

Robert Horton (Microsoft), Mario Inchiosa (Microsoft), Ali Zaidi (Microsoft)

Average rating:

(4.70, 10 ratings)

Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Machine learning on encrypted data: Challenges and opportunities

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Alon Kaufman (Duality), Vinod Vaikuntanathan (MIT and Duality Technologies)

Average rating:

(3.75, 4 ratings)

Alon Kaufman and Vinod Vaikuntanathan discuss the challenges and opportunities of machine learning on encrypted data and describe the state of the art in this space. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Executive Briefing: From the edge to AI—Taking control of your data for fun and profit

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Mike Olson (Cloudera)

Average rating:

(3.80, 5 ratings)

It's easier than ever to collect data, but managing it securely in compliance with regulations and legal constraints is harder. Mike Olson discusses the risks and the issues that matter most and explains how an enterprise data cloud that embraces your data center and the public cloud in combination can address them, delivering real business results for your organization. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Automated machine learning for Agile data science at scale

Data Science, Machine Learning & AI
Location: 2011

Sarah Aerni (Salesforce)

Average rating:

(4.25, 4 ratings)

How does Salesforce make data science an Agile partner to over 100,000 customers? Sarah Aerni shares the nuts and bolts of the platform and details the Agile process behind it. From open source autoML library TransmogrifAI and experimentation to deployment and monitoring, Sarah covers the tools that make it possible for data scientists to rapidly iterate and adopt a truly Agile methodology. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Deep learning beyond the learning

Data Engineering & Architecture
Location: 2008

Tobias Knaup (Mesosphere), Joerg Schad (ArangoDB)

Average rating:

(4.50, 2 ratings)

There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Applying deep learning at Google for recommendations

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Ron Bodkin (Google)

Average rating:

(4.33, 6 ratings)

Google uses deep learning extensively in new and existing products. Join Ron Bodkin to learn how Google has used deep learning for recommendations at YouTube, in the Play store, and for customers in Google Cloud. You'll explore the role of embeddings, recurrent networks, contextual variables, and wide and deep learning and discover how to do candidate generation and ranking with deep learning. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Goodbye, data lake: Why continuous analytics yield higher ROI

Data Engineering & Architecture
Location: 2002

Yaron Haviv (iguazio)

Average rating:

(4.00, 2 ratings)

Faced with the need to handle increasing volumes of data, alternative datasets ("alt data"), and AI, many enterprises are working to design or redesign their big data architectures, but traditional batch platforms fail to generate sufficient ROI. Yaron Haviv shares a continuous analytics approach that yields faster answers for the business while remaining simpler and less expensive for IT. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Reducing stream processing complexity using Apache Pulsar Functions

Data Engineering & Architecture
Location: 2002

Jowanza Joseph (Pluralsight), Karthik Ramasamy (Streamlio)

Average rating:

(4.00, 1 rating)

After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Persistent storage for machine learning in KubeFlow

Data Engineering & Architecture
Location: 2008

Skyler Thomas (MapR), Terry He (MapR Technologies)

Average rating:

(4.75, 4 ratings)

KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Serverless workflows for orchestration hybrid cluster-based and serverless processing

Data Engineering & Architecture
Location: 2002

Rustem Feyzkhanov (Instrumental)

Average rating:

(3.50, 8 ratings)

Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Cloud native data pipelines with Apache Kafka

Data Engineering & Architecture
Location: 2001

Gwen Shapira (Confluent)

Average rating:

(4.64, 11 ratings)

As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Point, click, predict

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall

Kevin Moore (Salesforce)

Average rating:

(4.50, 2 ratings)

Kevin Moore walks you through how TransmogrifAI—Salesforce's open source AutoML library built on Spark—automatically generates models that are automatically customized to a company's dataset and use case and provides insights into why the model is making the predictions it does. Read more.

11:00am–11:40am Thursday, March 28, 2019

How to protect big data in a containerized environment

Data Engineering & Architecture
Location: 2024

Thomas Phelan (HPE BlueData)

Average rating:

(4.50, 2 ratings)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). But TDE is difficult to configure and manage—particularly when run in Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them. Read more.

11:00am–11:40am Thursday, March 28, 2019

Cloud programming simplified: A Berkeley view on serverless computing

Data Engineering & Architecture
Location: 2007

Eric Jonas (UC Berkeley)

Average rating:

(4.50, 2 ratings)

Eric Jonas offers a quick history of cloud computing, including an accounting of the predictions of the 2009 "Berkeley View of Cloud Computing" paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Infinite segmentation: Scalable mutual information ranking on real-world graphs

Data Science, Machine Learning & AI
Location: 2011

Ken Johnston (Microsoft), Ankit Srivastava (Microsoft)

Average rating:

(4.50, 2 ratings)

Today, normal growth isn't enough—you need hockey-stick levels of growth. Sales and marketing orgs are looking to AI to "growth hack" their way to new markets and segments. Ken Johnston and Ankit Srivastava explain how to use mutual information at scale across massive data sources to help filter out noise and share critical insights with new cohort of users, businesses, and networks. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Journey to the cloud: Architecting for the cloud through customer stories

Data Engineering & Architecture
Location: 2001

Jason Wang (Cloudera), Sushant Rao (Cloudera)

Average rating:

(4.00, 2 ratings)

Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Serverless for data and AI

Data Engineering & Architecture, Data Science, Machine Learning & AI, Streaming and IoT
Location: 2007

Avner Braverman (Binaris)

Average rating:

(4.00, 3 ratings)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

How EPFL captured the feel of the Montreux Jazz Festival with its immersive 3D VR to three-geo archive

Visualization and UX
Location: 2024

Stefaan Vervaet (Western Digital Corporation), Alain Dufaux (École Polytechnique Fédérale de Lausanne (EPFL))

Average rating:

(5.00, 1 rating)

The École Polytechnique Fédérale de Lausanne (EPFL) spearheaded the official digital archival of 15,000+ hours of A/V content captured from the Montreux Jazz Festival since 1967. Stefaan Vervaet and Alain Dufaux explain how EPFL created an immersive 3D VR experience. From capture and store to delivery and experience, they detail the evolution of the workflow that made it all possible. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

How to train your model (and catch label leakage)

Data Science, Machine Learning & AI
Location: 2010

Till Bergmann (Salesforce)

Average rating:

(3.67, 6 ratings)

A problem in predictive modeling data is label leakage. At enterprise companies such as Salesforce, this problem takes on monstrous proportions as the data is populated by diverse business processes, making it hard to distinguish cause from effect. Till Bergmann explains how Salesforce—which needs to churn out thousands of customer-specific models for any given use case—tackled this problem. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Clusters in Kubernetes on a cluster: Building a multitenant environment for the field

Data Engineering & Architecture
Location: 2008

Paul Curtis (Weaveworks)

Average rating:

(4.50, 2 ratings)

What do you do when your technology doesn’t easily fit on a single laptop and consists of many components? Paul Curtis explains how MapR Technologies rolled out a containerized, scalable, globally available, and easily updatable environment using a combination of Kubernetes to orchestrate, shared data fabric to store and persist, and AppLariat to provide the user interface. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

How to survive future data warehousing challenges with the help of a hybrid cloud

Data Engineering & Architecture
Location: 2004

Eva Andreasson (Cloudera), Mark Brine (Cloudera), Michael Kohs (Cloudera)

Average rating:

(2.00, 3 ratings)

Michael Kohs, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Testing ad content with survey experiments

Data Science, Machine Learning & AI
Location: 2010

Patrick Miller (Civis Analytics)

Average rating:

(3.40, 5 ratings)

Brands that test the content of ads before they are shown to an audience can avoid spending resources on the 11% of ads that cause backlash. Using a survey experiment to choose the best ad typically improves effectiveness of marketing campaigns by 13% on average, and up to 37% for particular demographics. Patrick Miller explores data collection and statistical methods for analysis and reporting. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Data Engineering & Architecture
Location: 2002

Igor Canadi (Rockset), Dhruba Borthakur (Rockset)

Average rating:

(4.00, 1 rating)

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Apache Druid autoscale-out/in for streaming data ingestion on Kubernetes

Data Engineering & Architecture, Streaming and IoT
Location: 2006

Jinchul Kim (SK Telecom)

Average rating:

(2.17, 6 ratings)

Druid supports autoscaling for data ingestion, but it's only available on AWS EC2. You can't rely on the feature on your private cloud. Jinchul Kim demonstrates autoscale-out/in on Kubernetes, details the benefit on this approach, and discusses the development of Druid Helm charts, rolling updates, and custom metric usage for horizontal autoscaling. Read more.

Schedule: AI and Data technologies in the cloud sessions

Sponsorship Opportunities

Partner Opportunities

Contact Us