Sep 23–26, 2019

Schedule: Deep dive into specific tools, platforms, or frameworks sessions

9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1E 07
Dylan Bargteil (The Data Incubator)
The TensorFlow library provides for the use of computational graphs with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Dylan Bargteil explores TensorFlow's capabilities in Python, demonstrating how to build machine learning algorithms piece by piece and how to use TensorFlow's Keras API with several hands-on applications. Read more.
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1A 15/16
Michael Cullan (The Data Incubator)
Michael Cullan walks you through developing a machine learning pipeline from prototyping to production. You'll learn about data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python. Read more.
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1E 06
Jesse Anderson (Big Data Institute)
Jesse Anderson offers you an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it, as well as how to create consumers and publishers. You'll take a look Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL. Read more.
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1A 17
Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Nikki Rouda (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 10
Viktor Gamov (Confluent)
Building stream processing applications is certainly one of the hot topics in the IT community. But if you've ever thought you needed to be a programmer to do stream processing and build stream processing data pipelines, think again. Viktor Gamov explores KSQL, the stream processing query engine built on top of Apache Kafka. Read more.
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 11
Purnima Reddy Kuchikulla (Cloudera), Timothy Spann (Cloudera), Abdelkrim Hadjidj (Cloudera), Andre Araujo (Cloudera), Hemanth Yamijala (Cloudera)
There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, Abdelkrim Hadjidj, and Andre Araujo walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with CEM, which addresses these challenges with ease. Read more.
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 08
Matt Fuller (Starburst)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1A 23/24
David Talby (Pacific AI), Alex Thomas (John Snow Labs), Saif Addin Ellafi (John Snow Labs), Claudiu Branzan (Accenture)
David Talby, Alex Thomas, Saif Addin Ellafi, and Claudiu Branzan walk you through state-of-the-art natural language processing (NLP) using the highly performant, highly scalable open source Spark NLP library. You'll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve. Read more.
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1A 21
Karthik Sonti (Amazon Web Services), Emily Webber (Amazon Web Services), Varun Rao Bhamidimarri (Amazon Web Services)
Karthik Sonti, Emily Webber, and Varun Rao Bhamidimarri introduce you to the Amazon SageMaker machine learning platform and provide a high-level discussion of recommender systems. You'll dig into different machine learning approaches for recommender systems, including common methods such as matrix factorization as well as newer embedding approaches. Read more.
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 14
Purnima Reddy Kuchikulla (Cloudera), Dan Chaffelson (Cloudera), Attila Kanto (Cloudera), Tony Wu (Cloudera)
Kafka is omnipresent and the backbone of streaming analytics applications and data lakes. The challenge is understanding what's going on overall in the Kafka cluster, including performance, issues, and message flows. Purnima Reddy Kuchikulla and Dan Chaffelson walk you through a hands-on experience to visualize the entire Kafka environment end-to-end and simplify Kafka operations via SMM. Read more.
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 08/10
Nan Zhu (Uber), Felix Cheung (Uber)
XGBoost has been widely deployed in companies across the industry. Nan Zhu and Felix Cheung dive into the internals of distributed training in XGBoost and demonstrate how XGBoost resolves the business problem in Uber with a scale to thousands of workers and tens of TB of training data. Read more.
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1E 09
The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases. Read more.
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 15/16
Michael Noll (Confluent)
Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer. Read more.
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 23/24
Wim Stoop (Cloudera), Srikanth Venkat (Cloudera)
Establishing enterprise-wide security and governance remains a challenge for most organizations. Integrations and exchanges across the landscape are costly to manage and maintain, and typically work in one direction only. Wim Stoop and Srikanth Venkat explore how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value. Read more.
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 21/22
Prakhar Jain (Qubole), Sourabh Goyal (Qubole)
Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO. Read more.
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 07/08
Wangda Tan (Cloudera), Wei-Chiu Chuang (Cloudera)
Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x. Read more.
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 14
Elasticsearch (ES) allows extremely quick search and drilldowns on large amounts of semistructured data. Elasticsearch, however, does not have relational join capabilities. Giovanni Tummarello examines a plug-in for ES that adds cluster distributed joins and demonstrates how it enables an exciting array of use cases dealing with interconnected or "Knowledge Graph" enterprise data. Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 21/22
Chenzhao Guo (Intel), Carson Wang (Intel)
Shuffle in Spark requires the shuffle data to be persisted on local disks. However, the assumptions of collocated storage do not always hold in today’s data centers. Chenzhao Guo and Carson Wang outline the implementation of a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends, making life easier for customers. Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1E 07/08
Krishna Maheshwari (Cloudera)
Krishna Maheshwari provides an overview of the major features and enhancements in the HBase 2.0 release, upcoming releases, and the future of HBase. You'll be able to ask her questions at the end. Apache HBase 2.0 comes packed with a lot of new functionalities: off-heap read paths, multitier bucket cache, new finite state machine-based assignment manager, etc. Read more.
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 15/16
Alon Gavra (AppsFlyer)
Frequently, Kafka is just a piece of the stack that lives in production that often times no one wants to touch—because it just works. Alon Gavra outlines how Kafka sits at the core of AppsFlyer's infrastructure that processes billions of events daily. Read more.
1:15pm1:55pm Thursday, September 26, 2019
Location: 3B - Expo Hall
Victor Dibia (Cloudera Fast Forward Labs)
Recent advances in machine learning frameworks for the browser such as TensorFlow provides the opportunity to craft truly novel experiences within frontend applications. Victor Dibia explores the state of the art for machine learning in the browser using TensorFlow and outlines its use in the design of Handtrack.js—a library for prototyping real-time hand detection in the browser. Read more.
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 23/24
Omkar Joshi (Uber), Bo Yang (Uber)
Omkar Joshi and Bo Yang offer an overview of how Uber’s ingestion (Marmary) and observability team improved performance of Apache Spark applications running on thousands of cluster machines and across hundreds of thousands+ of applications and how the team methodically tackled these issues. They also cover how they used Uber’s open-sourced jvm-profiler for debugging issues at scale. Read more.
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 03
Stephan Ewen (Ververica)
Stephan Ewen details how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: new cross-batch-streaming machine learning algorithms, state-of-the-art batch performance, and new building blocks for data-driven applications and application consistency. Read more.
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 21/22
Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Randall DeFauw (Amazon Web Services)
As an increasing level of automation becomes available to data science, the balance between automation and quality needs to be maintained. Applying DevOps practices to machine learning workloads brings models to the market faster and maintains the quality and integrity of those models. Sireesha Muppala, Shelbee Eigenbrode, and Randall DeFauw explore applying DevOps practices to ML workloads. Read more.
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 08/10
Chad Scherrer (Metis)
Chad Scherrer explores the basic ideas in Soss, a new probabilistic programming library for Julia. Soss allows a high-level representation of the kinds of models often written in PyMC3 or Stan, and offers a way to programmatically specify and apply model transformations like approximations or reparameterizations. Read more.
3:45pm4:25pm Thursday, September 26, 2019
Location: 1E 09
Owen O'Malley (Cloudera)
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. Owen O'Malley dives into how column encryption in ORC files enables both fine-grain protection and audits of who accessed the private data. Read more.

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires