Tuesday March 6: Tutorials (Gold & Silver passes) |
Wednesday March 7: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes) |
8:45 | Location: San Jose Ballroom (salon 1&2) Strata Data Conference Keynotes |
10:30am Morning break |
Thursday March 8: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes) |
8:45 | Location: San Jose Ballroom (salon 1&2) Strata Data Conference Keynotes |
10:30am Morning break |
9:00am - 5:00pm Monday, March 5 & Tuesday, March 6
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.
Read more.
9:00am–12:30pm Tuesday, March 6, 2018
New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR.
Read more.
9:00am–12:30pm Tuesday, March 6, 2018
Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.
Read more.
9:00am–12:30pm Tuesday, March 6, 2018
Secondary topics:
Graphs and Time-series
Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.
Read more.
9:00am–12:30pm Tuesday, March 6, 2018
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.
Read more.
9:00am–12:30pm Tuesday, March 6, 2018
Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.
Read more.
9:00am–5:00pm Tuesday, March 6, 2018
Madhav Madaboosi (BP),
Meenakshisundaram Thandavarayan (Infosys),
Matt Conners (Microsoft),
Katie Malone (Civis Analytics),
Mike Prorock (mesur.io),
Thomas Miller (Northwestern University),
Ann Nguyen (Whole Whale),
Jennie Shin (Kaiser Permanente),
Valentin Bercovici (PencilDATA),
Wayde Fleener (General Mills),
Joe Dumoulin (Next IT),
Jules Malin (GoPro),
Taylor Martin Martin (O'Reilly Media),
Divya Ramachandran (Captricity)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions.
Read more.
1:30pm–5:00pm Tuesday, March 6, 2018
Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
Read more.
1:30pm–5:00pm Tuesday, March 6, 2018
TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.
Read more.
1:30pm–5:00pm Tuesday, March 6, 2018
Secondary topics:
Graphs and Time-series
If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data.
Read more.
1:30pm–5:00pm Tuesday, March 6, 2018
Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead.
Read more.
11:00am–11:40am Wednesday, March 7, 2018
The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations.
Read more.
11:00am–11:40am Wednesday, March 7, 2018
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.
Read more.
11:00am–11:40am Wednesday, March 7, 2018
Secondary topics:
Graphs and Time-series
Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka.
Read more.
11:00am–11:40am Wednesday, March 7, 2018
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Read more.
11:00am–11:40am Wednesday, March 7, 2018
Secondary topics:
Data Integration and Data Pipelines
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.
Read more.
11:50am–12:30pm Wednesday, March 7, 2018
Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.
Read more.
11:50am–12:30pm Wednesday, March 7, 2018
Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more.
Read more.
11:50am–12:30pm Wednesday, March 7, 2018
Secondary topics:
Graphs and Time-series
William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud.
Read more.
11:50am–12:30pm Wednesday, March 7, 2018
Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.
Read more.
11:50am–12:30pm Wednesday, March 7, 2018
A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products.
Read more.
11:50am–12:30pm Wednesday, March 7, 2018
Secondary topics:
Data Integration and Data Pipelines
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn.
Read more.
1:50pm–2:30pm Wednesday, March 7, 2018
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.
Read more.
1:50pm–2:30pm Wednesday, March 7, 2018
Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.
Read more.
1:50pm–2:30pm Wednesday, March 7, 2018
Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures.
Read more.
1:50pm–2:30pm Wednesday, March 7, 2018
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.
Read more.
1:50pm–2:30pm Wednesday, March 7, 2018
Secondary topics:
Data Integration and Data Pipelines
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.
Read more.
2:40pm–3:20pm Wednesday, March 7, 2018
Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving.
Read more.
2:40pm–3:20pm Wednesday, March 7, 2018
DataOps—a culture and practice for building data-intensive applications, including machine learning pipelines—expands DevOps philosophy to include data-heavy roles such as data engineering and data science. DataOps is based on cross-functional collaboration resulting in fast time to value and an agile workflow. Ellen Friedman offers an overview of DataOps and explains how to implement it.
Read more.
2:40pm–3:20pm Wednesday, March 7, 2018
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.
Read more.
2:40pm–3:20pm Wednesday, March 7, 2018
Deep learning has shown tremendous improvements in a number of areas and has justifiably generated enormous excitement. However, several key challenges—from prohibitive hardware requirements to immature software offerings—are impeding widespread enterprise adoption. Evan Sparks details fundamental challenges facing organizations looking to adopt deep learning and shares possible solutions.
Read more.
2:40pm–3:20pm Wednesday, March 7, 2018
Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money.
Read more.
2:40pm–3:20pm Wednesday, March 7, 2018
Secondary topics:
Data Integration and Data Pipelines
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines.
Read more.
2:40pm–3:20pm Wednesday, March 7, 2018
Secondary topics:
Expo Hall,
Graphs and Time-series
Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups.
Read more.
4:20pm–5:00pm Wednesday, March 7, 2018
Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams.
Read more.
4:20pm–5:00pm Wednesday, March 7, 2018
Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user.
Read more.
4:20pm–5:00pm Wednesday, March 7, 2018
Secondary topics:
Graphs and Time-series
Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage.
Read more.
4:20pm–5:00pm Wednesday, March 7, 2018
Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.
Read more.
4:20pm–5:00pm Wednesday, March 7, 2018
Secondary topics:
Data Integration and Data Pipelines
Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth.
Read more.
4:20pm–5:00pm Wednesday, March 7, 2018
Secondary topics:
Expo Hall
One of the key application domains leveraging live data is smart cities, but success depends on the availability of generic platforms that support high throughput and ultralow latency. Arun Kejariwal and Francois Orsini offer an overview of Satori's live data platform and walk you through a country-scale case study of its implementation.
Read more.
5:10pm–5:50pm Wednesday, March 7, 2018
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them.
Read more.
5:10pm–5:50pm Wednesday, March 7, 2018
Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime.
Read more.
5:10pm–5:50pm Wednesday, March 7, 2018
Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.
Read more.
5:10pm–5:50pm Wednesday, March 7, 2018
Secondary topics:
Data Integration and Data Pipelines
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.
Read more.
5:10pm–5:50pm Wednesday, March 7, 2018
Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series.
Read more.
5:10pm–5:50pm Wednesday, March 7, 2018
Secondary topics:
Data Integration and Data Pipelines
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.
Read more.
11:00am–11:40am Thursday, March 8, 2018
Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.
Read more.
11:00am–11:40am Thursday, March 8, 2018
For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud.
Read more.
11:00am–11:40am Thursday, March 8, 2018
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.
Read more.
11:00am–11:40am Thursday, March 8, 2018
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.
Read more.
11:00am–11:40am Thursday, March 8, 2018
Secondary topics:
Expo Hall
Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead.
Read more.
11:50am–12:30pm Thursday, March 8, 2018
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.
Read more.
11:50am–12:30pm Thursday, March 8, 2018
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.
Read more.
11:50am–12:30pm Thursday, March 8, 2018
Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and demonstrate how your applications will benefit from using them.
Read more.
11:50am–12:30pm Thursday, March 8, 2018
Secondary topics:
Graphs and Time-series
In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data.
Read more.
11:50am–12:30pm Thursday, March 8, 2018
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases.
Read more.
11:50am–12:30pm Thursday, March 8, 2018
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.
Read more.
1:50pm–2:30pm Thursday, March 8, 2018
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications.
Read more.
1:50pm–2:30pm Thursday, March 8, 2018
The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product.
Read more.
1:50pm–2:30pm Thursday, March 8, 2018
Secondary topics:
Graphs and Time-series
Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries.
Read more.
1:50pm–2:30pm Thursday, March 8, 2018
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Read more.
1:50pm–2:30pm Thursday, March 8, 2018
Secondary topics:
Expo Hall,
Graphs and Time-series
Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution.
Read more.
1:50pm–2:30pm Thursday, March 8, 2018
The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Josef Viehhauser and Tobias Bürger discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments.
Read more.
2:40pm–3:20pm Thursday, March 8, 2018
Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime.
Read more.
2:40pm–3:20pm Thursday, March 8, 2018
Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.
Read more.
2:40pm–3:20pm Thursday, March 8, 2018
Secondary topics:
Graphs and Time-series
Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way.
Read more.
2:40pm–3:20pm Thursday, March 8, 2018
Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies.
Read more.
2:40pm–3:20pm Thursday, March 8, 2018
Secondary topics:
Expo Hall
Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3.
Read more.
4:20pm–5:00pm Thursday, March 8, 2018
There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support.
Read more.
4:20pm–5:00pm Thursday, March 8, 2018
Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits.
Read more.
4:20pm–5:00pm Thursday, March 8, 2018
Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty.
Read more.
4:20pm–5:00pm Thursday, March 8, 2018
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.
Read more.
4:20pm–5:00pm Thursday, March 8, 2018
GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email.
Read more.