Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Speaker slides

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Mathieu Dumoulin offers an overview of stream processing and explains how to simplify a seemingly complex real-time enterprise streaming architecture using an open source business rules engine and Apache Kafka API streaming. Mathieu then illustrates this architecture with a demo based on a successful production use case for Busan, South Korea's Smart City initiative.
Ted Dunning explains how a stream-first approach simplifies and speeds development of applications, resulting in real-time applications that have significant impact. Along the way, Ted contrasts a stream-first approach with existing approaches that start with an application that dictates specialized data structures, ETL activities, data silos, and processing delays.
One challenge when dealing with manufacturing sensor data analysis is to formulate an efficient model of the underlying physical system. Rajesh Sampathkumar shares his experience working with sensor data at scale to model a real-world manufacturing subsystem with simple techniques, such as moving average analysis, and advanced ones, like VAR, applied to the problem of predictive maintenance.
Alluxio is an open source memory-speed virtual distributed storage system. In the past year, the Alluxio open source community has grown to more than 300 developers. The project also experienced a tremendous improvement in performance and scalability and was extended with new features. Haoyuan Li offers an overview of Alluxio, covering its use cases, its community, and the value it brings.
Apache Beam (incubating) defines a new data processing programming model evolved from more than a decade of experience building big data infrastructure within Google. Beam pipelines are portable across open source and private cloud runtimes. Dan Halperin covers the basics of Apache Beam—its evolution, main concepts in the programming mode, and how it compares to similar systems.
With enterprise adoption of Apache Spark come enterprise security requirements and the need to meet enterprise security standards. Vinay Shukla walks you through enterprise security requirements, provides a deep dive into Spark security features, and shows how Spark meets these enterprise security requirements.
Hybrid cloud architectures marry the flexibility to scale workloads on-demand in the public cloud with the ability to control mission-critical applications on-premises. Publish-subscribe message streams offer a natural paradigm for hybrid cloud use cases. Mathieu Dumoulin describes how to architect a real-time, global IoT analytics hybrid cloud application with a Kafka-based message stream system.
The future of big data is AI, and the future is here. With machine learning and deep learning, with robotics and heuristics, with fuzzy logic and AI, the convergence is changing myriad industries like healthcare, banking, insurance, and gaming. Raju Chellam explains why it’s time to step back and consider how big data with HPC and AI can make a key difference in your management.
Alex Gutow and Henry Robinson explain how Apache Hadoop and Apache Impala (incubating) take advantage of the benefits of the cloud to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments combined with the flexibility and cost efficiency of the cloud.
The opportunity to harness data to impact business is ripe, and as a result, every industry, every organization, and every department is going through a huge change, whether they realize it or not. John Kreisa shares use cases from across Asia and Europe of businesses that are successfully leveraging new platform technologies to transform their organizations using data.
O'Reilly recently launched Oriole, a new learning medium for online tutorials that combines Jupyter notebooks, video timelines, and Docker containers run on a Mesos cluster, based the pedagogical theory of computable content. Paco Nathan explores the system architecture, shares project experiences, and considers the impact of notebooks for sharing and learning across a data-centric organization.
Ofer Ron examines the development of LivePerson's traffic targeting solution from a generic to a domain-specific implementation to demonstrate that a thorough understanding of the problem domain is essential to a good machine-learning-based product. Ofer then reviews the underlying architecture that makes this possible.
Making recommendations for the food and beverage industry is tricky as they must take into consideration the user's context (location, time, day, etc.) in addition to the constraints of a regular recommendation algorithm. Arun Veettil explains how to incorporate user contextual information into recommendation algorithms and apply reinforcement learning to track continuously changing user behavior.
Shopback, a company that gives cash back to customers for successful transactions covering various lifestyles, crawls 25 million products from multiple ecommerce websites to provide a smooth customer experience. Qiaoliang Xiang walks you through how to crawl and update products, how to scale it using big data tools, and how to design a modularized system.
Join Lean Analytics author, Harvard lecturer, and Strata chair Alistair Croll for a look at how to think critically about data, based on his Harvard Business School course.
Raymond Chan dives into the trials and tribulations of DataKind SG, a data science consulting social good organization operating in the digitally underserved but rapidly developing frontier of Southeast Asia.
In 2007, a computer game company decided to jump ahead of competitors by capturing and using data created during online gaming, but it wasn't prepared to deal with the data management and process challenges stemming from distributed devices creating data. Mark Madsen shares a case study that explores the oversights, failures, and lessons the company learned along its journey.
Deep learning has made a huge impact on predictive analytics and is here to stay, so you'd better get up to speed with the neural net craze. Mateusz Dymczyk explains why all the top companies are using deep learning, what it's all about, and how you can start experimenting and implementing deep learning solutions in your business in only a few easy steps.
Ever wondered how Google Translate works so well, how the autocaptioning works on YouTube, or how to mine the sentiments of tweets on Twitter? What’s the underlying theme? They all use deep learning. Bargava Subramanian and Amit Kapoor explore artificial neural networks and deep learning for natural language processing to get you started.
Everyone is talking about data lakes. The intended use of a data lake is as a central storage facility for performing analytics. But, Jim Scott asks, why have a separate data lake when your entire (or most of your) infrastructure can run directly on top of your storage, ​minimizing or ​eliminating the need for data movement, separate​ processes and​ clusters​,​ and ETL?
Application developers have long created complex schemas to handle storing with minor relationships in an RDBMS. This talk will show how to convert an existing (complicated schema) music database to HBase for transactional workloads, plus how to use Drill against HBase for real-time queries. HBase column families will also be discussed.
Verdi March demystifies deep learning and shares his experience on how to gradually transition to deep learning. Using a specific example in computer vision, Verdi touches upon key differences in engineering traditional software versus deep learning-based software.
Huge amounts of data are generated every minute by nearly every company. . .which largely goes unused. Historically, so-called data exhaust has been collected for the purpose of manual analysis in the case of a fault or failure. Cameron Turner explains why companies are increasingly looking to their data exhaust as a valuable asset to influence their revenue and profit through machine learning.
Santander was one of the last big banks in the UK to start using Hadoop and other big data technologies. However, the maturity of the technology made it possible to create a customer-facing data product in production in less than a year and a fully adopted production analytics platform in less than two. Antonio Alvarez shares what other late entrants can learn from this experience.
If your organization has Hadoop clusters in research or as point solutions and you're wondering where you go from there, this session is for you. Phillip Radley explains how to run Hadoop as a service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably.
Real-time data analysis is becoming more and more important to Internet companies’ daily business. Qunar has been running Alluxio in production for over a year. Lei Xu explores how stream processing on Alluxio has led to a 16x performance improvement on average and 300x improvement at service peak time on workloads at Qunar.
As the number of products on Lazada grows exponentially, helping customers find relevant, quality products is key to customer experience. Eugene Yan shares how Lazada ranks products on its website, covering how Lazada scales data pipelines to collect user-behavioral data, cleans and prepares data, creates simple features, builds models to meet key objectives, and measures outcomes.
Marketing has become ever more data driven. While there are thousands of marketing applications available, it is challenging to get an end-to-end line of sight and fully understand customers. Franz Aman explains how bringing the data from the various applications and data sources together in a data lake changes everything.
Can you imagine an intelligent software to assist in your decision making and drive actions? Flavio Clesio and Eiti Kimura offer a practical demonstration of using machine learning to create an intelligent monitoring application based on a distributed system data analysis using Apache Spark MLlib.
Creating better models is a critical component to building a good data science product. It is relatively easy to build a first-cut machine-learning model, but what does it take to build a reasonably good or state-of-the-art model? Ensemble models—which help exploit the power of computing in searching the solution space. Bargava Subramanian discusses various strategies to build ensemble models.
Modern telecommunications are alphabet soups that produce massive amounts of diagnostic data. Ted Dunning offers an overview of a real-time, low-fidelity simulation of the edge protocols of such a system to help illustrate how modern big data tools can be used for telecom analytics. Ted demos the system and shows how several tools can produce useful analytical results and system understanding.
The use of maps in disaster response is evidently important. Yantisa Akhadi explores how to use OpenStreetMap (OSM), the biggest crowdsourced mapping platform, for safer urban environments, drawing on case studies from several major cities in Indonesia where citizen and government mapping has played a major role in improving resilience.
Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture.
Aljoscha Krettek offers a very short introduction to stream processing before diving into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible. All of this will be done in the context of a real-time analytics application that we'll be modifying on the fly based on the topics we're working though.
The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.
YARN includes security features such as SSL encryption, Kerberos-based authentication, and HDFS encryption. Nitin Khandelwal and Abhishek Modi share the challenges they faced in enabling these features for ephemeral clusters running in the cloud with multitenancy support as well as performance numbers for different encryption algorithms available.
Jorge Pablo Fernandez and Nicolette Bullivant explore Santander Bank's Spendlytics app, which helps customers track their spending by offering a listing of transactions, transactions aggregations, and real-time enrichment based on the categorization of transactions depending on market and brands. Along they way, they share the challenges encountered and lessons learned while implementing the app.
Most consumer-facing personalization today is rudimentary and coarsely targeted at best, and designers don’t give users cues for how they are meant to interact with and interpret personalized experiences and interfaces. Sara Watson makes the case for personalization signals that give context to personalization and expose levers of control to users.
When data is transformed into visualizations, the impact can sometimes be lost on the user. Drawing on her work with Doctors Without Borders, Vivian Peng explains how emotions help convey impact and move people to take action and demonstrates how we might design emotions into the data visualization experience.
Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications.
Jason Dai and Yiheng Wang share their experience building web-scale machine learning using Apache Spark—focusing specifically on "war stories" (e.g., in-game purchase, fraud detection, and deep leaning)—outline best practices to scale these learning algorithms, and discuss trade-offs in designing learning systems for the Spark framework.
Stop copying and pasting your D3.js visualization code each time you start a new project and start writing intelligent visualization software. Michael Freeman demonstrates how to build modular, reusable charting code by leveraging foundational JavaScript principles (such as closures) and the reusability structure used internally by the D3.js library.
Stop copying and pasting your D3.js visualization code each time you start a new project and start writing intelligent visualization software. Michael Freeman demonstrates how to build modular, reusable charting code by leveraging foundational JavaScript principles (such as closures) and the reusability structure used internally by the D3.js library.