Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference
Singapore

Data Engineering & Architecture

December 4-7, 2017
Singapore

Ben Lorica, Strata Conference Chair

Monday-Tuesday |Tuesday | Wednesday | Thursday

Data Engineering and Architecture

How to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

It’s not easy. Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

All Strata Data Conference Gold and Silver passes have access to Data Engineering and Architecture sessions Tuesday-Thursday. Platinum and Bronze passes have access to Data Engineering and Architecture sessions Wednesday-Thursday.

Monday-Tuesday December 4-5: Training courses (Platinum & Training passes)
Location: 334
Tuesday December 5: Tutorials (Gold & Silver passes)
Location: 308/309
Wednesday December 6: Keynotes & Sessions (Gold, Silver & Bronze passes)
Location: Summit 1 Location: 308/309 Location: 310/311
8:50 | Location: Hall 404AXF
Strata Data Conference Keynotes
10:45am
Morning break
12:45pm
Lunch
3:15pm
Afternoon break
5:45pm | Location: Sponsor Pavilion
Sponsor Pavilion Reception
Thursday December 7: Keynotes & Sessions (Gold, Silver & Bronze passes)
Location: Summit 1 Location: Summit 2 Location: 308/309 Location: 310/311
8:50 | Location: Hall 404AXF
Strata Data Conference Keynotes
10:45am
Morning break
12:45pm
Lunch
3:15pm
Afternoon break
Add to your personal schedule
1:30pm5:00pm Tuesday, December 5, 2017
Location: 308/309 Level: Intermediate
Jonathan Seidman (Cloudera), Ted Malaska (Blizzard Entertainment)
Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Read more.
Add to your personal schedule
11:15am11:55am Wednesday, December 6, 2017
Location: Summit 1 Level: Beginner
In the current Agile business environment, where developers are required to experiment multiple ideas and also react to various situations, doing cloud-native development is the way to go. Harjinder Mistry and Bargava Subramanian explain how to design and build a microservices-based cloud-native machine learning application. Read more.
Add to your personal schedule
11:15am11:55am Wednesday, December 6, 2017
Location: 308/309 Level: Intermediate
Ted Malaska (Blizzard Entertainment)
Ted Malaska shares the top five mistakes that no one talks about when you start writing your streaming app along with the practices you'll inevitably need to learn along the way. Read more.
Add to your personal schedule
11:15am11:55am Wednesday, December 6, 2017
Location: 310/311 Level: Intermediate
Neelesh Srinivas Salian offers an overview of the data platform used by data scientists at Stitch Fix, based on the Spark ecosystem. Neelesh explains the development process and shares some lessons learned along the way. Read more.
Add to your personal schedule
12:05pm12:45pm Wednesday, December 6, 2017
Location: Summit 1 Level: Intermediate
Jared Lander (Lander Analytics)
One common (but false) knock against R is that it doesn't scale well. Jared Lander shows how to use R in a performant matter both in terms of speed and data size and offers an overview of packages for running R at scale. Read more.
Add to your personal schedule
1:45pm2:25pm Wednesday, December 6, 2017
Location: Summit 1 Level: Intermediate
Aki Ariga (Cloudera)
Aki Ariga explains how to put your machine learning model into production, discusses common issues and obstacles you may encounter, and shares best practices and typical architecture patterns of deployment ML models with example designs from the Hadoop and Spark ecosystem using Cloudera Data Science Workbench. Read more.
Add to your personal schedule
1:45pm2:25pm Wednesday, December 6, 2017
Location: 308/309 Level: Beginner
Ofir Sharony (MyHeritage)
What are the most important considerations for shipping billions of daily events to analysis? Ofir Sharony shares MyHeritage's journey to find a reliable and efficient way to achieve real-time analytics. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline. Read more.
Add to your personal schedule
1:45pm2:25pm Wednesday, December 6, 2017
Location: 310/311 Level: Intermediate
Vickye Jain (ZS Associates), Raghav Sharma (ZS Associates)
Vickye Jain and Raghav Sharma explain how they built a very high-performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, December 6, 2017
Location: Summit 1 Level: Beginner
Wai Yau (Zendesk), Jeffrey Theobald (Zendesk)
Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. Drawing on their experience working on Zendesk's article recommendation product, Wai Yau and Jeffrey Theobald discuss design challenges and real-world problems you may encounter when building a machine learning product at scale. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, December 6, 2017
Location: Summit 2 Level: Intermediate
Modern engineering requires machine learning engineers, who are needed to monitor and implement ETL and machine learning models in production. Natalino Busa shares technologies, techniques, and blueprints on how to robustly and reliably manage data science and ETL flows from inception to production. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, December 6, 2017
Location: 308/309 Level: Intermediate
Yousun Jeong (SK Telecom), Ah Young Hwang (SK Telecom)
Data transfer is one of the most pressing problems for telecom companies, as cost increases in tandem with the growing data requirements. Yousun Jeong and Ah Young Hwang detail how SKT has dealt with this problem. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, December 6, 2017
Location: 310/311 Level: Intermediate
Mingxi Wu (TigerGraph), Yu Xu (TigerGraph)
Mingxi Wu and Yu Xu offer an overview of TigerGraph, a high-performance enterprise graph data platform that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network, allowing them to uncover the implicit patterns and critical insights to drive business growth. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, December 6, 2017
Location: Summit 1 Level: Intermediate
Holden Karau (Google)
Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau introduces Spark’s ML pipelines and explains how to extend them with your own custom algorithms, allowing you to take advantage of Spark's meta-algorithms and existing ML tools. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, December 6, 2017
Location: 308/309 Level: Intermediate
Xiaochang Wu (Intel)
Xiaochang Wu explains how to design and implement a real-time processing platform using the Spark Structured Streaming framework to intelligently transform production lines in the manufacturing industry. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, December 6, 2017
Location: 310/311 Level: Beginner
Feng Cheng (Grab), Yanyu Qu (Grab)
Grab uses Presto to support operational reporting (batch and near real-time), ad hoc analyses, and its data pipeline. Currently, Grab has 5+ clusters with 100+ instances in production on AWS and serves up to 30K queries per day while supporting more than 200 internal data users. Feng Cheng and Yanyu Qu explain how Grab operationalizes Presto in the cloud and share lessons learned along the way. Read more.
Add to your personal schedule
5:05pm5:45pm Wednesday, December 6, 2017
Location: Summit 1 Level: Intermediate
Peng Meng (Intel)
Apache Spark ML and MLlib are hugely popular in the big data ecosystem, and Intel has been deeply involved in Spark from a very early stage. Peng Meng outlines the methodology behind Intel's work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib ALS by 60x in JD.com’s production environment. Read more.
Add to your personal schedule
5:05pm5:45pm Wednesday, December 6, 2017
Location: 308/309 Level: Intermediate
Andreas Hadimulyono discusses the challenges that Grab is facing with the ever-increasing volume and velocity of its data and shares the company's plans to overcome them. Read more.
Add to your personal schedule
5:05pm5:45pm Wednesday, December 6, 2017
Location: 310/311 Level: Intermediate
Henry Robinson (Cloudera), Greg Rahn (Cloudera)
Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance. Read more.
Add to your personal schedule
11:15am11:55am Thursday, December 7, 2017
Location: Summit 1 Level: Beginner
Paco Nathan (O'Reilly Media)
Paco Nathan explains how O'Reilly employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video. Read more.
Add to your personal schedule
11:15am11:55am Thursday, December 7, 2017
Location: Summit 2 Level: Intermediate
Wee Hyong Tok (Microsoft), Danielle Dean (Microsoft)
Deep neural networks are responsible for many advances in natural language processing, computer vision, speech recognition, and forecasting. Danielle Dean and Wee Hyong Tok illustrate how cloud computing has been leveraged for exploration, programmatic training, real-time scoring, and batch scoring of deep learning models for projects in healthcare, manufacturing, and utilities. Read more.
Add to your personal schedule
11:15am11:55am Thursday, December 7, 2017
Location: 308/309 Level: Beginner
Wataru Yukawa (LINE)
Data is a very important asset to LINE, one of the most popular messaging applications in Asia. Wataru Yukawa explains how LINE gets the most out of its data using a Hadoop data lake and an in-house log analysis platform. Read more.
Add to your personal schedule
11:15am11:55am Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Holden Karau (Google), Joey Echeverria (Rocana)
Apache Spark offers greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark, and more. Read more.
Add to your personal schedule
12:05pm12:45pm Thursday, December 7, 2017
Location: Summit 1 Level: Intermediate
Graham Gear (Cloudera)
How can we drive more data pipelines, advanced analytics, and machine learning models into production? How can we do this both faster and more reliably? Graham Gear draws on real-world processes and systems to explain how it's possible to apply continuous delivery techniques to advanced analytics, realizing business value earlier and more safely. Read more.
Add to your personal schedule
12:05pm12:45pm Thursday, December 7, 2017
Location: Summit 2 Level: Beginner
Xianyan Jia (Intel), zhenhua wang (JD.com)
Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL. They also teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark. Read more.
Add to your personal schedule
12:05pm12:45pm Thursday, December 7, 2017
Location: 308/309 Level: Intermediate
Tzu-Li (Gordon) Tai (data Artisans)
Apache Flink is evolving from a framework for streaming data analytics to a platform that offers a foundation for event-driven applications that replaces the data management aspects that are typically handled by a database in more conventional architectures. Tzu-Li (Gordon) Tai explores the key features that are powering Flink's evolution, along with demonstrations of them in action. Read more.
Add to your personal schedule
12:05pm12:45pm Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Carson Wang (Intel), Yucai Yu (Intel)
Spark SQL is one of the most popular components of Apache Spark. Carson Wang and Yucai Yu explore Intel's efforts to improve SQL performance and offer an overview of an adaptive execution mode they implemented for Spark SQL. Read more.
Add to your personal schedule
1:45pm2:25pm Thursday, December 7, 2017
Location: Summit 1 Level: Intermediate
Teresa Tung (Accenture Labs), Ishmeet Grewal (Accenture Labs), Jurgen Weichenberger (Accenture Analytics)
As Accenture scaled to millions of predictive models, it required automation to ensure accuracy, prevent false alarms, and preserve trust. Teresa Tung, Ishmeet Grewal, and Jurgen Weichenberger explain how Accenture implemented a DevOps process for analytical models that's akin to software development—guaranteeing analytics modeling at scale and even in noncloud environments at the edge. Read more.
Add to your personal schedule
1:45pm2:25pm Thursday, December 7, 2017
Location: Summit 2 Level: Beginner
YONGLIANG XU (StarHub), Masaru Dobashi (NTT Data Corp.)
SmartHub and NTT DATA have embarked on a partnership to design next-generation architecture to power the data products that will help generate new insights. YongLiang Xu and Masaru Dobashi explain how deep learning and other analytics models coexist within the same platform to address issues relating to smart cities. Read more.
Add to your personal schedule
1:45pm2:25pm Thursday, December 7, 2017
Location: 308/309 Level: Advanced
Apache Beam allows data pipelines to work in batch, streaming, and a variety of open source and private cloud data processing backends, including Apache Flink, Apache Spark, and Google Cloud Dataflow. Jean-Baptiste Onofré offers an overview of Apache Beam's programming model, explores mechanisms for efficiently building data pipelines, and demos an IoT use case dealing with MQTT messages. Read more.
Add to your personal schedule
1:45pm2:25pm Thursday, December 7, 2017
Location: 310/311 Level: Beginner
Calvin Jia (Alluxio), Haoyuan Li (Alluxio)
Calvin Jia and Haoyuan Li explain how to decouple compute and storage with Alluxio, exploring the decision factors, considerations, and production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform. Read more.
Add to your personal schedule
2:35pm3:15pm Thursday, December 7, 2017
Location: Summit 1 Level: Intermediate
Kazunori Sato (Google)
BigQuery is Google's fully managed, petabyte-scale data warehouse. Its user-defined function realizes "smart" queries with the power of machine learning, such as similarity searches or recommendations on images or documents with feature vectors and neural network prediction. Kazunori Sato demonstrates how BigQuery and TensorFlow together enable a powerful "data warehouse + ML" solution. Read more.
Add to your personal schedule
2:35pm3:15pm Thursday, December 7, 2017
Location: Summit 2 Level: Intermediate
Chris Hausler (Zendesk), Arwen Griffioen (Zendesk)
Chris Hausler and Arwen Griffioen discuss Zendesk's experience with deep learning, using the example of Answer Bot, a question-answering system that resolves support tickets without agent intervention. They cover the benefits Zendesk has already seen and challenges encountered along the way. Read more.
Add to your personal schedule
2:35pm3:15pm Thursday, December 7, 2017
Location: 308/309 Level: Beginner
Supreet Oberoi (Oracle)
Time series data is any dataset that is plotted over a range of time. Often, in IoT use cases, what is of interest is finding a pattern in the sequence of measurements. However, queries on time series data do not traditionally scale. Supreet Oberoi explains how Oracle adapted and extended symbolic aggregate approximation (SAX) to solve such challenges. Read more.
Add to your personal schedule
2:35pm3:15pm Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Dong Li (Kyligence), Luke Han (Kyligence)
Apache Kylin is an extreme distributed OLAP engine on Hadoop. Well-tuned cubes bring about the best performance with the least cost but require a comprehensive understanding of tuning principles to use. Dong Li and Luke Han explain advanced tuning and introduce KyBot, which helps find and solve bottlenecks in an intelligent way with AI methods performed on log analysis results. Read more.
Add to your personal schedule
4:15pm4:55pm Thursday, December 7, 2017
Location: Summit 1 Level: Beginner
Prateek Nagaria (The Data Team)
Most data scientists use traditional methods of forecasting, such as exponential smoothing or ARIMA, to forecast a product demand. However, when the product experiences several periods of zero demand, approaches such as Croston may provide a better accuracy over these traditional methods. Prateek Nagaria compares traditional and Croston methods in R on intermittent demand time series. Read more.
Add to your personal schedule
4:15pm4:55pm Thursday, December 7, 2017
Location: Summit 2
Adam Gibson (Skymind)
Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization. Read more.
Add to your personal schedule
4:15pm4:55pm Thursday, December 7, 2017
Location: 308/309 Level: Intermediate
Xie Qi (Intel China), quanfu wang (Intel China)
Xie Qi and Quanfu Wang offer an overview of a configurable FPGA-based Spark SQL acceleration architecture that leverages FPGAs' very high parallel computing capability to tremendously accelerate Spark SQL queries and FPGAs' power efficiency to lower power consumption. Read more.
Add to your personal schedule
4:15pm4:55pm Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Wei Chen (Intel), Zhaojuan Bian (Intel)
Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios. Read more.
Add to your personal schedule
5:05pm5:45pm Thursday, December 7, 2017
Location: Summit 2 Level: Intermediate
Markus Kirchberg (Wismut Labs Pte. Ltd.)
As the share of digital payments increases so does payment fraud, which has almost tripled between 2013 and 2016. Markus Kirchberg explains how recent advances in AI and machine learning, decision sciences, and network sciences are driving the development of next-generation payment fraud capabilities for fraud scoring, deceptive merchant detection, and merchant compromise detection. Read more.
Add to your personal schedule
5:05pm5:45pm Thursday, December 7, 2017
Location: 308/309 Level: Advanced
Yu-Xi Lim (Teralytics), Michal Wegrzyn (Teralytics)
Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. At its core, the pattern relies on functional reactive programming idioms to shard and splice state fragments, ensuring high horizontal scalability, reliability, and high availability. Read more.
Add to your personal schedule
5:05pm5:45pm Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Graham Dumpleton (Red Hat)
Jupyter notebooks provide a rich interactive environment for working with data. Running a single notebook is easy, but what if you need to provide a platform for many users at the same time. Graham Dumpleton demonstrates how to use JupyterHub to run a highly scalable environment for hosting Jupyter notebooks in education and business. Read more.