Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference
Singapore

Data Engineering & Architecture

December 5-7, 2017
Singapore

Ben Lorica, Strata Conference Chair

Tuesday | Wednesday | Thursday

Data Engineering and Architecture

How to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

It’s not easy. Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

All Strata Data Conference Gold and Silver passes have access to Data Engineering and Architecture sessions Tuesday-Thursday. Platinum and Bronze passes have access to Data Engineering and Architecture sessions Wednesday-Thursday.

Tuesday December 5: Tutorials (Gold & Silver passes)
Location: 308/309
Wednesday December 6: Keynotes & Sessions (Gold, Silver & Bronze passes)
Location: Summit 1 Location: 308/309 Location: 310/311
8:50 | Location: Hall 404AXF
Strata Data Conference Keynotes
10:45am
Morning break
12:45pm
Lunch
3:15pm
Afternoon break
5:45pm | Location: Sponsor Pavilion
Sponsor Pavilion Reception
Thursday December 7: Keynotes & Sessions (Gold, Silver & Bronze passes)
Location: Summit 1 Location: Summit 2 Location: 308/309 Location: 310/311
8:50 | Location: Hall 404AXF
Strata Data Conference Keynotes
10:45am
Morning break
12:45pm
Lunch
3:15pm
Afternoon break
Add to your personal schedule
1:30pm5:00pm Tuesday, December 5, 2017
Location: 308/309 Level: Intermediate
Jonathan Seidman (Cloudera), Ted Malaska (Blizzard Entertainment)
Average rating: ****.
(4.00, 5 ratings)
Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Read more.
Add to your personal schedule
11:15am11:55am Wednesday, December 6, 2017
Location: Summit 1 Level: Beginner
Average rating: *....
(1.25, 4 ratings)
In the current Agile business environment, where developers are required to experiment multiple ideas and also react to various situations, doing cloud-native development is the way to go. Harjinder Mistry and Bargava Subramanian explain how to design and build a microservices-based cloud-native machine learning application. Read more.
Add to your personal schedule
11:15am11:55am Wednesday, December 6, 2017
Location: 308/309 Level: Intermediate
Ted Malaska (Blizzard Entertainment)
Average rating: ****.
(4.67, 9 ratings)
Ted Malaska shares the top five mistakes that no one talks about when you start writing your streaming app along with the practices you'll inevitably need to learn along the way. Read more.
Add to your personal schedule
11:15am11:55am Wednesday, December 6, 2017
Location: 310/311 Level: Intermediate
Average rating: ***..
(3.00, 2 ratings)
Neelesh Srinivas Salian offers an overview of the data platform used by data scientists at Stitch Fix, based on the Spark ecosystem. Neelesh explains the development process and shares some lessons learned along the way. Read more.
Add to your personal schedule
12:05pm12:45pm Wednesday, December 6, 2017
Location: Summit 1 Level: Intermediate
Jared Lander (Lander Analytics)
Average rating: ***..
(3.33, 3 ratings)
One common (but false) knock against R is that it doesn't scale well. Jared Lander shows how to use R in a performant matter both in terms of speed and data size and offers an overview of packages for running R at scale. Read more.
Add to your personal schedule
1:45pm2:25pm Wednesday, December 6, 2017
Location: Summit 1 Level: Intermediate
Aki Ariga (Cloudera)
Average rating: ***..
(3.00, 1 rating)
Aki Ariga explains how to put your machine learning model into production, discusses common issues and obstacles you may encounter, and shares best practices and typical architecture patterns of deployment ML models with example designs from the Hadoop and Spark ecosystem using Cloudera Data Science Workbench. Read more.
Add to your personal schedule
1:45pm2:25pm Wednesday, December 6, 2017
Location: 308/309 Level: Beginner
Ofir Sharony (MyHeritage)
Average rating: ****.
(4.57, 7 ratings)
What are the most important considerations for shipping billions of daily events to analysis? Ofir Sharony shares MyHeritage's journey to find a reliable and efficient way to achieve real-time analytics. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline. Read more.
Add to your personal schedule
1:45pm2:25pm Wednesday, December 6, 2017
Location: 310/311 Level: Intermediate
Vickye Jain (ZS Associates), Raghav Sharma (ZS Associates)
Average rating: *****
(5.00, 1 rating)
Vickye Jain and Raghav Sharma explain how they built a very high-performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, December 6, 2017
Location: Summit 1 Level: Beginner
Wai Yau (Zendesk), Jeffrey Theobald (Zendesk)
Average rating: ****.
(4.75, 8 ratings)
Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. Drawing on their experience working on Zendesk's article recommendation product, Wai Yau and Jeffrey Theobald discuss design challenges and real-world problems you may encounter when building a machine learning product at scale. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, December 6, 2017
Location: Summit 2 Level: Intermediate
Natalino Busa (DBS), Matteo Pelati (DataRobot)
Average rating: ****.
(4.00, 3 ratings)
Modern engineering requires machine learning engineers, who are needed to monitor and implement ETL and machine learning models in production. Natalino Busa shares technologies, techniques, and blueprints on how to robustly and reliably manage data science and ETL flows from inception to production. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, December 6, 2017
Location: 308/309 Level: Intermediate
Yousun Jeong (SK Telecom)
Average rating: ****.
(4.25, 4 ratings)
Data transfer is one of the most pressing problems for telecom companies, as cost increases in tandem with the growing data requirements. Yousun Jeong details how SKT has dealt with this problem. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, December 6, 2017
Location: 310/311 Level: Intermediate
Mingxi Wu (TigerGraph), Yu Xu (TigerGraph)
Average rating: *****
(5.00, 1 rating)
Mingxi Wu and Yu Xu offer an overview of TigerGraph, a high-performance enterprise graph data platform that enables businesses to transform structured, semistructured, and unstructured data and massive enterprise data silos into an intelligent interconnected data network, allowing them to uncover the implicit patterns and critical insights to drive business growth. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, December 6, 2017
Location: Summit 1 Level: Intermediate
Holden Karau (Google)
Average rating: ****.
(4.50, 6 ratings)
Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau introduces Spark’s ML pipelines and explains how to extend them with your own custom algorithms, allowing you to take advantage of Spark's meta-algorithms and existing ML tools. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, December 6, 2017
Location: 308/309 Level: Intermediate
Xiaochang Wu (Intel)
Average rating: ****.
(4.00, 1 rating)
Xiaochang Wu explains how to design and implement a real-time processing platform using the Spark Structured Streaming framework to intelligently transform production lines in the manufacturing industry. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, December 6, 2017
Location: 310/311 Level: Beginner
Feng Cheng (Grab), Yanyu Qu (Grab)
Average rating: *****
(5.00, 1 rating)
Grab uses Presto to support operational reporting (batch and near real-time), ad hoc analyses, and its data pipeline. Currently, Grab has 5+ clusters with 100+ instances in production on AWS and serves up to 30K queries per day while supporting more than 200 internal data users. Feng Cheng and Yanyu Qu explain how Grab operationalizes Presto in the cloud and share lessons learned along the way. Read more.
Add to your personal schedule
5:05pm5:45pm Wednesday, December 6, 2017
Location: Summit 1 Level: Intermediate
Peng Meng (Intel)
Average rating: *....
(1.00, 1 rating)
Apache Spark ML and MLlib are hugely popular in the big data ecosystem, and Intel has been deeply involved in Spark from a very early stage. Peng Meng outlines the methodology behind Intel's work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib ALS by 60x in JD.com’s production environment. Read more.
Add to your personal schedule
5:05pm5:45pm Wednesday, December 6, 2017
Location: 308/309 Level: Intermediate
Average rating: ****.
(4.33, 3 ratings)
Andreas Hadimulyono discusses the challenges that Grab is facing with the ever-increasing volume and velocity of its data and shares the company's plans to overcome them. Read more.
Add to your personal schedule
5:05pm5:45pm Wednesday, December 6, 2017
Location: 310/311 Level: Intermediate
Greg Rahn (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance. Read more.
Add to your personal schedule
11:15am11:55am Thursday, December 7, 2017
Location: Summit 1 Level: Beginner
Paco Nathan (O'Reilly Media)
Average rating: ****.
(4.60, 5 ratings)
Paco Nathan explains how O'Reilly employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video. Read more.
Add to your personal schedule
11:15am11:55am Thursday, December 7, 2017
Location: 308/309 Level: Beginner
Wataru Yukawa (LINE)
Average rating: ****.
(4.00, 3 ratings)
Data is a very important asset to LINE, one of the most popular messaging applications in Asia. Wataru Yukawa explains how LINE gets the most out of its data using a Hadoop data lake and an in-house log analysis platform. Read more.
Add to your personal schedule
11:15am11:55am Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Holden Karau (Google), Joey Echeverria (Rocana)
Average rating: ****.
(4.20, 5 ratings)
Apache Spark offers greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark, and more. Read more.
Add to your personal schedule
12:05pm12:45pm Thursday, December 7, 2017
Location: Summit 1 Level: Intermediate
Graham Gear (Cloudera)
Average rating: *****
(5.00, 3 ratings)
How can we drive more data pipelines, advanced analytics, and machine learning models into production? How can we do this both faster and more reliably? Graham Gear draws on real-world processes and systems to explain how it's possible to apply continuous delivery techniques to advanced analytics, realizing business value earlier and more safely. Read more.
Add to your personal schedule
12:05pm12:45pm Thursday, December 7, 2017
Location: Summit 2 Level: Beginner
Xianyan Jia (Intel), zhenhua wang (JD.com)
Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL. They also teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark. Read more.
Add to your personal schedule
12:05pm12:45pm Thursday, December 7, 2017
Location: 308/309 Level: Intermediate
Tzu-Li (Gordon) Tai (data Artisans)
Average rating: *****
(5.00, 1 rating)
Apache Flink is evolving from a framework for streaming data analytics to a platform that offers a foundation for event-driven applications that replaces the data management aspects that are typically handled by a database in more conventional architectures. Tzu-Li (Gordon) Tai explores the key features that are powering Flink's evolution, along with demonstrations of them in action. Read more.
Add to your personal schedule
12:05pm12:45pm Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Carson Wang (Intel), Yucai Yu (Intel)
Average rating: ****.
(4.50, 2 ratings)
Spark SQL is one of the most popular components of Apache Spark. Carson Wang and Yucai Yu explore Intel's efforts to improve SQL performance and offer an overview of an adaptive execution mode they implemented for Spark SQL. Read more.
Add to your personal schedule
1:45pm2:25pm Thursday, December 7, 2017
Location: Summit 1 Level: Intermediate
Teresa Tung (Accenture Labs), Ishmeet Grewal (Accenture Labs), Jurgen Weichenberger (Accenture Analytics)
Average rating: ****.
(4.50, 2 ratings)
As Accenture scaled to millions of predictive models, it required automation to ensure accuracy, prevent false alarms, and preserve trust. Teresa Tung, Ishmeet Grewal, and Jurgen Weichenberger explain how Accenture implemented a DevOps process for analytical models that's akin to software development—guaranteeing analytics modeling at scale and even in noncloud environments at the edge. Read more.
Add to your personal schedule
1:45pm2:25pm Thursday, December 7, 2017
Location: Summit 2 Level: Beginner
YONGLIANG XU (StarHub), Masatake Iwasaki (NTT DATA)
Average rating: *****
(5.00, 1 rating)
SmartHub and NTT DATA have embarked on a partnership to design next-generation architecture to power the data products that will help generate new insights. YongLiang Xu and Masatake Iwasaki explain how deep learning and other analytics models can coexist on the same platform to address opportunities and challenges in initiatives such as smart cities. Read more.
Add to your personal schedule
1:45pm2:25pm Thursday, December 7, 2017
Location: 308/309 Level: Advanced
Average rating: *****
(5.00, 1 rating)
Apache Beam allows data pipelines to work in batch, streaming, and a variety of open source and private cloud data processing backends, including Apache Flink, Apache Spark, and Google Cloud Dataflow. Jean-Baptiste Onofré offers an overview of Apache Beam's programming model, explores mechanisms for efficiently building data pipelines, and demos an IoT use case dealing with MQTT messages. Read more.
Add to your personal schedule
1:45pm2:25pm Thursday, December 7, 2017
Location: 310/311 Level: Beginner
Calvin Jia (Alluxio), Haoyuan Li (Alluxio)
Calvin Jia and Haoyuan Li explain how to decouple compute and storage with Alluxio, exploring the decision factors, considerations, and production best practices and solutions to best utilize CPUs, memory, and different tiers of disaggregated compute and storage systems to build out a multitenant high-performance platform. Read more.
Add to your personal schedule
2:35pm3:15pm Thursday, December 7, 2017
Location: Summit 1 Level: Intermediate
Kaz Sato (Google)
Average rating: ****.
(4.00, 1 rating)
BigQuery is Google's fully managed, petabyte-scale data warehouse. Its user-defined function realizes "smart" queries with the power of machine learning, such as similarity searches or recommendations on images or documents with feature vectors and neural network prediction. Kazunori Sato demonstrates how BigQuery and TensorFlow together enable a powerful "data warehouse + ML" solution. Read more.
Add to your personal schedule
2:35pm3:15pm Thursday, December 7, 2017
Location: 308/309 Level: Beginner
Supreet Oberoi (Oracle)
Average rating: ****.
(4.00, 1 rating)
Time series data is any dataset that is plotted over a range of time. Often, in IoT use cases, what is of interest is finding a pattern in the sequence of measurements. However, queries on time series data do not traditionally scale. Supreet Oberoi explains how Oracle adapted and extended symbolic aggregate approximation (SAX) to solve such challenges. Read more.
Add to your personal schedule
2:35pm3:15pm Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Dong Li (Kyligence), Luke Han (Kyligence)
Average rating: *****
(5.00, 1 rating)
Apache Kylin is an extreme distributed OLAP engine on Hadoop. Well-tuned cubes bring about the best performance with the least cost but require a comprehensive understanding of tuning principles to use. Dong Li and Luke Han explain advanced tuning and introduce KyBot, which helps find and solve bottlenecks in an intelligent way with AI methods performed on log analysis results. Read more.
Add to your personal schedule
4:15pm4:55pm Thursday, December 7, 2017
Location: Summit 1 Level: Beginner
Prateek Nagaria (The Data Team)
Most data scientists use traditional methods of forecasting, such as exponential smoothing or ARIMA, to forecast a product demand. However, when the product experiences several periods of zero demand, approaches such as Croston may provide a better accuracy over these traditional methods. Prateek Nagaria compares traditional and Croston methods in R on intermittent demand time series. Read more.
Add to your personal schedule
4:15pm4:55pm Thursday, December 7, 2017
Location: 308/309 Level: Intermediate
Xie Qi (Intel), quanfu wang (Intel China)
Xie Qi and Quanfu Wang offer an overview of a configurable FPGA-based Spark SQL acceleration architecture that leverages FPGAs' very high parallel computing capability to tremendously accelerate Spark SQL queries and FPGAs' power efficiency to lower power consumption. Read more.
Add to your personal schedule
4:15pm4:55pm Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Wei Chen (Intel), Zhaojuan Bian (Intel)
Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios. Read more.
Add to your personal schedule
5:05pm5:45pm Thursday, December 7, 2017
Location: 308/309 Level: Advanced
Yu-Xi Lim (Teralytics), Michał Węgrzyn (Teralytics)
Average rating: *****
(5.00, 2 ratings)
Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. At its core, the pattern relies on functional reactive programming idioms to shard and splice state fragments, ensuring high horizontal scalability, reliability, and high availability. Read more.
Add to your personal schedule
5:05pm5:45pm Thursday, December 7, 2017
Location: 310/311 Level: Intermediate
Graham Dumpleton (Red Hat)
Average rating: ***..
(3.00, 2 ratings)
Jupyter notebooks provide a rich interactive environment for working with data. Running a single notebook is easy, but what if you need to provide a platform for many users at the same time. Graham Dumpleton demonstrates how to use JupyterHub to run a highly scalable environment for hosting Jupyter notebooks in education and business. Read more.