Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

Data platform conference sessions

11:30am–12:00pm Tuesday, 12/06/2016
In semiconductor manufacturing, creating a high-yield process where sufficient portions of chips pass acceptance testing is extremely difficult to achieve. Data is collected and analyzed at every stage to improve yield and productivity. Amit Rustagi and Jingwen Ouyang share a Hadoop-based solution that reveals the true value and benefits of manufacturing data generated about every chip.
4:15pm–4:55pm Wednesday, 12/07/2016
Santander was one of the last big banks in the UK to start using Hadoop and other big data technologies. However, the maturity of the technology made it possible to create a customer-facing data product in production in less than a year and a fully adopted production analytics platform in less than two. Antonio Alvarez shares what other late entrants can learn from this experience.
9:30am–10:00am Tuesday, 12/06/2016
Sarang Anajwala discusses Autodesk’s next-generation data platform and its transition from an application for usage analytics to a platform for data analytics that provides capabilities like self-service ETLs, data exploration, multitenant data apps, and data products. This versatile platform supports use cases right from dashboards to data science, helping the move into a data-centric future.
1:45pm–2:25pm Thursday, 12/08/2016
Mediacorp analyzes its online audience through a computationally and economically efficient cloud-based platform. The cornerstone of the platform is Apache Spark, a framework whose clean APIs and performance gains make it an ideal choice for data scientists. Andrea Gagliardi La Gala and Brandon Lee highlight the platform’s architecture, benefits, and considerations for deploying it in production.
5:05pm–5:45pm Thursday, 12/08/2016
Rebecca Tien Yu Lin and Mon-Fong Mike Jiang offer an overview of a Hadoop-based big data solution helping the semiconductor industry increase yield by monitoring the huge amount of tool logs and the data generated from the FDC system.
5:05pm–5:45pm Thursday, 12/08/2016
Marketing has become ever more data driven. While there are thousands of marketing applications available, it is challenging to get an end-to-end line of sight and fully understand customers. Franz Aman explains how bringing the data from the various applications and data sources together in a data lake changes everything.
5:05pm–5:45pm Wednesday, 12/07/2016
Takayuki Nishikawa and Ei Yamaguhi explain how Panasonic developed an integrated data analytics platform to analyze the increasing number of home appliances logs from its IoT products, achieving scalability for millions of households and a 10x improvement in processing time with Hadoop and Hive, in the process gaining more reliable knowledge about users’ lifestyles with Spark MLlib.
11:15am–11:55am Wednesday, 12/07/2016
IHI has developed a common platform for remote monitoring and maintenance and has started leveraging Spark MLlib to get up speed developing applications for process improvement and product fault diagnosis. Yoshitaka Suzuki and Masaru Dobashi explain how IHI used PySpark and MLlib to improve its services and share best practices for application development and lessons for operating Spark on YARN.
10:15am–10:35am Thursday, 12/08/2016
M. C. Srivas covers the technologies underpinning the big data architecture at Uber and explores some of the real-time problems Uber needs to solve to make ride sharing as smooth and ubiquitous as running water, explaining how they are related to real-time big data analytics.
9:05am–9:30am Tuesday, 12/06/2016
Lyudmila Lugovskaya and Stuart Coleman discuss some of the many challenges that organizations face on their journey to become data-centric and share lessons learned from their experience doing and promoting data science within organizations of different type and size while dealing with restrictions imposed by traditional governance structures and policies.
12:05pm–12:45pm Wednesday, 12/07/2016
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Maosong Fu offers an overview of the end-to-end real-time stack Twitter designed in order to meet this challenge, consisting of DistributedLog (the distributed and replicated messaging system) and Heron (the streaming system for real-time computation).
12:05pm–12:45pm Thursday, 12/08/2016
Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang explain how LinkedIn has built a "voice of member" platform to analyze hundreds of millions of text documents. Chi-Yi, Weidong, and Yongzheng illustrate the critical components of this platform and showcase how LinkedIn leverages it to derive insights such as customer value propositions from an enormous amount of unstructured data.