Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference
Singapore

Architecture conference sessions

2:35pm–3:15pm Wednesday, 12/07/2016
Alluxio is an open source memory-speed virtual distributed storage system. In the past year, the Alluxio open source community has grown to more than 300 developers. The project also experienced a tremendous improvement in performance and scalability and was extended with new features. Haoyuan Li offers an overview of Alluxio, covering its use cases, its community, and the value it brings.
1:45pm–2:25pm Wednesday, 12/07/2016
Hybrid cloud architectures marry the flexibility to scale workloads on-demand in the public cloud with the ability to control mission-critical applications on-premises. Publish-subscribe message streams offer a natural paradigm for hybrid cloud use cases. Mathieu Dumoulin describes how to architect a real-time, global IoT analytics hybrid cloud application with a Kafka-based message stream system.
9:00am–12:30pm Tuesday, 12/06/2016
Mark Grover, Ted Malaska, and Jonathan Seidman explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world and discuss how to use components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.
2:35pm–3:15pm Wednesday, 12/07/2016
Mark Grover, Jonathan Seidman, and Ted Malaska, the authors of Hadoop Application Architectures, participate in an open Q&A session on considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.
2:35pm–3:15pm Thursday, 12/08/2016
Alex Gutow and Henry Robinson explain how Apache Hadoop and Apache Impala (incubating) take advantage of the benefits of the cloud to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments combined with the flexibility and cost efficiency of the cloud.
4:15pm–4:55pm Thursday, 12/08/2016
Ofer Ron examines the development of LivePerson's traffic targeting solution from a generic to a domain-specific implementation to demonstrate that a thorough understanding of the problem domain is essential to a good machine-learning-based product. Ofer then reviews the underlying architecture that makes this possible.
5:05pm–5:45pm Wednesday, 12/07/2016
Todd Lipcon and Marcel Kornacker provide an introduction to using Impala + Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting.
11:15am–11:55am Wednesday, 12/07/2016
In 2007, a computer game company decided to jump ahead of competitors by capturing and using data created during online gaming, but it wasn't prepared to deal with the data management and process challenges stemming from distributed devices creating data. Mark Madsen shares a case study that explores the oversights, failures, and lessons the company learned along its journey.
1:30pm–5:00pm Tuesday, 12/06/2016
Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and John Akred explain how to create a modern data strategy that powers data-driven business.
4:15pm–4:55pm Thursday, 12/08/2016
Everyone is talking about data lakes. The intended use of a data lake is as a central storage facility for performing analytics. But, Jim Scott asks, why have a separate data lake when your entire (or most of your) infrastructure can run directly on top of your storage, ​minimizing or ​eliminating the need for data movement, separate​ processes and​ clusters​,​ and ETL?
5:05pm–5:45pm Wednesday, 12/07/2016
Creating big data solutions that can process data at terabyte scale and produce spatial-temporal real-time insights at speed demands a well-thought-through system architecture. Chandras Sekhar Saripaka details the production architecture at DataSpark that works through terabytes of spatial-temporal telco data each day in PaaS mode and showcases how DataSpark operates in SaaS mode.
12:05pm–12:45pm Thursday, 12/08/2016
If your organization has Hadoop clusters in research or as point solutions and you're wondering where you go from there, this session is for you. Phillip Radley explains how to run Hadoop as a service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably.
11:15am–11:55am Wednesday, 12/07/2016
Real-time data analysis is becoming more and more important to Internet companies’ daily business. Qunar has been running Alluxio in production for over a year. Lei Xu explores how stream processing on Alluxio has led to a 16x performance improvement on average and 300x improvement at service peak time on workloads at Qunar.
1:45pm–2:25pm Thursday, 12/08/2016
Mediacorp analyzes its online audience through a computationally and economically efficient cloud-based platform. The cornerstone of the platform is Apache Spark, a framework whose clean APIs and performance gains make it an ideal choice for data scientists. Andrea Gagliardi La Gala and Brandon Lee highlight the platform’s architecture, benefits, and considerations for deploying it in production.
5:05pm–5:45pm Thursday, 12/08/2016
Marketing has become ever more data driven. While there are thousands of marketing applications available, it is challenging to get an end-to-end line of sight and fully understand customers. Franz Aman explains how bringing the data from the various applications and data sources together in a data lake changes everything.
2:35pm–3:15pm Thursday, 12/08/2016
Implementing a data governance strategy that is agile enough to take on the new technical challenges of big data while being robust enough to meet corporate standards is a huge, emerging challenge. Clara Fletcher explores what next-generation data governance will look like and what the trends will be in this space.
3:25pm–4:05pm Wednesday, 12/07/2016
Interested in Alluxio or storage? Stop by and meet Jiri.
1:45pm–2:25pm Wednesday, 12/07/2016
Stop by and talk with John Akred if you want to build a strong data strategy.
1:45pm–2:25pm Thursday, 12/08/2016
Ted will talk about streaming architecture, micro-services, how to build high performance systems, open source, math, or machine learning.
5:05pm–5:45pm Thursday, 12/08/2016
Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture.
12:05pm–12:45pm Wednesday, 12/07/2016
YARN includes security features such as SSL encryption, Kerberos-based authentication, and HDFS encryption. Nitin Khandelwal and Abhishek Modi share the challenges they faced in enabling these features for ephemeral clusters running in the cloud with multitenancy support as well as performance numbers for different encryption algorithms available.
2:35pm–3:15pm Wednesday, 12/07/2016
Garbage in, garbage out—this truism has become significantly more impactful for big data as companies have moved away from traditional schema-based approaches to more flexible and dynamic file system approaches. Steve Jones explains how to add governance, schema evolution, and the industrialization required to deliver true enterprise-grade big data solutions.
4:15pm–4:55pm Wednesday, 12/07/2016
If your design only focuses on the processing layer to get speed and power, you may be leaving a significant amount of optimization untapped. Ted Malaska describes a set of storage design patterns and schemas implemented on HBase, Kudu, Kafka, SolR, HDFS, and S3 that, by carefully tailoring how data is stored, can reduce processing and access times by two to three orders of magnitude.
4:15pm–4:55pm Wednesday, 12/07/2016
Jorge Pablo Fernandez and Nicolette Bullivant explore Santander Bank's Spendlytics app, which helps customers track their spending by offering a listing of transactions, transactions aggregations, and real-time enrichment based on the categorization of transactions depending on market and brands. Along they way, they share the challenges encountered and lessons learned while implementing the app.
11:15am–11:55am Wednesday, 12/07/2016
Spark is white-hot, but why does it matter? Some technologies cause more excitement than others, and at first the only people who understand why are the developers who use them. John Akred offers a tour through the hottest emerging data technologies of 2016 and explains why they’re exciting, in the context of the new capabilities and economies they bring.
12:05pm–12:45pm Thursday, 12/08/2016
Chi-Yi Kuan, Weidong Zhang, and Yongzheng Zhang explain how LinkedIn has built a "voice of member" platform to analyze hundreds of millions of text documents. Chi-Yi, Weidong, and Yongzheng illustrate the critical components of this platform and showcase how LinkedIn leverages it to derive insights such as customer value propositions from an enormous amount of unstructured data.
11:15am–11:55am Thursday, 12/08/2016
Jason Dai and Yiheng Wang share their experience building web-scale machine learning using Apache Spark—focusing specifically on "war stories" (e.g., in-game purchase, fraud detection, and deep leaning)—outline best practices to scale these learning algorithms, and discuss trade-offs in designing learning systems for the Spark framework.