Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

All Sessions

Below is a preliminary listing of all confirmed sessions for Strata + Hadoop World in Singapore 2015. We'll be adding more sessions daily and unveiling the full program soon.

Josh Patterson (Patterson Consulting)
Slides:   1-PPTX 
In this session we will take a look at a practical review of what is deep learning and introduce DL4J. We'll look at how it supports deep learning in the enterprise on the JVM. We’ll discuss the architecture of DL4J’s scale-out parallelization on Hadoop and Spark in support of modern machine learning workflows.
The music industry has been forever changed by tech and data. Here are some of the surprising things I learned over the course of six months and 70 interviews with some of the industry's biggest innovators.
This talk introduces a 12-step guide to help secure a data deployment in the cloud. Using the help of open source solutions and security best practices, you will be familiarized with a simple yet effective framework that can be used to fortify your own data-driven deployment in the cloud against accidental and malicious data breaches.
PechaKucha 20x20 is a simple presentation format where you show 20 images, each for 20 seconds. The images advance automatically and you talk along to the images.
Jeff Markham (Hortonworks)
Attend this session to: *Learn about the key drivers behind the shift to Hadoop based platforms *Understand the common steps in the journey to adopting Hadoop *Hear real-word case studies of business transformation using Hadoop *Get insight into the core components that make up Open Enterprise Hadoop
In this talk, I will share my experience and lessons learned, from my professional career in the industrial R&D, on how to achieve meaningful impact as a researcher (or engineer, scientist, technician, innovator, <insert what-not>). Throughout my talk, I will draw examples from my current team, in particular our approach to innovation life cycle and some of our team dynamics.
Feng-Yuan Liu (Infocomm Development Authority of Singapore)
At IDA’s Government Analytics department, our team of data scientists work with bus operators to offer demand-driven express bus routes by combining crowdsourcing and big data. We use Apache Spark to analyze ticketing, taxi, and crowdsourced data to find bus routes that are both time-saving and financially viable. We show how these insights are delivered into a new transport option for commuters.
Slides:   1-PPTX 
In this talk, we will present our efforts on building large scale distributed ML on Apache Spark with many "web-scale" companies, including very complex and advanced analytics applications / algorithms (e.g., topic modelling, deep neural network, etc.), as well as massively scalable learning system/platform leveraging both application and infrastructure specific optimizations.
Patrick McFadin (DataStax)
This tutorial is all about managing large volumes of data coming at your data center fast and continuously. If you don't have a strategy, then allow me to help. Amazing Apache Project software can make this problem a lot easier to deal with. Spend a few hours and learn about how each part works, and how they work together. Your users will thank you.
Deepak Ramanathan (SAS Asia Pacific)
With Hadoop becoming the chosen Data Platform across enterprises, analytical lifecycles are now being powered with Hadoop being the centrepiece for discovery and deployment. During this talk, attendees will get insights from organisations that are building and deploying thousands of analytical models into their operational environments.
Jonathan Hsieh (Cloudera, Inc), Philip Langdale (Cloudera), Kathleen Ting (Cloudera), Kostas Sakellis (Cloudera)
Collectively we have sixteen years of Hadoop ops experience. Come to our AMA and discuss debugging and tuning between the different layers (app, hadoop, jvm, kernel, networking) as well as tools and subsystems to keep your Hadoop clusters always up, running, and secure.
Kathleen Ting (Cloudera), Jonathan Hsieh (Cloudera, Inc), Philip Langdale (Cloudera), Kostas Sakellis (Cloudera)
Slides:   external link,   2-PDF 
Hadoop is emerging as the standard for big data processing and analytics. However, as usage of Hadoop clusters grow, so do the demands of managing and monitoring these systems. In this tutorial, attendees will get an overview of all phases of successfully managing Hadoop clusters, with an emphasis on production systems.
Ju Fan (National University of Singapore), Wei Wang (National University of Singapore)
We will introduce Apache SINGA, a flexible and scalable deep learning platform for big data analytics. SINGA is flexible to support various deep learning models, and is general to provide scalable training architecture. We will also show two applications to demonstrate how SINGA is helpful for healthcare data analytics, predicting risk-of-readmission and modeling chronic disease progression.
Masaru Dobashi (NTT DATA Corporation), Yoshitaka Suzuki (IHI Corporation)
We are developing a platform to process massive sensor data obtained from social infrastructures and industrial machinery all over the world, in order to achieve advanced safety management. In this session, we'll talk about the capability of Spark to realize time-series data processing, the best practices of application development, and realistic lessons on operating Spark on YARN.
Ted Malaska (Capital One), Mark Grover (Lyft)
In this session, we will discuss common archictectural patterns for building streaming applications.
Grab a drink, mingle with fellow Strata + Hadoop World participants, and see the latest technologies and products from leading companies in the data space.
Join us as we review the Big Data landscape and reflect on Big Data lessons being learned in enterprise over the last few years and how these organisations are avoiding their Big Data environments becoming unmanageable by using simplex management for deployment, administration, monitoring and reporting no matter how much the environment scales.
Ivan Teh (Fusionex)
If you could simulate the results of your business decisions, wouldn't that change the way you manage your business? The availability of big data solutions today introduces new management principles, opportunities as well as challenges.
SeongHwa Ahn (SK telecom), jisung kim (SK Telecom)
With Big Data system using Hadoop platform, we resolved the problem that make slow down the performance with existing legacy system based on RDBMS. And we set up real-time pattern analysis system using Spark. It provides easy and quick solutions to hands-on worker to monitor and diagnose manufacturing processes rather than traditional legacy system based on RDBMS.
Sujit Mathew (PayPal), Yew Yap Goh (PayPal)
Slides:   external link
Our team’s main focus at PayPal is to boost customer engagement. This talk is about how we use predictive modeling to recommend products to consumers. We will talk about the technologies we use and how we deploy our models to production.
Evan Chan (Tuplejump)
Slides:   external link
This talk will show architectures and techniques for combining Apache Cassandra and Spark to yield a 10-1000x improvement in OLAP analytical performance, and introduce a new open source database that takes advantage of these techniques.
Shirshanka Das (LinkedIn)
Slides:   1-PDF 
LinkedIn describes how they’ve built a self-serve petabyte-scale reporting platform centered around Hadoop, that powers all business decision making at LinkedIn. We describe how we overcame challenges to scale to over a thousand analysts, over a thousand metrics, and provide daily, hourly, as well as real-time reports. This has reduced turnaround times for dashboards from weeks to a few hours.
Deepak Agrawal (24[7] Inc.)
This talk is about an application of big data predictive analytics to improve the online customer experience. The application is built using big data infrastructure with Hadoop, Cassandra, and machine learning algorithms using R and Python, that predict customer intent and take actions in real time to deliver an enhanced experience. Key challenges and lessons learned are also discussed.
Kai Xin Thia (Lazada)
Southeast Asia provides a unique challenge to large recommender systems: how will you design one system that recommends products to millions of users, many whom are spread across several countries, with their own language and cultural preferences? Well, you don't. Instead, we will explore a hybrid system that integrates inputs from a variety of recommenders and deploys it on a distributed system.
Doug Cutting (Cloudera)
The data century is upon us and Apache Hadoop has emerged as the platform for managing your big data opportunity. The path to success is not without its perils, however, and without a thoughtful approach progress can be hindered by the impact of change, trust and security.
Wing Leong Ho (CLOUDERA)
Cloudera University's one-day essentials course presents an overview of Apache Hadoop and how it can help decision-makers meet business goals, providing a fundamental introduction to the main components of Hadoop and its use cases in various industries. This course is a good starting point for any role or set of objectives and is part of the data analyst learning path.
Yves-Alexandre de Montjoye (Imperial College London | MIT Media Lab)
We're living in an age of big data, a time when metadata about most of our movements and actions are collected and stored in real time. These data offer unprecedented insights on how we behave. Mathematical analysis of metadata, however, reveals how unique our behavior is and how this behavior puts fundamental constraints on our privacy.
Dave Chan (UBM Asia), Sonal Goyal (Nube)
Slides:   1-PDF 
UBM Asia is the largest trade show organizer in Asia. To deal with duplicate customer records and ensure clean marketing data, UBM Asia has built an end to end solution using Reifier from Nube Technologies built atop Spark. This talk will discuss UBM's use case and our use of Reifier fuzzy matching engine, Spark and machine learning. We will also cover Reifier's architecture and usage of Spark.
Marcel Kornacker (Cloudera), Skye Wanderman-Milne (Cloudera)
In this talk, we will explain how data scientists use nested data structures to increase analytic productivity. We will use two well-known relational schemas - TPC-H and Twitter - to demonstrate how to simplify data science workloads with nested schemas. Also, we will outline best practices for converting flat relational schemas into nested ones, and give examples of data science-style analysis.
Juliet Hougland (Cloudera), Sandy Ryza (Cloudera)
In this half-day tutorial, attendees will get a taste of how large-scale data science techniques and technologies developed for the consumer internet can be applied in the world of Telecom.
Julia Rodriguez (Eagle Investment Systems)
Designing data visualizations presents us with unique and interesting challenges: how to tell a compelling story; how to deliver important information in a forthright, clear format; and how to make visualizations beautiful and engaging. In this talk, Julie will share a few disruptive designs and connect those back to vizipedia, her compiled data visualization library.
Tara Hirebet (R/GA)
When data is hidden and crunched, and used purely for organization and optimization, we may be losing out on a crucial value it can offer – that of empowerment, engagement and impactful behavioral change.
Amit Bansal (Accenture Digital)
Learn how the intersection of people, data and intelligent machines will have far-reaching impact on the productivity, efficiency and operations of industries around the world as organizations transform to become data-driven, insight-powered enterprises.
Hallie Benjamin (Accenture)
As the lead for the Accenture-UC Berkeley Data & Analytics Partnership, Hallie has worked with a multidisciplinary team of professors, data scientists, and students to design the Applied Learning Course in Data Science. She will talk about her experience and lessons learned for bringing together technical and business minds to help data science feature more prominently in business strategy.
Melanie Warrick (Google)
Deep Learning is taking hold as a popular machine learning modeling technique because of its real world applications especially with regards to image, signal and language datasets (e.g. medical diagnosis, self-driving cars, real-time language translation). This talk provides an overview of what deep learning is especially around recent applications.
Tushar Shanbhag (Adatao, Inc)
All analytics is prescriptive analytics; it just depends on who's writing the prescription, human or machine. In this talk, I will present how to humanize the big data experience, promote collaboration between business users and data scientists, and bridge the gap between human and machine.
n this practical demonstration, participants will see how they can perform a simple, but meaningful analysis of social sentiment data using freely available and easy to deploy tools. Participants will be equipped with the download links, scripts, and complete step-by-step walkthrough of the analysis from start to finish.
Danielle Dean (iRobot), Wee Hyong Tok (Microsoft)
In this tutorial, you will create end-to-end predictive models based on an extensive library of machine learning algorithms included in Microsoft Azure Machine Learning studio with its R and Python language extensibility. You will then deploy and consume the model and use it for making predictions over business data.
Jun Liu (Intel), Zhaojuan Bian (Intel)
Slides:   1-PDF 
Based on previous experience, there are many challenges in designing an Impala cluster for production, such as table schema, data placement, file format selection, hardware selection, and software stack parameters tuning. We will walk through a real-world case study in the banking and financial services sector to illustrate how we use our simulator-based approach to design an Impala cluster.
Edd Wilder-James (Google), John Akred (Silicon Valley Data Science)
Slides:   1-PDF 
Big data and data science have great potential for accelerating business, but how do you reconcile the opportunity with the sea of possible technologies? Conventional data strategy has little to guide us, focusing more on governance than on creating new value. In this tutorial, we explain how to create a modern data strategy that powers data-driven business.
Hong Eng Koh (Oracle), Vladimir Videnovic (Oracle)
Slides:   1-PDF 
Public safety and national security are increasingly being challenged by technology; the need to use data to detect and investigate criminal activities has increased dramatically. But with the sheer volume of data and noise, law enforcement organisations are struggling to keep up. This session will examine trends and use cases on how big data can be utilised to make the world a safer place.
The high volume, velocity, variety and veracity of big data have been pushing for more comprehensive solutions and services to enable decision-making and insight discovery across business market segments. Ziya Ma, General Manager of Big Data Software Technologies in Intel's Software and Services Group, will discuss Intel’s software enabling role for making this possible and easier.
Fangjin Yang (Imply)
Organizations frequently rely on dedicated query layers, such as relational databases and key/value stores, for faster query latencies; but these technologies suffer many drawbacks for analytic use cases. In this session, we discuss examine using Druid to power applications designed to analyze sensor data, and why the architecture is well suited for different use cases in “smart cities”.
Sandy Ryza (Cloudera)
This talk will cover Spark design patterns in time series analysis, visualizing data, and Monte Carlo simulation; and will show you what it is like to approach financial modeling with Spark.
Jim Scott (NVIDIA)
Slides:   1-PPTX 
Application developers have long created complex schemas to handle storing with minor relationships in an RDBMS. This talk will show how to convert an existing (complicated schema) music database to HBase for transactional workloads, plus how to use Drill against HBase for real-time queries. HBase column families will also be discussed.
Bin Fan (Alluxio), Xiang Wen (Baidu)
Slides:   1-PPTX 
Baidu runs Tachyon in production with more than 100 nodes managing 2PB space! In this talk we will focus on how Tachyon can help improve big data analytics (ad-hoc query) with 30X performance improvement within Baidu.
Nirmal Ranganathan (Rackspace)
In this talk, we will discuss how a streamlined Spark stack including Tachyon and Zeppelin can solve both the need for speed and reduced development time. We will walk thru a sample use case that utilizes 20 years of data to look for insights and create a predictive model from the dataset.
Guy Harrison (Dell Software)
Slides:   1-ZIP 
When people think of big data processing, they think of Apache Hadoop, but that doesn't mean traditional databases don't play a role. In most cases users will still draw from data stored in RDBMS systems. Apache Sqoop can be used to unlock that data and transfer it to Hadoop, enabling users with information stored in existing SQL tables to use new analytic tools.
Felipe Hoffa (Google), Kalev Leetaru (GDELT Project (
The GDELT Project is a real-time open data global graph over human society, inventorying the world’s events, emotions, and narratives in 65 languages, used by organizations from the UN to Wall Street. Google BigQuery enables real-time querying and whole-of-data analysis of GDELT, such as exploring the cycles of world history through mass cross-correlation.
Sean Zhong (Previously Intel)
Slides:   1-PPTX 
GearPump is an akka based framework that processes real time data across a DAG of actors. Its data delivery is highly scalable with at least once data delivery guarantees.
Shift how you approach data-driven application development by extending the DevOps model to Big Data.
Yuichi Kuroda (Mitsubishi UFJ Information Technology (MUIT))
In this session, attendees will learn the concepts underlying graph data analytics based on MUFG's experiences. Moreover, it will cover how to analyze huge graph data with Apache Spark GraphX. Finally, it will explore what type of data tends to cause problems and how to solve them.
Oh , well everybody talks about how wonderful "Big Data" , "Hadoop" technologies are etc, but no one shares the mistakes and the learnings from the sleepless nights.I will be sharing my experience about building better data infrastructure.
Mark Grover (Lyft), Ted Malaska (Capital One), Gwen Shapira (Confluent), Jonathan Seidman (Cloudera)
Join the authors of Hadoop Application Architectures for an open Q/A session on considerations and recommendations for architecture and design of applications using Hadoop. Talk to us about your use-case and its big data architecture, or just come to listen in.
Gwen Shapira (Confluent), Ted Malaska (Capital One), Mark Grover (Lyft), Jonathan Seidman (Cloudera)
Looking for a deeper understanding of how to architect real-time data processing solutions? This tutorial will provide this understanding using a real-world example of a fraud detection system. We’ll use this example to discuss considerations for building such a system, how you’d integrate various technologies, and why those choices make sense for the use case in question.
Paul Scott-Murphy (WANdisco)
Slides:   1-ZIP 
Hadoop lacks a mechanism to extend the distributed file system beyond the confines of a single cluster. Done right, active-active consensus can guarantee consistency of replicated file system changes regardless of Hadoop versions, distributions and communication latency. Find out how to perform selective data replication for cluster migration, disaster recovery, multi-site ingest, backup and more.
Nikhil Joshi (EMC, Advanced Software Division), Priya Lakshminarayanan (EMC Corporation)
Slides:   1-PDF 
This session focuses on strategies and technologies you can use to build a global Hadoop cloud with geo-distributed access and protection for analytics in various use-cases like IoT - handling billions of small files or multi-terabyte files in the same system.
Jairam Ranganathan (Cloudera)
Apache Hadoop was designed when cloud models were in their infancy. Despite this fact, Hadoop has proven remarkably adept at migrating its architecture to work well in the context of the cloud, as production workloads migrate to a cloud environment. This talk will have cover several topics on adapting Hadoop to the cloud.
Todd Lipcon (Cloudera)
This session will investigate the trade-offs between real-time transactional access and fast analytic performance in Hadoop from the perspective of storage engine internals. We will discuss recent advances, evaluate benchmark results from current generation Hadoop technologies, and propose potential ways ahead for the Hadoop ecosystem to conquer its newest set of challenges.
Majken Sander (Majken Sander), Joerg Blumtritt (Datarella)
Algorithms are what make things "smart." More or less arbitrary, subjective decisions are regularly built into our connected things, when we choose a certain method or set parameters. These underlying value judgments imposed on users are hardly present in the privacy discussion or business point of view. However, they may be more important than the more obvious data collection and security.
Mark Donsky (Okera), Naren Koneru (Cloudera)
Slides:   1-PDF 
Find out how the world's most sophisticated Hadoop deployments are addressing data governance challenges head-on, while preserving Hadoop's flexibility, through an integrated data management and governance approach.
Slides:   1-PPT 
The fast evolution of services and mobile terminals combined with the aggressive competition between mobile operators is driving a continuous upgrade of the radio access network (RAN). This upgrade process is expensive and time consuming, and it scales with the number of base stations. This talk stresses the importance of the customer and proposes a new methodology for an efficient RAN upgrade.
Melanie Warrick (Google)
This talk will briefly explain what neural nets are and why they’re important, as well as give context about GPUs. Then we will walk through code and launch a neural net on a GPU. I will cover key pitfalls you may hit and techniques to diagnose and troubleshoot. You will walk away understanding how to start using GPUs and where to go for additional help.
Farrah Bostic (The Difference Engine)
This talk will highlight the top 5 mistakes we make in collecting and analyzing qualitative data, how to do it better, and how it can inspire your next big thing.
Selene Chew (Adatao)
In "The Power of Myth," Joseph Campbell distilled the modern story form as "A Hero's Journey." This talk presents the relevance of the story form in data analysis, and shows examples of how to tell insightful data stories.
Using data science to make better corporate, financial, and strategic decisions.
Avind Shrivastava (Hewlett Packard Enterprise)
How different Industry Verticals are using HPE Platform for Hadoop and Big Data Analytics. The session Covers various use cases across Industry Verticals which are implemented by customer's across APJ and why HPE has a Unique Value Proposition for Hadoop Solutions.
Albert Bifet (Télécom ParisTech), Silviu Maniu (Huawei)
Slides:   1-PDF 
Real-time analytics are becoming increasingly important to telecommunication operators due to the large amount of data that flows through their networks. Drawing from our experience at Huawei, we present StreamDM, a new open source data mining and machine learning library on top of Spark Streaming. We will present its implemented advanced methods, and demonstrate its ease of use and extensibility.
Paul Marriott (SAP Asia Pacific Japan)
SAP HANA Vora is a new in-memory query engine that leverages and extends the Apache Spark execution framework to provide enriched interactive analytics on Hadoop.
Ken Medlock (ANZ Banking Group)
Slides:   1-PDF 
ANZ has adopted an innovative approach to drive continuous business and identification of business value opportunities using disruptive and new big data technologies.
Matthew Conlen (FiveThirtyEight)
This session teaches use of modern data analysis and visualization tools for effective interactive data science. Attendees will learn how to use notebook environments to set up sharable and reproducible analysis pipelines, and will leverage tools for large scale analysis and web-based data visualization to drive further analysis and decision making.
Amy Shi-Nash (Singtel)
This talk will broach the topic of how DataSpark has created an innovative way of understanding people and what is important to them, by leveraging advanced data science and the wealth of data in an aggregated manner, while adhering to high standards of data privacy.
Vivian Balakrishnan (Government of Singapore )
Dr. Vivian Balakrishnan is the Singapore Minister for Foreign Affairs and the Minister-in-charge of the Smart Nation Programme Office.
We all think we can do it naturally if we just focus or care enough. The reality is that like everything else, we need to build a practice around listening to innovate, change, grow, and learn. I'll share my tips to help you nurture your nature in listening.
Rakesh Menon (McLaren Applied Technology)
High performance design requires the ability to optimally measure a system, understand its working, predict its performance, and continuously refine it. Central to this process is leveraging the data in an intelligent way. This talk explains how this can be achieved.
Industry Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
Industry Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
Andreas Mueller (NYU, scikit-learn)
This talk is a tutorial for the machine learning library scikit-learn in Python. It starts with a short introduction into what machine learning is, and then dives in-depth into how to use scikit-learn in practice. The tutorial will be in the format of an IPython notebook and includes exercises.
Mingfei Shi (Intel), Bin Fan (Alluxio)
Current memory size is far from enough to host data sets. NVM has emerged to respond to this need. However, how to integrate NVM to support a modernized big data system is a challenge. In this talk, we present our efforts to make a tiered store in Tachyon, which provided a software solution for next-gen data center platforms with NVM.
Danielle Dean (iRobot)
Predictive maintenance is a technique to predict when an in-service machine will fail so that maintenance can be planned in advance. This talk introduces the landscape and challenges of predictive maintenance applications in the industry. Through a real-world example, the talk also illustrates how to formulate a predictive maintenance problem with three machine learning models.
Eric Frenkiel (MemSQL)
Slides:   external link
Eric Frenkiel, CEO/cofounder, MemSQL, will demonstrate a prototype of a futuristic smart city where all household energy devices are tracked in real-time. He will show the challenges, design choices & architecture required to enable urban planners/energy companies to see what is possible for efficient energy consumption through a real-time data pipeline combining Kafka+Spark+an in-memory database.
Thomas Holleczek (Singtel)
We present a traffic measurement system that monitors subway and expressway traffic from telco location data.
Heesun Won (ETRI), Minh Chau Nguyen (ETRI)
This session will address how one single Hadoop cluster can be built across many geographically distributed data centers to provide multitenant analytics services. We extend the overall architecture of Hadoop so that multiple tenants can securely access, share, and analyze data in their own isolated executing environments.
Rishi Malhotra (Saavn)
In this session, we’ll take a look at how music streaming delivers real time data that enables us to proxy a billion behaviors and apply the signals to other industries. Rishi was also a participant in the O’Reilly Study “Music Science”, published in 2015 by Alistair Croll.
Markus Kirchberg (Deep Labs Pte. Ltd.)
In this talk, we will first take a look at current IoT standards, solutions, and common challenges; change management; and near real-time decision-making capabilities that are yet to be adequately addressed.
Joanna Schloss (Dell Software)
Join us to hear how Big Data and Analytics Subject Matter Expert, Joanna Schloss, envisions emerging technologies shaping the future of mission critical initiatives such as security and analytics. How will in-memory, big data, and IoT shape and guide businesses with deployment and maintenance of key capabilities?
Utkarsh B (Flipkart), Vinod Venkatraman (Flipkart Internet Private Limited)
Slides:   1-FILE 
Have you faced the challenge of storing and optimally serving multibillion-row EAV modeled data out of a traditional data store? Monolithic data stores fall short, even with fast storage like SSDs for a large online marketplace, quantified here as 3 billion catalog entries and 100 million catalog updates in a day. This talk is about paradigms and patterns we adopted to address this problem.
Deepak Ramanathan (SAS Asia Pacific)
Join this keynote presentation to get tips from the future and hear about key patterns emerging from a wide cross section of corporate and institutional Hadoop journeys. Perhaps they’ll inspire yours.
Uri Laserson (Cloudera)
The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large data sets. Big data tools developed to handle large-scale internet data (like Hadoop) will help scientists effectively manage this new scale of data, and also enable addressing a host of questions that were previously out of reach.
Oliver Chen (DataKind Singapore)
Private sector companies are becoming more data-driven, but what does it take to help the social sector become data-driven? DataKind is a global nonprofit that harnesses the power of data science in the service of humanity. Learn about two DataKind Singapore projects that brought together data science volunteers and nonprofit organizations to move the needle in the fight against climate change.
Neil Mendelson (Oracle)
Companies who are successful with big data need to be analytics-driven. During this session, Neil will look at new analytics capabilities that are essential for big data to deliver results, and discuss how to maximize the time you spend providing differentiation for your organization. This session will also cover some common big data use cases in both industry and government.
Isaac Jacob (Fusionex)
As organization strive to survive and breakthrough other markets, discover how Big Data Analytics can be the key to unlock new business opportunities or simply exploit a company’s full potential in its own competitive space. Companies are increasingly swimming in more and more data so developing the abilities to breathe data will be crucial to stay ahead of competition.
Regunath Balasubramanian (Flipkart Internet)
Slides:   1-PDF 
Aesop is an open source reliable change data propagation system. It has been used to build tiered data stores using best in class SQL and NoSQL databases. Aesop provides simple pubsub-like interfaces with implementations for popular technologies like MySQL, HBase, Redis, Elasticsearch, and Kafka. Aesop scales to multi-node clusters that process millions of data records.
Rod Smith (IBM Emerging Internet Technologies )
Big data and analytics continue to be a disruptive business force. Are we entering another phase – real-time digital business transformation, where businesses are realizing that the time to adjust to market and customer opportunities and threats is shrinking quickly?
Rod Smith (IBM Emerging Internet Technologies )
Big data and analytics continue to be a disruptive business force. Are we entering another phase – real-time digital business transformation, where businesses are realizing that the time to adjust to market and customer opportunities and threats is shrinking quickly?
Ted Dunning (MapR, now part of HPE)
Flexible data model. SQL compatibility. Unlimited scale. Nearly all data systems require that you pick at most one or two of three. You can now have them all. I will show how a real-world relational database can be massively simplified using document structure, how that database can be queried using SQL and how it can scale to the trillion-row, TB-scale required by modern applications.
Wes McKinney (Two Sigma Investments)
Many data applications are written in Python or R, but developing and deploying these applications at scale or in production is a pain point for many users. We will discuss our new efforts to bridge the gap between familiar in-memory data tools and distributed data systems. In particular, we are working to enable users to streamline interactions with Hadoop and scalable query engines like Impala.
Reynold Xin (Databricks)
In this talk, we introduce a recent effort in Spark to employ randomized algorithms for a number of common, expensive methods: membership testing, cardinality, stratified sampling, frequent items, quantile estimation.
Criminals and terrorists are leveraging the sharing economy and social networking to achieve their illicit ends too! Police and law enforcement have to revive the spirit of community policing introduced in the 19th century: "the Police are the Public and the Public are the Police". In this age of social networking, we call it "Social-Enabled Policing"; and Big Data has a big role in enabling it.
Sameer Farooqui (Databricks), Paco Nathan (, Reynold Xin (Databricks)
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing and visualizations. In class we will explore various Wikipedia datasets while applying the ideal programming paradigm for each analysis. The class will comprise of about 50% lecture and 50% hands on labs + demos.
Reynold Xin (Databricks)
In this talk, Reynold will look back and review Spark’s growth in adoption, use cases, and development. He will then look forward and discuss both technical initiatives and the evolution of the Spark community for 2016.
Kevin Lee (GrabTaxi)
Why do taxi drivers not want to pick me up when I most need a taxi? Join GrabTaxi's Kevin Lee to learn how GrabTaxi uses machine learning to answer this age old question and build models for predicting taxi availability in order to improve matching on the platform.
Stephen Hardy (National ICT Australia)
Slides:   1-PDF 
Privacy in the world of big data is often considered as a legal or regulatory function. However, there are technology solutions for analytics that can be used today to protect users' privacy and to enable applications over data that is too sensitive to share. We will illustrate the state-of-the-art in privacy-preserving machine learning, including new techniques we have developed.
Pauline Brown (Dataiku)
Slides:   1-PDF 
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? Find out how lack of collaboration is what is keeping companies from imagining and actually doing what is possible to accomplish with big data.
Tyler Akidau (Google)
Join me for a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, comparing and contrasting systems at Google with popular open source systems in use today.
Thomas Beaujard (Accenture Digital), Tom Ridsdill-Smith (Woodside)
Slides:   1-PDF 
In 2015 Woodside is working with Accenture to deliver predictive analytics to Woodside’s LNG operations. By combining Accenture’s expertise in data analytics and Woodside’s leading operational experience in oil and gas, valuable, actionable insights have been discovered throughout 2015.
Mike Olson (Cloudera)
Hadoop has come a long way from monolithic storage and batch processing; today the ecosystem is diverse and flexible and is emerging as the foundation of next-generation analytic applications. Join Mike Olson, Cloudera's Chief Strategy Officer, as he discusses new innovations across the ecosystem and gives a vision for Hadoop as an architectural must have for analytics transformation.
Whye Loon Tung (Nielsen)
Geospatial data is revolutionising the marketing research industry. In this talk, Nielsen researchers will describe how such information is being used by the company to improve internal processes and to give new insights into client behaviour. The goal is to give clients an analytic edge, as will be illustrated through key methodology and insights of recent projects.
Data professionals need a range of soft-skills to make analytics successful in their organisations. We need storytelling skills, to be excellent at staleholder engagement, to able to navigate company politics.
On every smartphone, more than 20 sensors continuously monitor our movements, actions, and our environment. These data draw a vivid picture, telling the story of our everyday lives. Come to see how creepy this can get with wearable technology like smart watches.
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Strata + Hadoop World Program Chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
Office Hours are your chance to meet face-to-face with Strata + Hadoop World presenters in a small-group setting. Drop in to discuss their sessions, ask questions, or make suggestions.
Sanqi Li (Huawei)
With the recent advances of big data and machine learning technologies, there has never been a better time for developing telecom data products. However there are various challenges associated with researching and developing telecom data products at scale.
Jennifer Marsman (Microsoft)
Slides:   1-ZIP 
Using the EPOC headset from Emotiv, I can capture the big data stream of EEG from our brains. I will share my results on a “lie detector” experiment comparing brain waves when telling the truth and lying. I have built classifiers based on the EEG data using Azure Machine Learning to predict whether a subject is telling the truth. The effectiveness of multiple classifiers can be easily compared.
Arshak Navruzyan (
Like most large internet sites, Telecom networks are constantly under attack by highly sophisticated fraudsters. Historically, carriers have tried to isolate fraudulent behavior through complex rules. However, increasingly there is a need to use machine learning algorithms that can keep up with the changing face of Telecom fraud.
Amit Kapoor (narrativeVIZ)
Slides:   1-PDF 
Understand techniques to effectively visualise multi-dimensional data to aid exploratory data analysis. We will look at standard 2D/3D, geometric transformations, glyph-based, pixel-based, and stacking-based approaches to visualise this data, and also explore the interactive approaches needed to make them work.
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Strata + Hadoop World Program Chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
Author book signings will be held in the O’Reilly booth on Wednesday. This is a great opportunity for you to meet O’Reilly authors and to get a free copy of their book. Complimentary copies will be provided for the first 25 attendees. Limit one free book per attendee.
Office Hours are your chance to meet face-to-face with Strata + Hadoop World presenters in a small-group setting. Drop in to discuss their sessions, ask questions, or make suggestions.
Jana Eggers (Nara Logics)
Within the next decade, 16 percent of current US jobs will be done by artificial intelligences. It’s time to start thinking about how we onboard these employees. While we’ll look at what it takes to get started with machine learning projects, our focus will be on the top 5 things you need to consider when your next employee is an AI.
Gwen Shapira (Confluent)
Kafka provides the low latency, high throughput, high availability, and scale that financial services firms require. But can it also provide complete reliability? In this session, we will go over everything that happens to a message - from producer to consumer - and pinpoint all the places where data can be lost if you are not careful.
Edd Wilder-James (Google)
Slides:   1-PDF 
Big data and data science have great potential for accelerating business, but how do you reconcile the opportunity with the sea of possible technologies? Conventional data strategy has little to guide us, focusing more on governance than on creating new value. In this talk, we explain the how to create a modern data strategy that powers data-driven business.