Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Sessions

Tuesday, 23 May

Add to your personal schedule
9:0017:00 Tuesday, 23 May 2017
Location: London Suite 2/3
Angie Ma (ASI), Ben Lorica (O'Reilly Media), Ira Cohen (Anodot), Yingsong Zhang (ASI Data Science), Ali Hürriyetoglu (Statistics Netherlands), Nelleke Oostdijk (Radboud University), Robin Senge (inovex GmbH), Mathew Salvaris (Microsoft), Miguel Gonzalez-Fierro (Microsoft), Amitai Armon (Intel), Yahav Shadmi (Intel), Kay Brodersen (Google), Ding Ding (Intel), Alan Mosca (Sendence | Birkbeck, University of London), Eduard Vazquez (Cortexica Vision Systems), Aida Mehonic (ASI Data Science), David Barber (Department of Computer Science, UCL)
A full day of hardcore data science, exploring emerging topics and new areas of study made possible by vast troves of raw data and cutting-edge architectures for analyzing and exploring information. Along the way, leading data science practitioners teach new techniques and technologies to add to your data science toolbox. Read more.

Wednesday, 24 May

Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Hall S21/23 (A)
Secondary topics:  Deep learning
Level: Intermediate
Mikio Braun (Zalando SE)
Average rating: ***..
(3.14, 7 ratings)
Deep learning has become the go-to solution for challenges such as image classification or speech processing, but does it work for all application areas? Mikio Braun offers background on deep learning and shares his practical experience working with these exciting technologies. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Hall S21/23 (B)
Level: Intermediate
Matthew Rocklin (Anaconda)
Average rating: ****.
(4.33, 3 ratings)
Dask parallelizes Python libraries like NumPy, pandas, and scikit-learn, bringing a popular data science stack to the world of distributed computing. Matthew Rocklin discusses the architecture and current applications of dask used in the wild. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 8/9
Secondary topics:  AI, Deep learning, Ecommerce
Level: Beginner
Yishay Carmiel (IntelligentWire)
Average rating: ***..
(3.00, 1 rating)
For years, people have been talking about the great promise of conversation AI. Recently, deep learning has taken us a few steps further toward achieving tangible goals, making a big impact on technologies like speech recognition and natural language processing. Yishay Carmiel offers an overview of the impact of deep learning, recent breakthroughs, and challenges for the future. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Average rating: ***..
(3.20, 5 ratings)
If you have Hadoop clusters in research or an early-stage data lake and are considering strategic vision and goals, this session is for you. Phillip Radley explains how to run Hadoop as a shared service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 12
Average rating: **...
(2.00, 5 ratings)
Herman van Hövell tot Westerflier looks back at the history of data systems, from filesystems, databases, and big data systems (e.g., MapReduce) to "small data" systems (e.g., R and Python), covering the pros and cons of each, the abstractions they provide, and the engines underneath. Herman then shares lessons learned from this evolution, explains how Spark is developed, and offers a peek... Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 13
Level: Beginner
Marcel Kornacker (Cloudera)
Average rating: ****.
(4.12, 8 ratings)
Marcel Kornacker offers an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 14
Level: Intermediate
Mark Donsky (Cloudera), Vikas Singh (Cloudera)
Average rating: ****.
(4.33, 9 ratings)
Big data needs governance—not just for compliance but also for data scientists. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start, especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Vikas Singh share a step-by-step approach to kickstart your big data governance. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 15/16
Level: Intermediate
Jack Norris (MapR Technologies)
Average rating: ***..
(3.80, 5 ratings)
Leading companies are integrating operations and analytics to make real-time adjustments to improve revenues, reduce costs, and mitigate risks. There are many aspects to digital transformation, but the timely delivery of actionable data is both a key enabler and an obstacle. Jack Norris explores how companies from Altitude Digital to Uber are transforming their businesses. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 17
Nicolaus Henke (McKinsey & Company)
Average rating: ****.
(4.33, 6 ratings)
Nicolaus Henke explores what CEOs currently think about AI and explains how to drive adoption successfully across the enterprise. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 7
Level: Intermediate
Damien Lefortier (Facebook)
Average rating: **...
(2.71, 7 ratings)
There are use cases where the only accessible feedback for training machine-learning models is partial and biased (e.g., when feedback is obtained through surveys). Damien Lefortier shares methods to handle these cases and explains how to ensure that they are performing well. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 2/3
Jan Willem Gehrels (IBM Corporation)
Big data is the new oil—an extremely valuable commodity—but how do you transform raw data into actionable insights, recommendations, and potential profits? Jan Willem Gehrels outlines the tangible value of applying advanced (predictive and prescriptive) analytics to business questions across several markets and industries. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 24 May 2017
Location: Capital Suite 4
Darren Strange (Google)
Average rating: ***..
(3.50, 10 ratings)
Darren Strange explores Google's lifelong mission to organize the world's information and make it universally accessible and useful and shares lessons learned along the way. Darren explains how Google grew from thinking of itself as a data company to being a machine-learning company and offers a glimpse of the company's future. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Hall S21/23 (A)
Secondary topics:  Cloud
Anima Anandkumar (UC Irvine)
Average rating: ****.
(4.00, 2 ratings)
Anima Anandkumar demonstrates how to use preconfigured Deep Learning AMIs and CloudFormation templates on AWS to help speed up deep learning development and shares use cases in computer vision and natural language processing. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Hall S21/23 (B)
Secondary topics:  R-lang
Level: Intermediate
Colin Gillespie (Jumping Rivers | Newcastle University)
Average rating: ****.
(4.33, 6 ratings)
R has the reputation for being slow. Colin Gillespie covers key ideas and techniques for making your R code as efficient as possible, from R setup to common R coding problems to linking R with C++ for an extra speed boost. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 8/9
Level: Beginner
Michael Noll (Confluent)
Average rating: ****.
(4.00, 11 ratings)
Michael Noll explains how Apache Kafka helps you radically simplify your data processing architectures by building normal applications to serve your real-time processing needs rather than building clusters or similar special-purpose infrastructure—while still benefiting from properties typically associated exclusively with cluster technologies. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Ben Sharma (Zaloni)
Average rating: ***..
(3.83, 6 ratings)
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 12
Level: Intermediate
Average rating: **...
(2.00, 3 ratings)
Security has been a large and growing aspect of distributed systems, specifically in the big data ecosystem, but it's an underappreciated topic within the Spark framework itself. Neelesh Srinivas Salian explains how detailed knowledge of setting up security and an awareness of what to be looking out for in terms of problems and issues can help an organization move forward in the right way. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 13
Level: Intermediate
Luke Han (Kyligence)
Average rating: *****
(5.00, 2 ratings)
Apache Kylin is rapidly being adopted over the world—especially in China. Luke Han explores how various industries use Apache Kylin, sharing why these companies choose Apache Kylin (a technology comparison), how they use Apache Kylin (their production deployment pattern), and most importantly, the resulting business impact. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 14
Level: Intermediate
Steve Touw (Immuta)
Average rating: ****.
(4.17, 12 ratings)
The global populace is asking for the IT industry to be held responsible for the safe-guarding of individual data. Steve Touw examines some of the data privacy regulations that have arisen and covers design strategies to protect personally identifiable data while still enabling analytics. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 15/16
Level: Non-technical
Kim Nilsson (Pivigo)
Average rating: ****.
(4.50, 8 ratings)
More organizations are becoming aware of the value of data and want to get started and scaled up as quickly as possible. But how? Is it possible to get something useful done in five weeks? Kim Nilsson shares her experiences, both good and bad, delivering over 80 five-week data science projects to over 50 organizations, as well as some concrete tips on how to become a data star organization. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 17
Manuel Sevilla (Capgemini)
Average rating: ****.
(4.00, 3 ratings)
Manuel Sevilla shares real-world examples to illustrate the rules you need to keep in mind when designing your own cloud strategy. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 7
Secondary topics:  R-lang
Level: Intermediate
Jeroen Janssens (Data Science Workshops)
Average rating: ***..
(3.00, 2 ratings)
Leaflet, one of the most popular open source JavaScript libraries for interactive maps, is used by websites ranging from the New York Times and the Washington Post to GitHub and Flickr, as well as GIS specialists like OpenStreetMap, Mapbox, and CartoDB. Jeroen Janssens explains how the Leaflet R package makes it easy to integrate and control Leaflet maps in R. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 2/3
Martin Oberhuber (Think Big, a Teradata company)
Average rating: ****.
(4.00, 1 rating)
In today’s big data world, organizations are struggling to establish new capabilities, processes, and organization models to deliver advanced analytics solutions. Martin Oberhuber explores real-world use cases that illustrate the capabilities needed to develop, deploy, monitor, and maintain analytical processes to seamlessly go from insight to production. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 24 May 2017
Location: Capital Suite 4
Eric Lotter (WANdisco)
Eric Lotter offers an overview of WANdisco's strongly consistent replication service for replicating between cloud object stores, HDFS, NFS, and other S3- and Hadoop-compatible filesystems. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Hall S21/23 (A)
Secondary topics:  AI, Cloud, Deep learning
Level: Beginner
Kazunori Sato (Google)
Average rating: ****.
(4.20, 5 ratings)
TensorFlow is democratizing the world of machine intelligence. With TensorFlow (and Google's Cloud Machine Learning platform), anyone can leverage deep learning technology cheaply and without much expertise. Kazunori Sato explores how a cucumber farmer, a car auction service, and a global insurance company adopted TensorFlow and Cloud ML to solve their real-world problems. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Hall S21/23 (B)
Level: Intermediate
Seth Hendrickson (Cloudera)
Average rating: ***..
(3.57, 7 ratings)
There are many resources available for learning how to use Spark to build collaborative filtering models. However, there are relatively few that explain how to build a large-scale, end-to-end recommender system. Seth Hendrickson demonstrates how to create such a system using Spark Streaming, Spark ML, and Elasticsearch. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 8/9
Level: Intermediate
Fabian Hueske (data Artisans)
Average rating: ***..
(3.50, 2 ratings)
Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Secondary topics:  Logistics
Level: Intermediate
Hellmar Becker (Hortonworks), Jorn Eilander (ING)
Average rating: ***..
(3.00, 1 rating)
Hellmar Becker and Jorn Eilander explore the real-time collection and predictive analytics of flight radar data with IoT devices, NiFi, HBase, Spark, and Zeppelin. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 12
Level: Intermediate
Holden Karau (IBM), Seth Hendrickson (Cloudera)
Average rating: ***..
(3.25, 8 ratings)
Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 13
Level: Beginner
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 14
Level: Non-technical
Antonio Alvarez (Santander Group), Lidia Crespo (Santander UK)
Average rating: ****.
(4.00, 5 ratings)
Successful organizations are becoming increasingly Agile, and the autonomy and empowerment that Agile brings create new, active modes of engagement. Data governance however is still very much a centralized task that only CDOs and data owners actively care about. Antonio Alvarez and Lidia Crespo outline a more engaging and active method of data governance: data citizenship. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 15/16
Level: Intermediate
Mark Madsen (Third Nature)
Average rating: ***..
(3.33, 12 ratings)
Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 17
Average rating: ***..
(3.00, 2 ratings)
This Executive Briefing is a part of the Strata Business Summit. Details to come. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 7
Secondary topics:  Deep learning
Level: Intermediate
Average rating: ***..
(3.33, 3 ratings)
Deep learning is one of the most exciting techniques in machine learning. Miguel González-Fierro explores the problem of image classification using ResNet, the deep neural network that surpassed human-level accuracy for the first time, and demonstrates how to create an end-to-end process to operationalize deep learning in computer vision for business problems using Microsoft RServer and GPU VMs. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 24 May 2017
Location: Capital Suite 4
Shree Dandekar (Honeywell)
A digital twin is a virtual model of a product or service that allows analysis of data and monitoring of systems to avert problems before they even occur—and even plan for the future by using simulations. Shree Dandekar explores a new cloud-based service from the Honeywell Connected Plant that provides industrial users with around-the-clock monitoring of plant data and rigorous simulations. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Hall S21/23 (A)
Level: Advanced
Ted Dunning (MapR Technologies)
Average rating: ***..
(3.00, 2 ratings)
Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case). Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Hall S21/23 (B)
Level: Intermediate
Michelle Casbon (Qordoba)
Average rating: *****
(5.00, 1 rating)
Supporting multiple locales involves the maintenance and generation of localized strings. Michelle Casbon explains how machine learning and natural language processing are applied to the underserved domain of localization using primarily open source tools, including Scala, Apache Spark, Apache Cassandra, and Apache Kafka. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 8/9
Level: Intermediate
Tristan Stevens (Cloudera)
Average rating: ***..
(3.67, 3 ratings)
Vodafone UK’s new SIEM system relies on Apache Flume and Apache Kafka to ingest over 1 million events per second. Tristan Stevens discusses the architecture, deployment, and performance-tuning techniques that enable the system to perform at IoT-scale on modest hardware and at a very low cost. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Victor Zabalza (ASI Data Science)
Average rating: ***..
(3.75, 4 ratings)
Data exploration usually entails making endless one-use exploratory plots. Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work. Victor offers an overview of the tool and explains how it was built and why it will become essential in the first steps of every data science project. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 12
Level: Intermediate
Average rating: ***..
(3.31, 13 ratings)
Spark is now the de facto engine for big data processing. Vincent Van Steenbergen walks you through two real-world applications that use Spark to build functional machine-learning pipelines (wine price prediction and malware analysis), discussing the architecture and implementation and sharing the good, the bad, and the ugly experiences he had along the way. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 13
Level: Intermediate
Marcel Kornacker (Cloudera), Mostafa Mokhtar (Cloudera)
Average rating: *****
(5.00, 2 ratings)
Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL on Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 14
Level: Non-technical
Yves-Alexandre de Montjoye (Imperial College London | MIT Media Lab)
Average rating: ****.
(4.00, 1 rating)
Yves-Alexandre de Montjoye shows how metadata can work as a fingerprint to identify people in a large-scale metadata database even though no “private” information was ever collected, shares a formula that can be used to estimate the privacy of a dataset if you know its spatial and temporal resolution, and offers an overview of OPAL, a project that enables safe big data use using modern CS tools. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 15/16
Level: Non-technical
Duncan Ross (TES Global), Emma Prest (DataKind)
Average rating: ****.
(4.50, 2 ratings)
Since its creation, DataKind has helped charities do some fantastic things with data science through volunteers from the data science community (that's you!). But charities often don't know what to do next. Duncan Ross and Emma Prest share lessons learned from DataKind's projects and outline a data maturity model for doing good with data. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 17
Aurélie Pols (Mind Your Privacy)
Average rating: ****.
(4.75, 4 ratings)
The EU's General Data Protection Regulation is an ambitious legal project to reinstate the rights of "data subjects" within an increasingly lucrative data ecosystem. Aurélie Pols explores the legal obligations on companies and their respective interpretations and looks at how scale and integrity will be safeguarded in the data we increasingly base decisions upon in the long term. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 7
Secondary topics:  AI, Deep learning, Logistics
Level: Intermediate
Josef Viehhauser (BMW Group), Dominik Schniertshauer (BMW Group)
Average rating: ****.
(4.60, 5 ratings)
Data-driven solutions based on machine and deep learning are gaining momentum in the automotive industry beyond autonomous driving. Josef Viehhauser and Dominik Schniertshauer explore use cases from the BMW Group where novel machine-learning pipelines (such as those based on XGBoost and convolutional neural nets, for example) support a broad variety of business stakeholders. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 4
Cory Minton (EMC)
Average rating: **...
(2.50, 4 ratings)
Cory Minton pulls back the covers on how big data applications impact underlying hardware based on real-world deployments and shares Dell EMC’s internal testing and benchmarking used to develop its architecture best practices. Along the way, Cory shows you how to get your architecture right the first time for optimal performance and scaling. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 24 May 2017
Location: Capital Suite 2/3
Average rating: **...
(2.67, 3 ratings)
Adam Grzywaczewski offers an overview of the types of analytical problems that can be solved using AI and shares a set of heuristics that can be used to evaluate the feasibility of analytical AI projects. Adam then covers the computational requirements for the deep learning training process, leaving you with the key tools you need to initiate an analytical AI project. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Hall S21/23 (A)
Level: Intermediate
Dean Wampler (Lightbend)
Average rating: ****.
(4.57, 7 ratings)
"Stream" is a buzzword for several things that share the idea of timely handling of never-ending data. Big data architectures are evolving to be stream oriented. Microservice architectures are inherently message driven. Dean Wampler defines "stream" based on characteristics for such systems, using specific tools as examples, and argues that big data and microservices architectures are converging. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Hall S21/23 (B)
Level: Beginner
Sean Owen (Cloudera)
Average rating: ***..
(3.80, 5 ratings)
Nobody seems to agree just what data science is. Is it engineering, statistics. . .both? David Donoho's "50 Years of Data Science" offers a criticism of the hype around data science from a statistics perspective, arguing that it's not a new field. Sean Owen responds, offering counterpoints from an engineer, in search of a better understanding of how to teach and practice data science in 2017. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 8/9
Level: Intermediate
Ben Stopford (Confluent), Ismael Juma (Confluent)
Dynamic data rebalancing is a complex process. Ben Stopford and Ismael Juma explain how to do data rebalancing and use replication quotas in the latest version of Apache Kafka. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Secondary topics:  Deep learning, Text Analysis and Mining
Level: Beginner
Jonathon Morgan (New Knowledge)
Average rating: *****
(5.00, 12 ratings)
Jonathon Morgan explores computer vision, deep learning, and natural language processing techniques for uncovering communities of white nationalists and neo-Nazis on social media and identifying which ones are on the path to radicalization. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 12
Secondary topics:  Ecommerce, Financial services
Level: Intermediate
Harry Powell (Barclays), Raffael Strassnig (Barclays)
Average rating: ****.
(4.00, 6 ratings)
Harry Powell and Raffael Strassnig demonstrate how to model unobserved customer preferences over businesses by thinking about transactional data as a bipartite graph and then computing a new similarity metric—the expected degrees of separation—to characterize the full graph. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 13
Level: Intermediate
Daniel Bäurer (inovex GmbH), Sascha Askani (inovex GmbH)
Average rating: *****
(5.00, 1 rating)
Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Bäurer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 14
Level: Non-technical
Emma Deraze (DataKind UK)
Emma Deraze explores a collaborative project between DataKind, Global Witness, and Open Corporates to analyze open UK corporate ownership data and presents findings and insights into the challenges facing open official data, specifically in the context of an international setting, such as complex corporate networks. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 15/16
Level: Beginner
Yuval Dvir (Google)
In an era when we are bombarded with data and tasks to finish, our ability to focus our attention becomes critical. When 70% of our code is for DevOps purposes and 90% of our data is dark, the cloud is a welcome, secure, and efficient relief. Yuval Dvir refutes common misconceptions about the cloud and explains why it's not a matter of "if" but "when" you'll move to the cloud. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 17
Level: Beginner
Mark Madsen (Third Nature)
Average rating: ****.
(4.75, 4 ratings)
In 2007, a computer game company decided to jump ahead of competitors by capturing and using data created during online gaming, but it wasn't prepared to deal with the data management and process challenges stemming from distributed devices creating data. Mark Madsen shares a case study that explores the oversights, failures, and lessons the company learned along its journey. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 7
Secondary topics:  AI, Deep learning
Level: Beginner
Laura Froelich (Think Big Analytics, a Teradata Company)
Average rating: ***..
(3.00, 3 ratings)
Laura Frolich explores applications of deep learning in companies—looking at practical examples of assessing the opportunity for AI, phased adoption, and lessons going from research to prototype to scaled production deployment—and discusses the future of enterprise AI. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 24 May 2017
Location: Capital Suite 2/3
Level: Non-technical
Jesse Anderson (Big Data Institute)
Average rating: *****
(5.00, 6 ratings)
Early project success is predicated on management making sure a data engineering team is ready and has all of the skills needed. Jesse Anderson outlines five of the most common non-technology reasons why data engineering teams fail. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Hall S21/23 (A)
Secondary topics:  Deep learning
Sherry Moore (Google)
Average rating: ***..
(3.60, 5 ratings)
Sherry Moore discusses TensorFlow progress and adoption over 2016 and looks ahead to TensorFlow efforts in future areas of importance, such as performance, usability, and ubiquity. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Hall S21/23 (B)
Secondary topics:  AI, IoT, Logistics, Streaming
Level: Beginner
Dr.-Ing. Michael Nolting (Volkswagen Commercial Vehicles)
Average rating: *....
(1.67, 6 ratings)
It is nearly impossible to sample enough training data initially to prevent autonomous driving accidents on the road, as has been sadly proven by Tesla’s autopilot. Michael Nolting explains that to overcome this problem, a real-time system has to be created to detect dangerous runtime situations in real time, a process much like website monitoring. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 8/9
Level: Beginner
Sanjeev Kulkarni (Streamlio), Maosong Fu (Twitter)
Twitter processes billions of events per day at the instant the data is generated. To achieve real-time performance, Twitter employs Heron, an open source streaming engine tailored for large-scale environments. Sanjeev Kulkarni and Maosong Fu share several optimizations implemented in Heron to improve throughput by 5x and reduce latency by 50–60%. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 10/11
Level: Beginner
Wojciech Biela (Teradata), Łukasz Osipiuk (Teradata)
Average rating: ****.
(4.00, 1 rating)
Wojciech Biela and Łukasz Osipiuk offer an introduction to Presto, an open source distributed analytical SQL engine that enables users to run interactive queries over their datasets stored in various data sources, and explore its applications in various big data problems. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 12
Level: Intermediate
Natalino Busa (Teradata)
Natalino Busa shares an implementation for classifying pictures based on Spark and Slider that was developed during the 2016 Yelp Restaurant Photo Classification challenge. Spark processes data and trains the ML model, which consists of deep learning and ensemble classification methods, while picture scoring is exposed via an API that is persisted and scaled with Slider. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 13
Secondary topics:  Deep learning
Level: Intermediate
Chris Fregly (PipelineAI)
Average rating: ***..
(3.00, 1 rating)
Chris Fregly explores an often-overlooked area of machine learning and artificial intelligence—the real-time, end-user-facing "serving” layer in hybrid-cloud and on-premises deployment environments—and shares a production-ready environment to serve your notebook-based Spark ML and TensorFlow AI models with highly scalable and highly available robustness. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 14
Level: Non-technical
Daniele Quercia (Bell Labs), Giovanni Quattrone (UCL)
Sharing economy platforms are poorly regulated because there is no evidence upon which to draft policies. Daniele Quercia and Giovanni Quattrone propose a means for gathering evidence by matching web data with official socioeconomic data and use data analysis to envision regulations that are responsive to real-time demands, contributing to the emerging idea of algorithmic regulation. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 15/16
Secondary topics:  AI, Text Analysis and Mining
Level: Non-technical
Adam Smith (Automated Insights)
Average rating: ***..
(3.75, 4 ratings)
Natural language generation, the branch of AI that turns raw data into human-sounding narratives, is coming into its own in 2016. Adam Smith explores the real-world advances in NLG over the past decade and then looks ahead to the next. Computers are already writing finance, sports, ecommerce, and business intelligence stories. Find out what—and how—they’ll be writing by 2026. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 17
Level: Non-technical
Carme Artigas (Synergic Partners)
Average rating: ****.
(4.08, 13 ratings)
Big data technology is mature, but its adoption by business is slow, due in part to challenges like a lack of resources or the need for a cultural change. Carme Artigas explains why an analytics center of excellence (ACoE), whether internal or outsourced, is an effective way to accelerate the adoption and shares an approach to implementing an ACoE. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 24 May 2017
Location: Capital Suite 7
Level: Intermediate
Arshak Navruzyan (Startup.ML)
Average rating: ****.
(4.80, 5 ratings)
Deep learning affords novel and powerful techniques for video prediction and analysis. Arshak Navruzyan explores the current state of the art for video analysis using deep learning techniques and the associated challenges. Read more.

Thursday, 25 May

Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Hall S21/23 (A)
Secondary topics:  Deep learning
Level: Intermediate
Average rating: ****.
(4.00, 2 ratings)
Nikolay Manchev offers an overview of the restricted Boltzmann machine, a type of neural network with a wide range of applications, and shares his experience using it on Hadoop (MapReduce and Spark) to process unstructured and semistructured data at a scale. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Hall S21/23 (B)
Level: Intermediate
David Talby (Pacific AI)
Average rating: ***..
(3.57, 7 ratings)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 8/9
Level: Beginner
Tyler Akidau (Google)
Average rating: ***..
(3.80, 5 ratings)
The world of big data involves an ever-changing field of players. Much as SQL is a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. Tyler Akidau explains how this vision has been realized and discusses the challenges that lie ahead. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Beginner
Aurélien Géron (Kiwisoft)
Average rating: *****
(5.00, 4 ratings)
Collaborative filtering is great for recommendations, yet it suffers from the cold-start problem. New content with no views is ignored, and new users get poor recommendation. Aurélien Géron shares a solution: knowledge graphs. With a knowledge graph, you can truly understand your users' interests and make better, more relevant recommendations. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Intermediate
Average rating: *****
(5.00, 1 rating)
Herman van Hövell tot Westerflier offers a deep dive into Spark SQL's Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how new and upcoming features are implemented using Catalyst. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 14
Level: Intermediate
Eric Tilenius (BlueTalon)
Average rating: ***..
(3.00, 4 ratings)
Many businesses will have to address EU GDPR as they deploy big data projects. This is an opportunity to rethink data security and deploy a flexible policy framework adapted to big data and regulations. Eric Tilenius explains how consistent visibility and control at a granular level across data domains can address both security and GDPR compliance. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 15/16
Secondary topics:  AI
Level: Non-technical
Martin Goodson (Evolution AI), Andrew Crisp (Dun & Bradstreet)
Average rating: ****.
(4.00, 8 ratings)
Martin Goodson gives a tell-all account of an ultimately successful installation of a deep learning system in an enterprise environment. Andy Crisp then shares insights into the challenges of integrating artificial intelligence systems into real-world business processes. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 17
Level: Beginner
Barzan Mozafari (University of Michigan, Ann Arbor)
Average rating: ****.
(4.50, 4 ratings)
Visualization and exploratory analytics require subsecond interactions with massive volumes of data, a goal that has remained illusive due to numerous inefficiencies across the stack. Barzan Mozafari offers an overview of Verdict, an open source middleware that guarantees subsecond visualization and analytics and works with Impala, Spark, Hive, and most other engines in the Hadoop ecosystem. Read more.
Add to your personal schedule
11:1511:55 Thursday, 25 May 2017
Location: Capital Suite 2/3
Secondary topics:  Deep learning
Radhika Rangarajan explains how Intel works with its users to build deep learning-powered big data analytics applications (object detection, image recognition, NLP, etc.) using BigDL. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Hall S21/23 (A)
Secondary topics:  AI, Cloud, Deep learning
Level: Intermediate
Barbara Fusinska (Katacoda)
Average rating: ***..
(3.00, 5 ratings)
The popularity of deep learning is due in part to its capabilities in recognizing patterns from inputs such as images or sounds. Barbara Fusinska offers an overview of Microsoft Cognitive Toolbox, an open source framework offering various modules and algorithms enabling machines to learn like a human brain. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Hall S21/23 (B)
Secondary topics:  Cloud
Level: Intermediate
Leah McGuire (Salesforce)
Average rating: ****.
(4.00, 2 ratings)
What if you had to build more models than there are data scientists in the world—a feat enterprise companies serving hundreds of thousands of businesses often have to do? Leah McGuire offers an overview of Salesforce's general-purpose machine-learning platform that automatically builds per-company optimized models for any given predictive problem at scale, beating out most hand-tuned models. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 8/9
Level: Intermediate
Bas Geerdink (ING)
Average rating: ***..
(3.25, 4 ratings)
As a data-driven enterprise, ING is heavily investing in big data, analytics, and stream processing. Bas Geerdink shares three use cases at ING and discusses their respective architectures and technology. All software is currently in production, running with modern tools such as Kafka, Cassandra, Spark, Flink, and H2O.ai. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Beginner
Xueyan Li (Qunar), Yupeng Fu (Alluxio)
Average rating: ***..
(3.00, 1 rating)
Alluxio—the first memory-speed virtual distributed storage system in the world—unifies the data from various under storage systems and presents a global namespace to various computation frameworks. Xueyan Li and Yupeng Fu explore how Alluxio has led to performance improvements averaging a 300x improvement at service peak time on stream processing workloads at Qunar. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Intermediate
Holden Karau (IBM)
Average rating: ****.
(4.75, 4 ratings)
Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau explores how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 13
Level: Intermediate
Andrei Savu (Cloudera), Philip Langdale (Cloudera)
Cloudera Enterprise has made many focused optimizations in order leverage all of the cloud-native capabilities of AWS for the CDH platform. Andrei Savu and Philip Langdale take you through all the ins and outs of successfully running end-to-end batch data engineering workflows in AWS and demonstrate a Cloudera on AWS data engineering workflow with a sample use case. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 14
Level: Intermediate
Eddie Garcia (Cloudera)
Average rating: ****.
(4.00, 4 ratings)
The use of big data and machine learning to detect and predict security threats is a growing trend, with interest from financial institutions, telecommunications providers, healthcare companies, and governments alike. Eddie Garcia explores how companies are using Apache Hadoop-based approaches to protect their organizations and explains how Apache Spot is tackling this challenge head-on. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 15/16
Level: Beginner
Sean Kandel (Trifacta)
Average rating: ***..
(3.00, 6 ratings)
Sean Kandel offers an overview of an entirely new approach to visualizing metadata and data lineage, explaining how to track how different attributes of data are derived during the data preparation process and the associated linkages across different elements in the data. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 17
Level: Beginner
Paco Nathan (O'Reilly Media)
Average rating: ****.
(4.33, 3 ratings)
O'Reilly recently launched Oriole, a new learning medium for online tutorials that combines Jupyter notebooks, video timelines, and Docker containers run on a Mesos cluster, based the pedagogical theory of computable content. Paco Nathan explores the system architecture, shares project experiences, and considers the impact of notebooks for sharing and learning across a data-centric organization. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 4
Neil Cullum (BMC Software), Alon Lebenthal (BMC Software)
Average rating: *....
(1.00, 1 rating)
Neil Cullum and Alon Lebenthal demonstrate how BMC can help automate every aspect of the big data journey with Control-M’s enterprise-grade automation capabilities and job-as-code approach, helping you deliver big data projects faster and better. Read more.
Add to your personal schedule
12:0512:45 Thursday, 25 May 2017
Location: Capital Suite 2/3
Alon Elishkov (Outbrain)
Migrating petabyte-scale Hadoop installations to a new cluster with hundreds of machines, several thousands of jobs daily, and countless ecosystem integrations while maintaining a stable production environment is a challenging task. Alon Elishkov discusses the techniques and tools Outbrain has developed to achieve this goal. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Hall S21/23 (A)
Secondary topics:  Deep learning, PyData
Level: Intermediate
Martin Görner (Google)
Average rating: ****.
(4.75, 12 ratings)
With TensorFlow, deep machine learning has transitioned from an area of research into mainstream software engineering. Martin Görner walks you through building and training a neural network that recognizes handwritten digits with >99% accuracy using Python and TensorFlow. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Hall S21/23 (B)
Level: Intermediate
Rumman Chowdhury (Accenture)
Average rating: ***..
(3.00, 1 rating)
Multilevel regression and poststratification (MRP) is a method of estimating granular results from higher-level analyses. While it is generally used to estimate survey responses at a more granular level, MRP has clear applications in industry-level data science. Rumman Chowdhury reviews the methodology behind MRP and provides a hands-on programming tutorial. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 8/9
Level: Advanced
Aljoscha Krettek (data Artisans)
Average rating: ***..
(3.50, 2 ratings)
Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Advanced
Mark Grover (Lyft), Ted Malaska (Blizzard Entertainment)
Average rating: ****.
(4.00, 4 ratings)
Any nontrivial streaming app requires that you consider a number of important topics, but questions like how to manage offsets or state often go unanswered. Mark Grover and Ted Malaska share practices that no one talks about when you start writing a streaming app but that you'll inevitably need to learn along the way. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Intermediate
Nicolas Poggi (Barcelona Supercomputing-Microsoft Research Center)
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 13
Level: Intermediate
Calum Murray (Intuit)
Average rating: ****.
(4.00, 1 rating)
As Intuit moves its SaaS platform from its own data centers to AWS, it will straddle both worlds for a period of time (and potentially indefinitely). Calum Murray looks at what straddling means to data and data systems. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 14
Level: Intermediate
Haifeng Chen (Intel)
Average rating: ****.
(4.00, 2 ratings)
As the processing capability of the modern platforms come to memory speed, securing big data using encryption usually hurts performance. Haifeng Chen shares proven ways to speed up data encryption in Hadoop and Spark, as well as the latest progress in open source, and demystifies using hardware acceleration technology to protect your data. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 15/16
Level: Beginner
Irene Ros (Bocoup)
Average rating: ***..
(3.00, 1 rating)
Measurement Lab is the largest collection of open internet performance data on the planet, with over five petabytes of information about the quality of experience on the internet and more data generated every day. Irene Ros shares recent work to develop a data processing pipeline, API, and visualizations to make the data more accessible. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 17
Level: Non-technical
David Martinez Rego (DataSpartan)
Average rating: **...
(2.50, 4 ratings)
The growth of data science as a strategic discipline makes its correct management paramount to the survival of new and traditional businesses that want to compete in a foreseeable data-driven economy. David Martinez Rego shares a set of sound, solid principles that will help increase your effectiveness as a data science manager. Read more.
Add to your personal schedule
14:0514:45 Thursday, 25 May 2017
Location: Capital Suite 2/3
Pranav Rastogi (Microsoft)
Average rating: ****.
(4.00, 1 rating)
Pranav Rastogi explains how to simplify your big data solutions with Datameer, AtScale, Dataiku, and StreamSets on Microsoft’s Azure HDInsight, a cloud Spark and Hadoop service for the enterprise. Join in to learn practical information that will enable faster time to insights for you and your business. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Hall S21/23 (A)
Secondary topics:  Deep learning
Level: Beginner
Nir Lotan (Intel), Barak Rozenwax (Intel)
Average rating: *****
(5.00, 3 ratings)
Barak Rozenwax and Nir Lotan explain how to easily train and deploy deep learning models for image and text analysis problems using Intel's Deep Learning SDK, which enables you to use deep learning frameworks that were optimized to run fast on regular CPUs, including Caffe and TensorFlow. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Hall S21/23 (B)
Level: Advanced
Gary Willis (ASI)
Average rating: ***..
(3.20, 5 ratings)
Gary Willis offers a technical presentation of a novel algorithm that uses public data and an unsupervised tree-based learning algorithm to help companies leverage locational data they have on their clients. Along the way, Gary also discusses a wide range of further potential applications. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 8/9
Level: Intermediate
Rekha Joshi (Intuit)
Average rating: **...
(2.00, 5 ratings)
Performance and security are often at loggerheads. Rekha Joshi explains why and offers a deep dive into how performance and security are managed in some of the most intense and critical data platform services at Intuit. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Intermediate
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Average rating: ****.
(4.75, 4 ratings)
In most organizations, data is spread across multiple data sources, such as Hadoop/cloud storage, RDBMS, and NoSQL. Tomer Shiran and Jacques Nadeau offer an overview of Apache Arrow, an open source in-memory columnar technology that enables users to combine multiple data sources and expose them as a virtual data lake to users of Spark, SQL-on-Hadoop, Python, and R. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 12
Level: Beginner
Matthias Niehoff (codecentric AG)
Average rating: ****.
(4.00, 4 ratings)
Matthias Niehoff shares lessons learned working with Spark, Cassandra, and the Spark-Cassandra connector and best practices drawn from his work on multiple big and fast data projects, as well as challenges encountered along the way. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 13
Level: Beginner
Matt Brandwein (Cloudera), Tristan Zajonc (Cloudera)
Average rating: ***..
(3.00, 1 rating)
Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 14
Level: Advanced
Graham Ahearne (Corvil), Fergal Toomey (Corvil)
Fergal Toomey and Graham Ahearne outline the challenges facing network security in complex industries, sharing key lessons learned from their experiences safeguarding electronic trading environments to demonstrate the utility of machine learning and machine time network data analytics. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 15/16
Grace Huang (Pinterest)
Average rating: *****
(5.00, 1 rating)
Grace Huang offers a glimpse into the unique challenges of maintaining a healthy ecosystem around machine-learning products at Pinterest. Grace explores the suite of tools Pinterest built to make sense of machine-learning experiment results and the panel of metrics it developed to help gauge the health of the content ecosystem. Read more.
Add to your personal schedule
14:5515:35 Thursday, 25 May 2017
Location: Capital Suite 17
Level: Intermediate
Iñaki Puigdollers (Social Point)
Average rating: ****.
(4.00, 1 rating)
Low cost, big impact: this is what data science can bring to your business. Iñaki Puigdollers explores how the analytics department changed Social Point games, creating an even better gaming experience and business. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Hall S21/23 (A)
Level: Beginner
Paco Nathan (O'Reilly Media)
Average rating: ****.
(4.50, 2 ratings)
Paco Nathan explains how O'Reilly employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Hall S21/23 (B)
Level: Beginner
Galiya Warrier (Microsoft)
Galiya Warrier demonstrates how to apply a conversational interface (in the form of a chatbot) to communicate with an existing data science model. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 8/9
Secondary topics:  Deep learning, Streaming
Level: Intermediate
Kamran Yousaf (Redis Labs)
Average rating: ***..
(3.50, 6 ratings)
Kamran Yousaf explains how to substantially accelerate and radically simplify common practices in machine learning, such as running a trained model in production, to meet real-time expectations, using Redis modules that natively store and execute common models generated by Spark ML and TensorFlow algorithms. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 10/11
Level: Beginner
The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 12
Secondary topics:  Deep learning, IoT
Level: Intermediate
Mads Ingwar (Think Big), Eliano Marques (Think Big)
Average rating: ****.
(4.50, 2 ratings)
Eliano Marques and Mads Ingwar share a case study on how to leverage data science to plan ship engine maintenance by warning about potential piston ring failure. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 13
Level: Intermediate
Arturo Bayo (Synergic Partners), Alvaro Fernandez Velando (Santander Spain)
Average rating: ****.
(4.50, 6 ratings)
Arturo Bayo and Alvaro Fernandez Velando explain how a data hub strategy helps clarify data sharing and governance in an organization and share one way to implement a data hub architecture using big data technology and resources that are already established in the enterprise. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 14
Level: Intermediate
Average rating: ****.
(4.00, 1 rating)
Brendan Rizzo explains how data encryption and tokenization can help you protect your Hadoop environment and outlines options for securing data and speeding Hadoop implementation, drawing on recent deployments in pharma, health insurance, retail, and telecoms to illustrate the impact to operations and other areas of the business. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 15/16
Level: Non-technical
Andy Petrella (Kensu)
Average rating: *****
(5.00, 1 rating)
Data science for enterprise use cases explodes the number of intermediate datasets. Thus, one of upcoming challenges is to find a way into these ever-growing data sources. Andy Petrella proposes a data-science-on-data-science approach, using behavioral data combined with static and runtime metadata of processes. Read more.
Add to your personal schedule
16:3517:15 Thursday, 25 May 2017
Location: Capital Suite 17
Level: Intermediate
John Akred (Silicon Valley Data Science)
Average rating: ***..
(3.00, 2 ratings)
Valuing data can be a headache. The unique properties of data make it difficult to assess its overall value using traditional valuation approaches. John Akred discusses a number of alternative approaches to valuing data within an organization for specific purposes so that you can optimize decisions around its acquisition and management. Read more.