Speaker Slides & Video: Big data conference & machine learning training

Architecting a data platform for enterprise use

Mark Madsen (Teradata), Todd Walter (Archimedata)

Download slides (PDF)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

50 reasons to learn the shell for doing data science

Jeroen Janssens (Data Science Workshops)

Download slides (PDF)

"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems.

A high-performance system for deep learning inference and visual inspection

Moty Fania (Intel)

Download slides (PDF)

Moty Fania explains how Intel implemented an AI inference platform to enable internal visual inspection use cases and shares lessons learned along the way. The platform is based on open source technologies and was designed for real-time streaming and online actuation.

Architecting a next-generation data platform

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Download slides (PDF)

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Architecting data platforms for cybersecurity

Charaka Goonatilake (Panaseer)

Download slides (PDF)

Data is becoming a crucial weapon to secure an organization against cyber threats. Charaka Goonatilake shares strategies for designing effective data platforms for cybersecurity using big data technologies, such as Spark and Hadoop, and explains how these platforms are being used in real-world examples of data-driven security.

Architectural design for interactive visualization

Bargava Subramanian (Binaize), Amit Kapoor (narrativeVIZ)

View slides

Creating visualizations for data science requires an interactive setup that works at scale. Bargava Subramanian and Amit Kapoor explore the key architectural design considerations for such a system and discuss the four key trade-offs in this design space: rendering for data scale, computation for interaction speed, adapting to data complexity, and being responsive to data velocity.

Audi's journey to an enterprise big data platform

Carsten Herbe (Audi Business Innovation GmbH), Matthias Graunitz (Audi AG)

Download slides (PDF)

Carsten Herbe and Matthias Graunitz detail Audi's journey from a Hadoop proof of concept to a multitenant enterprise platform, sharing lessons learned, the decisions Audi made, and how a number of use cases are implemented using the platform.

Autonomous ETL with materialized views

Adesh Rao (Qubole), Abhishek Somani (Qubole)

Download slides (PDF)

Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness.

Batch and real-time processing in LINE's log analysis platform

Wataru Yukawa (LINE)

View slides

LINE—one of the most popular messaging applications in Asia—offers many services, such as its news application. These services sometimes depend on real-time processing. Wataru Yukawa offers an overview of LINE's web tracking system, which consists of the JavaScript SDK, NGINX Fluentd, Kafka, Elasticsearch, and Hadoop, and explains how it helps with batch and real-time processing.

Big data meets renewable energy: Building a real-time asset management platform for renewable energy

Stamatis Stefanakos (D ONE AG)

Download slides (PDF)

Switzerland-based startup WinJi capitalizes on two current megatrends: big data and renewable energy. Stamatis Stefanakos offers an overview of WinJi's TruePower Asset Management Platform, covering the overall architecture and the motivation behind it, the physics behind the data, and the business case.

Charting a data journey to the cloud

Mick Hollison (Cloudera), Sven Loeffler (Deutsche Telekom), Robert Neumann (Ultra Tendency)

Watch the keynote

What happens when you combine near-limitless data with on-demand access to powerful analytics and compute? For Deutsche Telekom, the results have been transformative. Mick Hollison, Sven Löffler, and Robert Neumann explain how Deutsche Telekom is harnessing machine learning and analytics in the cloud to build Europe’s largest and most powerful IoT data marketplace.

Complex event processing with Apache Flink

Kostas Kloudas (data Artisans)

Download slides (PDF)

Complex event processing (CEP) helps detect patterns over continuous streams of data. DNA sequencing, fraud detection, shipment tracking with specific characteristics (e.g., contaminated goods), and user activity analysis fall into this category. Kostas Kloudas offers an overview of Flink's CEP library and explains the benefits of its integration with Flink.

Data protection and innovation

Eva Kaili (European Parliament | The Science and Technology Options Assessment Panel)

Watch the keynote

Keynote with Eva Kaili

Deep computer vision for manufacturing

Aurélien Geron (Kiwisoft)

Download slides (PDF)

Convolutional neural networks (CNN) can now complete many computer vision tasks with superhuman ability. This is will have a large impact on manufacturing, by improving anomaly detection, product classification, analytics, and more. Aurélien Géron details common CNN architectures, explains how they can be applied to manufacturing, and covers potential challenges along the way.

Deep learning for recommender systems

Nick Pentreath (IBM)

Download slides (PDF)

In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. Nick Pentreath explores recent advances in this area in both research and practice.

Deep learning in the browser: Explorable explanations, model inference, and rapid prototyping

Amit Kapoor (narrativeVIZ), Bargava Subramanian (Binaize)

View slides

Amit Kapoor and Bargava Subramanian lead three live demos of deep learning (DL) done in the browser—building explorable explanations to aid insight, building model inference applications, and rapid prototyping and training an ML model—using the emerging client-side JavaScript libraries for DL.

Deep learning with TensorFlow and Spark using GPUs and Docker containers

Nanda Vijaydev (BlueData), Thomas Phelan (HPE BlueData)

Download slides (PPTX)

In the past, you needed a high-end proprietary stack for advanced machine learning, but today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. Nanda Vijaydev and Thomas Phelan demonstrate how to deploy a TensorFlow and Spark with NVIDIA CUDA stack on Docker containers in a multitenant environment.

Designing ethical artificial intelligence

Jivan Virdee (Fjord), Hollie Lubbock (Fjord)

View slides

Artificial intelligence systems are powerful agents of change in our society, but as this technology becomes increasingly prevalent—transforming our understanding of ourselves and our society—issues around ethics and regulation will arise. Jivan Virdee and Hollie Lubbock explore how to address fairness, accountability, and the long-term effects on our society when designing with data.

Detecting small-scale mines in Ghana

Elena Terenzi (Microsoft), Michael Lanzetta (Microsoft)

View slides

Michael Lanzetta and Elena Terenzi offer an overview of a collaboration between Microsoft and the Royal Holloway University that applied deep learning to locate illegal small-scale mines in Ghana using satellite imagery, scaled training using Kubernetes, and investigated the mines' impact on surrounding populations and environment.

Distributed training of deep learning models

Mathew Salvaris (Microsoft), Miguel Gonzalez-Fierro (Microsoft), Ilia Karmanov (Microsoft)

Download slides (PDF)

Mathew Salvaris, Miguel Gonzalez-Fierro, and Ilia Karmanov offer a comparison of two platforms for running distributed deep learning training in the cloud, using a ResNet network trained on the ImageNet dataset as an example. You'll examine the performance of each as the number of nodes scales and learn some tips and tricks as well as some pitfalls to watch out for.

Driving better predictions in the oil and gas industry with modern data architecture

Jane McConnell (Teradata), Paul Ibberson (Teradata)

Download slides (1-PDF)

Download slides (2-PDF)

Oil exploration and production is technically challenging, and exploiting the associated data brings its own difficulties. Jane McConnell and Paul Ibberson share best practices and lessons learned helping oil companies modernize their data architecture and plan the IT/OT convergence required to benefit from full digitalization.

Executive Briefing: BI on big data

Mark Madsen (Teradata), Shant Hovsepian (Arcadia Data)

Download slides (PDF)

If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian discuss the trade-offs between a number of architectures that provide self-service access to data.

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations

Mark Donsky (Okera), Syed Rafice (Cloudera)

Download slides (PDF)

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Syed Rafice outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Executive Briefing: The ROI of data-driven digital transformation

Kevin Sigliano (IE Business School )

Download slides (1-PDF)

Download slides (2-PDF)

Financial and consumer ROI demands that business leaders understand the drivers and dynamics of digital transformation and big data. Kevin Sigliano explains why disrupting value propositions and continuous innovation are critical if you wish to dramatically improve the way your company engages customers and creates value and maximize financial results.

Executive Briefing: What you need to know about fast data

Dean Wampler (Anyscale)

View slides

Streaming data systems, so called fast data, promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler outlines what you need to know to exploit fast data successfully.

Executive Briefings: Killer robots and how not to do data science

Kate Vang (DataKind UK), Christine Henry (DataKind UK)

Download slides (PPTX)

Not a day goes by without reading headlines about the fear of AI or how technology seems to be dividing us more than bringing us together. DataKind UK is passionate about using machine learning and artificial intelligence for social good. Kate Vang and Christine Henry explain what socially conscious AI looks like and what DataKind is doing to make it a reality.

Fast analytics on fast data: Kudu as a storage layer for banking applications

Olaf Hein (ORDIX AG)

Download slides (PDF)

Olaf Hein explains how a large German bank relies on a Kudu-based data platform to speed up business processes. Olaf highlights key data access patterns and the system architecture and shares best practices and lessons learned using Kudu in development and operations.

Hadoop under attack: Securing data in a banking domain

Federico Leven (ReactoData)

Download slides (PDF)

The apparent difficulty of managing Hadoop compared to more traditional and proprietary data products makes some companies wary of the Hadoop ecosystem, but managing security is becoming more accessible in the Hadoop space, particularly in the Cloudera stack. Federico Leven offers an overview of an end-to-end security deployment on Hadoop and the data and security governance policies implemented.

How BT delivers better broadband and TV using Spark and Kafka

Phillip Radley (BT)

Download slides (PDF)

In the past year, British Telecom has added a streaming network analytics use case to its multitenant data platform. Phillip Radley demonstrates how the solution works and explains how it delivers better broadband and TV services, using Kafka and Spark on YARN and HDFS encryption.

How DHL is increasing efficiency and reducing distance traveled across the warehouse with the IoT

Michael Troughton (Conduce), Jonathan Genah (DHL Supply Chain)

View slides

DHL has partnered with Conduce to provide a human interface that provides real-time visualizations that track and analyze distance traveled by personnel and warehouse equipment, all calibrated around a center of activity. Michael Troughton explains how this immersive data visualization gives DHL unprecedented insight to evaluate and act on everything that occurs in its warehouses.

How to protect big data in a containerized environment

Thomas Phelan (HPE BlueData)

Download slides (PPT)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE), but TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them.

How Typeform's data and analytics team managed to embed its data scientists into cross-functional teams while maintaining their cohesion

Viola Melis (Typeform)

Download slides (PPTX)

Typeform's data team is transitioning into a less centralized structure and embedding its data scientists inside product and business teams. Viola Melis details initiatives the team developed to ensure alignment and cohesion, discusses the journey through this challenging process, and shares lessons learned, best practices, and new processes that were established.

How will the GDPR impact machine learning?

Steven Touw (Immuta)

Download slides (PDF)

The Strata Data conference in London takes place during one of the most important weeks in the history of data regulation, as GDPR begins to be enforced. Steve Touw explores the effects of the GDPR on deploying machine learning models in the EU.

Human-in-the-loop data science with Jupyter widgets

Pascal Bugnion (ASI Data Science)

Download slides (PDF)

Jupyter widgets let you create lightweight, interactive graphical interfaces directly in Jupyter notebooks. Pascal Bugnion demonstrates how to use Jupyter widgets to implement human-in-the-loop machine learning with highly interactive user interfaces.

Humanizing data: How to find the why

Hollie Lubbock (Fjord), Jivan Virdee (Fjord)

View slides

Data has opened up huge possibilities for analyzing and customizing services. However, although we can now manage experiences to dynamically target audiences and respond immediately, context is often missing. Hollie Lubbock and Jivan Virdee share a practical approach to discovering the reasons behind the data patterns you see and help you decide what level of personalized service to create.

Humans and the machine: Machine learning in context (sponsored by IBM)

JEAN FRANCOIS PUGET (IBM Analytics)

Watch the keynote

On the way to active analytics for business, we have to answer two big questions: What must happen to data before running machine learning algorithms, and how should machine learning output be used to generate actual business value? Jean-François Puget demonstrates the vital role of human context in answering those questions.

Improving computer vision models at scale

Marton Balassi (Cloudera), Mirko Kämpf (Cloudera), Jan Kunigk (Cloudera)

Download slides (PDF)

Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable.

Improving DevOps and QA efficiency using machine learning and NLP methods

Ran Taig (Dell), Omer Sagi (Dell)

Download slides (PPTX)

DevOps and QA engineers spend a significant amount of time investigating reoccurring issues. These issues are often represented by large configuration and log files, so the process of investigating whether two issues are duplicates can be a very tedious task. Ran Taig and Omer Sagi outline a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues.

Incorporating data sources inside and outside of the data center (sponsored by Cisco)

Chiang Yang (Cisco)

View slides

Han Yang explains how Cisco is leveraging big data and analytics and details how the company is helping customers to incorporate data sources from the internet of things and deploy machine learning at the edge and at the enterprise.

Interpretable machine learning products

Mike Lee Williams (Cloudera Fast Forward Labs)

Download slides (PDF)

Interpretable models result in more accurate, safer, and more profitable machine learning products, but interpretability can be hard to ensure. Michael Lee Williams examines the growing business case for interpretability, explores concrete applications including churn, finance, and healthcare, and demonstrates the use of LIME, an open source, model-agnostic tool you can apply to your models today.

Journey to GDPR compliance

Alison Howard (Microsoft)

Watch the keynote

May 25, the day the GDPR goes into effect, is an important milestone for data protection in the EU and elsewhere, but the journey to GDPR compliance neither begins nor ends there. Alison Howard explains how Microsoft, one of the world’s largest companies, with operations across the EU and around the globe, has prepared for May 25 and beyond.

Kafka in jail: Running Kafka in container-orchestrated clusters

Sean Glover (Lightbend)

View slides

Kafka is best suited to run close to the metal on dedicated machines in static clusters, but these clusters are quickly becoming extinct. Companies want mixed-use clusters that take advantage of every resource available. Sean Glover offers an overview of leading Kafka implementations on DC/OS and Kubernetes to explore how reliably they run Kafka in container-orchestrated clusters.

Learning how to design automatically updating AI with Apache Kafka and Deeplearning4j

Jason Bell (Independent Speaker)

Download slides (1-PPTX)

Download slides (2-ZIP)

Jason Bell offers an overview of a self-learning knowledge system that uses Apache Kafka and Deeplearning4j to accept data, apply training to a neural network, and output predictions. Jason covers the system design and the rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence.

Machine learning at Intuit: Five delightful use cases

Calum Murray (Intuit)

Download slides (1-PPTX)

Download slides (2-PPTX)

Machine learning-based applications are becoming the new norm. Calum Murray shares five use cases at Intuit that use the data of over 60 million users to create delightful experiences for customers by solving repetitive tasks, freeing them up to spend time more productively or solving very complex tasks with simplicity and elegance.

Machine learning platform lifecycle management

Hope Wang (Intuit)

Download slides (1-PPTX)

Download slides (2-PPTX)

A machine learning platform is not just the sum of its parts; the key is how it supports the model lifecycle end to end. Hope Wang explains how to manage various artifacts and their associations, automate deployment to support the lifecycle of a model, and build a cohesive machine learning platform.

Machine-learned model quality monitoring in fast data and streaming applications

Emre Velipasaoglu (Lightbend)

Download slides (PDF)

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu reviews monitoring methods, focusing on their applicability in fast data and streaming applications.

Macroeconomic news sentiment: Enhanced risk assessment for sovereign bond spreads

Christina Erlwein-Sayer (OptiRisk Systems)

Download slides (PPTX)

Christina Erlwein-Sayer explains how to enhance the modeling and forecasting of sovereign bond spreads by considering quantitative information gained from macroeconomic news sentiment, using a number of large news analytics datasets.

Measure what matters: How your measurement strategy can reduce opex

Radhika Dutt (Radical Product), Geordie Kaytes (Fresh Tilled Soil), Nidhi Aggarwal (Radical Product)

View slides

These days it’s easy for companies to say, "We measure everything!” The problem is, most popular metrics may not be appropriate or relevant for your business. Measurement isn’t free and should be done strategically. Radhika Dutt, Geordie Kaytes, and Nidhi Aggarwal explain how to align measurement with your product strategy so you can measure what matters for your business.

Multi-data center and multitenant durable messaging with Apache Pulsar

Ivan Kelly (Streamlio)

Download slides (PDF)

Ivan Kelly offers an overview of Apache Pulsar, a durable, distributed messaging system, underpinned by Apache BookKeeper, that provides the enterprise features necessary to guarantee that your data is where is should be and only accessible by those who should have access.

Operationalize deep learning models for fraud detection with Azure Machine Learning Workbench

Francesca Lazzeri (Microsoft), Jaya Susan Mathew (Microsoft)

Download slides (PDF)

Advancements in computing technologies and ecommerce platforms have amplified the risk of online fraud, which results in billions of dollars of loss for the financial industry. This trend has urged companies to consider AI techniques, including deep learning, for fraud detection. Francesca Lazzeri and Jaya Mathew explain how to operationalize deep learning models with Azure ML to prevent fraud.

Practical advice for driving down the cost of cloud big data platforms

Christopher Royles (Cloudera)

Download slides (PDF)

Big data and cloud deployments return huge benefits in flexibility and economics but can also result in runaway costs and failed projects. Drawing on his production experience, Christopher Royles shares tips and best practices for determining initial sizing, strategic planning, and longer-term operation, helping you deliver an efficient platform, reduce costs, and implement a successful project.

Putting AI to work for business: It's a journey. (sponsored by IBM)

CARLO APPUGLIESE (IBM)

Download slides (PDF)

What was once science fiction has now become reality as multiple AI consumer-based solutions have hit the market over last few years. In turn, consumers have become more comfortable interacting with AI. But has AI really lived up to the hype? For consumers, perhaps not yet. However, AI for business is a different (and more valuable) animal. Carlo Appugliese details how business can put AI to work.

Real-time deep learning on video streams

eran avidan (Intel)

Download slides (PDF)

Deep learning is revolutionizing many domains within computer vision, but doing real-time analysis is challenging. Eran Avidan offers an overview of a novel architecture based on Redis, Docker, and TensorFlow that enables real-time analysis of high-resolution streaming video.

Rendezvous with AI

Ted Dunning (MapR, now part of HPE)

Download slides (PPTX)

Ted Dunning offers an overview of the rendezvous architecture, which is geared to deal with much of the complexity involved in deploying models to production, thus allowing more time to be spent thinking and doing real data science. Ted covers the ideas behind the architecture, practical scenarios, and advantages and disadvantages of the architecture.

Running data analytic workloads in the cloud

Eugene Fratkin (Cloudera), Vinithra Varadharajan (Cloudera), Mael Ropars (Cloudera), Jason Wang (Cloudera)

Download slides (PDF)

Vinithra Varadharajan, Jason Wang, Eugene Fratkin, and Mael Ropars detail new paradigms to effectively run production-level pipelines with minimal operational overhead. Join in to learn how to remove barriers to data discovery, metadata sharing, and access control.

Scaling data science (teams and technologies)

David Asboth (Cox Automotive Data Solutions), Shaun McGirr (Cox Automotive Data Solutions)

Download slides (PPTX)

Cox Automotive is the world’s largest automotive service organization, which means it can combine data from across the entire vehicle lifecycle. Cox is on a journey to turn this data into insights. David Asboth and Shaun McGirr share their experience building up a data science team at Cox and scaling the company's data science process from laptop to Hadoop cluster.

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Jim Dowling (Logical Clocks)

Download slides (PDF)

Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy.

Stream processing for the practitioner: Blueprints for common stream processing use cases with Apache Flink

Aljoscha Krettek (Ververica)

Download slides (PPTX)

Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink.

StreamDM: Advanced data science with Spark Streaming

Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)

Download slides (PDF)

Heitor Murilo Gomes and Albert Bifet offer an overview of StreamDM, a real-time analytics open source software library built on top of Spark Streaming, developed at Huawei's Noah’s Ark Lab and Télécom ParisTech.

The cloud is expensive, so build your own redundant Hadoop clusters.

Stuart Pook (Criteo)

Download slides (PDF)

Criteo has a production cluster of 2K nodes running over 300K jobs a day in the company's own data centers. These clusters were meant to provide a redundant solution to Criteo's storage and compute needs. Stuart Pook offers an overview of the project, shares challenges and lessons learned, and discusses Criteo's progress in building another cluster to survive the loss of a full DC.

The Paradise Papers: Behind the scenes with the ICIJ

Pierre Romera (International Consortium of Investigative Journalists (ICIJ))

Watch the keynote

Last November, the International Consortium of Investigative Journalists (ICIJ) published the Paradise Papers, a yearlong investigation on the offshore dealings of multinational companies and the wealthy. Pierre Romera offers a behind-the-scenes look into the process and explores the challenges in handling 1.4 TB of data and making it available securely to journalists all over the world.

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

Lee Blum (Verint Systems)

Download slides (PDF)

Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the system’s overall results.

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

Holden Karau (Independent), Rachel B Warren (Salesforce Einstein)

Download slides (PDF)

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks

Baolong Mao (JD.com), Yiran Wu (JD.com), Yupeng Fu (Alluxio)

Download slides (PPSX)

Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.

Using Siamese CNNs for removing duplicate entries from real estate listing databases

Sergey Ermolin (Intel), Olga Ermolin (MLS Listings)

Download slides (PDF)

Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology.

Web analytics at scale with Druid at Naver

Jason Heo (Naver), Dooyong Kim (Navercorp)

Download slides (1-PDF)

View slides

Naver.com is the largest search engine in Korea, with a 70% share of the Korean search market, and it handles billions of pages and events everyday. Jason Heo and Dooyong Kim offer an overview of Naver's web analytics system, built with Druid.

You call it data lake; we call it Data Historian.

Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer)

Download slides (PDF)

There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security.

You’re doing it wrong: How Zoomdata rearchitected streaming

Erin Recachinas (Zoomdata)

View slides

The value of real-time streaming analytics with historical data is immense. Big data application Zoomdata updates historical dashboards in real time without complex reaggregations, but streaming in the age of the IoT requires handling of data in volumes not seen in traditional feeds. Erin Recachinas explains how Zoomdata moved to a scalable microservice architecture for streaming sources.

Speaker slides & video

Sponsorship Opportunities

Partner Opportunities

Contact Us