Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Donald Miner (Miner & Kasch)
Figuring out Hadoop is daunting. However, understanding a set of basic yet important principles is all you need to cut through the hype and make intelligent enterprise decisions. Donald Miner breaks down modern Hadoop into 10 important principles you need to know to understand what Hadoop is and how it is different from the old way of doing things.
Sebastien Pierre (FFunction)
Big data is great for feeding ML algorithms, but you quickly face a bandwidth issue when interfacing with humans. The brain is a fantastic information-processing machine and has an unparalleled, innate ability to detect patterns. Sébastien Pierre explains what designers can teach engineers about creating new ways to make large volumes of data understandable at the human level.
Bob Rogers (Intel)
Bob Rogers, Intel's chief data scientist for big data solutions, demonstrates the power of the question in analytics. Learn how different types of data, from cubes of structured data to live video streams from mobile systems, combine with analytical technology to inform the questions that can be answered.
Aaron Kalb (Alation)
A data catalog provides context to help data analysts, data scientists, and other data consumers (including those with little technical background) find a relevant dataset, determine if it can be trusted, understand what it means, and utilize it to make better products and better decisions. Aaron Kalb explores how enterprises build interfaces that make sourcing data as easy as shopping on Amazon.
Kostas Tzoumas (data Artisans)
Apache Flink is a full-featured streaming framework with high throughput, millisecond latency, strong consistency, support for out-of-order streams, and support for batch as a special case of streaming. Kostas Tzoumas gives an overview of Flink and its streaming-first philosophy, as well as the project roadmap and vision: fully unifying the worlds of “batch” and “streaming” analytics.
Doug Cutting (Cloudera)
2016 marks the 10th anniversary of Apache Hadoop. This birthday provides us an opportunity to celebrate, as well as the chance to reflect on how we got here and where we are going.
Tom Reilly (Cloudera), Alan Ross (Intel)
The cybersecurity landscape is quickly changing, and Apache Hadoop is becoming the analytics and data management platform of choice for cybersecurity practitioners. Tom Reilly and Alan Ross explain why organizations are turning toward the open source ecosystem to break down traditional cybersecurity analytics and data constraints in order to detect a new breed of sophisticated attacks.
Kathleen Ting (Cloudera), Vikram Srivastava (Cloudera), Darren Lo (Cloudera), Jordan Hambleton (Cloudera)
In this full-day tutorial, participants will get an overview of all aspects of successfully managing Hadoop clusters—from installation to configuration management, service monitoring, troubleshooting, and support integration—with an emphasis on production systems.
Jean-Marc Spaggiari (Cloudera), Kevin O'Dell (Rocana)
Most already know HBase, but many don't know that it can be coupled with other tools from the ecosystem to increase efficiency. Jean-Marc Spaggiari and Kevin O'Dell walk attendees through some real-life HBase use cases and demonstrate how they have been efficiently implemented.
Siddha Ganju (Nvidia)
Siddha Ganju explains how CERN uses machine-learning models to predict which datasets will become popular over time. This helps to replicate the datasets that are most heavily accessed, which improves the efficiency of physics analysis in CMS. Analyzing this data leads to useful information about the physical processes.
Andreas Schmidt (Blue Yonder)
While many companies struggle to adopt big data, a number of industry leaders are leapfrogging big data adoption by going straight to automating core business processes. Andreas Schmidt presents examples from leading European companies that have overcome cultural, technical, and scientific challenges and unlocked the potential of big data in an entirely different way.
Jacques Nadeau (Dremio)
There are (too?) many options for BI on Hadoop. Some are great at exploration, some are great at OLAP, some are fast, and some are flexible. Understanding the options and how they work with Hadoop systems is a key challenge for many organizations. Jacques Nadeau provides a survey of the main options, both traditional (Tableau, Qlik, etc.) and new (Platfora, Datameer, etc.).
Matt Olson (CenturyLink)
Software-defined networking (SDN) and network functions virtualization (NFV) hold tremendous potential to enable efficiency and flexibility in service delivery, but SDN/NFV environments are also highly complex and multilayered. Matt Olson explains why effective support for SDN/NFV services requires leveraging the tremendous amount of service and data streaming from the platform.
Neelesh Salian (Stitch Fix)
Spark has been growing in deployments for the past year. Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark and offers guidelines to help setup a real-world environment when planning an Apache Spark deployment in a cluster. Attendees can use these observations to improve the usability and supportability of Apache Spark in their projects.
Siva Raghupathy (Amazon Web Services), Manjeet Chayel (Amazon Web Services)
Analyzing real-time streams of data is becoming increasingly important to remain competitive. Siva Raghupathy and Manjeet Chayel guide attendees through some of the proven architectures for processing streaming data using a combination of cloud and open source tools such as Apache Spark. Watch a live demo and learn how you can easily scale your applications with Amazon Web Services.
Mario Inchiosa (Microsoft), Roni Burd (Microsoft)
Hadoop is famously scalable, as is cloud computing. R, the thriving and extensible open source data science software. . .not so much. Mario Inchiosa and Roni Burd outline how to seamlessly combine Hadoop, cloud computing, and R to create a scalable data science platform that lets you explore, transform, model, and score data at any scale from the comfort of your favorite R environment.
Sijie Guo (StreamNative)
DistributedLog is a high-performance replicated log service built on top of Apache BookKeeper that is the foundation of publish-subscribe at Twitter, serving traffic from transactional databases to real-time data analytic pipelines. Sijie Guo offers an overview of DistributedLog, detailing the technical decisions and challenges behind its creation and how it is used at Twitter.
Jayant Shekhar (Sparkflows Inc.), Amandeep Khurana (Cloudera), Krishna Sankar (U.S.Bank), Vartika Singh (Cloudera)
Jayant Shekhar, Amandeep Khurana, Krishna Sankar, and Vartika Singh guide participants through techniques for building machine-learning apps using Spark MLlib and Spark ML and demonstrate the principles of graph processing with Spark GraphX.
Adam Cheyer (Samsung)
As a technical founder at Siri, Sentient, and Viv Labs, Adam Cheyer has helped design and develop a number of intelligent systems solving real-world problems for hundreds of millions of users. Drawing on specific examples, Adam reveals some of the techniques he uses to maximize the impact of the AI technologies he employs.
Brandon Ballinger (Cardiogram), Johnson Hsieh (Cardiogram)
Each year, 15 million people suffer strokes, and at least a fifth of those are due to atrial fibrillation, the most common heart arrhythmia. Brandon Ballinger reports on a collaboration between UCSF cardiologists and ex-Google data scientists that detects atrial fibrillation with deep learning.
Emma McGrattan (Actian)
Hadoop can bring great value to businesses but also big headaches. Some solutions that provide SQL access to Hadoop data mean changing your business processes to overcome limitations in the technologies. Emma McGrattan explains how users can unlock tremendous business value through SQL-driven Hadoop solutions. Emma outlines what should be on your checklist and the pitfalls to avoid.
Joseph Sirosh (Microsoft), kai miller (Stanford University)
Joseph Sirosh offers a fascinating look into how brains connected with sensors to the cloud and machine learning could revolutionize a field of medicine.
Jake Porway (DataKind), Rachel Quint (Hewlett Foundation), Sue-Ann Ma, Jeremy Anderson (IBM)
So many of the data projects making headlines—from a new app for finding public services to a new probabilistic model for predicting weather patterns for subsistence farmers—are great accomplishments but don’t seem to have end users in mind. Discover how organizations are designing with, not for, people, accounting for what drives them in order to make long-lasting impact.
Jake Porway (DataKind), Daniella Perlroth (Lyra Health), Tim Hwang (ROFLCon / The Web Ecology Project), Lucy Bernholz (Stanford University)
So many of the data projects making headlines—from a new app for finding public services to a new probabilistic model for predicting weather patterns for subsistence farmers—are great accomplishments but don’t seem to have end users in mind. Discover how organizations are designing with, not for, people, accounting for what drives them in order to make long-lasting impact.
The Defense Advanced Research Projects Agency (DARPA) is synonymous with transformational change, developing the seeming impossible into the practical. Matthew van Adelsberg demonstrates how collaborative teams of SMEs, data scientists, and engineers have been organized to achieve “DARPA hard” results for nearly a decade and offers insights into how companies can do the same.
Ian Andrews (Pivotal)
Pivotal’s Ian Andrews explores why delivering information in context is the key to competitive differentiation in the digital economy.
Alex Silva (Pluralsight)
Alex Silva outlines the implementation of a real-time analytics platform using microservices and a Scala stack that includes Kafka, Spark Streaming, Spray, and Akka. This infrastructure can process vast amounts of streaming data, ranging from video events to clickstreams and logs. The result is a powerful real-time data pipeline capable of flexible data ingestion and fast analysis.
Organizations do not need a big data strategy. They need a business strategy that incorporates big data. Most organizations lack a roadmap for using big data to uncover new business opportunities. Bill Schmarzo explains how to explore, justify, and plan big data projects with business management.
Data scientists inhabit such an ever-changing landscape of languages, packages, and frameworks that it can be easy to succumb to tool fatigue. If this sounds familiar, you may have missed the increasing popularity of Linux containers in the DevOps world, in particular Docker. Michelangelo D'Agostino demonstrates why Docker deserves a place in every data scientist’s toolkit.
Eric Frenkiel (MemSQL)
The next evolution in the on-demand economy is in predictive analytics fueled by live streams of data—in effect knowing what customers want before they do. Eric Frenkiel explains how a real-time trinity of technologies—Kafka, Spark, and MemSQL—is enabling Uber and others to power their own revolutions with predictive apps and analytics.
Joey Echeverria (Rocana)
Real-time analysis starts with transforming raw data into structured records. Typically this is done with bespoke business logic custom written for each use case. Joey Echeverria presents a configuration-based, reusable library for data transformation that can be embedded in real-time stream-processing systems and demonstrates its real-world use cases with Apache Kafka and Apache Hadoop.
Denise McInerney (Intuit)
The most valuable people in your organization combine business acumen with data savviness. But these data heroes are rare. Denise McInerney describes how she has empowered business users at Intuit to make better decisions with data and explains how you can do the same thing in your organization.
Ted Malaska (Capital One), Jeff Holoman (Cloudera)
Ted Malaska and Jeff Holoman explain how to go from zero to full-on time series and mutable-profile systems in 40 minutes. Ted and Jeff cover code examples of ingestion from Kafka and Spark Streaming and access through SQL, Spark, and Spark SQL to explore the underlying theories and design patterns that will be common for most solutions with Kudu.
Wes McKinney (Two Sigma Investments), Jacques Nadeau (Dremio)
Hadoop’s traditional batch technologies are quickly being supplanted by in-memory columnar execution to drive faster data-to-value. Wes McKinney and Jacques Nadeau provide an overview of in-memory columnar execution, survey key related technologies, including Kudu, Ibis, Impala, and Drill, and cover a sample use case using Ibis in conjunction with Apache Drill to deliver real-time conclusions.
Chuck Yarbrough (Pentaho), Mark Burnette (Pentaho, a Hitachi Group Company)
A major challenge in today’s world of big data is getting data into the data lake in a simple, automated way. Coding scripts for disparate sources is time consuming and difficult to manage. Developers need a process that supports disparate sources by detecting and passing metadata automatically. Chuck Yarbrough and Mark Burnette explain how to simplify and automate your data ingestion process.
Thomas Phelan (HPE BlueData), Joel Baxter (BlueData)
Thomas Phelan and Joel Baxter investigate the advantages and disadvantages of running specific Hadoop workloads in different infrastructure environments. Thomas and Joel then provide a set of rules to help users evaluate big data runtime environments and deployment options to determine which is best suited for a given application.
Traditional data-warehousing techniques are sometimes limited by the scalability of the implementation tools themselves. Arun Thangamani explains how the advanced architectural approaches by tools like Apache Phoenix and HBase allow new, highly scalable live-analytics solutions using the same traditional techniques and showcases a successful implementation at CDK.
Brian Clark (Objectivity), Marco Ippolito (CGG GeoSoftware)
Oil and gas organizations are at the forefront of big data, adopting technologies such as Hadoop and Spark to develop next-generation fusion systems. Brian Clark and Marco Ippolito introduce a case study from CGG, a builder of common data models to drive analytics of sensor data and associated metadata from fast-changing big data streams, to show how to derive richer value from big data assets.
Alex Gorelik (Waterline Data)
It is fashionable today to declare doom and gloom for the data lake. Alex Gorelik discusses best practices for Hadoop data lake success and provides real-world examples of successful data lake implementations in a non-vendor-specific talk.
Robert Grossman (University of Chicago)
There is a big difference between running a machine-learning algorithm manually from time to time and building a production system that runs thousands of machine-learning algorithms each day on petabytes of data, while also dealing with all the edge cases that arise. Robert Grossman discusses some of the lessons learned when building such a system and explores the tools that made the job easier.
Brandon Rohrer (Microsoft)
Modern houses and robots have a lot in common. Both have a lot of sensors and have to make a lot of decisions. However, unlike houses, robots adapt and perform helpful tasks. Brandon Rohrer details an algorithm specifically designed to help houses, buildings, roads, and stores learn to actively help the people that use them.
Yael Garten (LinkedIn)
You’ve decided you need data scientists. You know who to hire. Now, what do you do with them? Yael Garten offers examples of how companies like LinkedIn use data to make business and product decisions. Yael reviews the spectrum of data science, and discusses the culture, process and tools needed to transform companies into data-driven organizations.
Rohit Jain (Esgyn)
Companies are looking for a single database engine that can address all their varied needs—from transactional to analytical workloads, against structured, semistructured, and unstructured data, leveraging graph databases, document stores, text search engines, column stores, key value stores, and wide column stores. Rohit Jain discusses the challenges one faces on the path to this nirvana.
Jeffrey Shmain (Cloudera), Mohammad Quraishi (Cigna)
How do you implement Apache Hadoop in a large healthcare company with a mature data-analysis infrastructure? Jeffrey Shmain and Mohammad Quraishi describe Cigna's journey toward big data and Hadoop, including an overview of new Hadoop capabilities like heterogeneous data integration and large-scale machine learning.
Moty Fania (Intel)
Moty Fania shares Intel’s IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.
Sreeni Iyer (quadanalytix), Anurag Bhardwaj (Quad Analytix)
Typically, 8–10% of product URLs in ecommerce sites are misclassified. Sreeni Iyer and Anurag Bhardwaj discuss a machine-learning-based solution that relies on an innovative fusion of classifiers that are both text- and image-based, along with human touch to handle edge cases, to automatically classify product URLs according to a canonical taxonomic organization with a high F-score.
Sumeet Singh (Yahoo), Mridul Jain (Yahoo)
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Jack Norris (MapR Technologies)
Big data is not limited to reporting and analysis; increasingly, companies are differentiating themselves by acting on data in real time. But what does "real time" really mean? Jack Norris discusses the challenges of coordinating data flows, analysis, and integration at scale to truly impact business as it happens.
Megan Price (Human Rights Data Analysis Group)
Megan Price demonstrates how machine-learning methods help us determine what we know, and what we don't, about the ongoing conflict in Syria. Megan then explains why these methods can be crucial to better understand patterns of violence, enabling better policy decisions, resource allocation, and ultimately, accountability and justice.
Roopa Tangirala (Netflix)
Roopa Tangirala details Netflix's migration from Oracle to Cassandra, covering the problems encountered, what worked and what didn't, and lessons learned along the way.
Helena Edelson (Apple), Evan Chan (Tuplejump)
Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics.
Paula Poundstone (Star of NPR's #1 radio show, "Wait Wait...Don't Tell Me")
Paula Poundstone isn’t just any comedian. After years of justly criticizing and questioning the purpose of the many studies used for questions on NPR’s #1 show, Wait Wait. . .Don’t Tell Me, on which she's a popular panelist, she’s here to explore what we can learn about asking the right questions from a unique critique of published behavioral research.
David Beyer (Amplify Partners)
Over the past decade, machine learning has become intertwined with newer, Internet-born businesses. This despite the fact that the vast majority of global GDP turns on larger, less visible industries like energy and construction. David Beyer explores the ways these backbone industries are adopting machine-intelligent applications and the trends underlying this shift.
As the volume and variety of data continue to grow, organizations have the opportunity to transform their industries and professions, but companies are grappling with how to deliver innovation. Adam Kocoloski shares his experience around this market shift and challenges attendees to join his mission of contributing to the community and investing in the power of open source and the cloud.
Chris DuBois (Dato), Brian Kent (Dato), Srikrishna Sridhar (Dato), Piotr Teterwak (Dato)
This hands-on tutorial provides a quick start to building intelligent business applications using machine learning. Learn about machine-learning basics, feature engineering, anomaly detection, recommender systems, and deep learning as you are guided through all the steps of prototyping and production: data cleaning, feature engineering, model building and evaluation, and deployment.
Todd Palino (LinkedIn), Gwen Shapira (Confluent)
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira explore how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production.
Travis Oliphant (Continuum Analytics)
Despite Python's popularity throughout the data-engineering and data science workflow, the principles behind its performance and scaling behavior are less understood. Travis Oliphant explains best practices and modern tools to scale Python to larger-than-memory and distributed workloads without sacrificing its ease of use or being forced to adopt heavyweight frameworks.
Kim Scott (Radical Candor, Inc.)
Bad bosses make people miserable. They kill innovation, stifle growth, increase costs, and create instability. Well-meaning people become bad bosses without even realizing it. Great bosses have relationships with employees and are sources of growth and stability for both individuals and companies. Kim Scott outlines three principles for approaching the relationship between a boss and their team.
If you consider user click paths a process, you can apply process mining. Process mining models users based on their actual behavior, which allows us to compare new clicks with modeled behavior and report any inconsistencies. Bolke de Bruin and Hylke Hendriksen explain how ING implemented process mining on Spark Streaming, enabling real-time fraud detection.
Ted Dunning (MapR)
Application messaging isn’t new—solutions include IBM MQ, RabbitMQ, and ActiveMQ. Apache Kafka is a high-performance, high-scalability alternative that integrates well with Hadoop. Can modern distributed messaging systems like Kafka be considered a legacy replacement or is it purely complementary? Ted Dunning outlines Kafka's architectural benefits and tradeoffs to find the answer.
Dean Wampler (Lightbend)
The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.
Kelvin Chu (Uber), Evan Richards (Uber)
Schema plays a key role in the Hadoop architecture at Uber. Kelvin Chu and Evan Richards explain why schema is important and how it can make your Hadoop and Spark application more reliable and efficient.
Grega Kespret (Celtra Inc.)
Celtra provides a platform for customers like Porsche and Fox to create, track, and analyze digital display advertising. Celtra's platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Grega Kešpret outlines Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake's cloud data warehouse with Spark.
Scott Donaldson (FINRA), Matt Cardillo (FINRA)
Scott Donaldson and Matt Cardillo detail the security measures and system architecture needed to bring alive a multipetabyte data warehouse via interactive analytics and directed graphs from several trillions of market events, using HBase, EMR, Hive, Redshift, and S3 technologies in a cost-efficient manner.
Phillip Radley explores how to use an “accumulation of marginal gains” approach to achieve success with an Apache Hadoop-based enterprise data hub (EDH), drawing on a set of design patterns built up over five years establishing BT’s EDH.
Jana Eggers (Nara Logics)
We hear about AI almost every day now. Opinions seem split between impending doom side and "superintelligence will save the human race." Jana Eggers offers the real deal on AI, explaining what's hype and what isn't and what we can do about it.
Keith Manthey (Dell EMC)
Many companies have created extremely powerful Hadoop use cases with highly valuable outcomes. The diverse adoption and application of Hadoop is producing an extremely robust ecosystem. However, teams often create silos around their Hadoop, forgetting some of the hard-learned lessons IT has gained over the years. Keith Manthey discusses one such often overlooked feature—governance.
Kaz Sato (Google), Amy Unruh (Google)
Kazunori Sato and Amy Unruh explore how you can use TensorFlow to drive large-scale distributed machine learning against your analytic data sitting in Google BigQuery, with data preprocessing driven by Dataflow (now Apache Beam). Kazunori and Amy dive into practical examples of how these technologies can work together to enable a powerful workflow for distributed machine learning.
Rajat Monga (Google)
TensorFlow is an open source software library for numerical computation with a focus on machine learning. Rajat Monga offers an introduction to TensorFlow and explains how to use it to train and deploy machine-learning models to make your next application smarter.
Chris Rawles (Pivotal)
The Internet of Things (IoT) continues to provide value and hold promise for both the consumer and enterprise alike. To succeed, an IoT project must concern itself with how to ingest data, build actionable models, and react in real time. Chris Rawles describes approaches to addressing these concerns through a deep dive into an interactive demo centered around classification of human activities.
Noah Illinsky (Amazon Web Services)
Noah Iliinsky surveys the state of visualization, outlines the major trends in the field, and explores the directions that visualization is headed. Noah also dives into the assorted tool domains—from enterprise to desktop to code-based—and discusses the pros and cons and use cases of each.
Julia Galef (Center for Applied Rationality)
Julia Galef explores why Bayesian thinking is so different from what we do by default and outlines the most important principles of thinking like a Bayesian.
Pratik Verma (BlueTalon), Paulo Pereira (GE)
Pratik Verma and Paulo Pereira share three security architecture principles for Hadoop to protect sensitive data without disrupting users: modifying requests to filter content makes security transparent to users; centralizing data-access decisions and distributing enforcement makes security scalable; and using metadata instead of files or tables ensures systematic protection of sensitive data.
Ilya Ganelin (Capital One Data Innovation Lab)
What if we have reached the point where open source can handle massively difficult streaming problems with enterprise-grade durability? Ilya Ganelin presents Capital One’s novel solution for real-time decisioning on Apache Apex. Ilya shows how Apex provides unique capabilities that ensure less than 2 ms latency in an enterprise-grade solution on Hadoop.
John Hugg (VoltDB)
In the race to pair streaming systems with stateful systems, the winners will be stateful systems that process streams natively. These systems remove the burden on application developers to be distributed systems experts and enable new applications to be both powerful and robust. John Hugg describes what’s possible when integrated systems apply a transactional approach to event processing.
John Belchamber (Telefónica), Arturo Canales (Telefónica)
Increasing competition and technological change is impelling the telco industry toward a new model of analytics. Telefónica has been at the front of this change, driving business transformation to a digital telco. John Belchamber and Arturo Canales tell the story of that transformation and detail the pitfalls and challenges faced by teams looking to follow a similar journey.
Bruce Andrews (US Department of Commerce)
US Department of Commerce Deputy Secretary of Commerce Bruce Andrews explores using commerce data to fuel innovation.
Alyosha Efros (UC Berkeley)
Alyosha Efros discusses using computer vision to understand big visual data.
Bob Levy (Virtual Cove, Inc.)
Ever want a superpower? Try increasing the bandwidth between your data and brain. Bob Levy explains how to pack higher dimensionality into your data exploration, enabled by virtual and augmented reality, and outlines what this means for the future of business.
Martin Yip (VMware), Justin Murray (VMware)
Martin Yip and Justin Murray explore the benefits of virtualization of Hadoop on vSphere and delve into three different examples of real-world deployments—at small, medium, and large scales—to demonstrate how enterprises are currently deploying Hadoop differently on virtual machines.
Aneesh Karve (Quilt)
Seemingly harmless choices in visualization, design, and content selection can distort your data and lead to false conclusions. Aneesh Karve presents a framework for identifying and overcoming these distortions by drawing upon research in human perception, focus and context, and mobile design.
Mike Lee Williams (Cloudera Fast Forward Labs)
Machines are not objective, and big data is not fair. Michael Williams uses sentiment analysis to show that supervised machine learning has the potential to amplify the voices of the most privileged people in society, violate the spirit and letter of civil rights law, and make your product suck.
Michael Franklin (AMPLab/UC Berkeley)
Michael Franklin offers an overview of the Berkeley Data Analytics Stack, outlines the current directions it's taking, and settles once and for all how BDAS should be pronounced.
Benedikt Koehler (DataLion)
Benedikt Koehler offers approaches to analyzing and visualizing bitcoin data—accessing and downloading the blockchain, transforming the data into a networked data format, identifying hubs and clusters, and visualizing the results as dynamic network graphs—so that typical patterns and anomalies can quickly be identified.