Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY
 
1 E8 / 1 E9
1:15pm How data science helps prevent churn at Avira, a 100-million user company Iulia Pasov (Avira), Călin-Andrei Burloiu (Avira)
1:35pm Probabilistic programming in data science Thomas Wiecki (Quantopian)
2:05pm Tackling machine learning complexity for data curation Ihab Ilyas (University of Waterloo)
2:25pm Learning to love Bayesian statistics Allen Downey (Olin College of Engineering)
2:55pm We need better maps: Smarter spatial clustering at the city level Brett Goldstein (University of Chicago)
1 E10 / 1 E11
11:20am How Hadoop is powering Walmart’s data-driven business Jeremy King (Walmart Global eCommerce)
1:15pm What can Big Pharma teach us about Wall Street? What can Wall Street teach us about Big Pharma? Joe Klobusicky (Geisinger Health System), Ali Habib (Northwestern Feinberg School of Medicine), Ekaterina Volkova (Cornell University)
2:55pm How women are conquering the S&P 500 Karen Rubin (Quantopian)
4:35pm Science fiction to product: Data-driven development Micha Gorelick (Fast Forward Labs)
1 E12/ 1 E13
11:20am Leverage data analytics to reduce human space mission risks Haden Land (Lockheed Martin IS&GS), Jason Loveland (Lockheed Martin)
2:05pm Big data, small internet: How to circumnavigate your information Raymond Collins (TE Connectivity), Scott Sokoloff (Orderup)
4:35pm Re-engineering legacy analytics solutions with big data Rosaria Silipo (KNIME.com AG)
1 E16 / 1 E17
11:20am Modern query processing with columnar formats: The best is yet to come Henry Robinson (Cloudera), Zuo Wang (Wanda), Arthur Peng (Intel)
2:55pm Great debate: Big data will live in the cloud Alistair Croll (Solve For Interesting), Joseph Adler (Facebook), Margaret Dawson (Red Hat), Joseph Sirosh (Microsoft), Evan Prodromou (Fuzzy.io)
4:35pm Should you trust your money to a robot? Vasant Dhar (NYU)
1 E18 / 1 E19
11:20am Big data at Netflix: Faster and easier Kurt Brown (Netflix)
2:05pm Paying the technical debt of machine learning: Managing ML models in production Carlos Guestrin (Apple | University of Washington )
1 E20 / 1 E21
11:20am What's new in Spark Streaming - a technical overview Tathagata Das (Databricks)
1:15pm Netflix: Integrating Spark at petabyte scale Daniel Weeks (Netflix)
2:05pm First-ever scalable, distributed deep learning architecture using Spark and Tachyon Christopher Nguyen (Arimo), Vu Pham (Adatao, Inc), Michael Bui (Adatao, Inc.)
2:55pm Spark on Mesos Dean Wampler (Lightbend)
4:35pm How Spark is working out at Comcast scale Sridhar Alla (BlueWhale), Jan Neumann (Comcast)
3D 02/11
11:20am Elastic stream processing without tears Michael Hausenblas (AWS)
2:55pm Building a real-time analytics stack with Kafka, Samza, and Druid Fangjin Yang (Imply), Gian Merlino (Imply)
4:35pm Oulu Smart City pilot Susanna Pirttikangas (University of Oulu)
3D 03/10
11:20am From profiling to analysis: Designing visualization tools for purpose Jeffrey Heer (Trifacta | University of Washington), Jock Mackinlay (Tableau)
1:15pm What have you done!? How to visualize methods and models for decision makers Michael Freeman (University of Washington)
4:35pm Designing happiness with data Pamela Pavliscak (SoundingBox)
3D 04/09
1:15pm Preventing a big data security breach Sam Heywood (Cloudera), Nick Curcuru (Mastercard), Ritu Kama (Intel)
2:05pm Data democratization versus data governance Peter Guerra (Booz Allen Hamilton)
4:35pm Big data governance Steven Totman (Cloudera), Mark Donsky (Okera), Kristi Cunningham (Capital One), Ben Harden (CapTech Consulting)
3D 05/08
11:20am Ask me anything: Hadoop application architectures Gwen Shapira (Confluent), Jonathan Seidman (Cloudera), Ted Malaska (Capital One), Mark Grover (Lyft)
1:15pm Ask me anything: Apache Spark Patrick Wendell (Databricks), Reynold Xin (Databricks)
2:05pm Ask me anything: Hadoop operations for production systems Miklos Christine (Databricks), Kathleen Ting (Cloudera), Philip Zeyliger (Cloudera), Philip Langdale (Cloudera)
2:55pm Ask me anything: Developing a modern enterprise data strategy John Akred (Silicon Valley Data Science), Julie Steele (Manifold), Scott Kurth (Silicon Valley Data Science)
4:35pm Ask me anything: Hadoop's storage gap - resolving transactional access/analytic performance tradeoffs with Kudu Todd Lipcon (Cloudera), JD Cryans (Cloudera), David Alves (Cloudera), Mike Percy (Cloudera), Dan Burkert (Cloudera), Michael Crutcher (Cloudera)
3D 06/07
11:20am Launch new financial products with confidence Beate Porst (IBM), Anand Ranganathan (IBM)
1:15pm Simplify big data with platform, discovery, and data preparation from the cloud Jeff Pollock (Oracle), Chris Lynskey (Oracle)
2:55pm Harnessing data to change banking for good Phil Kim (Capital One Labs)
1 E6 / 1 E7
11:20am Using Hadoop to detect high-risk fraud, waste, and abuse Alexander Barclay (UnitedHealthcare Shared Services)
2:55pm Enable secure data sharing and analytics in Hadoop with 5 key steps Reiner Kappenberger (HP Security Voltage)
4:35pm Cognitive computing: From theory to ubiquity Tim Estes (Digital Reasoning)
1 E14
11:20am Pentaho featuring Forrester: Delivering governed data for analytics at scale Michele Goetz (Forrester Research), Chuck Yarbrough (Pentaho)
1:15pm Hadoop II: The SQL Emma McGrattan (Actian)
2:55pm Commercializing IOT: What do you need to know? Ashish Verma (Deloitte)
1 E15
11:20am Patterns from the future Paul Kent (SAS)
1:15pm Do you know where your data is? Nidhi Aggarwal (Tamr, Inc.)
2:05pm Faster time to insight using Spark, Tachyon, and Zeppelin Nirmal Ranganathan (Rackspace)
2:55pm Building your first big data application on AWS Matt Yanchyshyn (Amazon Web Services)
4:35pm Apache Spark as a code-free data science workbench Michał Iwanowski (DeepSense.io), Piotr Piotr (deepsense.io)
3D 01/12
9:00am Spark Development Bootcamp (Day 3) Laurent Weichberger (OmPoint Innovations, LLC)
1B 03
9:00am Practical data science on Hadoop (Day 3) BRANDON MACKENZIE (IBM), John Rollins (IBM), Jacques Roy (IBM), Chris Fregly (PipelineAI), Mokhtar Kandil (IBM)
1B 04
8:00am Coffee Break
Room: Javits North
8:45am Plenary
Room: Javits North
Thursday keynote welcome Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
8:50am Plenary
Room: Javits North
Data science for mission Doug Wolfe (CIA)
9:00am Plenary
Room: Javits North
Privacy protection and reproducible research Daniel Goroff (Alfred P. Sloan Foundation)
9:10am Plenary
Room: Javits North
The big data dividend Jack Norris (MapR Technologies)
9:20am Plenary
Room: Javits North
The rise of the citizen data scientist Ben Werther (Platfora)
9:25am Plenary
Room: Javits North
Patterns from the future Paul Kent (SAS)
9:30am Plenary
Room: Javits North
Doing it Wrong: 10 Problems with Qualitative Data Farrah Bostic (The Difference Engine)
9:45am Plenary
Room: Javits North
IBM sponsored keynote Shivakumar Vaithyanathan (IBM)
9:50am Plenary
Room: Javits North
What does it take to apply data science for social good? Jake Porway (DataKind)
10:00am Plenary
Room: Javits North
Haunted by data Maciej Ceglowski (Pinboard.in)
10:20am Plenary
Room: Javits North
In praise of boredom Maria Konnikova (The New Yorker | Mastermind)
10:40am Plenary
Room: Javits North
Closing remarks Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
10:50am Morning Break sponsored by Cisco
Room: 3E
3:35pm Afternoon Break sponsored by Platfora
Room: 3E
12:00pm Lunch sponsored by MapR Technologies
Room: 3A & 3B
Lunch / Thursday BoF Tables
5:15pm Plenary
Room: South Concourse
Ice Cream Social
11:20am-12:00pm (40m) Data Science & Advanced Analytics
From anomalies to alerts: Identifying anomalies and rank ordering them to create alerts for data scientists to investigate
Robert Grossman (University of Chicago)
Large datasets have large numbers of anomalies, and the challenge is not just identifying anomalies but rank ordering them to create alerts, so that data scientists can examine the most interesting ones. We discuss three case studies that integrate machine learning and data engineering, and extract six techniques for identifying anomalies and ranking ordering them by their potential significance.
1:15pm-1:35pm (20m) Data Science & Advanced Analytics
How data science helps prevent churn at Avira, a 100-million user company
Iulia Pasov (Avira), Călin-Andrei Burloiu (Avira)
Reaching 100,000,000 antivirus users was a big challenge for Avira, but we managed to achieve the goal. The challenge that arises now is to convince our users to stay with us, by offering the best possible experience to each one of them. In this presentation we will share the entire flow of the user churn prevention, from building custom surveys to using machine learning algorithms.
1:35pm-1:55pm (20m) Data Science & Advanced Analytics
Probabilistic programming in data science
Thomas Wiecki (Quantopian)
Probabilistic programming has already revolutionized machine learning and will have a similar impact on the emerging field of data science. By automating the inference process, it dramatically increases the number of people who can build complex Bayesian models custom-made to the specific problem at hand; and makes experts vastly more effective in devising new machine learning methods.
2:05pm-2:25pm (20m) Data Science & Advanced Analytics
Tackling machine learning complexity for data curation
Ihab Ilyas (University of Waterloo)
Machine learning tools offer promise in helping solve data curation problems. While the principles are well-understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.
2:25pm-2:45pm (20m) Data Science & Advanced Analytics
Learning to love Bayesian statistics
Allen Downey (Olin College of Engineering)
Bayesian methods are well-suited for business applications because they provide concrete guidance for decision-making under uncertainty.  But many data science teams lack the background to take advantage of these methods.  In this presentation I will explain the advantages and suggest ways for teams to develop skills and add Bayesian methods to their toolkit.
2:55pm-3:35pm (40m) Data Science & Advanced Analytics
We need better maps: Smarter spatial clustering at the city level
Brett Goldstein (University of Chicago)
Spatial analytics is often hampered by the arbitrary choice of units, allowing local heterogeneity to obscure true patterns. A new “smart clustering” technique lets us use large quantities of open municipal data to literally redraw city maps to reflect facts on the ground, not administrative boundaries. This talk will explain what smart clusters are and the promise they hold for urban science.
4:35pm-5:15pm (40m) Data Science & Advanced Analytics
How Airbnb uses machine learning to detect host preferences
Bar Ifrach (Airbnb)
This talk describes the development of a machine learning model that infers Airbnb host preferences for accommodation requests based on their past behavior. The model is used to surface likely matches more prominently on Airbnb’s search results. In our A/B testing the model showed about a 3.75% increase in booking conversion, resulting in many more trips on Airbnb.
11:20am-12:00pm (40m) Data-driven Business
How Hadoop is powering Walmart’s data-driven business
Jeremy King (Walmart Global eCommerce)
Two years ago Walmart eCommerce moved from a small Hadoop cluster to a big one (250 modes) and has since used Hadoop to consolidate 10 different websites, including Sam’s Club online, into one website. Walmart eCommerce stores use all the incoming data in one central Hadoop cluster, which is driving the company’s focus to provide personalized, best-in-class customer experiences.
1:15pm-1:55pm (40m) Data-driven Business
What can Big Pharma teach us about Wall Street? What can Wall Street teach us about Big Pharma?
Joe Klobusicky (Geisinger Health System), Ali Habib (Northwestern Feinberg School of Medicine), Ekaterina Volkova (Cornell University)
Pharmaceutical companies follow a highly structured process for the approval of medications. From a financial viewpoint, the binary occasion of a drug’s passage offers a rare scientific opportunity: a well-defined, recurrent, and critical event spanning over multiple companies. We will show that integrating multiple datatypes uncovers how drug passage influences the market, and vice versa.
2:05pm-2:45pm (40m) Data-driven Business
Your data is screaming at you. Learn to listen through customer choice modeling
Vivek Farias (Celect)
How can a retailer discover that expensive handbags have a large upside in Lancaster, PA, a fact that doesn't fit demographic stereotypes? The answer lies in understanding customer choice, that what a customer buys is constrained and influenced by what they're offered. Explore a new approach to machine learning, which models customer choice patterns and preferences from sparse transactional data.
2:55pm-3:35pm (40m) Data-driven Business
How women are conquering the S&P 500
Karen Rubin (Quantopian)
Karen Rubin has spent the last nine months exploring “What would happen if you invested in women CEOs?" In doing so, she has developed an investment algorithm that invests in the women-led companies of the Fortune 1000. Based on a simulation run from 2002-2014, this algorithm would have outperformed the S&P 500 by more than 200%. In this talk she will share her algorithm and results.
4:35pm-5:15pm (40m) Data-driven Business
Science fiction to product: Data-driven development
Micha Gorelick (Fast Forward Labs)
It's 2015. We understand the technology - how to build functional data pipelines, analytics, and reporting. We have algorithms. We understand the culture issues of how to build a data-driven organization. This talk is about how to use these assets to imagine and create previously impossible products.
11:20am-12:00pm (40m) Hadoop Use Cases
Leverage data analytics to reduce human space mission risks
Haden Land (Lockheed Martin IS&GS), Jason Loveland (Lockheed Martin)
Lockheed Martin builds unmanned and manned human space systems, which require systems that are tested for all possible conditions – even for unforeseen situations. We present a test system that is a learning system built on big data technologies, that supports the testing of the Orion Multi-Purpose Crew Vehicle being designed for long-duration, human-rated deep space exploration.
1:15pm-1:55pm (40m) Hadoop Use Cases
Data and music: How India’s music streaming service uses big data to address a 1 billion-user market
Sriranjan Manjunath (Saavn Inc), Rahul Saxena (Saavn)
Saavn is the leading music streaming service in the South Asian market. This talk will focus on how we are leveraging data to adapt to very specific demands on the market. We will demonstrate how Hadoop, Kafka, and Storm came together to help us solve some of the challenges.
2:05pm-2:45pm (40m) Hadoop Use Cases
Big data, small internet: How to circumnavigate your information
Raymond Collins (TE Connectivity), Scott Sokoloff (Orderup)
Scott and Ray will discuss a real-life use case from a large manufacturing company, where data was produced in remote factories faster than it could be sent through the internet. This session is an interactive discussion around how to resolve the issue of "big data, small internet."
2:55pm-3:35pm (40m) Hadoop Use Cases
Use case examples of building applications on Hadoop with CDAP
Jonathan Gray (Cask)
Hadoop has evolved into a rich collection of technologies that enable a broad range of use cases. However, the technology innovation has outpaced the skills of most developers. The open-source Cask Data Application Platform (CDAP) project was initiated to close this developer gap. In this session, we will show how three different organizations utilized CDAP to deliver solutions on Hadoop.
4:35pm-5:15pm (40m) Hadoop Use Cases
Re-engineering legacy analytics solutions with big data
Rosaria Silipo (KNIME.com AG)
In this project, we re-engineered a few barely-usable legacy solutions from the past, and made them viable again by exploiting the speed and performance of Hadoop platform-based execution.
11:20am-12:00pm (40m) Hadoop Internals & Development
Modern query processing with columnar formats: The best is yet to come
Henry Robinson (Cloudera), Zuo Wang (Wanda), Arthur Peng (Intel)
Columnar data formats such as Apache Parquet promise much in terms of performance, but need help from modern CPUs to fully realize all the benefits. In this talk we'll show how the combination of the newest SIMD instruction sets, and an open-source columnar file format, can provide an enormous performance advantage. Our example system will be Impala, Parquet, and Intel's AVX2 instruction set.
1:15pm-1:55pm (40m) Hadoop Internals & Development
What does it mean to virtualize the Hadoop distributed file system?
Thomas Phelan (HPE BlueData)
This session will delve into the multiple different meanings of "virtualized HDFS." It will lead an investigation into the abstraction of the HDFS protocol in order to permit any storage device to deliver data to a Hadoop application in a performance critical environment. It will include a discussion and assessment of the work in this area done by projects such as Tachyon and MemHDFS.
2:05pm-2:45pm (40m) Hadoop Internals & Development
HDFS operations made easy: Guide to the improved, full service HDFS File Browser
Ravi Prakash (Altiscale)
The HDFS File Browser now has improved accessibility and is easier to use! Hadoop 2.4.0 introduced a new UI for file browsing with WebHDFS. This feature set has been expanded to include write operations and file uploads. Authentication issues have been addressed and the file browser is now configured with HttpFS. We'll present a demonstration and overview of possible configuration requirements.
2:55pm-3:35pm (40m) Business & Innovation
Great debate: Big data will live in the cloud
Alistair Croll (Solve For Interesting), Joseph Adler (Facebook), Margaret Dawson (Red Hat), Joseph Sirosh (Microsoft), Evan Prodromou (Fuzzy.io)
Data has gravity. Jim Gray once said that, “compared to the cost of moving bytes around, everything else is free,” and because of what this means for the economics of computing, the more data you have, the more it wants to be near other data. That means all big data systems, eventually, will live in centralized cloud environments. On the other hand, different data is processed in different ways.
4:35pm-5:15pm (40m) Data Science & Advanced Analytics
Should you trust your money to a robot?
Vasant Dhar (NYU)
Financial markets emanate massive amounts of data from which machines can, in principle, learn to invest with minimal initial guidance from humans. I contrast human and machine strengths and weaknesses in making investment decisions.
11:20am-12:00pm (40m) Data Innovations
Big data at Netflix: Faster and easier
Kurt Brown (Netflix)
The Netflix Data Platform is a constantly evolving, large scale infrastructure running in the (AWS) cloud. We are especially focused on performance and ease of use, with initiatives including Presto integration, Spark, and our big data portal and API. This talk will dive into the various technologies we use, the motivations behind our approach, and the business benefits we get.
1:15pm-1:55pm (40m) Data Innovations
Copycat: Fault tolerant streaming data ingestion powered by Apache Kafka
Neha Narkhede (Confluent)
Often the hardest step in processing streams is being able to collect all your data in a structured way. We present Copycat, a framework for data ingestion that addresses some common impedance mismatches between data sources and stream processing systems. Copycat uses Kafka as an intermediary, making it easy to get streaming, fault-tolerant data ingestion across a variety of data sources.
2:05pm-2:45pm (40m) Data Innovations
Paying the technical debt of machine learning: Managing ML models in production
Carlos Guestrin (Apple | University of Washington )
As companies increase the number of deployments of machine learning-based applications, the number of models that need to be monitored grow at a tremendous pace. In this talk, we outline some of the key challenges in large-scale deployments of machine learning models, then describe a methodology to manage such models in production to mitigate the technical debt.
2:55pm-3:35pm (40m) Data Innovations
Calculating high-resolution, global-scale geospatial analytics with MapReduce Geospatial
Ryan Smith (DigitalGlobe)
MrGeo is a geospatial toolkit designed to provide raster-based geospatial capabilities that can be performed at scale by leveraging the Hadoop ecosystem. This session will provide an overview of the MrGeo design for storing and processing large-scale raster datasets in the cloud, highlight core operations, and present performance benchmarks for some example operations on open data sets.
4:35pm-5:15pm (40m) Data Innovations
Considerations for building a cognitive application
Venky Ganti (Alation)
Recommendation engines are cognitive computing applications. Their algorithms “learn” from experience. What if a recommendation engine could help analysts sort through big data? Building a query recommendation engine is complex. We’ll share some of the technical challenges and learnings from building a cognitive application in daily use today, by analyst teams from eBay to Square.
11:20am-12:00pm (40m) Spark & Beyond
What's new in Spark Streaming - a technical overview
Tathagata Das (Databricks)
As the adoption of Spark Streaming in the industry is increasing, so is the community's demand for more features. Since the beginning of this year, we have made significant improvements in performance, usability, and semantic guarantees. In this talk, I discuss these improvements, as well as the features we plan to add in the near future.
1:15pm-1:55pm (40m) Spark & Beyond
Netflix: Integrating Spark at petabyte scale
Daniel Weeks (Netflix)
The Big Data Platform team at Netflix continues to push big data processing in the cloud with the addition of Spark to our platform. Recent enhancements to Spark allow us to effectively leverage it for processing against a 10+ petabyte warehouse backed by S3. We will share our experiences and performance of production jobs along with the pains and gains of deploying Spark at scale on YARN.
2:05pm-2:45pm (40m) Spark & Beyond
First-ever scalable, distributed deep learning architecture using Spark and Tachyon
Christopher Nguyen (Arimo), Vu Pham (Adatao, Inc), Michael Bui (Adatao, Inc.)
Deep learning algorithms have been used in many real-world applications, such as computer vision, machine translation, and fraud detection. We'll present an overview of the system architecture, the training and running of Deep Learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) on Spark with Tachyon, including the use of GPUs to improve execution time.
2:55pm-3:35pm (40m) Spark & Beyond
Spark on Mesos
Dean Wampler (Lightbend)
Apache Spark is often seen as a replacement for MapReduce in Hadoop systems, but Spark clusters can also be deployed and managed by Mesos. This talk explains how to use Mesos for Spark applications. We'll examine the pros and cons of using Mesos vs. Hadoop YARN as a data platform, and discuss practical issues when running Spark on Mesos. We'll even discuss how to combine the two with Myriad.
4:35pm-5:15pm (40m) Spark & Beyond
How Spark is working out at Comcast scale
Sridhar Alla (BlueWhale), Jan Neumann (Comcast)
Comcast uses Hadoop as the big data platform in several areas of its business. Their use cases have evolved in recent years and include personalization, clickthru analytics, modeling, and customer support initiatives, all adding up to billions of dollars in revenue.
11:20am-12:00pm (40m) IoT & Real-time
Elastic stream processing without tears
Michael Hausenblas (AWS)
By 2020, researchers estimate there will be 100 million internet connected devices. To process this data in real time—whether from mobile phones or jet engines—will be the new normal. How are companies today adapting to this new real time stream of data?
1:15pm-1:55pm (40m) IoT & Real-time
Modeling predictive maintenance applications in the IoT Era
Yan Zhang (Microsoft)
This talk introduces the landscape and challenges of predictive maintenance applications in the industry, illustrates how to formulate (data labeling and feature engineering) the problem with three machine learning models (regression, binary classification, multi-class classification), and showcases how the models can be conveniently trained and compared with different algorithms.
2:05pm-2:45pm (40m) IoT & Real-time
High performance results using Spark to analyze mining equipment sensor data
Ankur Gupta (Bitwise Inc.)
Using an open source technology stack, we implemented a solution for real-time analysis of sensor data from mining equipment. We will share the technical architecture used to show the tools we implemented for real-time complex event processing, why we implemented Spark instead of Storm, some of the challenges faced, benchmarks achieved, and tips for easy integration.
2:55pm-3:35pm (40m) IoT & Real-time
Building a real-time analytics stack with Kafka, Samza, and Druid
Fangjin Yang (Imply), Gian Merlino (Imply)
The maturation and development of open source technologies has made it easier than ever for companies to derive insights from vast quantities of data. In this session, we will cover how to build a real-time analytics stack using Kafka, Samza, and Druid. This combination of technologies can power a robust data pipeline that supports real-time ingestion and flexible, low-latency queries.
4:35pm-5:15pm (40m) IoT & Real-time
Oulu Smart City pilot
Susanna Pirttikangas (University of Oulu)
Oulu Smart City has a lively living lab tradition; we continuously collect data and expand our ecosystem of companies, research institutes, city officials, and citizens, and develop data-intensive services on top of the ecosystem. We present real use cases implementing big data platforms and development of higher level distributed reasoning and machine learning to exploit our data lake.
11:20am-12:00pm (40m) Design, User Experience, & Visualization
From profiling to analysis: Designing visualization tools for purpose
Jeffrey Heer (Trifacta | University of Washington), Jock Mackinlay (Tableau)
The talk will focus on considerations for designing data visualizations for data profiling required in data preparation; and considerations for designing data visualizations for later exploratory analysis and consumption phases of the overall analysis process.
1:15pm-1:55pm (40m) Design, User Experience, & Visualization
What have you done!? How to visualize methods and models for decision makers
Michael Freeman (University of Washington)
Data-driven decision-making can only be properly executed when the decision makers understand both the underlying data, and the types of manipulations that have been applied to it. In this session, we’ll explore what exactly we "do" to data (aggregation, "cleaning," statistical modeling, machine learning), and how to visually communicate about the processes and implications of our work.
2:05pm-2:45pm (40m) Design, User Experience, & Visualization
LIVE from New York: An introduction to Linked Immersive Visualization Environments
Margit Zwemer (LiquidLandscape)
Linked Immersive Visualization Environments (LIVE) is a framework that my startup, LiquidLandscape, has developed for combining multiple, high-volume data visualizations (d3, WebGL, WebVR) to provide comprehensive situational awareness for financial markets. We will discuss architecture and design challenges of visualizing real-time data at speed and scale, with lots of visual examples.
2:55pm-3:35pm (40m) Design, User Experience, & Visualization
Data, Design, and Organizations: Design thinking and prototyping approaches to data challenges in orgs
Peter Olson (IDEO), David Boardman (IDEO)
The experience of data extends beyond capturing, storing, and presenting it. Data can help shape customer journeys through products, change the way organizations communicate, and be either a source of confusion or tool for communication. This talk will focus on how design thinking can be applied to data, and how data design can be applied to a wide array of consumer and organizational experiences.
4:35pm-5:15pm (40m) Design, User Experience, & Visualization
Designing happiness with data
Pamela Pavliscak (SoundingBox)
Our understanding of happiness is becoming more nuanced, and much of that new knowledge relies on data from social media, quantified self apps, and large datasets. This session will look at the lessons we can learn from happiness data to design positive experiences with technology.
11:20am-12:00pm (40m) Security & Governance
Leveraging asset reputation systems to detect and prevent fraud and abuse at LinkedIn
Jenelle Bray (LinkedIn)
LinkedIn’s Security Data Science group uses various reputation systems as input to models designed to stop fraud and abuse. This session will discuss how we build these reputation systems and compare instantaneous online reputation scores to more complex offline systems.
1:15pm-1:55pm (40m) Security & Governance
Preventing a big data security breach
Sam Heywood (Cloudera), Nick Curcuru (Mastercard), Ritu Kama (Intel)
Hadoop is widely used thanks to its ability to handle volume, velocity, and variety of data. However, this flexibility and scale presents challenges for securing and governing this data. To avoid your company making the front pages over a data breach, experts from MasterCard, Intel, and Cloudera share the Hadoop Security Maturity Model phase 0-4 and steps to get your cluster ready for a PCI audit.
2:05pm-2:45pm (40m) Security & Governance
Data democratization versus data governance
Peter Guerra (Booz Allen Hamilton)
Combining data in Hadoop for the purpose of data discovery often runs into barriers from the security group because of legal or corporate policy. This talk will discuss the challenges with implementing data governance in big data systems, a design pattern for addressing those challenges within an organization, and a recent case study.
2:55pm-3:35pm (40m) Security & Governance
Transparent encryption in HDFS: The missing piece in big data security
Andrew Wang (Cloudera)
Encryption is a requirement for many business sectors dealing with confidential information. To meet these requirements, transparent, end-to-end encryption was added to HDFS. This protects data while it is in-flight and at-rest, and can be used compatibly with existing Hadoop apps. We will cover the design and implementation of transparent encryption in HDFS, as well as performance results.
4:35pm-5:15pm (40m) Security & Governance
Big data governance
Steven Totman (Cloudera), Mark Donsky (Okera), Kristi Cunningham (Capital One), Ben Harden (CapTech Consulting)
Moderator: Steve Totman, Big Data Evangelist at Cloudera Panelist: Kristi Cunningham, VP Enterprise Data Management at Capital One Panelist: Susan Meyer, Business Leader - Fraud Management Solutions at MasterCard Worldwide Panelist: Ben Harden, Managing Director at Captech Panelist: Mark Donsky, Navigator Product Manager at Cloudera
11:20am-12:00pm (40m) Ask Me Anything
Ask me anything: Hadoop application architectures
Gwen Shapira (Confluent), Jonathan Seidman (Cloudera), Ted Malaska (Capital One), Mark Grover (Lyft)
Join the authors of Hadoop Application Architectures for an open Q/A session on considerations and recommendations for architecture and design of applications using Hadoop. Talk to us about your use-case and its big data architecture, or just come to listen in.
1:15pm-1:55pm (40m) Ask Me Anything
Ask me anything: Apache Spark
Patrick Wendell (Databricks), Reynold Xin (Databricks)
Join the Spark team for an informal question and answer session. Spark committers from Databricks will be on hand to field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
2:05pm-2:45pm (40m) Ask Me Anything
Ask me anything: Hadoop operations for production systems
Miklos Christine (Databricks), Kathleen Ting (Cloudera), Philip Zeyliger (Cloudera), Philip Langdale (Cloudera)
Join the instructors of the all-day tutorial "Apache Hadoop operations for production systems," as they field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
2:55pm-3:35pm (40m) Ask Me Anything
Ask me anything: Developing a modern enterprise data strategy
John Akred (Silicon Valley Data Science), Julie Steele (Manifold), Scott Kurth (Silicon Valley Data Science)
Join the team behind the tutorial “Developing a modern enterprise data strategy," as they field a wide range of detailed questions. Even if you don’t have a specific question, join in to hear what others are asking.
4:35pm-5:15pm (40m) Ask Me Anything
Ask me anything: Hadoop's storage gap - resolving transactional access/analytic performance tradeoffs with Kudu
Todd Lipcon (Cloudera), JD Cryans (Cloudera), David Alves (Cloudera), Mike Percy (Cloudera), Dan Burkert (Cloudera), Michael Crutcher (Cloudera)
Ask the panel questions about Kudu and the tradeoffs between real-time transactional access and fast analytic performance.
11:20am-12:00pm (40m) Sponsored
Launch new financial products with confidence
Beate Porst (IBM), Anand Ranganathan (IBM)
Financial institutions use data such as streaming news feeds and proprietary data for insight. One company is taking filings from 130 countries and data from 500,000 equity instruments to create real-time applications. Data integration is essential for information to be trusted in these applications. Explore an architecture designed to capture all data and ensure it is trusted.
1:15pm-1:55pm (40m) Sponsored
Simplify big data with platform, discovery, and data preparation from the cloud
Jeff Pollock (Oracle), Chris Lynskey (Oracle)
In this session you’ll learn how Oracle has leveraged Spark-based machine learning (ML), natural language processing (NLP), and data graph semantics (Linked Open Data) to create the simplest and most powerful big data discovery and big data preparation tools in the market.
2:05pm-2:45pm (40m) Sponsored
How Autodesk is using Tableau to visualize its Kafka-Splunk-Hadoop pipeline
Charlie Crocker (Autodesk)
Building design software for industries from engineering to construction, manufacturing to media, meant Autodesk needed to architect its analytics platform to handle massive amounts of data. Learn how Autodesk uses open-source technologies like Kafka and Hadoop and integrates them with solutions like Splunk, Google BigQuery, and Tableau to achieve data insights at scale.
2:55pm-3:35pm (40m) Sponsored
Harnessing data to change banking for good
Phil Kim (Capital One Labs)
Capital One is on a mission to Change Banking for Good. Join Capital One as we take you through the journey of the Data Lab. How did we get started? What have we learned about mingling disciplines such as human centered design, full stack engineering, and data science? And how are we taking an entrepreneurial approach to develop successful solutions that deliver real impact?
11:20am-12:00pm (40m) Sponsored
Using Hadoop to detect high-risk fraud, waste, and abuse
Alexander Barclay (UnitedHealthcare Shared Services)
UnitedHealth Group has long been defined by our innovative approach to health care, and our approach to IT and analytics is no different. With the goal of making health care more affordable by identifying fraud, waste, and abuse activities, this session will provide details on how we leveraged Hadoop for payment integrity analytics to identify thousands of high-risk providers and claims.
1:15pm-1:55pm (40m) Sponsored
Case study: How YP.com addresses real-world analytical challenges for SQL on Hadoop
William Theisinger (YP), Ignacio Hwang (HP)
If you’re struggling with determining which implementation of SQL on Hadoop can meet your analytics needs, you’re not alone. Join us for a discussion on how YP.com, a leading local marketing solutions provider in the U.S. dedicated to helping local businesses and communities grow, uses HP Vertica for SQL on Hadoop to solve their organization’s big data challenges.
2:05pm-2:45pm (40m) Sponsored
Big data modeling and analytic patterns – beyond schema on read
Ron Bodkin (Google)
While schema on read is powerful, it’s just a first step on the journey to understanding effective ways of working with data in new big data systems. In this talk we highlight new patterns of working with data.
2:55pm-3:35pm (40m) Sponsored
Enable secure data sharing and analytics in Hadoop with 5 key steps
Reiner Kappenberger (HP Security Voltage)
Building a strategy and methodology that protects sensitive data is vital in securing your big data systems and enterprise assets. Learn how people protect big data in Hadoop, and understand how protecting the information is possible without removing the value of the data, or paying a performance penalty.
4:35pm-5:15pm (40m) Sponsored
Cognitive computing: From theory to ubiquity
Tim Estes (Digital Reasoning)
Cognitive computing has made the transition from a theoretical technology into one that is having a transformative impact on business and our daily lives. In this session, Tim Estes, CEO and founder of Digital Reasoning, will explore how key enabling technologies, such as artificial intelligence and natural language processing, have made this possible.
11:20am-12:00pm (40m) Sponsored
Pentaho featuring Forrester: Delivering governed data for analytics at scale
Michele Goetz (Forrester Research), Chuck Yarbrough (Pentaho)
Forrester Research Principal Analyst Michele Goetz discusses findings from Delivering Governed Data for Analytics at Scale, a June 2015 commissioned study conducted by Forrester Consulting on behalf of Pentaho on the topic of data governance and delivery.
1:15pm-1:55pm (40m) Sponsored
Hadoop II: The SQL
Emma McGrattan (Actian)
Can Hadoop now handle your enterprise analytic workloads? Actian SVP of Engineering Emma McGrattan will describe the various solutions that comprise the SQL on Hadoop landscape, identify the features that are important for those modernizing their enterprise analytic workloads on Hadoop, and describe the successes that Actian customers have had in moving their BI and Analytic workloads to Hadoop.
2:05pm-2:45pm (40m) Sponsored
Eventual consistent systems a.k.a mostly inconsistent systems vs. strongly consistent systems in big data
Jagane Sundar (WANdisco)
This talk explores the actual behavior of eventual consistent systems aka mostly inconsistent systems, while presenting a paxos algorithm alternative. We’ll highlight the Amazon use case and various fixes made to S3 in order to enable Hadoop workflows, and alternatives offered by Cassandra, then explore Paxos as an alternative to such inconsistent systems for Hadoop Storage and HBase solutions.
2:55pm-3:35pm (40m) Sponsored
Commercializing IOT: What do you need to know?
Ashish Verma (Deloitte)
The Internet of Everything (IoT) continues to give rise to new business models in the Retail, Industrial Manufacturing, Healthcare, Insurance, Medical device manufacturers, Telecommunications, and Technology industries. Learn what those efforts are and how to capitalize on these opportunities for your clients.
4:35pm-5:15pm (40m) Sponsored
SAP HANA Vora to query Big Data with greater ease
Balalji Krishna (SAP)
Join us to learn about how SAP HANA Vora can be used as a stand-alone or in concert with SAP HANA platform to extend enterprise-grade analytics to Hadoop clusters and provide enriched, interactive analytics on Hadoop.
11:20am-12:00pm (40m) Sponsored
Patterns from the future
Paul Kent (SAS)
Imagine the possibilities of having all of your data in one place – at a reasonable cost – with the computing potential to learn from relationships between data in all domains. Advanced analytics and Hadoop are changing the way organizations approach big data.Hear tips from the future and learn about key patterns emerging from a wide cross section of Hadoop journeys. Perhaps they’ll inspire yours.
1:15pm-1:55pm (40m) Sponsored
Do you know where your data is?
Nidhi Aggarwal (Tamr, Inc.)
Enterprises find it far too costly and time-consuming to locate all of the data relevant to analysis. Data is so fragmented that most enterprises lack even a basic inventory of all sources and attributes -- an enormous constraint on getting return on your big data investment. Tamr Catalog solves this by creating an inventory of all enterprise metadata in a central, platform-neutral place.
2:05pm-2:45pm (40m) Sponsored
Faster time to insight using Spark, Tachyon, and Zeppelin
Nirmal Ranganathan (Rackspace)
All of us involved in big data are working to decrease time to insights. We're building Spark on Yarn clusters with Hadoop ecosystem components, and there are clear benefits to this implementation. However, there are other use cases that may benefit from a more streamlined stack.
2:55pm-3:35pm (40m) Sponsored
Building your first big data application on AWS
Matt Yanchyshyn (Amazon Web Services)
Want to get ramped up on how to use Amazon's big data web services and launch your first big data application on AWS?
4:35pm-5:15pm (40m) Sponsored
Apache Spark as a code-free data science workbench
Michał Iwanowski (DeepSense.io), Piotr Piotr (deepsense.io)
With Spark becoming the rising star of cluster computing comes the prospect of putting it to use as a platform for end-to-end data science. At DeepSense.io we have built an intuitive interface to take Spark to the next level of usability. By introducing a layer that provides code-free UX and simplified resource management, Spark is brought even closer to the concepts known in data science.
9:00am-5:00pm (8h) Training
Spark Development Bootcamp (Day 3)
Laurent Weichberger (OmPoint Innovations, LLC)
This three-day curriculum features advanced lectures and hands-on technical exercises for Spark usage in data exploration, analysis, and building big data applications.
9:00am-5:00pm (8h) Training
Practical data science on Hadoop (Day 3)
BRANDON MACKENZIE (IBM), John Rollins (IBM), Jacques Roy (IBM), Chris Fregly (PipelineAI), Mokhtar Kandil (IBM)
In this three-day course, you will: * Learn how to use machine learning, text analysis, and real-time analytics to solve frequently encountered, high-value business problems, * Understand data science methodology and end-to-end work flow of problem solution including data preparation, model building and validation, and model deployment, * Use Apache Spark and other tools for analytics.
9:00am-5:00pm (8h) Training
Designing and building big data applications (Day 3)
Nathan Neff (Cloudera)
Cloudera University’s three-day course for designing and building big data applications prepares you to analyze and solve real-world problems using Apache Hadoop and associated tools in the enterprise data hub (EDH).
8:00am-8:45am (45m)
Break: Coffee Break
8:45am-8:50am (5m)
Thursday keynote welcome
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Strata + Hadoop World Program Chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
8:50am-9:00am (10m)
Data science for mission
Doug Wolfe (CIA)
In his ten-minute keynote, CIA Chief Information Officer Douglas Wolfe discusses how data science is a true team sport, and how the rapid evolution of this field continually improves the impact of the CIA mission.
9:00am-9:10am (10m)
Privacy protection and reproducible research
Daniel Goroff (Alfred P. Sloan Foundation)
It is easy to make "false discoveries" when analyzing big data. It is harder to draw causal conclusions that are reliable and reproducible, especially when private or proprietary information is involved. Recent mathematical ideas, like differential privacy, offer new ways of reaching robust conclusions while provably protecting personal information.
9:10am-9:20am (10m)
The big data dividend
Jack Norris (MapR Technologies)
The big data dividend refers to the ongoing, significant profits that are derived by running data-driven applications. This session will include examples of applications by leading companies, and provide insights into how developers and organizations can realize big data dividends from a new class of scalable applications with continuous analytics.
9:20am-9:25am (5m) Sponsored
The rise of the citizen data scientist
Ben Werther (Platfora)
The traditional BI and analytics tools of the last decade have made it difficult for users to work directly with their data. With the latest innovations in big data discovery platforms, a new role has emerged: the citizen data scientist. In this keynote, Ben will share Platfora’s research behind the importance of this emerging role so that companies can become truly data-driven.
9:25am-9:30am (5m) Sponsored
Patterns from the future
Paul Kent (SAS)
Imagine the possibilities of having all of your data in one place – at a reasonable cost – with the computing potential to learn from relationships between data in all domains. Advanced analytics and Hadoop are changing the way organizations approach big data. Hear tips from the future and learn about key patterns emerging from a wide cross section of Hadoop journeys.
9:30am-9:45am (15m)
Doing it Wrong: 10 Problems with Qualitative Data
Farrah Bostic (The Difference Engine)
Farrah Bostic, Founder, The Difference Engine
9:45am-9:50am (5m) Sponsored
IBM sponsored keynote
Shivakumar Vaithyanathan (IBM)
IBM fellow and director, Watson Content Services, IBM
9:50am-10:00am (10m)
What does it take to apply data science for social good?
Jake Porway (DataKind)
Jake Porway, founder and executive director of DataKind, unveils five keys for successful data science for good projects, based on the organization's three years of work rallying thousands of volunteers worldwide to give back.
10:00am-10:20am (20m)
Haunted by data
Maciej Ceglowski (Pinboard.in)
Big data is a bit like nuclear energy: while full of promise, it generates residue that is difficult to dispose of, poses risks for those who store it, and leaves the industry one major incident away from scaring the public off the technology entirely.
10:20am-10:40am (20m)
In praise of boredom
Maria Konnikova (The New Yorker | Mastermind)
What do you do when you find a momentary break in your otherwise endless barrage of tasks? In this talk, Maria argues for the vital importance of recapturing the seeming nothingness of boredom, of harnessing the pauses of life for their creative potential. It is in boredom that the truly deep questions and discoveries lie.
10:40am-10:45am (5m)
Closing remarks
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program Chairs Roger Magoulas, Doug Cutting, and Alistair Croll, close out the Strata + Hadoop World keynotes.
10:50am-11:20am (30m)
Break: Morning Break sponsored by Cisco
3:35pm-4:35pm (1h)
Break: Afternoon Break sponsored by Platfora
12:00pm-1:15pm (1h 15m) Events
Lunch / Thursday BoF Tables
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics.
5:15pm-6:15pm (1h) Events
Ice Cream Social
Join attendees, speakers, and exhibitors as we end the conference on a sweet note with some ice cream.