Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA
 
LL20 A
11:00am SparkNet: Training deep networks in Spark Robert Nishihara (University of California, Berkeley)
5:10pm Data science teams: Hold out for the unicorn or build bands of steeds? Michael Dauber (Amplify Partners), Yael Garten (LinkedIn), Monica Rogati (Data Natives), Daniel Tunkelang (Various)
LL20 C
11:50am Attack graphs: Visually exploring 300M alerts per day Leo Meyerovich (Graphistry), Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA)
1:50pm Delivering big data insight at Markerstudy Nick Turner (Markerstudy)
2:40pm Visualization as data and data as visualization: Building insights in a data-flow world Christopher Nguyen (Arimo), Anh Trinh (Arimo, Inc.)
5:10pm A few things engineers can learn from designers Sebastien Pierre (FFunction)
LL20 D
11:00am A year of anomalies: Building shared infrastructure for anomaly detection Chris Sanden (Netflix), Christopher Colburn (Netflix)
11:50am Real-time fraud detection using process mining with Spark Streaming Bolke de Bruin (ING), Hylke Hendriksen (ING)
1:50pm Docker for data scientists Michelangelo D'Agostino (ShopRunner)
2:40pm BayesDB: Query the probable implications of your data Richard Tibbetts (Tableau), Vikash Mansinghka (MIT)
4:20pm Can deep neural networks save your neural network? Artificial intelligence, sensors, and strokes Brandon Ballinger (Cardiogram), Johnson Hsieh (Cardiogram)
5:10pm Deep learning and recurrent neural networks applied to electronic health records Josh Patterson (Patterson Consulting), Dave Kale (Skymind), Zachary Lipton (University of California, San Diego)
LL21 B
11:00am How to tackle false positives in big data security applications Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science)), Cody Rioux (Netflix (Real-time Analytics))
11:50am Securing Apache Kafka Jun Rao (Confluent)
5:10pm Protecting enterprise data in Apache Hadoop Don Bosco Durai (Privacera)
LL21 C/D
11:00am Empowering business users to lead with data Denise McInerney (Intuit)
1:50pm Where’s the puck headed? Michael Dauber (Amplify Partners), Shivon Zilis (Bloomberg Beta), Cack Wilhelm (Scale Venture Partners), Roseanne Wincek (Institutional Venture Partners), Kristina Bergman (Ignition Partners)
2:40pm Why a data career is a great choice, now more than ever Jin Zhang (CA Technologies), Jerry Overton (DXC), Michele Chambers (Continuum Analytics)
LL21 E/F
11:00am Developing a big data business strategy Bill Schmarzo (EMC)
11:50am How to build a successful data lake Alex Gorelik (Waterline Data)
2:40pm An introduction to Transamerica's product recommendation platform Vishal Bamba (Transamerica), Nitin Prabhu (Transamerica), Jeremy Beck, Amy Wang (H2O.ai)
210 A/E
11:00am The state of Spark and where it is going in 2016 Reynold Xin (Databricks)
2:40pm Analyzing time series data with Spark Sandy Ryza (Clover Health)
5:10pm Testing and validating Spark programs Holden Karau (Independent)
210 C/G
11:00am Distributed stream processing with Apache Kafka Jay Kreps (Confluent)
4:20pm Real-world smart applications with Amazon Machine Learning Alex Ingerman (Amazon Web Services)
5:10pm Putting Kafka into overdrive Todd Palino (LinkedIn), Gwen Shapira (Confluent)
210 D/H
1:50pm Grounding big data: A meta-imperative Joe Hellerstein (UC Berkeley), Vikram Sreekanti (Berkeley AMP Lab)
2:40pm Unified namespace and tiered storage in Alluxio Calvin Jia (Alluxio), Jiri Simsa (Alluxio)
4:20pm Building the data infrastructure of the future with persistent memory Derrick Harris (Mesosphere), Rob Peglar (Micron Technology, Inc), Milind Bhandarkar (Ampool, Inc.), Richard Probst (SAP), Todd Lipcon (Cloudera)
5:10pm Streaming architecture: Why flow instead of state? Ted Dunning (MapR, now part of HPE)
211 A-C
11:00am Data science for good means designing for people: Part 1 Jake Porway (DataKind), Rachel Quint (Hewlett Foundation), Sue-Ann Ma, Jeremy Anderson (IBM)
11:50am Data science for good means designing for people: Part 2 Jake Porway (DataKind), Daniella Perlroth (Lyra Health), Tim Hwang (ROFLCon / The Web Ecology Project), Lucy Bernholz (Stanford University)
1:50pm We enhance privilege with supervised machine learning Mike Lee Williams (Cloudera Fast Forward Labs)
2:40pm Data ethics (not what you think) Louis Suarez-Potts (Age of Peers, Inc.)
4:20pm Big data ethics and a future for privacy Jonathan King (Ericsson)
5:10pm It’s a brave new world: Avoiding legal privacy and security snafus with big data and the IoT Alysa Z. Hutnik (Kelley Drye & Warren LLP), Kristi Wolff (Kelley Drye & Warren LLP)
230 A
11:00am Hadoop without borders: Building on-prem, cloud, and hybrid data flows HIREN SHAH (Microsoft), Anand Subbaraj (Microsoft)
1:50pm Hadoop in the cloud: Good fit or round peg in a square hole? Thomas Phelan (HPE BlueData), Joel Baxter (BlueData)
5:10pm Best practices for achieving customer 360 Steven Totman (Cloudera), Nick Curcuru (Mastercard), Robert Bagley (ClickFox), LORI BIEDA (Bank of Montreal)
230 C
11:00am The next 10 years of Apache Hadoop Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Mike Cafarella (University of Michigan)
11:50am Bringing the Apache Hadoop ecosystem to the Google Cloud Platform Jennifer Wu (Cloudera), James Malone (Google)
2:40pm Faster conclusions using in-memory columnar SQL and machine learning Wes McKinney (Two Sigma Investments), Jacques Nadeau (Dremio)
4:20pm Just-in-time optimizing a database Ted Dunning (MapR, now part of HPE)
5:10pm Architecting HBase in the field Jean-Marc Spaggiari (Cloudera), Kevin O'Dell (Rocana)
LL20 B
11:00am How Siemens handles complexity in streaming data from millions of sensors Yvonne Quacken (Siemens), Allen Hoem (Teradata)
11:50am Where are you in your big data journey? Armando Acosta (Dell), Adnan Khaleel, Jeff Weidner (Dell), Deepak Gattala (Dell)
4:20pm Making big data ready for business now Amit Walia (Informatica), Badhrinath Krishnamoorthy (Cognizant)
5:10pm Containers: The natural platform for data applications Partha Seetala (Robin Systems)
LL21 A
11:50am Wrangling, metadata, and governance: Supervision vs. adoption Wei Zheng (Trifacta), Mohan Sadashiva (Waterline Data), Mark Donsky (Okera)
2:40pm The emerging data imperative Wei Wang (Hortonworks), Scott Gnau (Hortonworks)
4:20pm How TD Bank is using Hadoop to create IT 3.0 and launch the next-generation bank Mok Choe (TD Bank Group ), Paul Barth (Podium Data)
230 B
11:00am Building a scalable data science platform with R Mario Inchiosa (Microsoft), Roni Burd (Microsoft)
11:50am The Internet of Things: How to do it. Seriously! Chris Rawles (Pivotal)
2:40pm From X-ray to MRI: New insights on data about data Dave Wells (Paxata), Nenshad Bardoliwalla (Paxata), Travis Ringger (PwC), Conrad Mulcahy (K2 Intelligence)
5:10pm Moving beyond the enterprise: Data sharing as the next big idea Sandy Steier (1010data), Dennis Gleeson (1010data)
210 B/F
5:10pm Building a scalable architecture for processing streaming data on AWS Siva Raghupathy (Amazon Web Services), Manjeet Chayel (Amazon Web Services)
Grand Ballroom 220
8:45am Wednesday keynote welcome Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
8:55am Apache Hadoop at 10 Doug Cutting (Cloudera)
9:15am Machine learning for human rights advocacy: Big benefits, serious consequences Megan Price (Human Rights Data Analysis Group)
9:25am Let's get real: Acting on data in real time Jack Norris (MapR Technologies)
9:35am Delivering information in context Ian Andrews (Pivotal)
9:40am Using commerce data to fuel innovation Bruce Andrews (US Department of Commerce)
10:10am Using computer vision to understand big visual data Alyosha Efros (UC Berkeley)
10:30am Morning Break sponsored by MemSQL | Room: Expo Hall
12:30pm Lunch sponsored by Microsoft Wednesday BoF Tables | Room: Expo Hall
3:20pm Afternoon Break sponsored by Pivotal | Room: Expo Hall
5:50pm Event Booth Crawl | Room: Expo Hall
6:30am Data Dash | Room: Guadalupe River Park
7:00pm Data After Dark: Cirque Celebration | Room: South Hall
12:30pm Event Women in Big Data Forum Meetup | Room: Hilton, Almaden 1
7:30am Coffee Break | Room: Grand Ballroom Foyer
11:00am-11:40am (40m) Spark & Beyond Artificial intelligence, Machine learning
SparkNet: Training deep networks in Spark
Robert Nishihara (University of California, Berkeley)
Robert Nishihara offers an overview of SparkNet, a framework for training deep networks in Spark using existing deep learning libraries (such as Caffe) for the backend. SparkNet gets an order of magnitude speedup from distributed training relative to Caffe on a single GPU, even in the regime in which communication is extremely expensive.
11:50am-12:30pm (40m) Data Science & Advanced Analytics Machine learning, Smart agents and human/machine augmentation
Augmenting machine learning with human computation for better personalization
Eric Colson (Stitch Fix)
Recommender systems use machine-learning algorithms to surface relevant products to consumers. While they are extremely effective, they cannot fully replace human interpretation. The two have very different capabilities that are additive. Eric Colson shows what's possible when the unique contributions of machines are combined with those of human experts to create a truly personalized experience.
1:50pm-2:30pm (40m) Data Science & Advanced Analytics Machine learning
Building a marketplace: Eventbrite's approach to search and recommendation
John Berryman (Eventbrite)
At Eventbrite, users can serendipitously discover events they will love. But making this possible isn't easy. Events are short lived, and by the time Eventbrite can build an adequate collaborative-filtering model, the event is already over. John Berryman explains how Eventbrite overcomes these technical challenges with a combination of collaborative-filtering and content-based methods.
2:40pm-3:20pm (40m) Data Science & Advanced Analytics Machine learning
How to make analytic operations look more like DevOps: Lessons learned moving machine-learning algorithms to production environments
Robert Grossman (University of Chicago)
There is a big difference between running a machine-learning algorithm manually from time to time and building a production system that runs thousands of machine-learning algorithms each day on petabytes of data, while also dealing with all the edge cases that arise. Robert Grossman discusses some of the lessons learned when building such a system and explores the tools that made the job easier.
4:20pm-5:00pm (40m) Data Science & Advanced Analytics
Putting the “science” into data science: The importance of reproducibility and peer review for quantitative research
Erik Andrejko (The Climate Corporation)
Best practices from scientific research can significantly increase the pace and quality of data science projects. Erik Andrejko discusses the benefits and challenges of reproducibility and collaboration, including review and inter-team communication, for data science work at the Climate Corporation.
5:10pm-5:50pm (40m) Data Science & Advanced Analytics
Data science teams: Hold out for the unicorn or build bands of steeds?
Michael Dauber (Amplify Partners), Yael Garten (LinkedIn), Monica Rogati (Data Natives), Daniel Tunkelang (Various)
We’ve all heard that rare breed the data scientist described as a unicorn. In building your DS team, should you hold out for that unicorn or create groups of specialists who can work together? Michael Dauber, Yael Garten, Monica Rogati, and Daniel Tunkelang discuss the pros and cons of various team models to help you decide what works best for your particular situation and organization.
11:00am-11:40am (40m) Visualization & User Experience
Panoramix: An open source data visualization platform
Maxime Beauchemin (Lyft)
Panoramix makes it easy to slice, dice, and visualize your data. Point it to Druid (or almost any other database) and navigate through your data at the speed of thought. Maxime Beauchemin outlines the features and use cases for Panoramix.
11:50am-12:30pm (40m) Security
Attack graphs: Visually exploring 300M alerts per day
Leo Meyerovich (Graphistry), Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA)
“Assuming breach” led to centralizing all logs (SIEMs), but incident response and forensics are still behind on the analytics side. Leo Meyerovich, Mike Wendt, and Joshua Patterson share how Graphistry and Accenture Technology Labs are rethinking data engineering and data analysis and modernizing end-to-end architectures.
1:50pm-2:30pm (40m) Visualization & User Experience
Delivering big data insight at Markerstudy
Nick Turner (Markerstudy)
Nick Turner offers a case study of Markerstudy, an insurance and insurance-related-services company based in the UK that recreated their data platform around Hadoop. Dubbed the Big Data Insight project, the new platform features near real-time reporting and self-service exploration and has resulted in reduced claims costs, better fraud detection, and increased customer-retention rates.
2:40pm-3:20pm (40m) Visualization & User Experience
Visualization as data and data as visualization: Building insights in a data-flow world
Christopher Nguyen (Arimo), Anh Trinh (Arimo, Inc.)
Most people think of data visualizations as charts or graphs with perhaps some interactivity. Christopher Nguyen and Anh Trinh present a new approach that considers visualizations to be first-class objects that also act as data sources and sinks. This enables powerful collaboration where thousands of users can build on the work of one another by sharing these visualization objects.
4:20pm-5:00pm (40m) Visualization & User Experience
What can user-centered design do for visualizing your data?
Irene Ros (Bocoup)
Data visualization is everywhere—it communicates meaningful data, finds insights through exploratory interfaces, and informs people through data-driven content. More and more, consumers expect to interact with the data, not just consume it. Irene Ros explains how to employ techniques from user-centered design to build better data-visualization interfaces.
5:10pm-5:50pm (40m) Visualization & User Experience
A few things engineers can learn from designers
Sebastien Pierre (FFunction)
Big data is great for feeding ML algorithms, but you quickly face a bandwidth issue when interfacing with humans. The brain is a fantastic information-processing machine and has an unparalleled, innate ability to detect patterns. Sébastien Pierre explains what designers can teach engineers about creating new ways to make large volumes of data understandable at the human level.
11:00am-11:40am (40m) Data Science & Advanced Analytics
A year of anomalies: Building shared infrastructure for anomaly detection
Chris Sanden (Netflix), Christopher Colburn (Netflix)
Chris Sanden and Christopher Colburn outline a shared infrastructure for doing anomaly detection. Chris and Christopher explain how their solution addresses both real-time and batch use cases and offer a framework for performance evaluation.
11:50am-12:30pm (40m) Data Science & Advanced Analytics Artificial intelligence, Machine learning
Real-time fraud detection using process mining with Spark Streaming
Bolke de Bruin (ING), Hylke Hendriksen (ING)
If you consider user click paths a process, you can apply process mining. Process mining models users based on their actual behavior, which allows us to compare new clicks with modeled behavior and report any inconsistencies. Bolke de Bruin and Hylke Hendriksen explain how ING implemented process mining on Spark Streaming, enabling real-time fraud detection.
1:50pm-2:30pm (40m) Data Science & Advanced Analytics
Docker for data scientists
Michelangelo D'Agostino (ShopRunner)
Data scientists inhabit such an ever-changing landscape of languages, packages, and frameworks that it can be easy to succumb to tool fatigue. If this sounds familiar, you may have missed the increasing popularity of Linux containers in the DevOps world, in particular Docker. Michelangelo D'Agostino demonstrates why Docker deserves a place in every data scientist’s toolkit.
2:40pm-3:20pm (40m) Data Science & Advanced Analytics Machine learning
BayesDB: Query the probable implications of your data
Richard Tibbetts (Tableau), Vikash Mansinghka (MIT)
BayesDB enables rapid prototyping and incremental refinement of statistical models by combining a model-independent declarative query language, BQL, with machine-assisted modeling and compositional models. Richard Tibbetts and Vikash Mansinghka explore the applications of BayesDB for analyzing and understanding developmental economics data in collaboration with the Gates Foundation.
4:20pm-5:00pm (40m) Data Science & Advanced Analytics Artificial intelligence, Machine learning
Can deep neural networks save your neural network? Artificial intelligence, sensors, and strokes
Brandon Ballinger (Cardiogram), Johnson Hsieh (Cardiogram)
Each year, 15 million people suffer strokes, and at least a fifth of those are due to atrial fibrillation, the most common heart arrhythmia. Brandon Ballinger reports on a collaboration between UCSF cardiologists and ex-Google data scientists that detects atrial fibrillation with deep learning.
5:10pm-5:50pm (40m) Data Science & Advanced Analytics Machine learning
Deep learning and recurrent neural networks applied to electronic health records
Josh Patterson (Patterson Consulting), Dave Kale (Skymind), Zachary Lipton (University of California, San Diego)
Time series data is increasingly ubiquitous with both the adoption of electronic health record (EHR) systems in hospitals and clinics and the proliferation of wearable sensors. Josh Patterson, David Kale, and Zachary Lipton bring the open source deep learning library DL4J to bear on the challenge of analyzing clinical time series using recurrent neural networks (RNNs).
11:00am-11:40am (40m) Security Artificial intelligence, Machine learning
How to tackle false positives in big data security applications
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science)), Cody Rioux (Netflix (Real-time Analytics))
In the era of large-volume security applications, false positives, as Gartner says, can make the difference between building an "indicator machine" and an "answering machine." Ram Shankar and Cody Rioux explore how to suppress false positives in security monitoring systems through use cases from Microsoft and Netflix.
11:50am-12:30pm (40m) Security
Securing Apache Kafka
Jun Rao (Confluent)
With Apache Kakfa 0.9, the community has introduced a number of features to make data streams secure. Jun Rao explains the motivation for making these changes, discusses the design of Kafka security, and demonstrates how to secure a Kafka cluster. Jun also covers common pitfalls in securing Kafka and ongoing security work.
1:50pm-2:30pm (40m) Security
Three principles for a data-centric security architecture on Hadoop to simplify your life
Pratik Verma (BlueTalon), Paulo Pereira (GE)
Pratik Verma and Paulo Pereira share three security architecture principles for Hadoop to protect sensitive data without disrupting users: modifying requests to filter content makes security transparent to users; centralizing data-access decisions and distributing enforcement makes security scalable; and using metadata instead of files or tables ensures systematic protection of sensitive data.
2:40pm-3:20pm (40m) Security
Simplifying Hadoop with RecordService, a secure and unified data access path for compute frameworks
Chao Sun (Cloudera), Alex Leblang (Cloudera)
Chao Sun and Alex Leblang explore RecordService, a new solution that provides an API to read data from Hadoop storage managers and return them as canonical records. This eliminates the need for components to support individual file formats, handle security, perform auditing, and implement sophisticated IO scheduling and other common processing that is at the bottom of any computation.
4:20pm-5:00pm (40m) Security Machine learning
Leveraging Spark to analyze billions of user actions to reveal hidden fraudsters
Yinglian Xie (DataVisor)
Yinglian Xie describes the anatomy of modern online services, where large armies of malicious accounts hide among legitimate users and conduct a variety of attacks. Yinglian demonstrates how the Spark framework can facilitate early detection of these types of attacks by analyzing billions of user actions.
5:10pm-5:50pm (40m) Security
Protecting enterprise data in Apache Hadoop
Don Bosco Durai (Privacera)
Bosco Durai offers a top-down view of security in the Hadoop ecosystem. Bosco explores the right way to protect your data based on your enterprise's security requirements, as he covers the available mechanisms to achieve your information security goals.
11:00am-11:40am (40m) Data-driven Business
Empowering business users to lead with data
Denise McInerney (Intuit)
The most valuable people in your organization combine business acumen with data savviness. But these data heroes are rare. Denise McInerney describes how she has empowered business users at Intuit to make better decisions with data and explains how you can do the same thing in your organization.
11:50am-12:30pm (40m) Data-driven Business
How to hook up your event data for behavioral insights
Lior Abraham (Interana)
Lior Abraham explores how Tinder reinvented its behavioral analytics approach with Interana to tune matchmaking and business operations. Lior discusses strategies for behavioral analytics and explains how they can be applied at your company to increase conversion, improve engagement, and maximize retention.
1:50pm-2:30pm (40m) Data-driven Business
Where’s the puck headed?
Michael Dauber (Amplify Partners), Shivon Zilis (Bloomberg Beta), Cack Wilhelm (Scale Venture Partners), Roseanne Wincek (Institutional Venture Partners), Kristina Bergman (Ignition Partners)
In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Join us as Shivon Zilis, Cack Wilhelm, Michael Dauber, Kristina Bergman, and Roseanne Wincek talk about trends that everyone is seeing and areas for investment that they find exciting.
2:40pm-3:20pm (40m) Data-driven Business Artificial intelligence, Machine learning
Why a data career is a great choice, now more than ever
Jin Zhang (CA Technologies), Jerry Overton (DXC), Michele Chambers (Continuum Analytics)
Data has become a hot career choice, but some fear that a career in data is highly stressful or simply boring. Jin Zhang, Jerry Overton, and Michele Chambers give an overview of the field and its various specializations with the hope that this understanding will eliminate any fear and empower attendees to pursue a career in data.
4:20pm-5:00pm (40m) Data-driven Business Machine learning
Automating decision making with big data: How to make it work
Andreas Schmidt (Blue Yonder)
While many companies struggle to adopt big data, a number of industry leaders are leapfrogging big data adoption by going straight to automating core business processes. Andreas Schmidt presents examples from leading European companies that have overcome cultural, technical, and scientific challenges and unlocked the potential of big data in an entirely different way.
5:10pm-5:50pm (40m) Data-driven Business Machine learning
Working on the blockchain gang: Crunching and visualizing bitcoin data
Benedikt Koehler (DataLion)
Benedikt Koehler offers approaches to analyzing and visualizing bitcoin data—accessing and downloading the blockchain, transforming the data into a networked data format, identifying hubs and clusters, and visualizing the results as dynamic network graphs—so that typical patterns and anomalies can quickly be identified.
11:00am-11:40am (40m) Enterprise Adoption
Developing a big data business strategy
Bill Schmarzo (EMC)
Organizations do not need a big data strategy. They need a business strategy that incorporates big data. Most organizations lack a roadmap for using big data to uncover new business opportunities. Bill Schmarzo explains how to explore, justify, and plan big data projects with business management.
11:50am-12:30pm (40m) Enterprise Adoption Machine learning
How to build a successful data lake
Alex Gorelik (Waterline Data)
It is fashionable today to declare doom and gloom for the data lake. Alex Gorelik discusses best practices for Hadoop data lake success and provides real-world examples of successful data lake implementations in a non-vendor-specific talk.
1:50pm-2:30pm (40m) Enterprise Adoption Machine learning
eBay analysts and governed self-service analysis: Delivering “turn-by-turn” smart suggestions
Debora Seys (eBay)
Autofill, spellcheck, and turn-by-turn directions provide just-in-time suggestions. What if guiding users to accurate data were as simple? Debora Seys explains how eBay is delivering self-service analytics by moving from heavily engineered metadata systems to the new world of machine-learned guidance and asynchronous collaboration.
2:40pm-3:20pm (40m) Enterprise Adoption Machine learning
An introduction to Transamerica's product recommendation platform
Vishal Bamba (Transamerica), Nitin Prabhu (Transamerica), Jeremy Beck, Amy Wang (H2O.ai)
Transamerica built a product recommendation system that can be leveraged across multiple distribution channels to recommend products, serve customer needs, and reduce complexity. Vishal Bamba, Nitin Prabhu, Jeremy Beck, and Amy Wang highlight the machine-learning technology, models, and architecture behind Transamerica's product recommendation platform.
4:20pm-5:00pm (40m) Enterprise Adoption
Not your father's database: How to use Apache Spark properly in your big data architecture
Vida Ha (Databricks)
Apache Spark is a versatile big data processing framework, but just because you can program in SQL for Spark does not mean Spark is a database. For an optimal big data infrastructure, you may still need a distributed file system, databases (SQL or NoSQL), message queues, and specialized systems such as ElasticSearch. Vida Ha explains how to design architecture for different use cases.
5:10pm-5:50pm (40m) Enterprise Adoption Artificial intelligence, Machine learning
Amazon for information: Building a modern data catalog
Aaron Kalb (Alation)
A data catalog provides context to help data analysts, data scientists, and other data consumers (including those with little technical background) find a relevant dataset, determine if it can be trusted, understand what it means, and utilize it to make better products and better decisions. Aaron Kalb explores how enterprises build interfaces that make sourcing data as easy as shopping on Amazon.
11:00am-11:40am (40m) Spark & Beyond
The state of Spark and where it is going in 2016
Reynold Xin (Databricks)
Reynold Xin reviews Spark’s adoption and development in 2015. Reynold then looks to the future to outline three major technology trends—the integration of streaming systems and enterprise data infrastructure, cloud computing and elasticity, and the rise of new hardware—discuss the major efforts to address these trends, and explore their implications for Spark users.
11:50am-12:30pm (40m) Hadoop Use Cases
Uber, your Hadoop has arrived: Powering intelligence for Uber’s real-time marketplace
Vinoth Chandar (Apache Hudi)
Vinoth Chandar explains how Uber revamped its foundational data infrastructure with Hadoop as the source-of-truth data lake, sharing lessons from the experience.
1:50pm-2:30pm (40m) Spark & Beyond
Scala and the JVM as a big data platform: Lessons from Apache Spark
Dean Wampler (Anyscale)
The success of Apache Spark is bringing developers to Scala. For big data, the JVM uses memory inefficiently, causing significant GC challenges. Spark's Project Tungsten fixes these problems with custom data layouts and code generation. Dean Wampler gives an overview of Spark, explaining ongoing improvements and what we should do to improve Scala and the JVM for big data.
2:40pm-3:20pm (40m) Data Science & Advanced Analytics Machine learning
Analyzing time series data with Spark
Sandy Ryza (Clover Health)
Want to build models over data every second from millions of sensors? Dig into the histories of millions of financial instruments? Sandy Ryza discusses the unique challenges of time series data and explains how to work with it at scale. Sandy then introduces the open source Spark-Timeseries library, which provides a natural way of munging, manipulating, and modeling time series data.
4:20pm-5:00pm (40m) Spark & Beyond
Designing a scalable real-time data platform using Akka, Spark Streaming, and Kafka
Alex Silva (Pluralsight)
Alex Silva outlines the implementation of a real-time analytics platform using microservices and a Scala stack that includes Kafka, Spark Streaming, Spray, and Akka. This infrastructure can process vast amounts of streaming data, ranging from video events to clickstreams and logs. The result is a powerful real-time data pipeline capable of flexible data ingestion and fast analysis.
5:10pm-5:50pm (40m) Spark & Beyond
Testing and validating Spark programs
Holden Karau (Independent)
Apache Spark is a fast, general engine for big data processing. As Spark jobs are used for more mission-critical tasks, it is important to have effective tools for testing and validation. Holden Karau details reasonable validation rules for production jobs and best practices for creating effective tests, as well as options for generating test data.
11:00am-11:40am (40m) IoT and Real-time
Distributed stream processing with Apache Kafka
Jay Kreps (Confluent)
The world is moving to real-time data, and much of that data flows through Apache Kafka. Jay Kreps explores how Kafka forms the basis for our modern stream-processing architecture. He covers some of the pros and cons of different frameworks and approaches and discusses the recent APIs Kafka has added to allow direct stream processing of Kafka data.
11:50am-12:30pm (40m) IoT and Real-time
Real-time Hadoop: What an ideal messaging system should bring to Hadoop
Ted Dunning (MapR, now part of HPE)
Application messaging isn’t new—solutions include IBM MQ, RabbitMQ, and ActiveMQ. Apache Kafka is a high-performance, high-scalability alternative that integrates well with Hadoop. Can modern distributed messaging systems like Kafka be considered a legacy replacement or is it purely complementary? Ted Dunning outlines Kafka's architectural benefits and tradeoffs to find the answer.
1:50pm-2:30pm (40m) IoT and Real-time Machine learning
IoT in the enterprise: A look at Intel (IoT) Inside
Moty Fania (Intel)
Moty Fania shares Intel’s IT experience implementing an on-premises big data IoT platform for internal use cases. This unique platform was built on top of several open source technologies and enables highly scalable stream analytics with a stack of algorithms such as multisensor change detection, anomaly detection, and more.
2:40pm-3:20pm (40m) IoT and Real-time Machine learning, Smart agents and human/machine augmentation
How to turn your house into a robot: An adaptive-learning algorithm for the Internet of Things
Brandon Rohrer (Microsoft)
Modern houses and robots have a lot in common. Both have a lot of sensors and have to make a lot of decisions. However, unlike houses, robots adapt and perform helpful tasks. Brandon Rohrer details an algorithm specifically designed to help houses, buildings, roads, and stores learn to actively help the people that use them.
4:20pm-5:00pm (40m) Data Science & Advanced Analytics Machine learning
Real-world smart applications with Amazon Machine Learning
Alex Ingerman (Amazon Web Services)
Alex Ingerman explains how several AWS services, including Amazon Machine Learning, Amazon Kinesis, AWS Lambda, and Amazon Mechanical Turk, can be tied together to build a predictive application to power a real-time customer-service use case.
5:10pm-5:50pm (40m) Data Innovations
Putting Kafka into overdrive
Todd Palino (LinkedIn), Gwen Shapira (Confluent)
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira explore how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production.
11:00am-11:40am (40m) Data Innovations
Analyzing billions of users with Druid and Theta Sketches
Eric Tschetter (Yahoo)
Yahoo uses Druid to provide visibility into the actions of its billions of users and developed a new type of sketch called a Theta Sketch to enable this analysis. Eric Tschetter discusses how Yahoo leverages Druid and Theta Sketches together to enable user-level understanding of their billions of users.
11:50am-12:30pm (40m) Data Innovations Machine learning
NoLambda: A new architecture combining streaming, ad hoc, machine-learning, and batch analytics
Helena Edelson (Apple), Evan Chan (Tuplejump)
Developers who want both streaming analytics and ad hoc, OLAP-like analysis have often had to develop complex architectures such as Lambda. Helena Edelson and Evan Chan highlight a much simpler approach using the NoLambda stack (Apache Spark/Scala, Mesos, Akka, Cassandra, Kafka) plus FiloDB, a new entrant to the distributed-database world, which combines streaming and ad hoc analytics.
1:50pm-2:30pm (40m) Data Innovations
Grounding big data: A meta-imperative
Joe Hellerstein (UC Berkeley), Vikram Sreekanti (Berkeley AMP Lab)
Metadata services are a critical missing piece of the current open source ecosystem for big data. Joe Hellerstein and Vikram Sreekanti give an overview of their vendor-neutral metadata services layer, Ground, through two reference use cases at UC Berkeley: genomics research driven by Spark and courseware using Jupyter Notebooks.
2:40pm-3:20pm (40m) Data Innovations
Unified namespace and tiered storage in Alluxio
Calvin Jia (Alluxio), Jiri Simsa (Alluxio)
Not all storage resources are equal. Alluxio has developed Alluxio tiered storage to achieve highly efficient utilization of memory, SSDs, and HDDs that is completely transparent to computation frameworks and user applications. Calvin Jia and Jiri Simsa outline the features and use cases of Alluxio tiered storage.
4:20pm-5:00pm (40m) Data Innovations
Building the data infrastructure of the future with persistent memory
Derrick Harris (Mesosphere), Rob Peglar (Micron Technology, Inc), Milind Bhandarkar (Ampool, Inc.), Richard Probst (SAP), Todd Lipcon (Cloudera)
Years of research in nonvolatile memory systems is being productized and has started coming to market. These exciting new technologies promise lower power consumption and higher density for persistent storage. Will these hardware advances revolutionize the data ecosystem as we know it? This compelling panel of data-infrastructure thought leaders discusses the possibilities.
5:10pm-5:50pm (40m) Data Innovations Machine learning
Streaming architecture: Why flow instead of state?
Ted Dunning (MapR, now part of HPE)
Until recently, batch processing has been the standard model for big data. Today, many have shifted to streaming architectures that offer large benefits in simplicity and robustness, but this isn't your father’s complex event processing. Ted Dunning explores the key design techniques used in modern systems, including percolators, replayable queues, state-point queuing, and microarchitectures.
11:00am-11:40am (40m) Law, Ethics, Governance
Data science for good means designing for people: Part 1
Jake Porway (DataKind), Rachel Quint (Hewlett Foundation), Sue-Ann Ma, Jeremy Anderson (IBM)
So many of the data projects making headlines—from a new app for finding public services to a new probabilistic model for predicting weather patterns for subsistence farmers—are great accomplishments but don’t seem to have end users in mind. Discover how organizations are designing with, not for, people, accounting for what drives them in order to make long-lasting impact.
11:50am-12:30pm (40m) Law, Ethics, Governance
Data science for good means designing for people: Part 2
Jake Porway (DataKind), Daniella Perlroth (Lyra Health), Tim Hwang (ROFLCon / The Web Ecology Project), Lucy Bernholz (Stanford University)
So many of the data projects making headlines—from a new app for finding public services to a new probabilistic model for predicting weather patterns for subsistence farmers—are great accomplishments but don’t seem to have end users in mind. Discover how organizations are designing with, not for, people, accounting for what drives them in order to make long-lasting impact.
1:50pm-2:30pm (40m) Law, Ethics, Governance Machine learning
We enhance privilege with supervised machine learning
Mike Lee Williams (Cloudera Fast Forward Labs)
Machines are not objective, and big data is not fair. Michael Williams uses sentiment analysis to show that supervised machine learning has the potential to amplify the voices of the most privileged people in society, violate the spirit and letter of civil rights law, and make your product suck.
2:40pm-3:20pm (40m) Law, Ethics, Governance
Data ethics (not what you think)
Louis Suarez-Potts (Age of Peers, Inc.)
2015 saw an increased urgency in the ethics of big data, as the UN began to adopt civil-society partnerships with big data organizations. But what, if anything, are we supposed to do with the data we acquire, interpret, and label big data? Louis Suarez-Potts examines big data ethics to explain best practices for putting to use the information gained by big data methodology.
4:20pm-5:00pm (40m) Law, Ethics, Governance
Big data ethics and a future for privacy
Jonathan King (Ericsson)
Jonathan King outlines ethical best practices for big data and explores the difficult questions emerging from missteps that have caused public outcry, as well as the legal, ethical, and regulatory frameworks that are just beginning to take shape around big data.
5:10pm-5:50pm (40m) Law, Ethics, Governance
It’s a brave new world: Avoiding legal privacy and security snafus with big data and the IoT
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Kristi Wolff (Kelley Drye & Warren LLP)
In the current explosion of the Internet of Things, big data, and mobile, compliance often takes a back seat. But the failure to address legal privacy and consumer-protection considerations has landed many in hot water, resulting in potential legal settlements and business failures. Alysa Hutnik and Kristi Wolff discuss flash points and proactive strategies to avoid becoming a target.
11:00am-11:40am (40m) Hadoop Use Cases
Hadoop without borders: Building on-prem, cloud, and hybrid data flows
HIREN SHAH (Microsoft), Anand Subbaraj (Microsoft)
Whether you want to extend your on-prem data lake with workflows that leverage the benefits of the cloud’s elastic scale, or you have sensitive data that you need to anonymize and aggregate on-prem before sending to the cloud, you need a hybrid data-integration solution for Hadoop. Hiren Shah and Anand Subbaraj show how to build hybrid data flows with Microsoft HDInsight and Azure Data Factory.
11:50am-12:30pm (40m) Spark & Beyond
Fast big data analytics and machine learning using Alluxio and Spark in Baidu
Bin Fan (Alluxio), Haojun Wang (Baidu)
Baidu runs Alluxio in production with hundreds of nodes managing petabytes of data. Bin Fan and Haojun Wang demonstrate how Alluxio improves big data analytics (ad hoc query)—Baidu experienced a 30x performance improvement—and explain how Baidu leverages Alluxio in its machine-learning architecture and how it uses Alluxio to manage heterogeneous storage resources.
1:50pm-2:30pm (40m) Hadoop Use Cases
Hadoop in the cloud: Good fit or round peg in a square hole?
Thomas Phelan (HPE BlueData), Joel Baxter (BlueData)
Thomas Phelan and Joel Baxter investigate the advantages and disadvantages of running specific Hadoop workloads in different infrastructure environments. Thomas and Joel then provide a set of rules to help users evaluate big data runtime environments and deployment options to determine which is best suited for a given application.
2:40pm-3:20pm (40m) Hadoop Use Cases
Successful enterprise data hub design patterns at BT
Phillip Radley (BT)
Phillip Radley explores how to use an “accumulation of marginal gains” approach to achieve success with an Apache Hadoop-based enterprise data hub (EDH), drawing on a set of design patterns built up over five years establishing BT’s EDH.
4:20pm-5:00pm (40m) Hadoop Use Cases
Subject-matter experts and access to rich data: A case study in protecting a network from the Brobot distributed denial of service attacks.
John Omernik (MapR Technologies)
John Omernik walks attendees through Operation Ababil's 2013 DDoS attacks to understand how banks were able to implement controls to protect their networks. Using subject-matter experts, Hadoop, and low-friction access to data, members of the US banking industry were able to come up with new models to protect their networks from distributed denial of service attacks.
5:10pm-5:50pm (40m) Data-driven Business
Best practices for achieving customer 360
Steven Totman (Cloudera), Nick Curcuru (Mastercard), Robert Bagley (ClickFox), LORI BIEDA (Bank of Montreal)
In a panel discussion, Cloudera's Steve Totman talks about the practicalities and realities of big data-based customer 360 with big data experts Lori Bieda, Nick Curcuru, and Robert Bagley. Attend if you have challenges implementing big data-based customer 360 or just want to learn from the panel's real-world experiences.
11:00am-11:40am (40m) Hadoop Internals & Development
The next 10 years of Apache Hadoop
Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Mike Cafarella (University of Michigan)
Ben Lorica hosts a conversation with Doug Cutting and Mike Cafarella, the cofounders of Apache Hadoop.
11:50am-12:30pm (40m) Enterprise Adoption
Bringing the Apache Hadoop ecosystem to the Google Cloud Platform
Jennifer Wu (Cloudera), James Malone (Google)
Jennifer Wu and James Malone offer an insider look at how Google has integrated Hadoop components like HDFS, Impala, and Apache Spark with Google Cloud Platform technologies like Google Compute Engine (GCE), Bigtable, BigQuery, and Cloud Storage. Jennifer and James also explore the importance of Google’s growing collaboration with open source communities.
1:50pm-2:30pm (40m) Hadoop Internals & Development
Hadoop's storage gap: Resolving transactional-access and analytic-performance tradeoffs with Apache Kudu (incubating)
Todd Lipcon (Cloudera)
Todd Lipcon explores the tradeoffs between real-time transactional access and fast analytic performance from the perspective of storage-engine internals. Todd also outlines Kudu, the new addition to the open source Hadoop ecosystem that complements HDFS and HBase to provide a new option for achieving fast scans and fast random access from a single API.
2:40pm-3:20pm (40m) Data Science & Advanced Analytics Machine learning
Faster conclusions using in-memory columnar SQL and machine learning
Wes McKinney (Two Sigma Investments), Jacques Nadeau (Dremio)
Hadoop’s traditional batch technologies are quickly being supplanted by in-memory columnar execution to drive faster data-to-value. Wes McKinney and Jacques Nadeau provide an overview of in-memory columnar execution, survey key related technologies, including Kudu, Ibis, Impala, and Drill, and cover a sample use case using Ibis in conjunction with Apache Drill to deliver real-time conclusions.
4:20pm-5:00pm (40m) Data Innovations
Just-in-time optimizing a database
Ted Dunning (MapR, now part of HPE)
SQL is normally a very static language that assumes a fixed and well-known schema. Apache Drill breaks these assumptions by restructuring the execution of queries so optimizations and type resolution can be done just in time. This has profound consequences for how applicable SQL is in the big data world. Ted Dunning walks attendees through Drill and explores its implications for big data.
5:10pm-5:50pm (40m) Hadoop Use Cases
Architecting HBase in the field
Jean-Marc Spaggiari (Cloudera), Kevin O'Dell (Rocana)
Most already know HBase, but many don't know that it can be coupled with other tools from the ecosystem to increase efficiency. Jean-Marc Spaggiari and Kevin O'Dell walk attendees through some real-life HBase use cases and demonstrate how they have been efficiently implemented.
11:00am-11:40am (40m) Sponsored
How Siemens handles complexity in streaming data from millions of sensors
Yvonne Quacken (Siemens), Allen Hoem (Teradata)
Yvonne Quacken and Allen Hoem explore the business and technical challenges that Siemens faced capturing continuous data from millions of sensors across different areas and explain how Teradata Listener helped Siemens simplify this data-capture process with a single, central service to ingest multiple real-time data streams simultaneously in a reliable fashion.
11:50am-12:30pm (40m) Sponsored
Where are you in your big data journey?
Armando Acosta (Dell), Adnan Khaleel, Jeff Weidner (Dell), Deepak Gattala (Dell)
An interactive panel, hosted by Dell's Armando Acosta, explores how business units have taken advantage of Hadoop's strengths to quickly identify and implement solutions that deal with massive amounts of data to deliver valuable results across the business.
1:50pm-2:30pm (40m) Sponsored
Transactional streaming: If you can compute it, you can probably stream it
John Hugg (VoltDB)
In the race to pair streaming systems with stateful systems, the winners will be stateful systems that process streams natively. These systems remove the burden on application developers to be distributed systems experts and enable new applications to be both powerful and robust. John Hugg describes what’s possible when integrated systems apply a transactional approach to event processing.
2:40pm-3:20pm (40m) Sponsored
Creating intelligence: An applications-first approach to machine learning
Carlos Guestrin (Dato Inc.)
Machine learning is a hot topic. Recommenders, sentiment analysis, churn and click-through prediction, image recognition, and fraud detection are at the core of intelligent applications. However, developing these models is laborious. Carlos Guestrin shares a new approach to leverage massive amounts of data and applied machine learning at scale to create intelligent applications.
4:20pm-5:00pm (40m) Sponsored
Making big data ready for business now
Amit Walia (Informatica), Badhrinath Krishnamoorthy (Cognizant)
Amit Walia, chief product officer of Informatica, hosts a discussion with industry experts on how big data management can enable organizations to deliver faster, more flexible, and more repeatable big data projects while ensuring security and governance. Learn how organizations are using big data management to be more successful with their big data initiatives.
5:10pm-5:50pm (40m) Sponsored
Containers: The natural platform for data applications
Partha Seetala (Robin Systems)
Containers have taken the world by storm by radically transforming the way applications are built and deployed. But many fail to appreciate how powerful containers can be for performance-sensitive data applications. Partha Seetala explains how containers can help you "virtualize" your mission-critical enterprise applications, simplify application life cycles, and increase data-center efficiency.
11:00am-11:40am (40m) Sponsored
Globally distributed hybrid on-premises/cloud big data
Jagane Sundar (WANdisco)
Jagane Sundar discusses the unique challenges of hybrid big data deployments and outlines strategies to address them.
11:50am-12:30pm (40m) Sponsored
Wrangling, metadata, and governance: Supervision vs. adoption
Wei Zheng (Trifacta), Mohan Sadashiva (Waterline Data), Mark Donsky (Okera)
Wei Zheng, Mohan Sadashiva, and Mark Donsky explain how data-wrangling tools not only enable users to work with a variety of new or complex sources of data in Hadoop but also ensure that the data lineage and metadata created through the process are appropriately catalogued and made available to others in the organization.
1:50pm-2:30pm (40m) Sponsored
Can you afford to drop ACID? Understanding real-world SQL requirements in the big data era
Emma McGrattan (Actian)
Hadoop can bring great value to businesses but also big headaches. Some solutions that provide SQL access to Hadoop data mean changing your business processes to overcome limitations in the technologies. Emma McGrattan explains how users can unlock tremendous business value through SQL-driven Hadoop solutions. Emma outlines what should be on your checklist and the pitfalls to avoid.
2:40pm-3:20pm (40m) Sponsored
The emerging data imperative
Wei Wang (Hortonworks), Scott Gnau (Hortonworks)
Join Hortonworks to discuss transformational use cases from Hortonworks customers that manage data in motion and data at rest. Hortonworks's Wei Wang and Scott Gnau explore the modern data applications being built and deployed in 2016 that are driving new frontiers in information technology.
4:20pm-5:00pm (40m) Sponsored
How TD Bank is using Hadoop to create IT 3.0 and launch the next-generation bank
Mok Choe (TD Bank Group ), Paul Barth (Podium Data)
Learn how TD Bank is creating the bank of the future through IT 3.0. Central to this is business agility, fueled by secure, self-service access to enterprise and market data. Mok Choe and Paul Barth detail the fundamentals for success in this transformation, which started with rapid consolidation of hundreds of data sources onto a Hadoop enterprise data provisioning platform.
5:10pm-5:50pm (40m) Sponsored
Remedying the accounts receivable reporting gap for a large multinational imaging and electronics company using a Hadoop-based open source platform
Sudipto Dasgupta (Infosys Limited), Ganesan Pandurangan (Infosys Limited)
Sudipto Dasgupta and Ganesan Pandurangan offer a case study of a large multinational imaging and electronics company that migrated accounts receivable reports to the Hadoop-based open source Infosys Information Platform, which implemented dynamic age bucketing capabilities and reduced the number of end-user views from over 400 to 50.
11:00am-11:40am (40m) Sponsored
Building a scalable data science platform with R
Mario Inchiosa (Microsoft), Roni Burd (Microsoft)
Hadoop is famously scalable, as is cloud computing. R, the thriving and extensible open source data science software. . .not so much. Mario Inchiosa and Roni Burd outline how to seamlessly combine Hadoop, cloud computing, and R to create a scalable data science platform that lets you explore, transform, model, and score data at any scale from the comfort of your favorite R environment.
11:50am-12:30pm (40m) Sponsored
The Internet of Things: How to do it. Seriously!
Chris Rawles (Pivotal)
The Internet of Things (IoT) continues to provide value and hold promise for both the consumer and enterprise alike. To succeed, an IoT project must concern itself with how to ingest data, build actionable models, and react in real time. Chris Rawles describes approaches to addressing these concerns through a deep dive into an interactive demo centered around classification of human activities.
1:50pm-2:30pm (40m) Sponsored
How GE created a pervasive culture of data-driven insights at scale
Don Perigo (GE Power)
Applying big data to an internal business use case is challenging and requires expertise and focus. Even harder is scaling it out across a global enterprise. Don Perigo explains how GE Power Services has been able to deliver results in an uncertain world by leveraging big data and scaling its platform across a global employee base that spans over 25 countries.
2:40pm-3:20pm (40m) Sponsored
From X-ray to MRI: New insights on data about data
Dave Wells (Paxata), Nenshad Bardoliwalla (Paxata), Travis Ringger (PwC), Conrad Mulcahy (K2 Intelligence)
In a conversation moderated by Nenshad Bardoliwalla, analytic leaders Conrad Mulcahy, Travis Ringger, and Dave Wells share real-world data-preparation challenges and discuss new technologies, including Spark-powered machine learning, latent semantic indexing, statistical pattern recognition, and text analytics techniques, that accelerate the ability to transform data into usable information.
4:20pm-5:00pm (40m) Sponsored
What it takes to develop enterprise-grade Hadoop SQL Analytics
Bob Hansen (HPE)
Bob Hansen outlines the latest innovations from HPE for SQL on Hadoop.
5:10pm-5:50pm (40m) Sponsored
Moving beyond the enterprise: Data sharing as the next big idea
Sandy Steier (1010data), Dennis Gleeson (1010data)
Sandy Steier and Dennis Gleeson explain how the promise of easy data sharing and collaborative analysis—on petabyte-scale data—can fundamentally change business culture in the same way that the Internet has changed our consumer culture.
11:00am-11:40am (40m) Sponsored
Dash forward: From descriptive to predictive analytics with Apache Spark + end-user feature with Kellogg's JR Cahill
Eric Frenkiel (MemSQL), JR Cahill (Kellogg)
To win in the on-demand economy, businesses must embrace real-time analytics. Eric Frenkiel demos an enterprise approach to data solutions for predictive analytics. Eric is joined by JR Cahill, who outlines Kellogg's approach to advanced analytics with MemSQL, including moving from overnight to intraday analytics and integrating directly with business intelligence tools like Tableau.
11:50am-12:30pm (40m) Sponsored
How data science and spend analytics found $100 million+ in savings
Nidhi Aggarwal (Tamr, Inc.)
Data scientists have career-making opportunities to use more diverse datasets to deliver bigger business returns. Nidhi Aggarwal demonstrates how Tamr, a machine-driven, human-guided approach to finding, integrating, and preparing data, enables new levels of insight into corporate spend over previous analytics tools—in one case identifying new savings opportunities worth more than $100M.
1:50pm-2:30pm (40m) Sponsored
Tame that beast: How to bring operations, governance, and reliability to Hadoop
Keith Manthey (Dell EMC)
Many companies have created extremely powerful Hadoop use cases with highly valuable outcomes. The diverse adoption and application of Hadoop is producing an extremely robust ecosystem. However, teams often create silos around their Hadoop, forgetting some of the hard-learned lessons IT has gained over the years. Keith Manthey discusses one such often overlooked feature—governance.
2:40pm-3:20pm (40m) Sponsored
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret (Celtra Inc.)
Celtra provides a platform for customers like Porsche and Fox to create, track, and analyze digital display advertising. Celtra's platform processes billions of ad events daily to give analysts fast and easy access to reports and ad hoc analytics. Grega Kešpret outlines Celtra’s data-pipeline challenges and explains how it solved them by combining Snowflake's cloud data warehouse with Spark.
4:20pm-5:00pm (40m) Sponsored
A survival guide for machine learning: Top 10 tips from a battle-tested solution
Patrick Hall (SAS), Paul Kent (SAS)
Although it’s been around for decades, machine learning is currently thriving, and organizations are looking to benefit from it. Patrick Hall and Paul Kent offer 10 crucial tips to know before venturing into the mix—a personal survival guide from the creators of a solution that was there in the beginning and continues to drive the industry today.
5:10pm-5:50pm (40m) Sponsored
Building a scalable architecture for processing streaming data on AWS
Siva Raghupathy (Amazon Web Services), Manjeet Chayel (Amazon Web Services)
Analyzing real-time streams of data is becoming increasingly important to remain competitive. Siva Raghupathy and Manjeet Chayel guide attendees through some of the proven architectures for processing streaming data using a combination of cloud and open source tools such as Apache Spark. Watch a live demo and learn how you can easily scale your applications with Amazon Web Services.
8:45am-8:55am (10m)
Wednesday keynote welcome
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Strata + Hadoop World program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
8:55am-9:10am (15m)
Apache Hadoop at 10
Doug Cutting (Cloudera)
2016 marks the 10th anniversary of Apache Hadoop. This birthday provides us an opportunity to celebrate, as well as the chance to reflect on how we got here and where we are going.
9:10am-9:15am (5m)
Driving the on-demand economy with predictive analytics
Eric Frenkiel (MemSQL)
The next evolution in the on-demand economy is in predictive analytics fueled by live streams of data—in effect knowing what customers want before they do. Eric Frenkiel explains how a real-time trinity of technologies—Kafka, Spark, and MemSQL—is enabling Uber and others to power their own revolutions with predictive apps and analytics.
9:15am-9:25am (10m)
Machine learning for human rights advocacy: Big benefits, serious consequences
Megan Price (Human Rights Data Analysis Group)
Megan Price demonstrates how machine-learning methods help us determine what we know, and what we don't, about the ongoing conflict in Syria. Megan then explains why these methods can be crucial to better understand patterns of violence, enabling better policy decisions, resource allocation, and ultimately, accountability and justice.
9:25am-9:35am (10m) Sponsored
Let's get real: Acting on data in real time
Jack Norris (MapR Technologies)
Big data is not limited to reporting and analysis; increasingly, companies are differentiating themselves by acting on data in real time. But what does "real time" really mean? Jack Norris discusses the challenges of coordinating data flows, analysis, and integration at scale to truly impact business as it happens.
9:35am-9:40am (5m) Sponsored
Delivering information in context
Ian Andrews (Pivotal)
Pivotal’s Ian Andrews explores why delivering information in context is the key to competitive differentiation in the digital economy.
9:40am-9:55am (15m)
Using commerce data to fuel innovation
Bruce Andrews (US Department of Commerce)
US Department of Commerce Deputy Secretary of Commerce Bruce Andrews explores using commerce data to fuel innovation.
9:55am-10:10am (15m)
Summoning the demon: My perspective from the belly of the beast of AI
Jana Eggers (Nara Logics)
We hear about AI almost every day now. Opinions seem split between impending doom side and "superintelligence will save the human race." Jana Eggers offers the real deal on AI, explaining what's hype and what isn't and what we can do about it.
10:10am-10:25am (15m)
Using computer vision to understand big visual data
Alyosha Efros (UC Berkeley)
Alyosha Efros discusses using computer vision to understand big visual data.
10:30am-11:00am (30m)
Break: Morning Break sponsored by MemSQL
12:30pm-1:50pm (1h 20m) Event
Wednesday BoF Tables
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics.
3:20pm-4:20pm (1h)
Break: Afternoon Break sponsored by Pivotal
5:50pm-6:50pm (1h) Event
Booth Crawl
Quench your thirst with vendor-hosted libations and snacks while you check out all the exhibitors in the Expo Hall.
6:30am-7:30am (1h) Event
Data Dash
Please join Cloudera and O'Reilly Media for the Data Dash run/walk, held in conjunction with Strata + Hadoop World in San Jose.
7:00pm-9:00pm (2h) Event
Data After Dark: Cirque Celebration
Join us as we commemorate Hadoop's 10th birthday with a Cirque Celebration! Experience an evening under the big top with incredible food, drinks, entertainment, and networking.
12:30pm-1:50pm (1h 20m) Event
Women in Big Data Forum Meetup
If you’re looking for a diverse, tech-minded community to join, come to the Women in Big Data Forum Meetup on Wednesday during lunch to meet other women (and men) interested in supporting diversity in the technology community.
7:30am-8:45am (1h 15m)
Break: Coffee Break