Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

All Trainings, Tutorials & Sessions

Confirmed tutorials and sessions for Strata + Hadoop World are listed below. New sessions are being added regularly—check back to see the latest updates. A detailed day-by-day schedule will be available soon.

Martin Waterhouse (Chevron)
Efficiency, cost effectiveness, organizational capability, corporate standards, risk aversion, shareholder returns, innovation and talent management, are stated essential ingredients for any large enterprise. When it comes to today's challenge of obtaining, engaging, developing and retaining dynamic technical talent there are few LESS appealing places to seek employment.
Andy Palmer (Tamr, Inc.)
As IT and big data/analytics investments increase, so do data silos. To get full value from these investments, businesses must embrace the variety of data silos - now. Current top-down approaches are tapped out. Innovations in data unification can overcome silos virtually, delivering 360-degree views of customers, long-tail opportunities in supply chains, and other business opportunities.
In a landmark partnership, IBM and Twitter are combining advances in analytics, cloud and cognitive computing in a manner that has the potential to transform how institutions understand customers, markets and trends. Adam Kocoloski, CTO of IBM Cloud Data Services and co-founder of Cloudant will explain how when it comes to gaining insights from Big Data, the future is brighter than we know.
Moderated by:
Roman Shaposhnik (Pivotal Inc.)
In the wake of the Open Data Platform initiative announced earlier this week, Roman Shaposhnik, Director of Open Source strategy at Pivotal and a VP of Apache Software Foundation Incubator will talk about how a well-defined, fully validated ODP common core platform is going to address some of the biggest customer pain points around rapid evolution and standardization in the big data area
Fei-Fei Li (Stanford University)
In this talk, I will give an overview of what computer vision technology is about and its brief history. I will particularly emphasize on what we call the “three pillars” of AI in our quest for visual intelligence: data, learning and knowledge.
Fintan Quill (Kx Systems Inc.), Doug Talbott (Bedarra Research Labs)
One of the first industries to invest heavily in Big Data analytics was financial services, where firms have been pushing the boundaries on speed and scale in dynamically processing large volumes of structured market data for the past twenty years to gain competitive advantage. . .
Alan Gates (Hortonworks)
Slides:   external link
Starting in Hive 0.14, insert values, update, and delete have been added to Hive SQL. In addition, ACID compliant transactions have been added so that users get a consistent view of data while reading and writing. This talk will cover the intended use cases, architecture, and performance of insert, update, and delete in Hive.
Located on the Concourse level.
Adam Silberstein (Trifacta), Joe Hellerstein (UC Berkeley)
Leveraging a dataset’s summary or data profile to inform the analysis process isn't a new concept but in the changing data landscape this process needs to be rethought to handle the different shapes and sizes of big data. Trifacta's Joe Hellerstein and Adam Silberstein discuss new approaches to data profiling specifically designed for quickly understanding the content & quality of modern datasets.
Rosie Atkins (Groupon)
Slides:   1-PDF 
30% of restaurants fail in the first year, so why would anyone go into the business? Most restaurateurs will tell you that it’s an act of love. They love hospitality; they love sharing great food; they love creating a place where people come together to share something special. Almost none of them tell you you that they go into business based on data.
Ian Eslick (VitalLabs)
Capturing and integrating device-based and other health data for research is frustratingly difficult. We explain the open source technology frame​work for capturing and routing device-based health data for use by healthcare providers and for access, via a Trusted Analytic Container, to ​​researchers​ we developed, working with O'Reilly and the Robert Wood Johnson Foundation.​
Kathleen Ting (Cloudera), Philip Zeyliger (Cloudera), Philip Langdale (Cloudera), Miklos Christine (Databricks)
Slides:   1-PDF    2-PDF    3-PDF 
Hadoop is emerging as the standard for big data processing & analytics. However, as usage of the Hadoop clusters grow, so do the demands of managing and monitoring these systems. In this tutorial, attendees will get an overview of all phases for successfully managing Hadoop clusters, with an emphasis on production systems.
Sameer Farooqui (Databricks), Jesse Anderson (Smoking Hand)
This three-day curriculum features advanced lectures and hands-on technical exercises for advanced Spark usage in data exploration, analysis, and building Big Data applications. Course materials emphasize architectural design patterns and best practices for leveraging Spark in the context of other popular, complementary frameworks for building and managing Enterprise data workflows.
Sameer Farooqui (Databricks), Jesse Anderson (Smoking Hand)
This three-day curriculum features advanced lectures and hands-on technical exercises for advanced Spark usage in data exploration, analysis, and building Big Data applications. Course materials emphasize architectural design patterns and best practices for leveraging Spark in the context of other popular, complementary frameworks for building and managing Enterprise data workflows.
Sameer Farooqui (Databricks), Jesse Anderson (Smoking Hand)
This three-day curriculum features advanced lectures and hands-on technical exercises for advanced Spark usage in data exploration, analysis, and building Big Data applications. Course materials emphasize architectural design patterns and best practices for leveraging Spark in the context of other popular, complementary frameworks for building and managing Enterprise data workflows.
Chris Neumann (500 Startups)
Creating data analytics solutions for the cloud requires a new way of thinking about data architectures. Users expect to combine data seamlessly across services while IT demands that new tools leverage existing investments in security and administration. This talk will discuss the challenges of architecting for the cloud and present real-world case studies of the benefits of these architectures.
Tye Rattenbury (Trifacta), Jeffrey Heer (Trifacta | University of Washington)
The ability of software to recognize patterns in usage, data, or other inputs to improve a user’s experience & productivity is an expected attribute of modern software. Trifacta’s Jeffrey Heer and Tye Rattenbury discuss design and software architecture principles for creating intelligent software that incorporates learning to make the process of transforming data more intuitive and efficient.
Mark Grover (Lyft), Jonathan Seidman (Cloudera), Gwen Shapira (Confluent), Ted Malaska (Capital One)
Slides:   1-PDF    external link
Are you looking for a deeper understanding of how to integrate components in the Apache Hadoop ecosystem to implement data management and processing solutions? Then this tutorial is for you. We'll provide a clickstream analytics example illustrating how to architect solutions with Apache Hadoop along with providing best practices and recommendations for using Hadoop and related tools.
Irina Borisova (Chegg), Asim Mathur (eBay)
Slides:   1-PPTX 
In this talk we are addressing the following aspects of machine translation development at eBay: - leveraging huge amounts of transactional and behavioral data for development and evaluation of our MT systems; - adapting evaluation metrics to reflect the eBay buyer experience and measuring translation quality and impact on the shopping experience of our international users.
Tara Sainath (Google)
DNNs were first explored for acoustic modeling, where numerous research labs demonstrated improvements in WER between 10-40% relative. In this talk, I will provide an overview of the latest improvements in deep learning across various research labs since the initial inception.
Slides:   1-PDF 
In far too many organizations, data scientists and designers work in silos, and quibble about who’s more important. This is a huge missed opportunity. At Intuit, we are reimagining how our data and design teams to work together to fuel innovation and surpass Intuit’s business goals. I will walk through methods we are using to bridge these two wildly different groups and share stories of success.
Clint Sharp (Splunk)
In this session you will hear from big data expert, Clint Sharp, with real world experience on the architectural patterns and platform integrations used to solve real business problems with data.
Kurt Brown (Netflix)
Slides:   external link
The Netflix Data Platform is a constantly evolving, large scale infrastructure running in the (AWS) cloud. We are especially focused on performance and ease of use, with initiatives including Presto integration, Spark, and our Big Data Portal and API. This talk will dive into the various technologies we use, the motivations behind our approach, and the business benefits we get.
Jonathan King (CenturyLink )
Our modern world is one where virtually everything is public by default, making the very notion of privacy radically different than the “private by default” era when the concept was first enshrined in law. This session will explore what we can do with the exploding volume of our personal data alongside the increasingly important question of what should we be doing with this data.
Reena Tiwari (Cisco Systems Inc.)
Slides:   1-PPTX 
In many organizations, Marketing may be the most impacted by the advent of big data with new data on prospects and customers. New channels, new data types and sources, and new technologies … how did Cisco bring these all together to see a different view of customers?
Eden Medina (Indiana University, Bloomington)
We are often told that past holds lessons on how to approach the present, but we rarely look to older technologies for inspiration. Rarer still do we look at the historical experiences of less industrialized nations to teach us about the technological problems of today.
R has emerged as the language of data science. In this session, IBM will discuss and demonstrate Big R, a comprehensive set of capabilities that provides end-to-end integration with open source R, transparent execution on Hadoop, and seamless access to machine learning algorithms based on SystemML. Learn also about how Big R and Spark can be used with new geo-spatial and text analytic tooling.
Ellen Friedman (Independent)
Slides:   1-PPTX 
Big data stories reveal fundamental concepts about emerging technologies, their potential impact on society and decisions that drive successful projects. Using real world examples, this talk shows key insights that inform critical choices about new technologies, including time series database tools and scalable machine learning algorithms, used to address important business and research problems.
Dorman Bazzell (Capgemini), GOUTHAM BELLIAPPA (CAPGEMINI), David Freeman (Pentaho)
Tasked with improving engagement and data integrity with emphasis on a self-serve framework, Sears Hometown and Outlet (SHO) forged ahead along their journey in Big Data. With the help of Pentaho and CapGemini, SHO has transitioned from costly and rigid legacy systems to a dynamic, company owned/managed system. . .
Quench your thirst with vendor-hosted libations and snacks while you check out all the cool stuff in the Expo Hall.
Michael Brown (comScore, Inc.)
Bots don't drink soda, so advertisers don’t want to advertise to them. Accurately counting real people is critical in the digital ad industry. This session will show how comScore uses over 1.5 trillion events of data to separate real people from bots. I’ll describe how we use correlations at scale, heuristic classification, and multi-source anomaly detection to make decisions in real time.
George Corugedo (RedPoint Global)
Learn how Hadoop 2.0 and its YARN architecture can make a serious impact on the previously intractable problem of data quality and serve as a super-charged marshaling area for accessing, cleansing and delivering high-quality data
Eric Frenkiel (MemSQL)
Slides:   1-PPTX 
This session will cover approaches to building real-time pipelines with MemSQL, Hadoop, and Spark, including: How Novus built the premier financial portfolio management platform using MemSQL as a real-time data store and query engine Introduction to the MemSQL Spark connector Strategies for integrating Spark and Hadoop with real-time systems for transaction processing and operational analytics
Manu Mukerji (8x8), John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop and big data ecosystem fit together in production to create a data platform supporting batch, interactive and realtime analytical workloads.
Anant Jhingran (Apigee)
In this session, Apigee VP of products Anant Jhingran will discuss how the combination of APIs and data is leading to the next generation of adaptive apps.
Tom White (Cloudera), Joey Echeverria (Rocana), Ryan Blue (Cloudera)
Slides:   1-PDF    2-PDF 
In the second (afternoon) half of the Architecture Day tutorial, attendees will apply the best practices they learned in the morning session to build a data application for sessionizing user data.
Fangjin Yang (Imply), Vadim Ogievetsky (Imply)
Slides:   1-PDF 
The maturation of big data technologies has enabled numerous organizations to derive insights from vast quantities of data. The next set of challenges we face involve building applications that allow us to visualize, navigate, and interpret this data. Creating intuitive user interfaces is often a cumbersome process requiring complex data transformations, integrations, and queries.
Jonathan Dinu (Zipfian Academy)
Slides:   external link,   2-ZIP 
The best insight you produce is only as good as your ability to explain it. As data scientists and engineers, our task is not only to execute robust analyses, but also to convince decision-makers to act on data. Through an example-driven approach, attendees will examine features of great graphics, techniques of effective visualization, and learn to use D3.js to create their own data narrative.
Jon Bock (SnowFlake)
Tens of thousands of simultaneous game players generate lots of data. For online game-maker Kixeye, that data provides insights that drive decisions about game play and monetization. In this session Kixeye and Snowflake will discuss Snowflake’s data warehouse cloud service and how Kixeye uses it to get data insight with the performance, elasticity, and flexibility made possible by the cloud.
Jeffrey Heer (Trifacta | University of Washington)
Keynote with Jeffrey Heer, Co-Founder, Trifacta
Eric Frenkiel (MemSQL)
MemSQL CEO Eric Frenkiel will discuss the need for simplicity in enterprise data architecture, the convergence of transactions and analytics, and what is required to operationalize Spark and Hadoop in the enterprise.pipelines by integrating their technology with Hadoop, and Spark.
Joseph Sirosh (Compass)
Armed with just a browser, data scientists can develop sophisticated machine learning models, and deploy them in a few clicks in cloud-hosted APIs that can be called from any device. The APIs scale elastically to power high volume intelligent apps for phones, websites and the internet of things. . .
Joseph Sirosh (Compass)
Join Microsoft’s Joseph Sirosh for a surprising conversation about a farmer's dilemma, a professor's ingenuity and how cloud, data and devices came together to fundamentally re-imagine an age old way of doing business.
Matt Ingenthron (Couchbase, Inc.), Justin Michaels (Couchbase), Michael Kehoe (LinkedIn)
Justin Michaels of Couchbase will provide an overview of the use case and review how this is handled within Couchbase while providing real-time access to user data. Matt Ingenthron of Couchbase will talk about key features of the underlying components to enable processing at the scale required by deployments such as AT&T and PayPal.
Nitesh Ambastha (Credit Suisse), David Brewster (Paxata), Nenshad Bardoliwalla (Paxata)
In this lively, technical session, Nitesh Ambastha, Global Head of Data IT, Private Banking & Wealth Management Products at Credit Suisse talks about what his organization demands from vendors who sell data preparation, data quality and governance technologies.
Ann Johnson (Interana)
People want data, they really do. Given the choice between knowing what's going on and not knowing, almost anyone will choose knowing. People will not, however, choose things that take a long time and are hard to understand.
Eric Colson (Stitch Fix)
Slides:   1-PDF 
Even the most data-driven organizations still incorporate “art” into their decision-making process. Values, culture, social norms, and biases influence decisions as much as the data. This isn’t always a bad thing—data can sometimes fail to tell the whole story. And, by combining data with the intellectual assets that reside in the heads of employees we can create new capabilities.
Join Data After Dark on our World Tour! Celebrate the global reach of Strata + Hadoop World as we pay homage to some of big data’s biggest markets.
When building a real-time data application, we must decide what tradeoffs are permissible without eroding core functionality. As the purpose of data applications become more complex, and the size of the data stores analyzed expand, maintaining integrity and speed becomes increasingly difficult to solve.
Please join Cloudera and O'Reilly Media for the Data Dash run / walk, held in conjunction with Strata + Hadoop World San Jose 2015.
Greg Goldsmith (Attivio)
In this session Greg will share his insights on this gap from his years of experience in the visual exploratory data discovery & advanced analytics space working with customers and most of the major players in the Big and small data management ecosystem.
Sumeet Singh (Yahoo), Thiruvel Thirumoolan (Yahoo!, Inc.)
Hadoop has allowed us to move towards a unified source of truth for all of organization's data. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs will become critical with increasing scale of operations. In this talk, we will share an approach in tackling these challenges with a data discovery tool.
Emi Nomura (Jawbone)
Using case studies from the Jawbone UP activity tracker, we’ll discuss why data products are key to shaping the future of wearables as well as their implications to business and public health.
David Freeman (LinkedIn)
LinkedIn's Security Data Science team is tasked with detecting bad activity on the LinkedIn site and building proactive solutions to keep it from happening in the first place. In this talk we'll explore various types of abuse we see at LinkedIn and discuss some of the solutions we've built to defend against it.
DJ Patil (White House Office of Science and Technology Policy)
Data Science, where are we going? What impact can we expect?
Laura Fennell (Intuit), Bill Loconzolo (Intuit)
Slides:   1-PPTX 
When your company stores some of the most sensitive customer data that exists, how do you build game changing big data innovations while maintaining customer trust and loyalty? Combine the two groups responsible for that vision--legal and data science--and unite them toward a common goal! We'll discuss how Intuit turned the typical data-legal model on its head to boost data-driven innovation.
Ajit Gaddam (VISA)
Slides:   1-PDF 
Vendors and pundits suggest plug-n-play options for Hadoop security - do this and in <20 mins, your petabytes of data is now secure. What happens when PowerPoint approaches fail in a real-world enterprise deployment? In this session, we will review techniques that worked, controls that completely failed, and create business processes we had to stand up.
The Data Visualization Lounge at Strata + Hadoop World highlights particularly successful visualizations that provide new insight and relevance to big data across a broad range of topics.
Julia Rodriguez (Eagle Investment Systems)
Slides:   1-ZIP 
Designing data visualizations presents us with unique and interesting challenges: how to tell a compelling story; how to deliver important information in a forthright, clear format; and how to make visualizations beautiful and engaging. In this talk, Julie will share a few disruptive designs and connect those back to vizipedia, her compiled data visualization library.
Alistair Croll (Solve For Interesting), Cait O'Riordan (Financial Times), Lutz Finger (Google), Kuang Chen (Captricity), Emi Nomura (Jawbone), AJ Loiacono (Truveris), Rosie Atkins (Groupon), Anne Johnson (Credit Suisse), Jerry Overton (DXC), Ann Johnson (Interana), Mark Madsen (Teradata), Leah Hunter (Tech Journalist), Ellen Friedman (Independent), India Swearingen (United Way of the Bay Area), Satyam Priyadarshy (Halliburton), Joerg Blumtritt (Datarella)
All-Day: For business strategists, marketers, product managers, and entrepreneurs, Data-Driven Business looks at how to use data to make better business decisions faster. Packed with case studies, panels, and eye-opening presentations, this fast-paced day focuses on how to solve today's thorniest business problems with Big Data. It's the missing MBA for a data-driven, always-on business world.
Poppy Crum (Dolby Laboratories | Stanford University)
Our experience of the sensory world does not need to be constrained by our physical limitations. When navigating the environment our senses interact to perceive a robust non-veridical experience. Understanding these interactions and being able to define them perceptually and algorithmically allows technological developments that can facilitate sensory enhancement and optimization.
Eddie Garcia (Cloudera)
Open data is quickly gaining momentum and when applied as data for good, it becomes a much more powerful concept that we should all consider as good data stewards. Organizations to cities are starting to share data like traffic conditions or climate sensors and allowing others to use this open data to improve quality of life.
Vijay Subramanian (Rent the Runway)
At Rent the Runway, we have focused on using data to make decisions since day 1. But, the best manifestation is driving the strategy and building products using data, which has been critical to our growth. This talk will share examples that illustrate this, and how data is an unlikely hero behind the scenes of successfully renting sparkly designer dresses.
Douglas Turnbull (OpenSource Connections)
Today we've got NoSQL. But relational databases were the noSomething. What was that something? Why and where did relational databases come from? Then why years later are we seemingly focused on rejecting the lessons that led us to relational databases? This talk Lessons from the past that help strike a balance between the dueling promises of SQL and noSQL.
Sheetal Dolas (Hortonworks)
Slides:   1-PPTX 
Businesses are moving from large-scale batch data analysis to large-scale real-time data analysis. Apache Storm has emerged as one of the most popular platforms for the purpose. This talk covers proven design patterns for real time stream processing. Patterns that have been vetted in large-scale production deployments that process 10s of billions of events/day and 10s of terabytes of data/day.
Dustin Clute (Cloudera), Michael Judd (Cloudera)
Cloudera University’s four-day course for designing and building Big Data applications prepares you to analyze and solve real-world problems using Apache Hadoop and associated tools in the enterprise data hub.
Dustin Clute (Cloudera), Michael Judd (Cloudera)
Cloudera University’s four-day course for designing and building Big Data applications prepares you to analyze and solve real-world problems using Apache Hadoop and associated tools in the enterprise data hub.
Dustin Clute (Cloudera), Michael Judd (Cloudera)
Cloudera University’s four-day course for designing and building Big Data applications prepares you to analyze and solve real-world problems using Apache Hadoop and associated tools in the enterprise data hub.
Dustin Clute (Cloudera), Michael Judd (Cloudera)
Cloudera University’s four-day course for designing and building Big Data applications prepares you to analyze and solve real-world problems using Apache Hadoop and associated tools in the enterprise data hub.
Gwen Shapira (Confluent)
Organizations do not store, process and analyze data for their amusement. They plan to use the data to drive business decisions. If data validity is uncertain, the data is useless for decision making. In this session we will show how to design architectures that allow to prove and improve data validity at every step of the decision making process.
Lutz Finger (Google)
Slides:   1-PPTX 
Data is changing our world. Predictions using massive data not only have improved many products. At the same time, they have, in some industries, disrupted business models and created new ones. What does an organization need to do to generate a new competitive advantage out of data?
Alonzo Canada (Interana)
Slides:   1-PDF 
Data products are poised to go mainstream, but only if they are designed well. Most data products are designed by developers for developers. This talk discusses methods from Stanford's D.School used by companies like Yahoo!, Samsung, and Audi to design break-out products. These principles can help developers get beyond technology and design products for everyday users.
Etan Lightstone (New Relic)
As Director of UX Design at software analytic company New Relic, my core focus is trying to present the over 200 billion data points across more than three million applications we monitor in a way that provides meaning so customers can make good decisions for their business. Today I’ll share some of what I’ve learned along the way.
Arianna McClain (IDEO), Coe Leta Stafford (IDEO), Kevin Ho (IDEO)
IDEO's Hybrid team brings all the design tools from IDEO's product design process to work with clients on data oriented projects. The team will share elements of their process and case studies to show how incorporating human-centered techniques from design can improve data as an input to decision making.
Get certified as a Spark Developer at Strata + Hadoop World in San Jose.
Prith Banerjee (Schneider Electric)
Slides:   1-PPTX 
Dr. Prith Banerjee, Managing Director of Global Technology Research and Development, Accenture , will present the Accenture Tech Vision 2015 and discuss how organizations are driving value from big data.
Mark Madsen (Teradata)
Slides:   1-PDF 
Storytelling is not about raising someone’s IQ, it’s about raising their blood pressure. Stories engage emotions rather than intellect, making “storytelling with data” a poor metaphor for data visualization when our goal is to communicate clearly.
Jacques Nadeau (Dremio)
I will talk about how Drill achieves high performance with flexibility and ease of use. Includes: First read planning and statistics. Flexible code generation depending on workload. Code optimization and planning techniques. Dynamic schema subsets. Advanced memory use and moving between Java and C. Making a static typing appear dynamic through any-time and multi-phase planning.
Ryan Michaluk (Allstate), Alexander Gray (Skytree, Inc.)
Slides:   1-PPTX 
Allstate’s foundation is data. We extract value from our data by applying machine learning to make data-driven decisions. In this session, we discuss Allstate’s drive for better business results by using machine learning on Hadoop.
Chris Re (Stanford University | Apple)
We describe how DeepDive is being used in a range of tasks from diagnosing rare diseases to drug purposing to filling out the tree of life. DeepDive helps to create knowledge bases that meet--and sometimes even exceed--human-level quality and to perform predictive analytics on top of this data.
Kirk Borne (George Mason University )
Slides:   1-PPT 
I will introduce USA’s next big astronomy project (LSST) and describe how this telescope requires massive data stream analytics – to discover and respond to exotic rapidly changing events in the Universe. I will discuss parallels between big data astronomy and Decision Science-as-a-Service for Business, Cybersecurity Information and Event Management, and Marketing Automation using Hadoop.
June Andrews (Wise / GE Digital)
With LinkedIn's wealth of data we can answer questions previously limited by human resources. We can ask which industries have the most ties with health care? How do you meet Richard Branson? More seriously, what types of connections are used to find jobs? To answer these questions, we weave the algorithmic complexities and data harvesting into stories that enrich our understanding of the answers.
Vida Ha (Databricks), Holden Karau (Independent)
Slides:   external link
Writing efficient Spark programs requires a deeper understanding of Spark internals. In this talk, we present practical tips for writing better Spark programs for the beginner or intermediate Spark programmer.
Josh Byrd (GoPro), Darren Chinen (GoPro)
In this session, GoPro discusses their process for transforming the extreme volume and variety of datasets landing in GoPro’s data lake into usable formats for analysis tools or predictive modeling demands improving their ability to overcome the current technical and human bottlenecks that typically limit the productivity of these efforts.
Moderated by:
Arnab Chakraborty (Accenture)
Alexander Prinz (Lufthansa Airlines), Reena Tiwari (Cisco Systems Inc.)
Slides:   1-PPTX 
This panel discussion will focus on how organizations can find value, equity and business opportunities in their data supply chain. The modern enterprise data supply chain allows organizations to move, manage and mobilize an ever-increasing amount of data across the organization for consumption by people and things.
Eamonn Keogh (University of California - Riverside)
In this talk I will argue that, relative to other types of data (text, social networks etc), time series data is relatively underexploited, and that many opportunities are available for novel commercial applications and scientific discoveries.
Jeremy Heffner (Azavea)
We often face the need to analyze the count of discrete events which occur at a specific time and place whether they be crime events, taxi requests, or phone calls. Forecasting these space-time events brings particular challenges: finding suitable tools for geographic processing and techniques for modeling the data. The session will cover the lessons learned in building such a system.
Matt Asay (AWS)
Silicon Valley may be the center of Big Data technology production, but its application is having a far bigger impact on old-school industries like agriculture and brick-and-mortar retailing. This session will detail some of the world's most innovative applications from some of the world's oldest organizations.
Marcel Kornacker (Cloudera)
n this talk, attendee will learn about Impala’s approach to on-the-fly, automatic data transformation, which in conjunction with the ability to handle nested structures such as JSON and XML documents, addresses the needs of at-source analytics — including direct querying of your input schema, immediate querying of data as it lands in HDFS, and high performance on par with specialized engines.
Alistair Croll (Solve For Interesting), Doug Cutting (Cloudera), Roger Magoulas (O'Reilly Media)
Program Chairs, Roger Magoulas, Doug Cutting, and Alistair Croll, welcome you to the second day of Strata + Hadoop World keynotes.
Birds of a Feather (BoF) sessions are informal roundtable discussions happening during lunch on Thursday, February 19 and Friday, February 20. You can join any BoF table or start your own with a topic of your choice. The BoF sign-up board will be near the Registration area.
Author book signings will be held in the O’Reilly booth on Wednesday, Thursday, and Friday. This is a great opportunity for you to meet O’Reilly authors and to get a free copy of their book. Complimentary copies will be provided for the first 25 attendees. Limit one free book per attendee.
Office Hours are your chance to meet face-to-face with Strata Conference+ Hadoop World presenters in a small-group setting. Drop in to discuss their sessions, ask questions, or make suggestions.
Jake Klamka (Insight), Kathy Copic (Insight Data Science)
Scientists make the best data scientists. Yet there is a skills gap that exists between quantitative data analysis done in a research context and data science in industry. The Data Science Fellows Program has helped over 150 PhDs make the transition, in this session it's founder will share lessons learned in bridging that gap, and the lessons that can be applied to building data science teams.
Vin Sharma (Intel), Jason (Jinquan) Dai (Intel)
Join this session to hear about lessons learnt in building these domain specific solutions, Intel’s reference architecture for data science and analytics services deployment in the cloud, and the new Intel initiative to advance the state of art in big data analytics on Hadoop and Spark.
Eric Schmidt (Google)
Map Reduce, Millwheel and other technologies changed the way data scientists approached data problems. New technologies like Spark and Cloud Dataflow deal with the complexity of stringing together map reduces and creating end-to-end programming logic from multiple steps by making Big Data into a concrete set of executable operations. Gain insights into programming options and what comes next.
Bill Schmarzo (EMC Consulting)
CIOs and business executives alike are looking for ways to mine the potential value of their customer, product and operational data as they consider where and how to start their Big Data journey. What are the organizational ramifications of big data? How can CIOs foster a culture of data-driven decision-making? How can the data lake play support an organization’s business transformation efforts?
Eric Sammer (Rocana)
Slides:   1-ZIP 
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. In this session, we’ll describe one such system, in detail, handling terabytes an hour of event-oriented data, providing real time streaming, search, and SQL access to data.
Explore several different approaches taken by organizations embarking on a data governance journey to meet their own unique business objectives. Review best practices and lessons learned.
John Russell (Cloudera), Alan Choi (Cloudera)
Slides:   1-ZIP    2-PDF    3-PDF 
Impala is the massively parallel analytic database delivering interactive performance on Hadoop. In this half-day tutorial, we'll walk you through hands-on exercises, taking you from zero to up and running with Impala.
Jagane Sundar (WANdisco)
Hadoop is now widely used to support mission-critical applications that operate within a ‘data lake’ infrastructure, but how can it overcome complete data center failures to guarantee continuous operation? In this session, we lay out the blueprint for a multi-data center Hadoop that solves the storage and compute problems in operating over the WAN using single coordinated, Paxos-based file system.
Kurt Hurtado (Elasticsearch Inc), Tal Levy (Elasticsearch)
This tutorial will provide an introduction to the individual components of the ELK stack followed by a discussion of use cases and a hands-on lab. This includes installing and configuring Elasticsearch, Logstash, and Kibana.
Jay Kreps (Confluent)
Slides:   1-PPTX 
What happens if you take everything that is happening in your company--every click, every impression, every database change, every application log--and make it all available as a real-time stream of well structured data? Companies such as LinkedIn have done this experiment and this talk will describe how this changes the way data is thought about and put to use in an organization.
David Andrzejewski (Sumo Logic)
Many of the millions of events logged inside a given software system are not isolated occurrences, but rather links in richly interconnected causal chains. However, classic SQL-style aggregation cannot easily capture this underlying structure. This talk discusses how graph mining techniques can surface high-value insights from the relationships between logged events.
Moderated by:
Alistair Croll (Solve For Interesting)
Jeremy Edberg (MinOps), Jerry Overton (DXC), Tatsiana Maskalevich (Stitch Fix), Anne Johnson (Credit Suisse)
Ruthless optimization squeezes every ounce of advantage from the current business model. But it takes a leap of faith—not something the numbers tend to encourage—to truly innovate. When we’re informed by data, are we blinded by opportunity? Or does data pave the way for the best innovations, forcing us to take a harder look at bad ideas that will never work out?
Moderated by:
Mark Grover (Lyft)
Jonathan Seidman (Cloudera), Gwen Shapira (Confluent), Ted Malaska (Capital One)
Join the authors of Hadoop Application Architectures for an open Q/A session on considerations and recommendations for architecture and design of applications using Hadoop. Talk to us about your use-case and its big data architecture, or just come to listen in.
Allen Day (MapR Technologies), Sungwook Yoon (MapR)
Genomics applications like the Genome Analysis Toolkit (GATK) have long used techniques like MapReduce to parallelize I/O, but have never before run on Hadoop. We will describe what we did to build an end-to-end GATK-based genome analysis pipeline on Hadoop, show how it scaled at lower platform cost, and demonstrate the results.
Aaron Myers (Cloudera, Inc.), Daniel Templeton (Cloudera)
The Hadoop ecosystem is a vibrant and growing set of tools for taming data at massive scales. It's also less than straightforward at times. During this talk we'll take a light-hearted and interactive plunge into the dark corners of Hadoop to shine light on some of the trap doors and blind alleys one may encounter in the wild. Attendees will leave dazed, confused, and a hopefully little wiser.
Amr Awadallah (Cloudera)
As Hadoop and the surrounding projects & vendors mature, their impact on the data management sector is growing. Amr will talk about his views on how that impact will change over the next five years. How central will Hadoop be to the data center of 2020? What industries will benefit most? Which technologies are at risk of displacement or encroachment?
Ben Lorica (O'Reilly), Ben Recht (University of California, Berkeley), Chris Re (Stanford University | Apple), Maya Gupta (Google), Alyosha Efros (UC Berkeley), Eamonn Keogh (University of California - Riverside), John Myles White (Facebook), Fei-Fei Li (Stanford University), Tara Sainath (Google), Michael Jordan (UC Berkeley), Anima Anandkumar (UC Irvine), John Canny (UC Berkeley), David Andrzejewski (Sumo Logic)
All-Day: Strata's regular data science track has great talks with real world experience from leading edge speakers. But we didn't just stop there—we added the Hardcore Data Science day to give you a chance to go even deeper. The Hardcore day will add new techniques and technologies to your data science toolbox, shared by leading data science practitioners from startups, industry, consulting...
Mike Flannagan (Cisco)
Organizations are experiencing unprecedented complexity in managing their data, with the rise of Big Data, Cloud and overall hyper connectivity of our world. Cisco is building solutions to help our customers adopt Big Data solutions, solve business problems using Analytics, and harness the power of an intelligent infrastructure to provide highly differentiated Data and Analytics solutions.
Azarias Reda (Republican National Committee )
In the 2014 election cycle, the Republican National Committee spent significant amount of resources on engineering and data science to help GOP senate candidates across the country. As the first ever Chief Data Officer of the RNC, Azarias led this effort. In this talk, he will discuss some of the lessons learned helping the republican party use data and engineering to win the US Senate.
Ross Fubini (Canaan Partners), Ari Gesher (Kairos Aerospace), Wei Zheng (Trifacta), Omer Trajman (ScalingData), Sylvain Le Borgne (Havas Media)
Slides:   1-ZIP 
Big Data is existing it's buzz word phase and we are seeing applications which use big data infrastructure to power every day lives. This is a discussion from the front lines with panelists from industry and startups describing real deployed application powered by big data, but which are happy to be hiding the elephant behind beautiful interfaces.
John Canny (UC Berkeley)
How fast can machine learning (ML) and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. "Codesign" pairs efficient algorithms with complementary hardware. These methods can lead to dramatic improvements in single node performance: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to...
Carter Shanklin (Hortonworks), Mostafa Mokhtar (Cloudera)
This session will examine Hive performance past, present and future.
Steven Beeckman (Ministry of Defence of Belgium)
While cutting edge startups use Spark to see their data analysed in real-time, older and bigger organisations still struggle to share their data in a structural way between its HR, finance and operations departments. This talk will discuss how the belgian MoD opened its data using open source tools and becomes more and more data-driven.
Bing Xiao (Huawei)
In this talk, we’ll walk through a recent implementation at one of world’s top 5 mobile operators (a company with 300 million subscribers)
Tatsiana Maskalevich (Stitch Fix)
During the last government shutdown, on "The Daily Show with Jon Stewart," John Oliver noted that congress has a 90% retention rate despite a 10% approval rating. Why? Gerrymandering has become a prime suspect. Is this true, or just truthy? Come find out how a state with a 51% Democrat, 49% Republican electorate enjoys a lopsided congressional delegation of 4 Democrats and 9 Republicans.
Shankar Vedaraman (Netflix), Christopher Colburn (Netflix)
In this session we will talk through the challenges of anomaly detection in high cardinality dimensions, and specifically how we derive value through a combination of data science and traditional business intelligence.
Terence Spies (Voltage Security)
This talk by Voltage Security CTO Terence Spies presents options for securing data and speeding Hadoop implementation. Attendees will leave with strategies to de-risk Hadoop implementations in multi-platform Enterprise environments.
Julien Le Dem (WeWork)
Slides:   1-PDF 
Parquet is a columnar format designed to be efficient and interoperable across the hadoop ecosystem. Its integration in most processing frameworks and serialization models makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine.
Sasha Laundy (Warby Parker)
Many development teams fail to set up logging properly so that when they bring in a data science team down the road, their data is missing, wrong, or lacking key fields. A quick data audit could catch many of these common mistakes, saving money, time, and insight. This talks covers the three things to check and several handy tools so you can sleep soundly, knowing your data are collecting safely.
Ari Gesher (Kairos Aerospace), James Thompson (Palantir Technologies)
From its inception, Palantir Technologies has been about integrating the best of big data technology into systems that enable subject matter experts (as opposed to data scientists or programmers) to move through huge volumes of data and do their own data analysis. Equal parts data system design and UX, we break down the design of building systems usable by mere mortals.
Join attendees, speakers, and exhibitors as we end the conference on a sweet note with some gelato.
Vinayak Borkar (X15 Software)
We will take a close look at use cases related to log data processing listing fundamental requirements that must be satisfied by log management systems. We will look at existing products and technologies harnessed for ingesting, storing, querying, and analyzing machine data. Finally, we will attempt to construct the archetype of the ideal platform for the management of log data.
Dave Holtz (Airbnb)
In a 2013 New York Times column, Thomas Friedman claimed that “Airbnb’s real innovation is not online rentals. It’s ‘trust’.” This session will discuss recent experiments conducted at Airbnb to improve the frequency and honesty of reviews, and the methods used to evaluate changes in the quality of subsequent reviews and the impact of these changes on other key business metrics.
Anil Gadre (MapR)
To get value out of today’s big and fast data, organizations must evolve beyond traditional analytic cycles that are heavy with data transformation and schema management. . .
Oliver Mainka (SAP Labs LLC)
In Predictive Maintenance and Service makers of assets (like automotive) or the operators of assets (like mining or manufacturing) bring together machine sensor data and maintenance data to better understand when and why machines fail, but also predict future failures, and needed business activities. This presentation gives an overview of the topic and what SAP customers have done.
Oscar Celma (Pandora)
Pandora is not the “Netflix for music.” The success of Pandora lies in the unique combination of a curated music catalog of 1.5M+ tracks with a sophisticated set of machine learning models that integrates contextual user feedback from more than 250M people. This talk will unveil Pandora’s dynamic ensemble learning approach that provides a truly personalized experience for each of our listeners.
Lu Cheng (Airbnb), Lisa Qian (Airbnb)
According to industry research, only half of travelers today know exactly where they want to go before planning a trip. This session will provide a holistic view on some of our recent personalization products aimed at inspiring travel. We will start with the product vision, describe the powerful algorithms deployed, and finally explain how we evaluated the long term effects of our product.
Michael Greene (Intel)
The exponential growth of digitally stored data and the transition of data science from academia to real world applications hold the promise of improving nearly every aspect of our lives.
Lisa Hammitt (Salesforce)
Wearables contribute to Big Data and the insights are already realizing significant gains in key industries.
Maya Gupta (Google)
What makes a large machine learning system more interpretable and robust in practice? How do we take into account engineer's prior information about signals? We'll discuss the importance of monotonicity, smoothness, semantically-meaningful inputs and outputs, and designing algorithms that are easy to debug.
Xuefu Zhang (Cloudera), Chengxiang Li (Intel)
Hive is Hadoop's de facto standard SQL on big data, and Spark is gaining significant momentum as a new data processing platform beyond MapReduce. The marriage of the two will generate more waves of momentum in both communities.
Brian Granger (Cal Poly San Luis Obispo), Fernando Perez (UC Berkeley and Lawrence Berkeley National Laboratory), Kyle Kelley (Netflix)
The Jupyter/IPython Notebook is a web-based interactive computing platform for Data Science in Python, Julia, R and other languages. In this talk we will describe our recent work to bring the Notebook to larger groups of users, both on the open web and within organizations. The focus will be on new collaboration features and deploying the Notebook in these contexts.
Shawn Scully (Dato), Carlos Guestrin (Apple | University of Washington ), Alice Zheng (Amazon), Chris DuBois (Dato), Yucheng Low (Dato)
This all-day, hands-on training program provides a quick start to building and deploying predictive applications at scale. You will learn simple and effective ways of building powerful machine learning models and deployment them. We will walk you through all the steps of prototyping and production: data cleaning, feature engineerings, model building and evaluation, and deployment.
Reynold Xin (Databricks), Matei Zaharia (Databricks)
Spark users have been pushing the boundary of data analytics. In this talk, we focus on the scalability dimension, including: - Multiple real-world use cases on PBs of data and on clusters with thousands of machines - Configuration and performance tuning tips learned from these deployments - Changes in recent releases of Spark for better handling of these workloads
Satyam Priyadarshy (Halliburton)
Upstream Oil and Gas Industry collects more data and most verticals. The emergence of digital oil fields, the 4D seismic studies, the sensor networks, etc will increase the ability to get ever relevant data. However, there remains many expensive challenges during the drilling operations. Big Data Driven approaches for holistic digital oil field are being helpful and the way for the future.
Located in 230 A - 230 C.
Learn to develop machine learning, exploratory and predictive models at scale on data stored in-memory. This hands-on course will address exploratory statistical modeling with SAS Visual Statistics, a GUI designed for rapidly screening models and segments.
Learn to develop machine learning, exploratory and predictive models at scale on data stored in-memory. This hands-on course will address exploratory statistical modeling with SAS Visual Statistics, a GUI designed for rapidly screening models and segments.
Shai Fine (Intel)
We will introduce the concept of Machine Learning Building Blocks - elements that can be mapped into hardware and software primitives and patterns. We demonstrate the implication of this concept in designing some specific workloads. Finally, we look at the Workload Optimization Framework, which includes a comprehensive Machine Learning workload suite, composed of sampled & constructed workloads
Ben Hamner (Kaggle)
The US is in an oil boom, driven by new technologies that enable the economic production of shale resources. Conventional exploration techniques don’t work well for these unconventional reserves. In this talk, Kaggle’s Chief Scientist will discuss Kaggle’s pioneering work in machine learning for oil exploration. ML for energy applications differs dramatically from consumer web applications.
Cliff Click (0xdata), Michal Malohlava (0xdata, Inc)
H2O's powerful Machine Learning algorithms coupled with Spark's SQL and scala data munging, a potent combination solving real-world problems.
Yuliya Feldman (Dremio Corporation)
The good news: Hadoop has a lot of tools. The bad news: Hadoop has a lot of tools, and conflicting priorities. This talk shows how advances in YARN and Mesos allow you to run multiple distinct workloads together. We show how to use SLA and latency rules along with preemption in YARN to maintain high throughput while guaranteeing latency for applications such as HBase and Drill.
Scott Donaldson (FINRA)
In 2014, FINRA developed a new system to analyze the complex linkages between orders and trades in the US equities capital markets. This session will highlight the outcomes of the big data solution that allowed FINRA’s analysts to more efficiently conduct regulatory reviews and improve accessibility to over a trillion market events, and highlight the lessons learned during the implementation.
Slides:   1-PPT 
Entirely new industries are forming as the result of business model innovations. But discovering these disruptive ideas is, still, largely a matter of trial and error. We need faster, more effective ways of testing out new business model designs.
Spencer Herath (Accenture), Aaron Benz (Accenture)
Slides:   1-PPTX 
HBase can be a good solution for hierarchical time series data. And we can access the data using both R and Python. This case study is a sanitized version of a solution we brought to a client that provided real business value—without requiring significant investment or time. We show how to move to a simple, scalable NoSQL solution without alienating the scientists who work with the data.
Mobile App Test
David Dobbins (Rackspace Hosting), Chris Lalonde (ObjectRocket)
During this session learn how you can rapidly deploy a modern data platform and watch a live demo that highlights how our easy-to-use control panels and API’s with simple bridges allow you to manage, integrate, and gain insights from your data environments in minutes.
Located on the Concourse level.
AJ Loiacono (Truveris)
With drug inflation far outpacing inflation for the rest of the economy, consumers, companies, and government entities are struggling to understand one of the primary components of health care costs. Drug inflation has always been difficult to measure, since the information is published infrequently, often annually, with a high degree of variance between the reporting organizations.
Matei Zaharia (Databricks)
As the Apache Spark userbase grows, the developer community is working to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries.
Alan Gates (Hortonworks)
Alan is a cofounder of Hortonworks. Stop by his office hours and chat with him about the use cases they targeted when designing transactions for Hive, streaming ingestion of data using Hive transactions, and using INSERT, UPDATE, and DELETE in Hive.
Anu Tewary (Intuit), Lucian Lita (Yoyo Labs), Jonathan Goldman (Intuit)
If you’re working on building an effective data team, meet with Anu, Lucian, and Jonathan. They can show you how to structure your data team to get maximum advantage, how to build metrics capabilities in your team and your company, and how to make data-driven products and train product managers to be more data-driven.
Chengxiang Li (Intel), Xuefu Zhang (Cloudera)
If you’re using (or considering) Hive and Spark, check in with Xuefu and Chengxiang. They’ll talk about the status of Hive on Spark, compare Hive on Spark vs Hive on Tez, and discuss your own technical challenges.
Chris Neumann (500 Startups)
Architecting for the cloud does have its challenges, and Chris can help you meet them. He’s ready to answer your questions on self-service cloud, including data security and permissions considerations, and challenges integrating with decentralized cloud services.
Cliff Click (0xdata)
H2O's machine learning algorithms pair well with Spark's SQL and Scala data munging. If you’re looking at this combo to help solve some of your pressing data problems, stop by and chat with Cliff. He’ll help you with machine learning with H2O and Spark, H2O Internals, and H2O and Spark integration technology.
Stop by and play with Quid! Danielle is ready to talk about the tradeoffs you’ll make when building real-time data applications, such as data visualization (data quality vs fast interactivity), normalization techniques for the optimal user experience, and ways to find specific insights and quick analysis in large data sets.
Danyel Fisher (
Data visualization pro Danyel Fisher is available to discuss your biggest viz challenges. Find out how to tool for visualization, and how you can use visualization to understand and explore big data.
Dean Wampler (Anyscale)
Meet with Dean to discuss all things Spark, including its effective use. He’ll also talk about Scala for Spark, Spark streaming and SQL, and the Hadoop alternative, Spark on Mesos.
Ellen Friedman (Independent)
Discuss your large-scale data challenges with Ellen, and learn the technical and human elements in the decisions that drive successful projects. She’ll talk about the use of proven design patterns to help you streamline the planning of your big data project; the five habits of successful Hadoop project leads; and ways to build effective coordination between stakeholders in a new big data project.
Emma McGrattan (Actian)
A leading authority on DBMS technologies, Emma is ready to discuss how you can navigate around the technology limitations of Hadoop and turn it into a high-performance, fully-functional analytics engine. She’ll also take you through various SQL on Hadoop solutions, along with the maze of features, vectorization, YARN integration, and more to determine which solution will work for you.
Eric Sammer (Rocana)
Building a system for machine and event-oriented data? Visit with Eric. He’ll answer your questions about event and time series data management with open source technologies, using Hadoop, Kafka, and Spark to build modern data processing infrastructure. He’ll also discuss Hadoop and distributed systems operations in production environments.
Gary Davis (McAfee, a division of Intel Security)
As Chief Consumer Security Evangelist at McAfee, Gary is available to discuss a wide range of vital topics, including current security trends—specifically mobile security as well as the future of security and the IoT.
Jacques Nadeau (Dremio)
Jacques is available to talk to you about query optimization and execution, the SQL-on-Hadoop and Hadoop execution landscapes, and high performance distributed computing in Java. He’ll also discuss the rising role of JSON and Parquet in Hadoop, the NoSchema revolution and how it impacts data management, and in-memory databases and analytics.
Jay Kreps (Confluent)
Jay is one of the primary architects for LinkedIn and one of the original authors of several open source projects in the scalable data systems space—including Voldemort, Azkaban, Kafka, and Samza. This is a great opportunity to ask Jay about Apache Kafka, stream processing, the wild world of data ingestion, and logs and logging.
Joe Hellerstein (UC Berkeley)
This is a unique opportunity to talk to Joe one-on-one about data transformation, big data infrastructure and distributed systems, and academic research.
John Haddad (Informatica)
During office hours, John will discuss the types of organizations that generate the most revenue and profits from big data initiatives. Ask him about the organization structures, roles, skills, and interactions that make data-driven organizations successful. Learn how to gain competitive advantage by turning data assets into revenue generating data products and actionable information.
Julia Rodriguez (Eagle Investment Systems)
Tap into Julie’s expertise in user research, analysis, and design for complex systems. She can answer all of your data viz questions, including applicable methods we can use in all our visualizations. Learn what makes a data visualization successful, as well as rules to direct good data visualization principles for representing data.
Julien Le Dem (WeWork)
Thinking about using Apache Parquet as a basis for ETL and analytics? A chat with Julien can save you a lot of wasted time. He’ll answer questions about integrating Parquet in existing ETL and processing pipelines, and using Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading. He’ll also discuss the future of Parquet.
June Andrews (Wise / GE Digital), David Freeman (LinkedIn)
Spend some time with June and David, and gain interesting insights into societal questions that can be answered with big data. They’ll discuss future directions for data science as both a field and an occupation.
Kirk Borne (George Mason University )
If you’re interested in streaming data, you need to spend some time with Kirk. He’ll discuss techniques, algorithms, and approaches to discovery in streaming data, including anomaly, correlation, association, and class discovery.
Kurt Brown (Netflix)
Curious what Netflix is up to in regards to their data platform? Visit Kurt to find out. He’ll talk about big data infrastructure, his experience building a cohesive framework for easy platform interaction, and Netflix’s architecture/approach and the benefits that they (and hopefully you can) achieve.
Lutz Finger (Google)
If you want to build a new competitive advantage from data, spend a few minutes with Lutz. He’s happy to talk to you about improving and creating data products, as well as data analytics, data science, building a data team, and innovation.
Marco Di Placido (O365 Security Signals ), Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))
Meet Ram and Marco for a Security A team chat. They’ll tell you how things really work in the security data science spaces. They’ll also and answer questions on model user accounts vs system accounts, the reliability of frequency/histogram analysis of events, whether unsupervised learning = clustering, and where visualization fits into all of this.
Michael Brown (comScore, Inc.)
Michael helped build the largest Decision Support Systems on the Windows platform, all with the ability to capture over 220 billion rows of new data every week. Talk to him about uncovering non-obvious insights by using heuristic classification and anomaly detection. Learn how comScore can scale to process over 1.5 trillion events and overcome big data challenges to uncover key marketing insights.
Miklos Christine (Databricks), Kathleen Ting (Cloudera)
Having issues with running Spark on YARN or porting your existing MR1 jobs over to YARN? Miklos and Kathleen will help you troubleshoot things like service management, monitoring & diagnostics, and extensibility.
Noelle Saldana (Heroku), Bob Filbin (Crisis Text Line)
Want to learn more about volunteering as a data scientist? Bob and Noelle will be happy to share tips and stories about their experiences in bringing a data-driven culture and approach to non-profit organizations. They’ll discuss how to become an effective data science volunteer and how to make an impact with data science processes, not just models.
Getting started with data governance can be a challenge; everyone has a different starting point and different priorities. With her useful insight on data governance, Paula can answer questions on how to do it and help you determine which people would make good data governance champions. Come by and talk to her about your own data governance challenges.
Philip Zeyliger (Cloudera), Philip Langdale (Cloudera), Kathleen Ting (Cloudera), Miklos Christine (Databricks)
Tap into a combined 16 years of Hadoop ops experience. Stop by and ask Philip, Miklos, Kathleen, and Philip about debugging and tuning between the different layers (app, hadoop, jvm, kernel, and networking), and using tools and subsystems to keep your Hadoop clusters always up, running, and secure.
Rahul Pathak (Amazon Web Services)
If you have questions about the cloud, Rahul is your man. Ask him about AWS big data & Analytic services, including EMR (Managed Hadoop), Redshift (MPP DW), Kinesis (Streaming Ingestion), Data Pipeline (Data flow/orchestration), S3 (object storage), and EC2 (compute). He’ll also discuss the core components for data and analytics.
Ron Bodkin (Google)
Have you deployed Hadoop NextGen MapReduce (YARN), or are you planning to? Stop by and talk to Ron about using YARN in practice, choosing YARN applications (Spark, Storm, Tez, Impala etc.), and using Hadoop 2.0 architectures.
Christopher Colburn (Netflix), Shankar Vedaraman (Netflix)
Want to learn not only how to detect anomalies, but also how to make them actionable? Shankar and Chris can help. They’ll talk about the business problem for anomaly detection, the algorithm, implementation details, and how the system is used at Netflix. Learn the challenges Netflix faced while engineering the system, and other use cases at Netflix for anomaly detection.
Sumeet Singh (Yahoo), Thiruvel Thirumoolan (Yahoo!, Inc.)
Yahoo! is the largest Hadoop user and a major contributor across most areas of the Hadoop ecosystem. Ask Sumeet and Thiruvel what it’s like to work on cloud platforms and Hadoop, including the notable use cases and technology stack that puts them at the frontier of Hadoop scale. Find out where they’re headed in the areas of core Hadoop, BI and adhoc queries, and developer tools.
Terence Spies (Voltage Security)
Want to use cryptographic tools to mask data, but retain analytics functionality in Hadoop and other systems? Terence can help. Stop by and ask him about the relative security of different algorithms, and the best way to integrate them into ingestion and analytics tools. And learn how to use de-identification tools, such as masking, tokenization, and encryption.
Tom White (Cloudera), Joey Echeverria (Rocana), Ryan Blue (Cloudera)
Slides:   1-PDF 
If you have Hadoop questions, bring them to Ryan, Joey, and Tom. They’ll explain the Hadoop ecosystem, as well as how to get started with Hadoop using the Kite SDK.
Vadim Ogievetsky (Imply), Fangjin Yang (Imply)
Fangjin and Vadim are happy to discuss building scalable, interactive, data-driven applications, particularly Facet and Druid. They’ll talk about common tradeoffs in building user-facing applications, and ways to architect systems for efficiency and cost in multi-tenant environments.
Woody Christy (Cloudera)
Woody can answer important Hadoop questions, like sizing hardware for a Hadoop cluster, considerations regarding network topology, and the impact of virtualization on Hadoop.
Michael Jordan (UC Berkeley)
In this talk we show how statistical decision theory provides a mathematical point of departure for achieving such a blending. We develop theoretical tradeoffs between statistical risk, amount of data and "externalities" such as computation, communication and privacy. We develop procedures that allow one to choose desired operating points along such tradeoff curves.
Randy Guck (Dell Software)
Slides:   1-PPTX 
Not all big data problems require big cluster solutions. Doradus OLAP compresses data into compact shards, yielding fast analytical queries using little disk even for big data sets. Learn how Doradus leverages OLAP techniques, columnar storage, and Cassandra to yield sophisticated query features while using amazingly little disk space.
Grab a drink, mingle with fellow Strata + Hadoop World participants, and see the latest technologies and products from leading companies in the data space.
Patrick McFadin (Datastax)
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. Add in Apache Spark and Kafka, you have an amazing time series solution. We will talk data models, go through deployment and code to build a functional, real-time application. Languages used: Java, Scala
HP will discuss two innovations to help you take on analytics for your Hadoop Data.
Anu Tewary (Intuit), Lucian Lita (Yoyo Labs), Jonathan Goldman (Intuit)
Data scientists navel gazing in a corner. Engineers not thinking, just refactoring. Product just making slides. That’s no way to build data products. Is it even possible to have them play well together, without promising free lunches, unlimited gummy bears, and a Red Bull IV? We share our experience about what worked and what didn’t, both in a startup and in a big company environment.
Ozgun Erdogan (Citus Data)
PostgreSQL has recently become the most popular database for technology companies. In part, it owes this success to rethinking the monolithic SQL database, and providing an extensible architecture instead. In this talk, we will describe key challenges associated with scaling out SQL. We will then show PostgreSQL extensions that overcome these challenges, and describe how they do so.
Robert Grossman (University of Chicago)
Slides:   1-PDF 
Finding anomalies is essential for a wide range of applications, including cybersecurity, event detection and health and status monitoring. Anomaly techniques that scale successfully to large datasets tend to integrate machine learning with good data engineering. We discuss three case studies and extract eight techniques that have proved effective for detecting anomalies in large scale systems.
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science)), Marco Di Placido (O365 Security Signals )
The audience will learn about the novel ways of using ranking algorithms in intrusion detection systems, how to provide consistent security monitoring in a constantly changing environment and finally, data scientists will walk away with an actionable framework for testing their system even with the lack of labelled attack data.
Noelle Saldana (Heroku)
While many companies use data science to increase profits, some nonprofits are using it to save lives! Crisis Text Line connects teens in crisis to counselors via text message, and recently partnered with DataKind and Pivotal on a pro bono project to more quickly route teens to help. Go behind the scenes to learn how they came together to make an impact and how you can too!
Adam Jorgensen (Pragmatic Works)
Slides:   1-PPT 
Retail buyers are the backbone of the industries’ profitability. These individuals drive organizational goals with their performance. Many decisions are made by intuition and “gut” feeling, where predictive analytics would have made significant improvements in outcomes. This session takes real world experiences and shows how to transform retail performance through data driven buying decisions.
Jike Chong (Simply Hired)
Slides:   1-PPTX 
Learn how tools based on nation-wide job market data can help both students and institutions improve outcomes from the job market level down to curriculum and course choice.
Andreas Mueller (NYU, scikit-learn), Jennifer Klay (Cal Poly San Luis Obispo), Peter Wang (Anaconda), Travis Oliphant (Anaconda), Andy Terrel (NumFOCUS), Matthew Rocklin (Anaconda), Wes McKinney (Two Sigma Investments), Stefan van der Walt (UC Berkeley), Kyle Kelley (Netflix), Jonathan Frederic (IPython)
Join the presenters of the PyData Tutorials for further discussions on some of the most used tools in the Python data stack. This is a great opportunity to ask questions and share insight with those who have authored or contributed to: * scikit-learn * NumPy * Bokeh * IPython * Numba * Blaze * pandas * scikit-image
Andreas Mueller (NYU, scikit-learn), Jennifer Klay (Cal Poly San Luis Obispo), Peter Wang (Anaconda), Travis Oliphant (Anaconda), Andy Terrel (NumFOCUS), Matthew Rocklin (Anaconda), Wes McKinney (Two Sigma Investments), Stefan van der Walt (UC Berkeley), Jonathan Frederic (IPython), Kyle Kelley (Netflix)
Slides:   1-PDF 
Python has become an increasingly important part of the data engineer and analytic tool landscape. Pydata at Strata provides in-depth coverage of the tools and techniques gaining traction with the data audience, including iPython Notebook, NumPy/matplotlib for visualization, SciPy, scikit-learn, and how to scale Python performance, including how to handle large, distributed data sets.
Garrett Grolemund (RStudio), Nicholas Horton (Amherst College ), Winston Chang (RStudio)
From advanced visualization, collaboration, reproducibility to data manipulation, R Day at Strata covers a raft of current topics that analysts and R users need to pay attention to. The R Day tutorials come from leading luminaries and R committers, the folks keeping the R ecosystem apace of the challenges facing analysts and others who work with data.
Ted Dunning (MapR, now part of HPE), Ellen Friedman (Independent)
Slides:   1-PDF 
What’s important about a technology is what you can use it to do. We’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and we’d like to relay what worked well for them and what did not. . .
Michael Conover (LinkedIn)
Building real-time relevance systems for mobile presents a unique blend of challenges from both modeling and architectural perspectives. In this talk, we’ll take an in-depth look at the machine learning infrastructure that powers Connected, LinkedIn’s mobile application that helps our members nurture and leverage their professional networks.
Adam Smith (Automated Insights)
This session tells the story of how a young technology company helped The Associated Press embrace cutting-edge data innovation at the heart of its business: the automation of corporate earnings stories. The challenges along the way offer valuable lessons for any brand engaged in a major data implementation – and any vendor who wants to help.
Harrison Mebane (Silicon Valley Data Science)
This session will discuss how to build a resilient, multi-modal event-detection system based on error-prone sources—video, audio, natural language, and external APIs. We will briefly review event-detection techniques and then demonstrate how to combine these across multiple data streams.
Annika Jimenez (Pivotal), Kaushik Das (Pivotal), Rashmi Raghu (Pivotal), Woo Jung (Pivotal), Srivatsan Ramanujam (Pivotal)
With a global team of 30 Data Scientists doing innovative work in almost every vertical market, Pivotal has a rich view into the trends impacting enterprises and their approach to Big Data.
Pankaj Mathur (Acxiom)
Companies using data especially the ones deploying analytics driven workflow are challenged about right mix of first party and third party data. A large part of challenges are due to lack of clarity about data sources and its reliability, privacy laws and logistics needed for mass scale data aggregation.
Lance Olson (Microsoft)
Slides:   1-PPTX 
In this session, we will show you how easy it is to spin up a 32 node Storm cluster and give all attendees a free unlimited 30-day pass to deploy your own Hadoop cluster on Microsoft Azure.
Naser Al (Altiscale, Inc.)
Slides:   1-PPTX 
In this from-the-trenches, DevOps-focused talk we explore operational issues in running Hadoop on top of Docker containers in a production, multi-tenant setup. With Hadoop's native Docker support still in the works and Docker being more of a development tool, a production deployment of the two together is like swimming in treacherous waters... Here's a lantern and a lifeboat to the rescue.
Nick Curcuru (Mastercard), Craig Duncan (MasterCard), Craig Hibbeler (MasterCard Advisors)
In this session, attendees will gain an understanding of the technology and processes crucial to delivering a secure platform. Additionally, they’ll benefit from insights on how to ensure their organization’s Hadoop environment complies with stringent security requirements. Recent implementations of compliance programs will be highlighted as part of the discussion.
Slides:   1-PPTX    2-PDF 
Learn how SAS applications use YARN in order to be a good citizen in a busy Hadoop cluster. Best practices and customer examples for several different user scenarios will be shared and discussed.
Slides:   1-PPT 
This talk discuss how to do realtime analytics with a SQL like query language. We will discuss role of Complex Event Processing in realtime analytics, and then discuss a scalable CEP engine that let users write their queries using declarative SQL like CEP query language, but let them execute those queries using a graph of CEP nodes deployed on top of Apache Storm
Daniel Eklund (Think Big, a Teradata Company), Rick Stellwagen (Think Big, a Teradata Company)
This presentation will highlight why the concept of the "data lake" is a game changing paradigm. Attendees will examine data lake architecture and tradeoffs in construction and operations in data lake design as seen from Think Big consulting engagements.
Costin Leau (Elastic)
Search is more than typing words into a box. It's evolved into the backbone for today’s analytics demands​,​​ and an asset for businesses ​to ​ask the right questions to make sense of their data. This session will highlight how a versatile, agile search and analytics platform can give shape to data, and uncover the "uncommonly common” trends within, to make the right data-driven decisions.
Gary Davis (McAfee, a division of Intel Security)
Slides:   1-PPTX 
Consumers are widely adopting wearable technology – Deloitte predicts there will be 100 million wearable cameras, smartwatches, fitness trackers and other gadgets on the market by 2020. With this mass adoption of wearable devices, comes a new data ecosystem that must be protected. Embracing the protection of this new, intricate data ecosystem is imperative to the success of wearable industry.
Cait O'Riordan (Financial Times)
Slides:   1-PDF 
As the number of ways to discover and listen to music increases, Shazam's data becomes even more powerful in predicting music tastes/fashions. Labels/artists/radio stations increasingly look to Shazam to predict what the next big hit or summer smash will be. Shazam also uses its usage data to create new product opportunities.
Pamela Peele (UPMC)
Slides:   1-PPTX 
Big data is the sexy new frontier for many businesses but it’s expensive to stand up in an organization and expensive to buy from an external vendor. What is the most fundamental way to demonstrate that data science matters to the organization? This session covers the meaningful data consumption metric that every data science group needs to track.
Manu Mukerji (8x8), John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers), Tatsiana Maskalevich (Stitch Fix), Harrison Mebane (Silicon Valley Data Science)
What does successful big data and data science really look like? As consultants out in the field, we've learned a lot of lessons and have great stories to tell about success, failure, and how to negotiate a path through a fast-moving technology landscape.
Anand Venugopal (Impetus Technologies Inc.)
This talk will address an emerging problem in the Modern Enterprise Data Landscape and a possible realization of a "Smart Enterprise Big Data Bus" using an open source stack including Apache Kafka and Apache Storm.
Joerg Blumtritt (Datarella)
Slides:   1-PDF 
Each smartphone generates huge heaps of data - up to hundreds of megabytes per day. Apart from location, all sorts of information on behavior and environmental conditions are seamlessly collected in the backgroud of our devices. We will show how to harvest the data and how to tell the story of our everyday lives from the billions of data points that pile up continuously.
The Solutions Showcase Theater on the Strata + Hadoop World Expo floor will feature 60 lightning talks from the leading vendors in the Big Data space. These 10 minute rapid-fire sessions feature Strata + Hadoop World sponsors presenting use cases about real-world companies and how they solve big data's thorniest problems
Brian Ulicny (Thomson Reuters )
Slides:   1-PPTX 
As the leading source of intelligent information, Thomson Reuters delivers must-have insight to the world’s financial and risk, legal, tax and accounting, intellectual property and science and media professionals, supported by the world’s most trusted news organization.
Paco Nathan (, Holden Karau (Independent), Krishna Sankar (U.S.Bank), Reza Zadeh (Matroid | Stanford), Denny Guang-yeu Lee (Databricks), Chris Fregly (Amazon Web Services)
A full-day, hands-on tutorial introducing Apache Spark and libraries for building workflows: Spark SQL, Spark Streaming, MLlib, GraphX, etc.
Paco Nathan (, Matei Zaharia (Databricks), michael dddd (Databricks), Reynold Xin (Databricks), Holden Karau (Independent), Reza Zadeh (Matroid | Stanford), Sameer Farooqui (Databricks), Denny Guang-yeu Lee (Databricks), Chris Fregly (Amazon Web Services)
Join the Spark team for an informal question and answer session. Several of the Spark committers, trainers, etc., from Databricks will be on hand to field a wide range of detailed questions. Even if you don't have a specific question, join in to hear what others are asking.
Tathagata Das (Databricks)
Spark Streaming extends Apache Spark to do large scale stream processing. In this talk, I am going to explain about how Spark Streaming is revolutionizing the way Big "Streaming" Data applications are being written, and making it as easy as 1-2-3.
Emma McGrattan (Actian)
Slides:   1-PDF 
In this session you will hear of some of the fascinating use cases for SQL in Hadoop based on real-world customer examples. You will learn some of the innovative techniques that have emerged to overcome limitations of the Hadoop platform that enable features one expects in a proven mature database.
What new companies are at the leading edge of the data space? Meet some of the best, most innovative founders as they demonstrate their game-changing ideas at the Startup Showcase.
Tigran Khrimian (FINRA)
FINRA, an independent regulator charged with protecting investors, processes 30 billion market events per day and analyzes the data in search of patterns that indicate possible manipulation of US financial markets. This talk provides an overview of FINRA's Big Data architecture behind that process.
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Lauri Mazzuchetti (Kelley Drye)
Slides:   1-PPT 
Privacy laws as to a company’s obligations on data collection, use, disclosure are changing rapidly. Failing to understand how the laws affect a company’s personal data assets can result in media exposes, regulatory investigations, Congressional hearings and lawsuits. This session will provide guidance on “privacy by design” compliance and practical tips to avoid becoming a target of scrutiny.
Jim Scott (NVIDIA)
Slides:   1-PDF 
Processing data from social media streams and sensors devices in real-time is becoming increasingly prevalent and there are plenty open source solutions to choose from. To help practitioners decide what to use when we compare three popular Apache projects allowing to do stream processing: Apache Storm, Apache Spark and Apache Samza.
Subutai Ahmad (Numenta, Inc.)
The unprecedented increase in streaming data sources requires a new approach to analytical algorithms. Systems must be highly automated, adapt to changing statistics, and naturally deal with temporal data streams. They must require no batch training and should deploy custom models on the fly. It will be impossible to build scalable practical systems without these properties.
Richard Caudle (DataSift)
This session will outline strategies for cost effectively turning enormous streams of Social Data into intelligence for use in any application.
Jeff Pollock (Oracle)
In this session you’ll learn about how to apply Data Discovery and Deep Data Storage for new breakthroughs in data warehousing. We’ll discuss the benefits of using Hadoop technologies like Spark, Kafka, and Hive together with enterprise information architecture and data governance best practices.
Leah Hunter (Tech Journalist)
Slides:   1-PPTX 
People and startups altering the fabric of things through hardware, data science, and entrepreneurial vision. The shape and business of IoT is shifting. Learn about key startups making technological advances and surprising intellectual leaps. We aren't yet indistinguishable from magic. But these people are getting us there.
Anima Anandkumar (UC Irvine)
I will demonstrate how to exploit tensor methods for learning. Tensors are higher order generalizations of matrices, and are useful for representing rich information structures. Tensor factorization involves finding a compact representation of the tensor using simple linear and multilinear algebra.
This session covers a number of these challenges that Hadoop presents to efficient query processing and discussed a number of the novel approaches that modern SQL-on-Hadoop solutions take in order to overcome these hurdles.
Solomon Hsiang (UC Berkeley)
Advances in data science empower leaders to make better decisions for society. By using new kinds of information unavailable during the last several millennia of government, we can avoid mistakes of the past. We will discuss how data and statistical inference are informing how we manage the global climate rationally, a defining policy challenge for our generation.
Josh Baer (Spotify), Rafał Wojdyła (Spotify)
Slides:   1-PDF 
There's many confusing and painful things about setting up and operating a 900 node Hadoop cluster used as the centerpiece in many of Spotify's Big Data initiatives, we'll go over a few interesting stories and frustrations which have influenced the direction of our architectural choices and the lessons we've learned from them.
Joey Echeverria (Rocana)
As the volume of data and number of applications moving to Apache Hadoop has increased, so has the need to secure that data and those applications. In this presentation, we'll take a brief look at where Hadoop security is today and then peer into the future.
Cathy Tanimura (Strava)
Slides:   1-PPTX    2-PPTX 
The most frustrating part of data science is when customers don’t “get it”: endless revisions, recommendations not implemented, or data products not adopted. Exciting new research in neurology, cognitive psychology, and behavioral economics have a lot to say about why. We’ll explore the findings and implications for designing more successful “human-data interfaces.”
Sensor devices are proliferating fast now that manufacturing price is under $100. And while more data is being generated, more is also being thrown out because of the resource gap between compute and network. We must make new computational trade-offs whilst ensuring quality. In this talk, we discuss these trade-offs and examine architectures for peer-based and crowd-sourced model generation.
Rado Kotorov (Information Builders)
This session uses actual case studies to illustrate how organizations are innovating, changing and growing their businesses with Big Data. The presentation will discuss the data requirements and the front end analytic applications used to deliver game changing Big Data initiatives. . .
John Haddad (Informatica)
What types of organizations generate the most revenue and profits from Big Data initiatives? They need to be agile and adopt new technologies, grow existing resource skills while attracting new skills, and yet still manage and govern data. In this session we’ll describe organization structures, roles, skills, and interactions that make these types of data-driven organizations successful.
Woody Christy (Cloudera), Steve Anderson (Intel), Patrick Schots (Intel), Floris Grandvarlet (Cisco)
There is often debate in the Hadoop community of the correct hardware combination for a cluster. In this talks, attendees will learn how varying different components impacts performance and how to chose the right components for their own workloads.
Joseph Adler (Facebook), Robert Johnson (Interana)
Thirty years ago, data warehouses revolutionized data storage at big companies, storing summarized data in a strict structure and making it possible to efficiently analyze data. We believe that modern technology lets you adopt a simpler and more powerful scheme to organize historical data: time ordered raw event logs. In this session, we'll explain why raw data is better.
Yanpei Chen (Cloudera), Karthik Kambatla (Cloudera)
You will never look at SSDs the same after this presentation. We discuss how SSDs improve the performance of MapReduce workloads. We identify cost-per-performance as a more pertinent metric than cost-per-capacity when evaluating SSDs versus HDDs for performance, and quantify that SSDs can achieve up to 70% higher performance for 2.5x higher cost-per-performance.
Slides:   1-PDF 
If we want to use data to understand human behavior and to design successful interventions to change that behavior, social scientists and data scientists will need to work together. However, the two often approach problems differently and speak strikingly different languages. This talk will present success stories and tips for productive collaboration between social scientists and data scientists.
Jairam Ranganathan (Cloudera)
With hundreds of developers from a variety of organizations participating, Hadoop moves quickly. This talk will survey the important changes admins and users should be aware of and their impacts to various use cases.
The explosion of internal data sources, data “lakes” (e.g., Hadoop), external public data sources, and feeds from the Internet of Things is creating a tsunami of diverse data sources for enterprises to leverage. Top-down data-integration and data-scientist tools won’t scale to meet integration demands. Learn how a scalable data curation platform can help enterprises with data integration at scale.
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program Chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of Strata + Hadoop World Keynotes.
Birds of a Feather (BoF) sessions are informal roundtable discussions happening during lunch on Thursday, February 19 and Friday, February 20. You can join any BoF table or start your own with a topic of your choice. The BoF sign-up board will be near the Registration area.
Author book signings will be held in the O’Reilly booth on Wednesday, Thursday, and Friday. This is a great opportunity for you to meet O’Reilly authors and to get a free copy of their book. Complimentary copies will be provided for the first 25 attendees. Limit one free book per attendee.
Office Hours are your chance to meet face-to-face with Strata Conference+ Hadoop World presenters in a small-group setting. Drop in to discuss their sessions, ask questions, or make suggestions.
John Carnahan (Ticketmaster)
We will describe how we have used Storm, stream-processing and machine-learned classifiers to improve access to tickets during onsales and how this can extend to similar recipes for trend prediction and anomaly detection. We will also describe how we use tools such Kafka, Storm and Hbase to build an optimal solution for real-time “n-squared” marketing.
Monte Zweben (Splice Machine Inc.)
Slides:   1-PPTX 
Once just the realm of Java jockeys and data scientists, Hadoop has become a mainstream tool for business analysts with the rapid proliferation of SQL-on-Hadoop solutions. But there are pitfalls that can plague implementations as IT teams get their first exposure to production Hadoop environments. We’ll discuss the most common pitfalls companies face and how to get around them.
Chad Meley (Teradata), John Kreisa (Hortonworks)
Hadoop and The Internet of Things has enabled data driven companies to leverage new data sources and apply new analytical techniques in creative ways that provide competitive advantage. We will discuss real world case studies from the field that describe the strategies, architectures, and results from forward thinking companies across a variety of verticals.
Anirudh Todi (Twitter Inc.)
Slides:   1-PDF 
Twitter's users generate tens of billions of tweet views per day. Aggregating these events in real time - in a robust enough way to incorporate into our products - presents a massive scaling challenge. In this talk I'll introduce TSAR (the TimeSeries AggregatoR), a robust, flexible, and scalable service for real-time event aggregation designed to solve this problem and a range of similar ones
Patrick Wendell (Databricks)
Slides:   1-PPTX 
Apache Spark is a popular engine for large scale analytics. This talk will give insights into tuning and debugging a production Spark deployment. It will start with details about Spark internals and an overview of the runtime behavior of a Spark application. I'll explain how to diagnose performance bottlenecks and get the best performance out of Spark jobs.
Michael Abbott (Stanford University), Ki Ra (Directly), Christopher Pouliot (Nio), Mike Polcari (23andMe)
Most people are familiar with the basic principles driving today’s hottest big data and enterprise companies. But what’s really going on underneath the hood? In this session, Kleiner Perkins Caufield & Byers General Partner Michael Abbott unboxes a variety of startups in the space to examine the technology, architecture, and innovations they’ve harnessed to deliver superior products and services.
Kuang Chen (Captricity)
Enterprise data grows over 65% a year. Last year, non-productive information work—reformatting, data entry, and so on—consumed more than US$1.5 Trillion. Yet companies continue to pour billions into human-driven paper-to-digital processes.
LianHui Wang (Tencent)
In this talk, we introduce the general data architecture of Tencent with a focus on our Spark use cases on a GAIA (our improved resource manager based on YARN) cluster of 8000+ nodes. We contrast Spark with the previous MapReduce use cases, followed by tuning methods and optimizations for large scale clusters.
Nima Sarshar (inPowered)
There is no shortage of opinions expressed across the Web on virtually any topic. This enables a diversity of voices to be heard, but often leaves users overwhelmed. We report on our implementation of a big data platform that identifies and ranks experts on a large number of topics. It allows users to cut through the noise and discover opinions expressed by credible experts in topics of interest.
India Swearingen (United Way of the Bay Area)
Slides:   external link
Social service organizations have a tough job when it comes to using data to drive social impact. With “world saving” goals and large scale impact, it's crucial these organizations leverage a variety of data streams and do more with less. But, pulling multiple data streams and leveraging partners can be a tricky one, this session walks through some ins-and-outs using United Way as one example.
Stewart Collis (aWhere Inc.)
As climate change increases weather variability, farmers must adapt. Add to this global population growth and diet changes that require world food production to have increased by 70 percent in 2050 means farmers will struggle to meet demand. Of the 580 million farmers in the world, 500 million have little access to technology or information to ensure agile adaptation. Big helps solve this problem.
Richard Williamson (Silicon Valley Data Science)
Slides:   1-PPTX 
Getting the full value from data often requires the combination of stream processing on new events combined with large scale historical analysis. While both these activities are served by Spark’s execution framework, leveraging multiple persistence layers is key to efficiently and extensibly enabling these use cases.
Daniel Crankshaw (UC Berkeley)
In this talk, I will introduce Velox, the newest component of the Berkeley Data Analytics Stack. Velox is the missing piece in the predictive analytics stack enabling interactive applications ranging from content recommendations to personalized search by addressing the challenges of serving and managing personalized machine learning models at scale.
Alyosha Efros (UC Berkeley)
In this talk, I will describe some of our efforts to bypass the "language bottleneck" and other information to help in visual understanding and visual data mining.
Moderated by:
Cornelia Levy-Bencheton (CLB )
Michele Chambers (RapidMiner), Alice Zheng (Amazon), Neha Narkhede (Confluent)
The future is all about information. It will belong to those who can find it, understand it and know how to use it. In this panel discussion, we explore evidence-based benefits of welcoming more women into the tech community and of increasing female talent power on work teams.
Author book signings will be held in the O’Reilly booth on Wednesday, Thursday, and Friday. This is a great opportunity for you to meet O’Reilly authors and to get a free copy of their book. Complimentary copies will be provided for the first 25 attendees. Limit one free book per attendee.
Danyel Fisher (, Miriah Meyer (University of Utah)
Slides:   external link
We lots of things "data visualization," from a news interactive, to spreadsheets, to an infographic counting calories. These surface similarities hide deep differences in what it means to interact with data. In this talk, we cross disciplines—from data science to design—to enliven our techniques and encourage us to try new methods for creating visualizations.
Anne Johnson (Credit Suisse)
Slides:   1-PPTX 
As the Global Head of Investment Risk, Anne Johnson of Credit Suisse takes data quality very seriously. A single misplaced number can put billions of dollars of client assets’ at risk. Find out some of the challenges that Anne and her team face in governing the integrity of their data and the new ways they are thinking about data integration and quality.
Michael Dauber (Amplify Partners), Matthew Ocko (Data Collective), Max Gazor (Charles River Ventures), Cack Wilhelm (Scale Venture Partners), Arif Janmohamed (Lightspeed Venture Partners)
To anticipate who will succeed and invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, we’ll consider the big trends in Big Data, asking top-tier VCs to look over the horizon discuss the visions they have two or more years in the future.
Doug Stein (metacog, Inc.)
Big Data can transform learning from past's one way push of finely crafted content "at" learners to a two way data conversation that streams real-time feedback from students as learning challenges are tackled. What kind of rich opportunities exist in analyzing and visualizing that two-way data stream as learners interact with open-ended content?
Ron Bodkin (Google)
YARN has featured in the marketing of Hadoop distributions for the past 2 years. All the major vendors now ship production versions. What is the real world state of deployment? What does it let you do? What are the limitations? In this talk we review three distinct deployments look at benefits and challenges, and highlights lessons for those considering to take the plunge.
John Myles White (Facebook)
In this talk, I'll describe the ways in which Julia improves upon the current generation of languages used for data science.
Dean Wampler (Anyscale)
Slides:   1-PDF 
Spark is an open-source computation platform for Big Data. All the major Hadoop vendors have embraced Spark as a replacement for MapReduce, because Spark offers much better performance, a more powerful and productive API, and support for event processing. Spark's secrets for success are the Scala programming language and Functional Programming. We'll explore why.
If you’re a woman looking for like-minded communities to join, c’mon down to our meetup on Wednesday evening after the Opening Reception for more appetizers, drinks, and networking. In addition to meeting other women (and men) in the community, you’ll hear lightning talks from representatives of groups, companies, and projects that support diversity in the technology community.
Ted Dunning (MapR, now part of HPE)
Slides:   1-PPTX 
YARN and MESOS are often positioned as competitors for managing datacenter resources, but in reality they work together to seamlessly share datacenter resources. Why force IT to choose between these two great technologies, when we can show you how they work in concert.
Kathleen Ting (Cloudera), Miklos Christine (Databricks)
Slides:   1-PDF 
The next generation of MapReduce, YARN, has widely touted job throughput and Apache Hadoop cluster utilization benefits. Less known are the pitfalls littering the migration path to YARN. Learn from our extensive field experience to avoid those pitfalls and get your YARN cluster configured right the first time.
Alistair Croll (Solve For Interesting)
Roughly every decade, some kind of military or enterprise technology makes its way into the mainstream: the personal computer; the consumer Internet; the mobile phone; the Internet of Things. What happens when Big Data turns into a consumer product? Strata chair Alistair Croll offers some speculation about what data will do to the way we live, love, work, and play.
Rahul Pathak (Amazon Web Services)
Join us as we explore the big data services of AWS and watch a speaker-led tutorial and a link to a lab in which you can take with you.