Strata + Hadoop World Speaker Slides & Video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Python is the language of choice when it comes to integrating analytical components. We will present a series of concepts and walkthroughs that illustrate how easy scientific computing is in Python, from machine learning and time series to spatial relationships and network analysis.
Apache Hadoop is enabling companies across many different industries that need to process and analyze large data sets. In this tutorial you will learn why and how people are using Hadoop and related technologies like Hive, Pig and HBase.
Nobody knows data like a web analyst. That’s because everything we do online leaves a digital breadcrumb trail that’s easy to track and mine. The real world is less well instrumented—but that’s changing. Noted analyst Marshall Sponder takes us on a tour of some applications that blend real-world sensors with deep analytics.
Presentation: external link
Open-source developers all over the world contribute to millions of projects every day on GitHub: writing and reviewing code, filing bug reports and updating docs. Data from these events provides an amazing window into open source trends: project momentum, language adoption, community demographics, and more.
Samantha Ravich, former National Security Advisor to Vice President Richard Cheney, will discuss the challenges that face strategic decision makers from the wealth of data now provided by advances in technology.
A successful big data analytic project is not just about selecting the right algorithm for building a predictive model, but also about how to deploy the model efficiently into operational systems, how to evaluate the effectiveness of the model, and how to continuously improve it. In this tutorial we cover best practices for each of these phases in the life cycle of a predictive model.
Proper tooling and good habits that maximize reproducibility are essential to being productive as a data scientist. From management of raw data to model version control, the entire workflow must be carefully controlled from end-to-end to produce quality research that scales with the quantity and complexity of data being analyzed.
Hadoop started as an offline, batch-processing system. It made it practical to store and process much larger datasets than before. Subsequently, more interactive, online systems emerged, integrating with Hadoop.
Hadoop is considered THE technology for addressing Big Data. While it shines as a processing platform, it does not respond anywhere close to "human time". In developing our solution, we needed the ability to query across billions of rows in seconds. Hear how and why we developed Druid, our distributed, in-memory OLAP data store after investigating various commercial and open source alternatives.
Society confronts enormous challenges today: How will we feed nine billion people? How can we diagnose and treat diseases better, and more cheaply? How will we produce more energy, more cleanly, than ever before? Big questions like these demand new approaches, and "Big Data" is a crucial of the toolkit we will use over the coming years to answer them.
In this session, Kevin Foster, IBM Big Data Solution Architect, will provide an overview of big data analytic accelerators and how they are being used by organizations to speed up deployments and solve big data problems sooner.
Nokia’s Big Data analytics service is a strategic multi-tenant, multi-petabyte platform that executes 10,000 jobs each day. It is made up of technologies that provide location content processing, ETL, ad-hoc SQL, dashboards and advanced analytics, including Calpont InfiniDB for SQL, Scribe, REST, Hadoop, and R. This talk discusses the platform, motivations behind design choices, and challenges.
In recent years, "Big Data" has matured from a vague description of massive corporate data to a household term that refers to not just volume but the diversity of data and velocity of change. Today, there's a wealth of data trapped in corporate data repositories, new platforms like Hadoop, a new generation of data marketplaces and volumes generated hourly on the Web.
Presentation: external link
Since the first human scrawled an image on a cave wall, the brain has been processing petabytes of data. Today, we're passing through an historical threshold where big data is leaching out of our braincases into the disembodied cloud. For the first time in human existence, we can "think" outside of our brains. What does this mean for privacy, morality, ethics, and the law?
If you are a manager on the IT team in your organization, chances are there is already a lot of buzz around big data. If you are wondering if this hype amounts to just another IT project, wait. Big data affects the whole enterprise, and requires more business ownership and drive than just IT.
Amy O'Connor, Sr. Director of Nokia Analytics, together with her daughter and Nokia Intern, Danielle Dean, will share what makes a great data scientist, their different paths to acquiring the diverse skill sets that are needed and finally Amy will discuss how to spot, attract and train emerging data scientists in what is quickly becoming a heated market.
The onset of the Big Data phenomenon has created a unique opportunity, but the challenge ahead of us is to move beyond Big Data infrastructure to morally and practically useful applications. This requires new technologies that close the "Understanding Gap" and, by doing so, can make great strides to prevent evil, reduce suffering, and create more actualized human potential.
Apache Flume (incubating) is a scalable, reliable, fault-tolerant, distributed system designed to collect and transfer massive amounts of event data from disparate systems into some storage tier such as Hadoop HDFS. In this tutorial we show how to easily build a large-scale data collection and transfer system in a scalable way using Flume NG, the next generation of Flume.
Data manipulation, cleaning, integration, and preparation can be one of the most time consuming parts of the data science process. In this talk I will discuss key points in the design and implementation of data structures and algorithms for structured data manipulation. It is an accumulation of lessons learned and experience building pandas, a widely-used Python data analysis toolkit.
Big data isn’t just about volume. It’s also about speed—making decisions in real time, and enabling interactive exploration—and richness of information. While we talk a lot about how large organizations can benefit from mining and connecting data sets, Big Data can help the little guy, too.
In this case study, David Boyle will look at how EMI changed itself, and the music industry, by moving from gut instinct and opinions to a data-informed business.
Online marketplace Etsy cares a lot about its customers, and what they’re worth. How do we understand the value they bring to an organization? In this session, Roberto Medri shows us how Etsy thinks about Customer Lifetime Value, and how this data is used to target your best customers, mitigate churn, and even get a meaningful value of your company’s worth.
In this rapid-fire keynote, we’ll introduce how virtually every new technology trend is inextricably linked – or should be to attain maximum leverage. We’ll discuss how you can use technologies such as cloud and mobility to spread the value of analytics pervasively across your virtual organization, and how that positively impacts your employees, customers and partners.
This session will delve into the MapReduce computation paradigm, introduced by Google and widely adopted via the open-source Hadoop platform, combined with commodity hardware to execute computation at the storage node where data exists.
Imagine the social graph where personal relationships are replaced by commercial relationships based on real financial data. Imagine the possibilities for small businesses to grow, connect, transact and prosper.
Communicating Data Clearly describes how to draw clear, concise, accurate graphs that are easier to understand than many of the graphs one sees today. The tutorial emphasizes how to avoid common mistakes that produce confusing or even misleading graphs. Graphs for one, two, three, and many variables are covered as well as general principles for creating effective graphs.
An effective data science team looks a lot like an effective design team: brainstorming creative ideas, making prototypes, receiving feedback, telling stories, and deeply understanding the needs of others.
In this tutorial, we’ll provide an introduction to an open source Map/Reduce library for R called RHadoop that makes Map/Reduce programming convenient and easy to understand for statistical modeling users. The session will cover the basics of RHadoop, common techniques and best practices, and some interactive real-world examples.
How does Opower deliver insights to millions of households with big (and getting bigger) data? I discuss how to effectively use Hadoop, integrate it with R and Python, and harness an engaged workforce to solve data science and efficiency problems.
This workshop is a jumpstart lesson on how to get from a blank page and a pile of data to a useful data visualization. We'll focus on the design process, not specific tools. Bring your sample data and paper or a laptop; leave with new visualization ideas.
An increasing number of organizations are embracing data to drive intelligent decisions. For many industries, this is a monumental shift in method and culture. Data communication strategies come in many flavors, from static metric reports to immersive data experiences. In this session I present a user-centered framework for designing or evaluating data delivery methods.
In this joint session, experts from Cisco and Cloudera reveal the fundamental design considerations of Hadoop in the Enterprise Data Center. Drawing from lessons learned in the real world, they'll share best practices from deployments of Cloudera's Hadoop distribution alongside Cisco's networking components.
Data is getting bigger faster than ever, and visualization is emerging as the preeminent tool for gainign insight, gleanign answers, and making decisions informed by your mountain of data. Unfortunately, most of what's being presented visually these days is, at best, more style than substance, and at worst, wildly misleading.
ODS is Facebook's internal large-scale monitoring system. HBase turns out be to a good fit for its workload and solves some manageability and scalability challenges with the previous MySQL based setup. We would like to share a series of valuable experiences learnt from building this large scale realtime system based on HBase.
You need more than a database 'hammer' for today's Big Data projects. Organizations need a 'data platform' providing integrated tools to capture, store, process and present data. Without it companies can achieve - volume, velocity, or variety - but not all three. Join us to learn the extreme capabilities needed to distill new business signals from big data.
Can a million monkeys on a million typewriters eventually recreate Shakespeare? The great minds since Aristotle have been thinking about this theorem. In 2011, Jesse Anderson randomly recreated Shakespeare using Hadoop. Here's why you should care.
The exponential growth of graph-based data analysis is fueling the need for machine learning. Recently, frameworks have emerged to perform these computations at large scale. But, feeding data to these frameworks is a challenge in itself. This talk introduces the GraphBuilder library for Hadoop, which makes the job easier for programmers. Several case studies showacse the utility of library.
Presentation: external link
This hands-on tutorial teaches you how to setup and use Hive, a high-level, data warehouse tool for Hadoop. Hive provides a SQL-like query language, HiveQL, that is easy to learn for people with prior SQL experience, making Hive attractive for data warehousing teams. Hive leverages the power of Hadoop for working with massive data sets without requiring expertise in MapReduce programming.
With the rise of Apache Hadoop, a next-generation enterprise data architecture is emerging that connects the systems powering business transactions and business analytics
A look at using Hadoop, HBase and other technologies to bring together and process health data from many sources in real time. This includes techniques for dealing with data that's incomplete or out-of-order when it arrives, merging bulk and real-time data sets, and creating search indexes and data models to enable better health care.
This session will provide insights into how the combination of scale, efficiency, and analytic flexibility creates the power to expand the applications for Hadoop to transform companies as well as entire industries.
Big Data takes on the planet’s toughest challenge by analyzing weather’s complex behavior. Using hundreds of terabytes of data and trillions of simulation datapoints, The Climate Corporation models weather’s impact on crops to create customized insurance for farmers facing the financial impact of extreme weather.
How a traditional Spanish-language media company, made the strategic decision to build a robust analytics intelligence division to more effectively target the Hispanic market. Attendees will walk away with insights on how this traditional media company implemented a big data and MapReduce operations from the ground up.
How do you keep up with the velocity and variety of data streaming in from the operational systems that power your business? What about getting analytics on your data even before you persist and replicate it?
We’ve looked at the many uses of data, and what governance is required. But are our expectations of privacy realistic? In this session, Terence Craig and Mary Ludloff, authors of Privacy and Big Data, ask (and answer) the question: What level of privacy do we really have in the digital age?
Big data initiatives often begin with a pilot project. This can generate internal support to invest in larger big data initiatives. Nevertheless, executing pilot projects can be difficult, and many pilots don’t convert into larger big data projects. In this session we’ll explore the challenges of big data pilots and suggest ways to plan and execute a successful pilot.
Data has been locked in a mindset of rows and columns. Our brains are trapped by database schemas. To get out of that predisposition and communicate visually requires new thinking. This session covers techniques for reframing our thoughts about data, how to describe data, forming a narrative, and coming up with visual solutions.
This is a presentation that talks about how cluster design impacts performance. The presentation will cover several different design options and the trade offs in terms of performance and cost. The talk will also cover some of the tuning options based on the underlying hardware considerations.
Presentation: external link
While many of the necessary building blocks for data processing exist within the Hadoop ecosystem, it can be a challenge to assemble them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
In this session, Deborah Cooper will show how organizations can use one of the largest free data sets around, along with privately collected data, to provide business insights and drive market strategy. She’ll introduce some of the available datasets, show how to organize them, and provide examples that integrate public and private datasets together for practical decision making.
MLB captures 10Tb of game data every year. While valuable data, lessons were quickly learned that effective use of this data required different visual front-ends for fans, players, coaches and scouts. The ability to adapt and address different audiences helped the success of this project and can help other big data projects.
Apache Pig makes Apache Hadoop easier to use thanks to its high-level data flow language, Pig Latin. In this talk, we will discuss common data analysis tasks, the choices one can make while writing a query and impact of each on performance. The core principles behind the optimization recommendations shared during this presentation are applicable to all MapReduce applications.
The Hadoop and data science communities have matured to the point now that common design patterns across domains are beginning to emerge. Now that Hadoop is maturing and momentum is gaining in the user base, the experienced users can start documenting design patterns that can be shared. In this talk, we'll talk about what makes up a MapReduce design pattern and give some examples.
New York City is a complex, thriving organism. Hear how data science has played a surprising and effective role in helping the city government provide services to over 8 million people, from preventing public safety catastrophes to improving New Yorkers' quality of life.
Anne Milgram, Senior Fellow at the NYU Law Center on the Administration of Criminal Law Center.
A lot has been presented on the tools, models and technologies you should employ in Big Data, but this talk will focus on the critical strategies and tactics you need to employ. In particular it will address such as finding and hiring the right talent, designing the right roles and responsibilities, building the right processes into your group, and molding the culture of the whole organization.
Our Data Science tech stack has shifted from best-of-breed, "classic" business intelligence technologies to a hybrid environment, fully leveraging Hadoop and other Big Data solutions. Our philosophy has also evolved, now distilled in thinking and practice into "data science as a service". Why did we do it? What does it look like? What are the benefits? Come find out.
The story of Big Data technology has centered on engines, algorithms, and statistical methods for data analysis. Less has been said-and too little has been done-regarding technology to improve the lives of data analysts.
Performing investigative analysis on data stored in HBase is challenging. Most tools operate on files stored in HDFS, and interact poorly with HBase's data model. This talk will describe characteristics of data in HBase and exploratory analysis patterns. We will describe best practices for modeling this data efficiently and survey tools and techniques appropriate for data science teams.
All the data in the world won’t make a difference if we can’t change people’s minds. There’s overwhelming evidence that we don’t behave rationally, and that small changes in how information is shared or tainted have huge impacts on its effects.
There has been a lot of excitement lately about streaming approaches to handling Big Data such as Storm, S4, SQLStream, and InfoStreams. But many use cases can be better handled by low latency access with NoSQL databases and search indexing backed by scoring with batch analytics in Hadoop. We compare such integrated Big Data with streaming systems and look to the future.
The quantity of digital information collected and processed every day is growing at an exponential rate. To make sense of this mountain of data we can no longer afford the delays of batch processing systems. In this track we'll introduce Storm, a new, real-time analytic framework, and show how to use it to massively parallelize information analysis, to get instant results from your data.
To kick off Bridge to Big Data Day, we present two views of big data. Is it truly something new, or just an evolution of what we have already? Join us for an interesting and entertaining talk that will help frame your thinking on big data.
This tutorial will help participants understand why distributed search is important and teach them how to use the landscape of tools available. Based on our hands-on experience at NetApp, we will lead a tutorial session that will teach participants how to setup and use search technologies such as Apache Solr and Lucene to enable real-time Big Data analytics with Hadoop, HBase, and other NoSQL.
Your DNA, written out as a string of G, A, T, and C, is about three and half gigabytes long. That string is about 99.9% identical to an arbitrary Reference Genome. Practically all of those differences are harmless, but a a tiny fraction can cause disease, contribute to disease, or just change how your body reacts to drugs. We're using Hadoop to find the variants that actually matter.
Building analytical models is a process of trial and error. Often it makes sense to sample down a data set so that numerous methods and new variables can be tried quickly. Consider moving to the entire data set with Hadoop only after the lessons gleaned from the failures have been incorporated into a few candidate models.
The accepted wisdom from the very beginning of the Web was that the internet would change everything—media, marketing, commerce, communications. So why do marketers still try to find audiences using the clumsy tools of traditional media? In this session, Tom Phillips will challenge current marketing methodologies, arguing that in an era of big data, it’s time for the machines to take over.
This tutorial will explore the tools and techniques you need to ensure that your MapReduce applications are both correct and efficient. You'll learn how to do unit testing, integration testing and performance testing for your Hadoop jobs, as well as how to intepret diagnostic information to isolate and solve problems in your code.
Building a reliable data-driven solution to a complex business problem is like designing a pocket watch from scratch. At the heart of successful analytics is the art of decomposing the looming big objective into smaller components, each of which may have its own data feed, modeling technique and runtime constraint. We showcase this process on the example of M6D’s online display advertising.
While moving away from single powerful servers, distributed databases still tend to be monolithic solutions. But e.g. key-value storage is rapidly becoming a commodity service, on which richer databases might be built. What are the implications?
Business users' attitude to data is changing rapidly – remember when building an EDW was all consuming? Now Big Data is edging the EDW to the side or likely into obscurity. Is this good or bad? How do you bring the values and software investment surrounding the EDW to the wild west of Big Data?
Data integration for Big Data projects can consume up to 80% of the development effort and yet too many developers reinvent the wheel by hand-coding custom connectors, data parsers, and data integration transformations. A metadata-driven, codeless IDE with pre-built transformations and data quality rules have proven to be up to 10X more productive than hand coding and easier to maintain.
Hadoop is scalable, inexpensive and can store near-infinite amounts of data. But driving it requires exotic skills and hours of batch processing to answer straightforward questions. Learn how everything is about to change.
Over the past two decades, Rick Smolan, creator of the best selling "Day in the Life" books, has produced a series of ambitious global projects in collaboration with hundreds of the world’s leading photographers, writers, and graphic designers. This year Smolan invited more than 100 journalists around the globe to explore the world of Big Data.
Presentation: external link
This session presents a simple analytical and generative toolkit for interface design. It provides designers with an effective starting point for creating satisfying and relevant user experiences for Big Data and discovery interfaces. The toolkit helps designers understand and describe users' activities and needs, and then define and design the interactions and interfaces necessary.
A fireside chat with Cathy O'Neil about why universities can't make data scientists. Lots of companies want to hire data scientists, and there aren't enough to go around. Some universities are adding data science graduate departments, but they're facing an uphill battle, thanks to a lack of good data for academics, political infighting, and scalability issues.
Data science is a team sport. Collaboration inside and outside your organization is the ultimate Big Data technique. Success depends on having a collaboration platform and solving the number one problem of the Big Data era: the supply and demand for data scientists. Learn how you can take action today to accelerate the success of your data science efforts.
Trecul is a dataflow system that powers Akamai's Online Adversting business, processing billions of events hourly. Trecul is built on top of HDFS & Hadoop Pipes to achieve fantastic runtime performance. We'll talk about it's use of LLVM-based JIT compilation so everything runs as native C++ code, no Java and no runtime interpreter. Akamai has open-sourced Trecul and it is available on Github.
We’re on the verge of a sea change of connectivity, as we instrument the world around us, a movement known as the Internet of Things. In this session, Rob Coneybeer looks at the many factors behind this transformation, and how they’re creating a wide-range of new products and opportunities for business and technology.
As Apache HBase matures, the community has augmented it with new features that are considered hard requirements for many enterprises. We will discuss how the upcoming HBase 0.96 release addresses many of these shortcomings by introducing new features that will help the administrator minimize downtime, monitor performance, control access to the system, and geo-replicate data across data centers.
Jonathan Alexander, VP Engineering at Vocalocity and the author of Codermetrics (O’Reilly 2011) and Moneyball for Software Engineering (O’Reilly Radar 2011/2012) presents new ideas on how to gather data and use analytics to create more effective software development teams.
HBase is one of the more popular open source NoSQL databases that have cropped up over the last few years. Building applications that use HBase effectively is challenging. This tutorial is geared towards teaching the basics of building applications using HBase and covers concepts that a developer should know while using HBase as a backend store for their application.
Attendees with learn practical examples how to build a collaborative environment that accelerates the value of big data, with the goal of “making data part of every conversation.”
As data scientists, we encounter large networks all the time. Recommendations, social ties, transactions, and other types of data are naturally represented as networks. To understand these networks, metrics help, but visualization is crucial. This talk will focus on tools, techniques, and frameworks to visualize networks cleanly, avoiding or at least minimizing “hairballs”.
If you’re going to transform your business by infusing it with data and analysis, you’re going to have to pay attention to what you use and how you use it. In this session, Micheline Casey provides an overview of data governance and data management principles that should be applied to big data projects.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts.