Below is the preliminary list of all confirmed sessions for Strata 2013. We are in the process of adding more content to the program, and will release the day-by-day schedule in the coming weeks.

As Hurricane Sandy prepared to pound NYC late last October, Palantir's Philanthropy Engineering team deployed with Direct Relief International and Team Rubicon - non-profit disaster relief organizations. Combining their expertise and data with Palantir's software led to some surprising outcomes that are continuing to transform data-driven disaster response. Here's our stories from the hurricane.
Oracle and the Frederick National Laboratory for Cancer Research used Hadoop, Hive, and R to analyze relationships between Genomes and Cancer subtypes. This project was selected by a panel of judges as the 2012 Government Big Data Solutions Award Winner (over eighty competing government Big Data projects were nominated).
In this Enterprise IT session, we will talk about big data's practical application in business and how certain architectures facilitate faster insights and lower TCO.
A way to introduce the idea that access to Big Data in many countries – especially Argentina – is still a work in progress and somewhat politicized. Despite that, media like La Nacion Newspaper, are working with developers and experts in Data Viz to address the lack of transparency and accountability.
While many libraries are available today to help create interactive visualizations, they are generally not integrated with the data analysis tool chain. This talk will focus on how to combine agile data manipulations with web-based visualization libraries to create a more efficient workflow for data science.
Big data gives us a powerful new way to see patterns in information - but what can't we see? When does big data not tell us the whole story? This talk opens up the question of the biases we bring to big data, and how we might work beyond them.
This session is an overview of Apache Drill, another big data system inspired by a Google white paper.
An introduction Spark and Shark, two components of the open-source Berkeley Data Analytics Stack (BDAS) in development at UC Berkeley. Spark is a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100x. Shark is a port of Apache Hive onto Spark that is fully compatible with, and up to 100x faster than, Hive.
Crunch 40 years worth of daily global satellite data at the push of a button, perform spatial analyses on GBs of your own GIS data and securely share the results privately or publish to 1B Google Earth users. This talk will focus on how what was once the realm of a few is now easily and intuitively accessible from the comfort of your Chrome browser.
AMPLab’s open source data analysis projects, Spark and Shark, deliver iterative queries up to 100x faster than Hadoop MapReduce. Hear how companies are using Spark-based data platforms for fast, interactive analysis on big data.
IBM and the University of Oxford partnered in late 2012 to explore and understand how organizations and have really begun leveraging big data to create competitive advantage in the marketplace. The joint study, based on a survey of more than 1100 business and IT executives, combines executive interviews and case studies to establish benchmarks that define the big data era ahead
BigDataCamp is a free unconference for users of Hadoop and related Big Data technologies to exchange ideas in a loosely distributed format.
For CIOs, IT executives, and technology professionals, Strata's Enterprise Big Data day lays out the roadmap to get your organization up to speed on big data. In this all-day event, hear how to create a big data strategy, understand the issues of managing data, and learn how data science can be used powerfully in your organization.
Today's smartphones have evolved into incredibly rich sensing and computing devices, that can be used to infer complex and interesting things about us, our environment, and our communities. This talk will give an overview of user-centric, continuous mobile sensing, and our work, originating at the MIT Media Lab, to develop open tools to democratize this capability.
The promise of big data is to enable business transformation through new and powerful insights. Such transformations are mandated by the executive team, but how does that work with IT? What does big data success look like, and how do enterprises get there? Discover why big data is a business-owned problem, and how the relationship between IT and business must change.
At Strata 2012 in New York, we discussed the hazards of curbing big data inferences by defining a new category of thoughtcrime. After all, acting on thoughts might constitute a crime, but thoughts, in isolation, cannot be criminal. It's time to go deeper. Let's create and evaluate a predictive criminal model that highlights where the sensitivities lie, both technically and ethically.
This hands-on tutorial will give you on an overview of how AWS can quickly and easily enable you to start generating insights from your company’s data.
Data science for consumer internet products relies on our ability to effectively analyze and understand ubiquitous computing in terms of a holistic product experience, as individuals consume and create data on mobile and desktop devices in their day-to-day lives. I'll talk about mobile data science challenges — from product development to data-driven decision making.
Come learn about ACG, Analytical Compute Grid, a solution Rackspace built leveraging OpenStack, Big Data and NoSQL to help end users manage complex information and data.
Come to the Data Science Meetup to get a great head start on your Strata networking and educational experience.
Apache Hadoop is an innovative emerging technology causing CIOs to rethink their data architecture - making this an exciting time to be a “big data” technologist. This tag-team presentation brings leaders in both Apache Hadoop and data warehousing on the stage, to answer these questions by sharing their vision for the future of big data management and analytics.
2012 was particularly interesting for the variety of Big Data use-cases implemented. This session explores key patterns across horizontal and vertical use cases.
As big data makes inroads into all aspects of society, how governments regard the technology will be critical for its success. If the past is a guide, the state will embrace big data for its own uses (both good and ill). It will recognize that its authority is threatened and lash out
In this key note, we will explore some of the challenges of big data operating in a truly global context.
Factual believes that some data problems are bigger than any one company. This talk describes how Factual combines both machines and other (human) data communities to their best effect, within the context of similar data-centric, community-driven applications.
Quench your thirst with vendor-hosted libations and snacks while you check out all the cool stuff in the Expo Hall.
In this talk, we present the broad data challenge and discuss potential starting points for solutions. We illustrate these approaches using data from a "meta-catalog" of over 1,000,000 open datasets that have been collected from about two hundred governments from around the world.
This talks dives into the extreme details of Building Recommendation Platforms. It covers the end to end Architecture and Design of such a system. It dives into the various ML Algorithms to be used along with their details. It also covers the Solutions to commonly seen Recommendation Patterns and detailed Use Cases along with their Solution.
The Infrastructure team at Stumbleupon leverages the state of the art tools and technologies to build platforms that enable us collect, categorize, organize, store and analyze huge volumes of data. The platform is fast and robust that it adds minimal latency to the site.Timely collection and analysis of data helps data scientists, analysts and executives make the best decisions and validate them.
In this session we’ll first discuss our experience extending Hadoop development to new platforms & languages and then discuss our experiments and experiences building supporting developer tools and plugins for those platforms.
In this talk, Susan Etlinger will discuss how organizations are addressing the challenges of social data--technological, organizational and cultural--and what it can teach us on the road to big data.
As anyone who's seen Minority Report can attest, collecting data about people as they move in public places can be kind of creepy, whether it's for security, advertising, or any other reason. Meghan Athavale has made it her mission to find ways to collect user engagement and behavior data to create playful environments, eliminating the 'creepy' factor.
Many companies have figured out how to generate incremental value through the use of recommendation engines. As such, the underlying algorithms are considered a valuable asset. But what happens when a company’s entire business model rests on its ability to get relevant products in front of the customer? When this happens you see a massive commitment to algorithms, data, and data scientists.
Communicating Data Clearly describes how to draw clear, concise, accurate graphs that are easier to understand than many of the graphs one sees today. The tutorial emphasizes how to avoid common mistakes that produce confusing or even misleading graphs. Graphs for one, two, three, and many variables are covered as well as general principles for creating effective graphs.
Big Data is about more than petabytes; it is also about new paradigms, languages, and tools. This talk will cover work going on in Hadoop projects to coordinate sharing of data and user code between tools.
At Strata RX, we announced the release of DocGraph, the largest open named social graph data set that we know of. This data set included links between doctor who commonly team together in the Medicare dataset. Since then, we have added tremendous depth to the data by crowdfunding the acquisition of doctor credentialing data. Come learn how healthcare works under the cover.
Presentation: D3_js tutorial Presentation [PDF]
An introduction to D3, one of the most powerful Javascript data visualization libraries.
In a connected, social world, honing the effectiveness of campaign efforts with data is critical. In this Q&A discussion, we'll look behind the scenes to see how data science is changing the face of campaigning, where the opportunities are, and what resistance and constraints candidates face.
A casual get-together for conference-goers after a busy day at Strata
For business strategists, marketers, product managers, and entrepreneurs, Data Driven Business looks at how to use data to make better business decisions faster. Packed with case studies, panels, and eye-opening presentations, this fast-paced day focuses on how to solve today's thorniest business problems with Big Data. It's the missing MBA for a data-driven, always-on business world.
Everyone is looking for ways to define data as an asset that can be monetized. But data itself will never move the needle for the Fortune 1000. Data is a means to an end. The end is not just insight, or knowledge, or brief moments of wisdom (when marveling at gorgeous data visualizations). The end we seek is wise action.
A turning point in psychotherapy is seeing that your problems aren’t unique. Once you see that your problems are the same as others, you cross a threshold into the realization that they can be solved. Big Data is at this threshold. We all still see our data problems as unique. We must cross this threshold and see that we are not alone; that our problems can be solved with known solutions.
Data Science has created quite the movement in the data world, yet confusion between data science and analytics still remain across the enterprise. Rather than approach the subject talking about semantic differences between the two, we will discuss the topics as they relate to solving problems, how businesses are approaching them and what you can start doing with data science.
Sensors are the future of distributed data. General-purpose computing is dissipating out into the environment and becoming increasingly invisible and embedded into our lives. We will soon begin to move in a sea of data, our movements tracked and our environments measured and adjusted to our preferences, without need for direct intervention.
Managing data in Hadoop gets complex quickly - *Loom* is the data set management system for Hadoop that makes it easy. *Loom* provides tools to track the lineage and provenance of all registered HDFS data, and *Activescan* so that all of the critical information about data sets is collected dynamically.
This session explores applications of Shneiderman’s mantra for visual data analysis (overview first, zoom and filter, then details-on-demand) as a framework in the context of three complex analytical applications at Wells Fargo: (1) Analytics process, (2) Interactive meeting facilitation and (3) Dashboard design.
Be sure to visit our Data Visualization Lounge, featuring works selected for their innovation, creativity and beauty.
Learn how to wrangle data in R: from acquiring and cleaning data, to changing data formats and performing targeted, groupwise calculations. This course will emphasize the 'reshape2' and 'plyr' packages.
In this talk I will discuss the realities of human productivity bottlenecks in data analysis, and give an overview of research and product directions for addressing this critical bottleneck in a substantive way.
How Airbnb was able to quickly spin big data into a meaningful response to Super Storm Sandy.
Machine learning and AI have appeared on the front page of the New York Times three times in recent memory. In this talk, Kaggle's president and chief scientist will explain exactly what occurred, why it was front-page newsworthy for the New York Times, how it will impact business, and what you need to know to make these new algorithms work for you.
How software can transform human lives by bringing intelligence to wherever big data lives.
The emergence of Apache Hadoop over the past few years has required organizations to completely rethink architectures that have been in place for decades. And with changes in the underlying data fabric, come ripple effects, and often bottlenecks, that impact all levels of an organization both business and technical.
While audience analysis is an old topic, it is being reimagined as personas along topic distributions as opposed to the usual demographic terms. This provides deeper insights into the communities among the internet that provide interesting insights into how the internet is consumed.
Electronic discovery has transformed the way cases are litigated. Gone are the days of manual review, where litigators spent days poring over emails, messages, and documents. Today's e-discovery technologies mine through vast troves of information, looking for the needle in the proverbial haystack that will blow a case wide open.
Whether the user is a business user or an IT user, with today's data complexity, there are a number of design principles that are key to achieving success. Hear how to approach product designing for today's data challenges and meet new user expectations for fast and timely insights at scale.
Every hour since noon on June 21, 1999 Stephen Cartwright has recorded the exact latitude, longitude and elevation of his position on the earth with a handheld GPS. With this and other data, Cartwright creates multi-dimensional maps and objects. His work offers a unique sculptural perspective of one person’s transit through life.
In today's world, decisions are made for us based on data. On one hand, this is appealing, but on the other hand its disorienting. To address this, designers need to focus on the things that make us uniquely human and focus on the translation between the abstract and human. This presentation will look at the ways humans make decisions and how big data and technology can enable this, not lead it.
Attend this session to hear how NetApp was able to solve their big data problem. Since the design and implementation of the solution, NetApp has a number of takeaways and best practices required to convert theory into practice, allowing completion of an enterprise-level implementation of such a solution.
This talk will discuss how Druid allows users to have interactive queries on real-time data at scale; we feature a case study with Netflix leveraging Druid to obtain at-the-moment insight as it ingests over two terabytes per hour.
You're standing in a dimly lit room inside the temple. The temperature is overall cool, and there is a cold draft blowing from the north east. What does your party do? Embrace the inner story teller and become comfortable with thinking on your feet, and working with others.
As enterprises deploy Hadoop, it’s not the volume or velocity of data that is problematic, but the variety of types and formats of their critical data. This session discusses how leading companies have integrated Hadoop, NoSQL (HBase) and enterprise sources on one platform. Data is combined and processed in one simplified architecture. Case studies and reference architectures will be reviewed.
Grab a drink, mingle with fellow Strata participants, and see the latest technologies and products from leading companies in the data space.
Many of the services that are critical to Google’s ad business have historically been backed by MySQL. We have recently migrated several of these services to F1, a new RDBMS developed at Google. F1 implements rich relational database features, including a strictly enforced schema, a powerful parallel SQL query engine, general transactions, change tracking and notification, and indexing.
Visualization is a powerful way to understand data, but today building the right data set and accompanying data visualization requires sophisticated programming skills. We discuss an approach to a unified language describing both visualization and database queries. This approach could be used by both programmers and business users, accelerating data exploration and speeding time to insight.
Most stable systems rely on feedback - from central heating to industrial plants and biological organisms. This introductory talk will explain what feedback is, why it is relevant to enterprise software development, and how to apply it to some typical problems arising in business and technical situations.
Learn how HP has established itself as the premier Big Data vendor with a solid portfolio of turnkey solutions that can be deployed faster than ever, while keeping acquisition and operational costs down. Learn more at
The MITRE Corporation supports the FAA in advancing the safety, security, and efficiency of civil aviation. Our research requires analysis of a diverse set of surveillance, weather, terrain, and infrastructure data. This talk will describe how we use Hadoop to fuse and analyze these data sets, based on statistical inference of textual, temporal, and geospatial features.
This talk discusses the broad design considerations necessary for effective visualizations. Attendees will learn what's required for a visualization to be successful, gain insight for critically evaluating visualizations they encounter, and come away with new ways to think about the visualization design process.
As an ecommerce site with more than 800,000 different sellers, Etsy is particularly interested in understanding how shoppers find the items they seek. This talk will discuss the challenges of funnel analysis at Etsy, the corresponding deficiencies of several widely used web analytics tools, and our event sequence matching tool implemented in Hadoop.
How must big companies evolve in order to realize big value from big data? Investing in data, technology and data scientists is just a first step.
How must big companies evolve in order to realize big value from big data? Investing in data, technology and data scientists is just a first step.
When data volume and velocity become massive, processing and analysis solutions require specialized technologies for different parts of the data pipeline. Google’s Cloud Platform is designed to help you focus on building applications, not infrastructure. We’ll demonstrate how to build end to end Big Data applications - from data collection, to analysis, to reporting and visualization.
Hadoop and SAP HANA are taking the world by storm. SAP HANA is the fastest growing commercial database in the market, being adopted by the world’s top enterprises for real-time analytics and applications.
The Great Debate series returns to Strata. In this Oxford-style debate, two opposing teams take opposing positions. We poll the audience, and the teams try to sway opinions. It'll be a fast-paced, sometimes irreverent look at some of the core challenges of putting data to work.
This hands-on tutorial teaches you how to use Hive, a high-level, data warehouse tool for Hadoop. Hive provides a SQL-like query language, HiveQL, that is easy to learn for people with prior SQL experience, making Hive attractive for data warehousing teams. Hive leverages the power of Hadoop for working with massive data sets without requiring expertise in MapReduce programming.
The excitement about Big Data stems from the results: the impact on revenue, the decrease in costs, the Big gains in competitive advantage that result from Hadoop and HBase applications. This keynote provides insights into how the combination of scale, efficiency and analytic flexibility creates the power to expand the applications for Hadoop to transform companies as well as entire industries.
Hadoop is the engine powering the Big Data era, an unstoppable force boasting massive investments and a rich ecosystem. But this is only the beginning: Hadoop has the potential to reach beyond Big Data and become the Foundation for Change, catalyzing new levels of business productivity and transformation. Hadoop will become the Foundation for Change.
Building on our previous tutorial introducing BDAS, the open-source Berkeley Data Analytics Stack, in this tutorial we will provide each audience member with a Spark/Shark cluster on EC2 and walk through hands-on coding examples. Lessons will cover the Spark and Shark command line interfaces, writing a standalone program, and data clustering using a distributed machine learning algorithm on Spark.
In this talk, we describe using Redis, an open source, in-memory key-value store, to capture large volumes of data from numerous remote sources while also allowing real-time monitoring and analytics. With this approach, we were able to capture a high volume of continuous data from numerous remote environmental sensors while consistently querying our database for real time monitoring and analytics.
Join Flip Kromer, co-founder and CTO of Infochimps, as he walks you through a series of decision trees, making you rethink your use of Hadoop in the cloud and opening up possibilities for new patterns of work that are uniquely developer-friendly. Patterns of work like tuning your cluster to the job, and why the first priority of any analytics cluster should be downtime.
This session will demonstrate to attendees how easy it is to crowdsource identity theft to commit fraud and make money. We will look at which segments of the population are easy targets for large scale identity fraud. Attendees will be given methodologies to combat this type of fraud leveraging Big Data and various technologies.
It's great being a data scientist -- some are even calling it the sexiest job of the 21st century. What's not so great is trying to hire data scientists when the demand for them far outstrips the supply. Which makes it imperative that you nail the hiring process. In this session, I'll share hiring tips, and specifically what we've learned at LinkedIn about how to interview data scientists.
Discussion of how big data is impacting modern business, which market trends are driving the adoption of big data solutions, and how big data professionals can choose the right technology to transform their business.
Designing for human fault-tolerance leads to important conclusions on the fundamental ways data systems should be architected.
Ignite is back at Strata, with a focus on how data is collected and interpreted to understand and shape the world around us--from the quirky to the sublime to the downright creepy.
The Cloudera Impala project is for the first time making scalable parallel database technology, which is the underpinning of Google's Dremel as well as that of commercial analytic DBMSs, available to the Hadoop community.
In this session we'll explore a Big Data architecture that combines the core components needed in most Big Data implementations.
In this talk, we'll examine compelling, real-world examples that offer a blueprint for integrating big data technologies, delivering rapid visibility and insights to IT professionals, data analysts and business users, and that accelerate the adoption of big data in the enterprise.
Presentation: external link
Julia is a new mathematical programming language that is scalable, high-performance, and open source. Julia is fast, approaching and often matching the performance of C/C++, easy to learn, and designed for distributed computation. This session will demonstrate some of the special capabilities of Julia and give you the tools you need to get started using this exciting technical computing language.
This tutorial provides a solid foundation for those seeking to understand large scale data processing with MapReduce and Hadoop, plus its associated ecosystem. This session is intended for those who are new to Hadoop and are seeking to understand where Hadoop is appropriate and how it fits with existing systems. No programming experience is required.
Everyone wants to predict the future; fame and fortune follow those who succeed. I cover the basics of forecasting including tips, tricks, and best practices, and how forecasting differs from prediction analysis. I walk through simple examples using R and link to several resources to put you on the path to becoming the next Nostradamus.
Presentation: external link
As more industries adopt data-driven policies, people untrained in the formal analysis of data are find themselves staring at a spreadsheet and asking what they did to deserve it. In this tutorial, two of Kaggle’s top data scientists will walk attendees through the basics of solving an analytics challenge, from defining the problem, to performing basic analysis, to visualizing the output.
Data science efforts can be derailed for many reasons. We highlight common pitfalls on planning & executing data science: the optimal organizational mindsets, the technical considerations, and what constitutes the diverse skills of practitioners. This talk is based on the upcoming Bad Data Handbook as well as a survey and analysis of a few hundred Data Science practitioners from around the world.
Presentation: external link
The majority of the world's data is now unstructured, non-English text. How can we extract useful information from it? Many of our assumptions about English do no carry over to other languages. This talk will give a high-level overview of how languages vary, what current language technologies can (and cannot) achieve, and how we can process and visualize this information at scale.
This panel will share insights on how K-16 education can benefit from developments in Big Data ecosystems.
In today’s data-driven age, healthcare is transitioning from opinion-based decisions to informed decisions based on data and analytics. Analyzing the data reveals trends and knowledge that may run contrary to our assumptions causing a shift in ultimate decisions that in turn will better serve both patients and healthcare enterprises.
ParAccel runs analytic queries 100x faster than Hive with much deeper SQL Support. Hear how companies are using analytic platforms for fast, interactive analysis on big data.
Presentation: external link
Learn how LinkedIn endorsements used data mining techniques to develop a viral social tagging and reputation system.
Location Intelligence (LI) transforms how public health and agriculture initiatives are managed and monitored by translating big complex data from multiple sources and varying temporal and spatial scales into local, actionable insight. This empowers national governments and global development organizations to focus on saving lives and building healthy, sustainable communities.
The world of mapping is undergoing another revolution. New techniques for visualizing and querying increasingly large amounts of data can lead to new ways of interacting with and discovering meaning in your data. In this session, we'll talk about the latest in vector mapping and how you can use it to explore the hidden stories in your data.
The majority of data we consume today are presented in lists, one-dimensional orderings that limit the users ability to understand context or perform strategic analyses. For unstructured data, we need to re-imagine what types of visualisations enable exploration in the way that geographic maps can.
This talk will discuss Rest Devices proprietary low-cost sensor technology, its use of and vision for big biometric data, and the need for design integration in all facets of product development, be it software or hardware.
Code for America fellows have been tackling not only the promise of data in America’s cities, but the reality of the challenges, for the past two years. In February 2013, six new fellows will be working on our hardest problem yet: using data to unclog the criminal justice system in Louisville and New York City. If the public sector can innovate using data, and results benefit us all.
Hear from MailChimp’s Chief Scientist John Foreman as he dishes on dirty data and demonstrates the latest in MailChimp’s anti-abuse artificial intelligence. MailChimp sends 3 billion emails a month for their millions of users, and they can't afford to let a drop of spam go out. Learn how the company is using cutting edge NoSQL solutions and predictive models to leave the bad guys out in the cold.
Which friends make you healthier? Which specialists do your doctors trust most? What treatments work for others like you? We are about to enter a new phase of healthcare where the answers to these questions become commonplace. Healthcare powered by personalized recommendations and social network analysis. It won't be Instagram for X-rays, but it will change the way you experience healthcare.
Rachel Schutt, Senior Research Scientist at Johnson Research Labs, will discuss her Columbia Data Science course: her motivations for teaching it, how she designed the curriculum, how the NYC tech community was involved, and what impact, if any, she had on her students. She thought about the course as testing the hypothesis: It is possible to incubate awesome data science teams in the classroom.
Can open data help you land a date, and show her a great time? It sure can, and I will show you how.
With more data come more problems. Did you know Excel dates begin on January 1, 1900? Unless you're using the OS X version, then dates begin on January 1, 1904. Or Unix time, which begins January 1, 1970. These pervasive, easily-overlooked gremlins are the bane of any data scientist and in this session I will explore a variety of these little nuisances.
This talk introduces an open-source distributed file system that will double the capacity of your Hadoop cluster and speed up your MapReduce jobs. The talk will describe the Reed-Solomon implementation and its implications for cluster performance, how it leverages the speed of modern networks to achieve better storage efficiency and make Hadoop jobs run faster.
Learn first-hand how advanced analytics are enabling modern enterprises to deal with big data challenges.
Prepare for the coming zombie apocalypse or subjugation by our vampire overlords by tracking the spread of these threats and understand the characteristics of the populations already infected using a combination of social media analytics and classic market research cluster analysis. Learn about new methods for unpacking consumer conversations and tracking true attitudinal consumer segments.
For food retailers the fresh food category is important for customer satisfaction. Providing sufficient stocks while avoiding food waste makes for customers happy and keeps the retailer profitable. This case study shows how a fully automated, data-driven replenishment process is possible based on internal and external data sources combined with advanced predictive analytics.
This tutorial will be a hands-on introduction to the essential tools for working with structured data in Python, 'pandas' and 'NumPy'
Cloudera, the standard for Apache Hadoop in the enterprise, empowers data-driven enterprises to Ask Bigger Questions™ and get bigger answers from all their data at the speed of thought. Cloudera Enterprise, the platform for Big Data, enables organizations to easily derive business value from structured and unstructured data to achieve a significant competitive advantage.
With the growth in volume and velocity of data, businesses need a scalable solution alongside batch processing to process events on the fly and provide real time insights. In this session, we will describe how we used Storm to analyze network data to detect causes of network performance degradation.
Hadoop is great for analyzing data at rest. But what if your business problem requires the ability to analyze and respond in real-time and without a human in the loop?
How do you deploy real-time predictive models to production environments? This talk describes a five-stage process which begins with data distillation and ends with real-time in-database model scoring. We'll discuss the technologies used at each stage, and share some best practices for development and implementation of real-time models.
Learn how LivePerson and Zoomdata perform stream processing and visualization on mobile devices of structured site traffic and unstructured chat data in real-time for business decision making. Technologies include Kafka, Storm, and d3.js for visualization on mobile devices. Byron Ellis, Data Scientist for LivePerson will join Justin Langseth of Zoomdata to discuss and demonstrate the solution.
Given a machine learning (ML) problem, which method(s) should you use, and how does big data affect your choices? I will discuss some principles derived from decades of theory and practice, illustrated through real-world ML success stories in medicine, marketing, financial services, and astronomy.
Peeko is a onesie-enabled baby monitor system that allows parents to see their baby's breathing, body position, skin temperature, activity level, and audio in realtime on their smartphones, from anywhere in the world.
To kick off the Big Data for Enterprise IT Day, we present two views of big data. Is it truly something new, or just an evolution of what we have already? Join us for an interesting and entertaining talk that will help frame your thinking on big data.
Given the exponential rise in data, attorneys have an obligation to meet today’s Governance, Risk and Compliance (GRC) challenges and stay on top of technology in order to achieve broader institutional benefits. Join Digital Reasoning and the Clutch Group to learn how moving from document-centric to entity-centric analytics is key in gaining valuable knowledge from unstructured information.
When a data scientist crosses over to the dark side, look out. High-quality spam, large-scale CAPTCHA-breaking, impolite spiders, oh my! This talk will explore attack vectors that can be exploited by black-hat data scientists. We'll also discuss countermeasures and defenses that are available to the good guys, and assess their effectiveness.
In this hands-on tutorial, you will learn the importance of distributed search by our industry experience and knowledge of real use cases. We’ll introduce different architectures that incorporate distributed search techniques, share pain points experienced and lessons learned. For the hands-on part of the tutorial, you will learn how to install and use Apache Solr for real-time search on big data.
Kids and adults alike are fascinated by fire trucks. And yet little is known about the structure of fire truck society. In this talk I'll show how I used web scraping and social network analysis to develop an ethnography of the trucks of the Seattle Fire Department, and I'll share some of my more interesting findings.
There are often privacy and confidentiality concerns with putting sensitive personal information about employees or customers on the cloud. Secure computation methods allow the release of encrypted data and still performing complex data analytics on that encrypted data. This presentation will describe how secure analytics work and give examples of their application in the healthcare context.
In many modern web and big data applications the data arrives in a streaming fashion and needs to be processed on the fly. Due to the size of data, the computations need to be done incrementally, and hence sketches of data are used that take a small amount of memory but allow for fast updates and queries. We will present the techniques to design these sketches and provide clarifying examples.
I will discuss how a wearable sensing platform, the Sociometric Badge, allows us to measure and analyze human behavior in the real-world, particularly in the workplace. We’ll discuss how we use the badges to recognize concepts such as persuasiveness and social support and how we have used the badges in real companies to drive organizational change and put hard numbers behind management methods.
Meet the Philanthropy Engineering team at Palantir Technologies. We work with non-profits that have good data about important problems and donate our both our software and expertise to really make a difference in the world.
The Splice SQL Engine delivers real-time transaction updates with integrity and ACID compliance while offering standard SQL support so applications don’t have to be rewritten.
Presentation: external link
This talk is about the emergence of a new class of analytic databases based on principles first popularized by Google Dremel. These systems have been designed with the goal of enabling real-time SQL on Hadoop, while also supporting schema-on-read, semi-structured data, and pluggable storage engines. In this talk we will explain the novel architectural features that make these goals a reality.
Don't miss Startup Showcase, Strata's live demo program and competition for startups and early-stage companies. The judges will pick winners from 10 finalist companies selected to present at the showcase.
Privacy laws as to a company’s obligations on data collection, use, disclosure are changing rapidly. Failing to understand how the laws affect a company’s personal data assets can result in media exposes, regulatory investigations, Congressional hearings and lawsuits. This session will provide guidance on “privacy by design” compliance and practical tips to avoid becoming a target of scrutiny.
Opposites attract and that’s the case with Hadoop and analytic databases. Both have a role to play in your Big Data projects. This session explores the various approaches to cementing the bond between Hadoop to your analytic database, how SAP customers are integrating Hadoop into BI and advanced analytic environments, and why you’ll want to do that too.
You can't manage what you don't measure, and that matters for big data initiatives too. This session will help you answer questions including: How far can big data take us from a business perspective? How do we compare our organization to others with respect to big data as a business enabler? How far can we push big data to transform our value creation processes?
We will describe the BigData Top100 List initiative—an new, open, community-based effort for benchmarking big data systems.
MapReduce, Hadoop, and other “NoSQL” big data approaches opened opportunities for data scientists in every industry to develop new data-intensive applications. But what about the more traditional SQL users or analysts? How can they unlock insights through standard business intelligence (BI) tools or ANSI SQL access?
For centuries, business has been about scale. Business students are taught that cconomies of scale are the only long-term sustainable advantage, because with scale you can control markets, set prices, own channels, influence regulators, and so on. But thanks to software and big data, however, scale’s importance is waning.
While the industry has been busy abandoning the relational database and calling it a fundamentally limited technology, several trends are conspiring to revive the good old RDBMS. While it might not resemble the MySQL or Oracle database you are running today, this talk will explore how hardware trends, software trends, and industry research are point to SQL, structure, and ACID at scale.
Enterprises are moving forward with the vision of creating a central repository of all enterprise data stored inexpensively and processed efficiently in Hadoop. Only a fraction have yet been successful. This session will explore the pitfalls of implementing the Hadoop Data Reservoir and the requirements that lead to success.
A quick look into the 20 years of work that went into Gangnam Style's overnight success story.
Presentation: external link
In this talk, I will introduce the IPython Notebook, an open-source, web-based interactive computing environment for Python and other languages. By enabling the data scientist to build documents that combine code, text, formulas, visualizations, images and video the Notebook creates a foundation for data science that is interactive, repeatable, documented and sharable.
Just the basics: you've probably heard about data mining and think you need a PhD to do it. Clever stuff with numbers. Predictions. Clusters. Algorithms. The 9 Laws explains the why of the basic steps you can take to be successful as a data miner, and show that this is primarily a business discipline, not a branch of computer science.
This talk discusses the market needs that are giving birth to the "scientific database", what these systems have to offer that is currently lacking in either the data management or statistical worlds, and how scientific databases will co-exist and co-evolve with Hadoop and other leading big data platforms.
Data science can power incredible innovation, but the most important insights typically aren't known ahead of time. This makes it challenging to manage schedules, expectations, and goals. At Decide, data science is core to our product. This talk will share lessons learned from both sides, and provide the audience with strategies to improve process and communication in their own teams.
The Victory Lab presents a secret history of modern American politics, pulling back the curtain on the tactics and strategies used by some of the era's most important figures-including Barack Obama and Mitt Romney-with iconoclastic insights into human decision-making, marketing and how analytics can put any business on the road to victory.
Big data tools made it possible to gain extremely valuable insight from large scale analysis of web data, but until recently few people had access to the data. Now tools like Grep the Web and increased raw access to web data grant anyone the power to do such analysis. This presentation addresses practical applications of web data analysis that you can incorporate into your research or products.
This talk examines the notion of a "workflow" as a general abstraction for common use cases encountered in Data Science, particularly for building Enterprise apps. Patterns of workflows provide recipes for integrating different frameworks, plus the means for optimizing large-scale apps. We review this approach in the context of a sample app based on the Cascading open source project.
This hands-on session will show how a dataset turns into a story, the narrative process the Guardian's team goes through, the tools used and the lessons learned.
The key takeaway from this session will be an understanding of the third generation of tools for realizing machine learning algorithms - examples of these tools include Twister, HaLoop, GraphLab. Attendees will also understand why the second generation tools such as Mahout has not implemented some of the machine learning algorithms for big data. The session will also have real-life use cases.
Program Chairs, Edd Dumbill and Alistair Croll, welcome you to the second day of keynotes.
Birds of a Feather (BoF) sessions are informal roundtable discussions happening during lunch on Wed 2/27 and Thu 2/28. You can join any BoF table or start your own with a topic of your choice. The BoF sign-up board will be near the Registration area.
Office Hours are your chance to meet face-to-face with Strata Conference presenters. Drop in to discuss their sessions, ask questions, or make suggestions.
Microsoft partner, Ascribe, is using Microsoft’s Big Data solutions to turn emergencies into actionable data
All is quiet on the log file front, but yet the system is down. What next? Three parts practical know-how (“here’s my toolbox”) and one part position paper (“must-haves for comprehensibility”), this talk will cover the tricks of the trade for debugging distributed systems. Motivated by experience gained diagnosing Hadoop, we’ll dig into the JVM, Linux esoterica, and outlier visualization.
Every month Birchbox delivers a box of samples to each of its subscribers. Boxes are targeted to subscribers based on their profile, history, and behavior. In this talk we discuss the mathematics behind allocating samples to customers (aka solving for happiness).
Billions of mobile phones worldwide leave vast volumes of geolocated data traces on the networks of operators. We present smart steps, a product created by Telefonica to provide insights to retailers on footfall volumes and trends across entire countries, turning these billions of data points into information that enables businesses to make decisions like where to open a shop or opening times.
Opower, the global leader in the field of energy information and analysis, works with 80 utility companies worldwide to give families context, insights, and advice about how to save energy. With access to an unprecedented (and still growing) amount of energy data—currently drawn from 50 million US homes—Opower is uncovering unique trends in how people are using energy at home.
More than ever before, students are using the Internet to study, leaving behind a trail of valuable data. How can we leverage this data to improve education?
The human eye can detect infinitesimal patterns in the world around us. Shouldn’t we make use of this amazing skill when recognizing patterns or detecting anomalies in big data? In this session we’ll explore why rendering every pixel is a challenge with big data and look at how these limitations can be overcome.
Learn how Neustar has expanded their data warehouse capacity, agility for data analysis, reduced costs, and enabled new data products. Discuss challenges and opportunities in capturing 100′s of TB’s of compact binary network data, ad hoc analysis, integration with a scale out relational database, more agile data development, and building new products integrating multiple big data sets.
HBase is one of the more popular open source NoSQL databases that have cropped up over the last few years. Building applications that use HBase effectively is challenging. This tutorial is geared towards teaching the basics of building applications using HBase and covers concepts that a developer should know while using HBase as a backend store for their application.
From markup languages like SVG to OpenGL based APIs like WebGL, the browser provides several ways for creating visualizations. In this talk we'll show some web based visualizations we worked on for different projects and for Twitter, and show what standards were used to create them. We'll dissect each example showing what was used not only for rendering but also for data handling and interaction.
SimpleSearch is the search engine for Hadoop, enabling companies to easily explore and analyze Big Data in real-time.
In this talk, EA CTO Rajat Taneja will dive in to the challenges and complexities facing the gaming industry, how to harness the power of data and share examples of how technologies like machine learning and predictive analytics have been put in place to improve the customer experience.
Extending much of the hard work around big data, we'll focus on how do we take all these powerful tools and empower organizations to drive actionable decisions and strategies from data. We'll share what we've found exploring how human psychology, collaborative dynamics, gamification, and design can be utilized to not only improve what we're doing now, but drive where we are going.
Strata Program Chairs, Edd Dumbill and Alistair Croll, welcome you to the first day of keynotes.
Birds of a Feather (BoF) sessions are informal roundtable discussions happening during lunch on Wed 2/27 and Thu 2/28. You can join any BoF table or start your own with a topic of your choice. The BoF sign-up board will be near the Registration area.
Office Hours are your chance to meet face-to-face with Strata Conference presenters. Drop in to discuss their sessions, ask questions, or make suggestions.
Classic data science problems involve finding stationary patterns in big datasets. However, in adversarial settings, enemies deliberately shift their approach to avoid detection. They can challenge learning systems by randomizing behavior, hiding tracks, lacing traffic and more. Successful application of machine learning requires new approaches to feature engineering, training and classification.
Real-world examples of utility companies around the world using Hadoop to optimize their services and changing Hadoop in the process.
From politicians to marketers everyone tries to influence. Data analytics of traditional as well as social media data has made it easier to spot deliberate attempts to skew the public opinion. The talk will give insights into new measurements by analyzing large events such as the London Olympics. Those measures will help to disguise the more and more sophisticated attempts of fake influence.
If you're a woman looking for like-minded communities to join, c'mon down to our meetup on Monday evening. In addition to great networking, you'll hear lightning pitches from groups, companies, and projects seeking new participants.
Microsoft keynote, featuring Dave Campbell, Vice President of Product Development for the SQL Server product suite.
Dealing with the flood of data that confronts researchers is the fundamental challenge of 21st century research. Citizen Science has allowed researchers within the Zooniverse to take on research problems at a scale impossible without the attention of a large community of volunteers.


Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Susan Stewart at

Media Partner Opportunities

For information on trade opportunities with O'Reilly conferences contact Kathy Yu at mediapartners

Press and Media

For media-related inquiries, contact Maureen Jennings at

Contact Us

View a complete list of Strata contacts