In a landmark partnership, IBM and Twitter are combining advances in analytics, cloud and cognitive computing in a manner that has the potential to transform how institutions understand customers, markets and trends. Adam Kocoloski, CTO of IBM Cloud Data Services and co-founder of Cloudant will explain how when it comes to gaining insights from Big Data, the future is brighter than we know.
In the wake of the Open Data Platform initiative announced earlier this week, Roman Shaposhnik, Director of Open Source strategy at Pivotal and a VP of
Apache Software Foundation Incubator will talk about how a well-defined, fully
validated ODP common core platform is going to address some of the biggest
customer pain points around rapid evolution and standardization in the
big data area
Starting in Hive 0.14, insert values, update, and delete have been added to Hive SQL. In addition, ACID compliant transactions have been added so that users get a consistent view of data while reading and writing. This talk will cover the intended use cases, architecture, and performance of insert, update, and delete in Hive.
30% of restaurants fail in the first year, so why would anyone go into the business? Most restaurateurs will tell you that it’s an act of love. They love hospitality; they love sharing great food; they love creating a place where people come together to share something special. Almost none of them tell you you that they go into business based on data.
Hadoop is emerging as the standard for big data processing & analytics. However, as usage of the Hadoop clusters grow, so do the demands of managing and monitoring these systems. In this tutorial, attendees will get an overview of all phases for successfully managing Hadoop clusters, with an emphasis on production systems.
Are you looking for a deeper understanding of how to integrate components in the Apache Hadoop ecosystem to implement data management and processing solutions? Then this tutorial is for you. We'll provide a clickstream analytics example illustrating how to architect solutions with Apache Hadoop along with providing best practices and recommendations for using Hadoop and related tools.
In this talk we are addressing the following aspects of machine translation development at eBay:
- leveraging huge amounts of transactional and behavioral data for development and evaluation of our MT systems;
- adapting evaluation metrics to reflect the eBay buyer experience and measuring translation quality and impact on the shopping experience of our international users.
In far too many organizations, data scientists and designers work in silos, and quibble about who’s more important. This is a huge missed opportunity. At Intuit, we are reimagining how our data and design teams to work together to fuel innovation and surpass Intuit’s business goals. I will walk through methods we are using to bridge these two wildly different groups and share stories of success.
The Netflix Data Platform is a constantly evolving, large scale infrastructure running in the (AWS) cloud. We are especially focused on performance and ease of use, with initiatives including Presto integration, Spark, and our Big Data Portal and API. This talk will dive into the various technologies we use, the motivations behind our approach, and the business benefits we get.
In many organizations, Marketing may be the most impacted by the advent of big data with new data on prospects and customers. New channels, new data types and sources, and new technologies … how did Cisco bring these all together to see a different view of customers?
We are often told that past holds lessons on how to approach the present, but we rarely look to older technologies for inspiration. Rarer still do we look at the historical experiences of less industrialized nations to teach us about the technological problems of today.
Big data stories reveal fundamental concepts about emerging technologies, their potential impact on society and decisions that drive successful projects. Using real world examples, this talk shows key insights that inform critical choices about new technologies, including time series database tools and scalable machine learning algorithms, used to address important business and research problems.
This session will cover approaches to building real-time pipelines with MemSQL, Hadoop, and Spark, including:
How Novus built the premier financial portfolio management platform using MemSQL as a real-time data store and query engine
Introduction to the MemSQL Spark connector
Strategies for integrating Spark and Hadoop with real-time systems for transaction processing and operational analytics
In the second (afternoon) half of the Architecture Day tutorial, attendees will apply the best
practices they learned in the morning session to build a data application for sessionizing user data.
The maturation of big data technologies has enabled numerous organizations to derive insights from vast quantities of data. The next set of challenges we face involve building applications that allow us to visualize, navigate, and interpret this data. Creating intuitive user interfaces is often a cumbersome process requiring complex data transformations, integrations, and queries.
The best insight you produce is only as good as your ability to explain it. As data scientists and engineers, our task is not only to execute robust analyses, but also to convince decision-makers to act on data. Through an example-driven approach, attendees will examine features of great graphics, techniques of effective visualization, and learn to use D3.js to create their own data narrative.
Keynote with Jeffrey Heer, Co-Founder, Trifacta
MemSQL CEO Eric Frenkiel will discuss the need for simplicity in enterprise data architecture, the convergence of transactions and analytics, and what is required to operationalize Spark and Hadoop in the enterprise.pipelines by integrating their technology with Hadoop, and Spark.
Join Microsoft’s Joseph Sirosh for a surprising conversation about a farmer's dilemma, a professor's ingenuity and how cloud, data and devices came together to fundamentally re-imagine an age old way of doing business.
Even the most data-driven organizations still incorporate “art” into their decision-making process. Values, culture, social norms, and biases influence decisions as much as the data. This isn’t always a bad thing—data can sometimes fail to tell the whole story. And, by combining data with the intellectual assets that reside in the heads of employees we can create new capabilities.
DJ Patil (White House Office of Science and Technology Policy)
Data Science, where are we going? What impact can we expect?
When your company stores some of the most sensitive customer data that exists, how do you build game changing big data innovations while maintaining customer trust and loyalty? Combine the two groups responsible for that vision--legal and data science--and unite them toward a common goal! We'll discuss how Intuit turned the typical data-legal model on its head to boost data-driven innovation.
Vendors and pundits suggest plug-n-play options for Hadoop security - do this and in <20 mins, your petabytes of data is now secure. What happens when PowerPoint approaches fail in a real-world enterprise deployment? In this session, we will review techniques that worked, controls that completely failed, and create business processes we had to stand up.
Designing data visualizations presents us with unique and interesting challenges: how to tell a compelling story; how to deliver important information in a forthright, clear format; and how to make visualizations beautiful and engaging. In this talk, Julie will share a few disruptive designs and connect those back to vizipedia, her compiled data visualization library.
Poppy Crum (Dolby Laboratories | Stanford University)
Our experience of the sensory world does not need to be constrained by our physical limitations. When navigating the environment our senses interact to perceive a robust non-veridical experience. Understanding these interactions and being able to define them perceptually and algorithmically allows technological developments that can facilitate sensory enhancement and optimization.
Open data is quickly gaining momentum and when applied as data for good, it becomes a much more powerful concept that we should all consider as good data stewards. Organizations to cities are starting to share data like traffic conditions or climate sensors and allowing others to use this open data to improve quality of life.
Businesses are moving from large-scale batch data analysis to large-scale real-time data analysis. Apache Storm has emerged as one of the most popular platforms for the purpose.
This talk covers proven design patterns for real time stream processing. Patterns that have been vetted in large-scale production deployments that process 10s of billions of events/day and 10s of terabytes of data/day.
Data is changing our world. Predictions using massive data not only have improved many products. At the same time, they have, in some industries, disrupted business models and created new ones. What does an organization need to do to generate a new competitive advantage out of data?
Data products are poised to go mainstream, but only if they are designed well. Most data products are designed by developers for developers. This talk discusses methods from Stanford's D.School used by companies like Yahoo!, Samsung, and Audi to design break-out products. These principles can help developers get beyond technology and design products for everyday users.
Dr. Prith Banerjee, Managing Director of Global Technology Research and Development, Accenture , will present the Accenture Tech Vision 2015 and discuss how organizations are driving value from big data.
Storytelling is not about raising someone’s IQ, it’s about raising their blood pressure. Stories engage emotions rather than intellect, making “storytelling with data” a poor metaphor for data visualization when our goal is to communicate clearly.
Allstate’s foundation is data. We extract value from our data by applying machine learning to make data-driven decisions. In this session, we discuss Allstate’s drive for better business results by using machine learning on Hadoop.
I will introduce USA’s next big astronomy project (LSST) and describe how this telescope requires massive data stream analytics – to discover and respond to exotic rapidly changing events in the Universe. I will discuss parallels between big data astronomy and Decision Science-as-a-Service for Business, Cybersecurity Information and Event Management, and Marketing Automation using Hadoop.
Writing efficient Spark programs requires a deeper understanding of Spark internals. In this talk, we present practical tips for writing better Spark programs for the beginner or intermediate Spark programmer.
This panel discussion will focus on how organizations can find value, equity and business opportunities in their data supply chain. The modern enterprise data supply chain allows organizations to move, manage and mobilize an ever-increasing amount of data across the organization for consumption by people and things.
While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. In this session, we’ll describe one such system, in detail, handling terabytes an hour of event-oriented data, providing real time streaming, search, and SQL access to data.
Impala is the massively parallel analytic database delivering interactive performance on Hadoop.
In this half-day tutorial, we'll walk you through hands-on exercises, taking you from zero to up and running with Impala.
What happens if you take everything that is happening in your company--every click, every impression, every database change, every application log--and make it all available as a real-time stream of well structured data? Companies such as LinkedIn have done this experiment and this talk will describe how this changes the way data is thought about and put to use in an organization.
As Hadoop and the surrounding projects & vendors mature, their impact on the data management sector is growing. Amr will talk about his views on how that impact will change over the next five years. How central will Hadoop be to the data center of 2020? What industries will benefit most? Which technologies are at risk of displacement or encroachment?
Big Data is existing it's buzz word phase and we are seeing applications which use big data infrastructure to power every day lives. This is a discussion from the front lines with panelists from industry and startups describing real deployed application powered by big data, but which are happy to be hiding the elephant behind beautiful interfaces.
Parquet is a columnar format designed to be efficient and interoperable across the hadoop ecosystem. Its integration in most processing frameworks and serialization models makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine.
To get value out of today’s big and fast data, organizations must evolve beyond traditional analytic cycles that are heavy with data transformation and schema management. . .
The exponential growth of digitally stored data and the transition of data science from academia to real world applications hold the promise of improving nearly every aspect of our lives.
Wearables contribute to Big Data and the insights are already realizing significant gains in key industries.
Entirely new industries are forming as the result of business model innovations. But discovering these disruptive ideas is, still, largely a matter of trial and error. We need faster, more effective ways of testing out new business model designs.
HBase can be a good solution for hierarchical time series data. And we can access the data using both R and Python. This case study is a sanitized version of a solution we brought to a client that provided real business value—without requiring significant investment or time. We show how to move to a simple, scalable NoSQL solution without alienating the scientists who work with the data.
As the Apache Spark userbase grows, the developer community is working to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries.
If you have Hadoop questions, bring them to Ryan, Joey, and Tom. They’ll explain the Hadoop ecosystem, as well as how to get started with Hadoop using the Kite SDK.
Not all big data problems require big cluster solutions. Doradus OLAP compresses data into compact shards, yielding fast analytical queries using little disk even for big data sets. Learn how Doradus leverages OLAP techniques, columnar storage, and Cassandra to yield sophisticated query features while using amazingly little disk space.
Finding anomalies is essential for a wide range of applications, including cybersecurity, event detection and health and status monitoring. Anomaly techniques that scale successfully to large datasets tend to integrate machine learning with good data engineering. We discuss three case studies and extract eight techniques that have proved effective for detecting anomalies in large scale systems.
Retail buyers are the backbone of the industries’ profitability. These individuals drive organizational goals with their performance. Many decisions are made by intuition and “gut” feeling, where predictive analytics would have made significant improvements in outcomes. This session takes real world experiences and shows how to transform retail performance through data driven buying decisions.
Learn how tools based on nation-wide job market data can help both students and institutions improve outcomes from the job market level down to curriculum and course choice.
Andreas Mueller (NYU, scikit-learn),
Jennifer Klay (Cal Poly San Luis Obispo),
Peter Wang (Continuum Analytics),
Travis Oliphant (Continuum Analytics, Inc.),
Andy Terrel (Bold Metrics),
Matthew Rocklin (Continuum),
William McKinney (Cloudera),
Stefan van der Walt (UC Berkeley),
Jonathan Frederic (IPython),
Kyle Kelley (Netflix)
Python has become an increasingly important part of the data engineer and analytic tool landscape. Pydata at Strata provides in-depth coverage of the tools and techniques gaining traction with the data audience, including iPython Notebook, NumPy/matplotlib for visualization, SciPy, scikit-learn, and how to scale Python performance, including how to handle large, distributed data sets.
What’s important about a technology is what you can use it to do. We’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and we’d like to relay what worked well for them and what did not. . .
In this session, we will show you how easy it is to spin up a 32 node Storm cluster and give all attendees a free unlimited 30-day pass to deploy your own Hadoop cluster on Microsoft Azure.
In this from-the-trenches, DevOps-focused talk we explore operational issues in running Hadoop on top of Docker containers in a production, multi-tenant setup.
With Hadoop's native Docker support still in the works and Docker being more of a development tool, a production deployment of the two together is like swimming in treacherous waters... Here's a lantern and a lifeboat to the rescue.
Learn how SAS applications use YARN in order to be a good citizen in a busy Hadoop cluster. Best practices and customer examples for several different user scenarios will be shared and discussed.
This talk discuss how to do realtime analytics with a SQL like query language. We will discuss role of Complex Event Processing in realtime analytics, and then discuss a scalable CEP engine that let users write their queries using declarative SQL like CEP query language, but let them execute those queries using a graph of CEP nodes deployed on top of Apache Storm
Consumers are widely adopting wearable technology – Deloitte predicts there will be 100 million wearable cameras, smartwatches, fitness trackers and other gadgets on the market by 2020. With this mass adoption of wearable devices, comes a new data ecosystem that must be protected. Embracing the protection of this new, intricate data ecosystem is imperative to the success of wearable industry.
As the number of ways to discover and listen to music increases, Shazam's data becomes even more powerful in predicting music tastes/fashions. Labels/artists/radio stations increasingly look to Shazam to predict what the next big hit or summer smash will be. Shazam also uses its usage data to create new product opportunities.
Big data is the sexy new frontier for many businesses but it’s expensive to stand up in an organization and expensive to buy from an external vendor. What is the most fundamental way to demonstrate that data science matters to the organization? This session covers the meaningful data consumption metric that every data science group needs to track.
Each smartphone generates huge heaps of data - up to hundreds of megabytes per day. Apart from location, all sorts of information on behavior and environmental conditions are seamlessly collected in the backgroud of our devices. We will show how to harvest the data and how to tell the story of our everyday lives from the billions of data points that pile up continuously.
As the leading source of intelligent information, Thomson Reuters delivers must-have insight to the world’s financial and risk, legal, tax and accounting, intellectual property and science and media professionals, supported by the world’s most trusted news organization.
In this session you will hear of some of the fascinating use cases for SQL in Hadoop based on real-world customer examples. You will learn some of the innovative techniques that have emerged to overcome limitations of the Hadoop platform that enable features one expects in a proven mature database.
Privacy laws as to a company’s obligations on data collection, use, disclosure are changing rapidly. Failing to understand how the laws affect a company’s personal data assets can result in media exposes, regulatory investigations, Congressional hearings and lawsuits. This session will provide guidance on “privacy by design” compliance and practical tips to avoid becoming a target of scrutiny.
Processing data from social media streams and sensors devices in real-time is becoming increasingly prevalent and there are plenty open source solutions to choose from. To help practitioners decide what to use when we compare three popular Apache projects allowing to do stream processing: Apache Storm, Apache Spark and Apache Samza.
People and startups altering the fabric of things through hardware, data science, and entrepreneurial vision.
The shape and business of IoT is shifting. Learn about key startups making technological advances and surprising intellectual leaps. We aren't yet indistinguishable from magic. But these people are getting us there.
Advances in data science empower leaders to make better decisions for society. By using new kinds of information unavailable during the last several millennia of government, we can avoid mistakes of the past. We will discuss how data and statistical inference are informing how we manage the global climate rationally, a defining policy challenge for our generation.
There's many confusing and painful things about setting up and operating a 900 node Hadoop cluster used as the centerpiece in many of Spotify's Big Data initiatives, we'll go over a few interesting stories and frustrations which have influenced the direction of our architectural choices and the lessons we've learned from them.
The most frustrating part of data science is when customers don’t “get it”: endless revisions, recommendations not implemented, or data products not adopted. Exciting new research in neurology, cognitive psychology, and behavioral economics have a lot to say about why. We’ll explore the findings and implications for designing more successful “human-data interfaces.”
If we want to use data to understand human behavior and to design successful interventions to change that behavior, social scientists and data scientists will need to work together. However, the two often approach problems differently and speak strikingly different languages. This talk will present success stories and tips for productive collaboration between social scientists and data scientists.
Once just the realm of Java jockeys and data scientists, Hadoop has become a mainstream tool for business analysts with the rapid proliferation of SQL-on-Hadoop solutions. But there are pitfalls that can plague implementations as IT teams get their first exposure to production Hadoop environments. We’ll discuss the most common pitfalls companies face and how to get around them.
Twitter's users generate tens of billions of tweet views per day. Aggregating these events in real time - in a robust enough way to incorporate into our products - presents a massive scaling challenge. In this talk I'll introduce TSAR (the TimeSeries AggregatoR), a robust, flexible, and scalable service for real-time event aggregation designed to solve this problem and a range of similar ones
Apache Spark is a popular engine for large scale analytics. This talk will give insights into tuning and debugging a production Spark deployment. It will start with details about Spark internals and an overview of the runtime behavior of a Spark application. I'll explain how to diagnose performance bottlenecks and get the best performance out of Spark jobs.
Social service organizations have a tough job when it comes to using data to drive social impact. With “world saving” goals and large scale impact, it's crucial these organizations leverage a variety of data streams and do more with less. But, pulling multiple data streams and leveraging partners can be a tricky one, this session walks through some ins-and-outs using United Way as one example.
Getting the full value from data often requires the combination of stream processing on new events combined with large scale historical analysis. While both these activities are served by Spark’s execution framework, leveraging multiple persistence layers is key to efficiently and extensibly enabling these use cases.
We lots of things "data visualization," from a news interactive, to spreadsheets, to an infographic counting calories. These surface similarities hide deep differences in what it means to interact with data. In this talk, we cross disciplines—from data science to design—to enliven our techniques and encourage us to try new methods for creating visualizations.
As the Global Head of Investment Risk, Anne Johnson of Credit Suisse takes data quality very seriously. A single misplaced number can put billions of dollars of client assets’ at risk. Find out some of the challenges that Anne and her team face in governing the integrity of their data and the new ways they are thinking about data integration and quality.
Spark is an open-source computation platform for Big Data. All the major Hadoop vendors have embraced Spark as a replacement for MapReduce, because Spark offers much better performance, a more powerful and productive API, and support for event processing. Spark's secrets for success are the Scala programming language and Functional Programming. We'll explore why.
YARN and MESOS are often positioned as competitors for managing datacenter resources, but in reality they work together to seamlessly share datacenter resources. Why force IT to choose between these two great technologies, when we can show you how they work in concert.
The next generation of MapReduce, YARN, has widely touted job throughput and Apache Hadoop cluster utilization benefits. Less known are the pitfalls littering the migration path to YARN. Learn from our extensive field experience to avoid those pitfalls and get your YARN cluster configured right the first time.