The 2013 update to IBM’s Big Data Analytics Survey examines in-depth the key components (people, process, technology, culture, leadership and governance)required for organizations to excel at deriving value from their information assets (structured/unstructured, streaming/static, big data/little data) in a digital landscape that includes big data, mobile and cloud technologies.
At Intel, we envision a future in which every organization in the world can use new sources of data to enhance its operational intelligence, fostering discoveries and innovation in science, industry, and medicine.
Abstractions are what enable us to think clearly about complex systems. In this talk, we will see how some simple abstractions, such as Monoids, can be used to pattern analytics platforms.
There are many practical details involved in building an anomaly detection
system. In this presentation, I will describe the major
classes of these systems, and show you how to build anomaly detection
* Determining when an event rate shifts
* Determining when new topics appear in content streams
* Determining when systems with defined inputs and outputs act strangely
The Hadoop 2.0 revolution is in full force! Organizations, companies, users are gearing up for the move from 1.0 to 2.0. In this talk, we will discuss what Hadoop 2.0 is about, what YARN is, what features that HDFS2 unlocks and what it means to move to 2.0. We'll discuss this major migration from 1.0 to 2.0 from various perspectives - admins, frameworks, end users & data processing platforms.
In this talk, we'll explore how Apache Hadoop has rapidly evolved to become the new foundation for enterprise analytics - the enterprise data hub - and learn about the state-of-the-art in deploying a modern data warehouse on top of the Hadoop stack.
Google "Omega" research: 80% cluster jobs are batch, 60% cluster resources go to services. Batch is simple, services are hard, mixing workloads is key to building efficient distributed apps.
This talk examines case studies of Mesos workloads: ranging from Twitter (100% on prem) to Airbnb (100% cloud). How did they leverage "data center OS" building blocks for orders of magnitude gains at scale?
Join Paxata’s Nenshad Bardoliwalla for a look at the new breed of data preparation tools that use semantic algorithms to detect data types, apply machine learning to find hidden patterns, and link related columns of data automatically.
Many organizations are jumping on the analytic wagon and hiring their own in-house analytic teams. This talk addresses the dos and don'ts of building a data team including some surprising skills sets you will need on your data team, where to find them, how to organize them and then how to manage knowledge discovery to drive business optimization at the corporate level.
In this talk we discuss the challenges associated with data center operations management and provide details on how CloudPhysics big data platform solves these problems and enables new capabilities that were previously not possible.
Smart meters may be the most visible element of the so-called smart grid, but how smart is it if the plants producing the energy are dumb?
To ensure the integrity of the grid, every stage of our electrical power infrastructure – including generation, transmission and distribution – has to get ”smart.” Sophisticated sensors connected to big data analytics are key to keeping the power flowing.
In this session, Accenture’s Narendra Mulani takes us beneath the streets of one of the world’s biggest cities and show how big data architectures, data science algorithms, and advanced visualizations tackle the management challenges of large-scale, infrastructure-heavy networks such as water utilities, and how analytics can replace capital to extend the effectiveness of current infrastructure.
If Big Data is the grand challenge of our time, most analytic effort is like ground control: the hard work behind the scenes that enables ambitious analysis to occur.
In this session, we will share the results of our study, a price-performance comparison of a bare-metal Hadoop cluster and cloud-based Hadoop clusters.
How does the world change when big data reaches a billion people? What happens when anyone, from farmers to criminal investigators, gains the power to quickly derive meaningful insights from vast and varied data sources? Join Quentin Clark, Microsoft Corporate Vice President, who will highlight how simple, familiar tools and cutting-edge cloud technologies are bringing big data to all.
3-Hours: What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop and big data ecosystem fit together in production to create a data platform supporting batch, interactive and realtime analytical workloads.
The United States Patent and Trademark Office wanted a simple, lightweight, yet modern and rich discovery interface for Chinese patent data. This is how we did it.
3-Hours: Apache HBase is a distributed, column-oriented, key-value store for Apache Hadoop (via integration with HDFS). In this tutorial, you will learn the basic elements of building a real-time application that uses Apache HBase as a persistent data store.
Attend this session to learn how you can take advantage of the new economics of data. This session will present examples of how leading organizations are evolving their enterprise data architectures to bring together the Data Warehouse, Hadoop & Data Discovery Platforms so All Users can benefit from ALL Analytics on ALL Data.
Today, we are facing a looming job skills gap in any area that uses Big Data. McKinsey estimates there will be a need for 1.5 million data-savy managers and analysts in the next 5 years with this number increasing exponentially around the world. Can digital education be the catalyst to start closing this gap?
Crowdsourcing can be an effective way to collect massive amounts of data to enable deeper analysis in many situations. Explore the foundational steps that can lead to successfully crowd sourcing data though the lenses of the International Barcode of Life and Technical University Munich (TUM) ProteomicsDB projects. SAP is proud to be involved with driving the success of both these projects.
Crossing the Chasm has been a key reference point for high-tech marketing since its publication in 1990, but a lot has changed since then, especially with the rise of cloud computing, software as a service, mobile endpoints, big data analytics, and viral marketing.
The emerging Internet Of Things (IOT) enables us to build smart systems. We already have the sensory and motor parts of these systems available, but we don't have the brain. This is where data science comes into the picture! I will talk about how we are using big data technologies in conjunction with data science here at Pivotal to build the digital brain that makes a system smart.
90-Minutes: Data analysts routinely report spending more time "wrangling" their data than performing analysis per se. In this tutorial we focus on the ever-present yet oft-overlooked challenges of Data Transformation, including discovery, structure, content and curation. We emphasize recent approaches that jointly emphasize interaction and inference, leveraging both human acuity and...
You have data. You’ve hired data scientists. Now, how do you structure your teams? Do you keep the data scientists together to allow them to learn from each other? Or do you assign them individually to project teams so they can share their knowledge and become closer to the business? Intuit experimented with both ways and learned what it takes to get great outcomes.
Data science algorithms (think machine learning, clustering, outlier detection) often get conflated with the industry-standard tools and programming languages that run them. In this tutorial, John Foreman will use only spreadsheets to build models from his book Data Smart to demonstrate exactly how data science techniques work step-by-step.
With increased road congestion around the globe and growing amounts of car data we need more intelligent analytical methods to beat the traffic. This talk presents our work on traffic velocity and travel disruption analytics. We describe our approach in detail, how we went from idea to implemented algorithm and how our methods can be applied to gain deep insight into influential factors.
Humans are constantly curious and learning should be about making new discoveries. With big data, we have the potential to take formal learning which is taught and combine it with informal learning which is experienced, to create personalized learning paths for every individual.
In this talk Dr. Amr Awadallah will present the Enterprise Data Hub (EDH) as the new foundation for the modern information architecture. Built with Apache Hadoop at the core, the EDH is an extremely scalable, flexible, and fault-tolerant, data processing system designed to put data at the center of your business.
A well-designed domain specific language makes all parts of the data science process easier. In this talk I'll discuss two DSLs implemented in R that make it data manipulation and visualisation both easier to describe and faster to compute.
We will discuss the strategic significance of infrastructure core services (compute, storage, network, and comprehensive security) required for robust big data solutions. Also the strategic significance of Hadoop 2.0, Hadoop/NoSQL convergence, and the critical need for effective modeling, query formulation, and data analysis capabilities as Hadoop becomes an enterprise platform for big data.
This five-minute keynote will provide a quick overview of some of the more surprising things Hadoop is capable of in 5 minutes or less.
How Comcast Turns Big Data into Real-Time Operational Insights
While the first big data systems made a new class of applications possible, organizations must now compete on the speed and sophistication with which they can draw value from data. Future data processing platforms will need to not just scale cost-effectively; but to allow ever more real-time analysis, and to support both simple queries and today's most sophisticated analytics algorithms.
Twitter's Observability stack collects, processes, monitors and visualizes over 170 million real-time time series from all service and system components. This session covers how the stack is built and scales to enable developers and reliability engineers to build fault-tolerant distributed services. In this talk, you will learn what works and what doesn’t, from architecture to implementation.
Big Data without analytics is just data, but how do you perform the analytics? In this session, learn how In-Hadoop analytics is changing the game for the possibilities of Hadoop.
Alistair Croll, Strata Program Chair
edo Interactive shares how they drive agile, improved decision-making by complementing native Hadoop technologies with analytical databases and ETL optimization and data visualization solutions from vendors such as Pentaho.
Open Data, for many local governments is an experiment they are not willing to risk. In these lean times we continuously look for ways to add value and lower expenses. The City of Palo Alto has ventured into this space and has learned lessons as well as discovered proven approaches that will help transform open data from a trial to a successful business model for local agencies.
Organizations of all types and sizes are experiencing an explosion of machine log data whose literally inhuman diversity and scale overwhelms traditional analysis tools and techniques. We will discuss how machine learning can complement human expertise, enabling the extraction of valuable and actionable insights from log data.
With more than 45 million users and over 40,000 petitions created every month, Change.org is the biggest online platform for social change around the world. This talk is about how both bleeding edge and simple machine learning algorithms are used at Change.org to connect users to petitions and social issues which are most relevant to them.
What if data doesn't need to be big? Many use cases are served as well, or nearly as well, by a Small Data mindset, storage, processing, and algorithms. This talk presents ideas and options you might not have considered for reducing big problems to comparatively small and cheap ones.
We will discuss Rackspace’s vision for Data-as-a-Service, and provide a few key questions that could help you complement your technical analysis when choosing a database service. Along the way, we will also discuss parts of the portfolio of data services available at Rackspace, including SQL, MongoDB, Redis and Hadoop-based solutions.
Storing massive data is one challenge. Making it useful throughout all levels of a company in real time is quite another. The ability to intuitively sort, sift and analyze data through touch and gesture is here.
We will review several case studies of how companies are creating an intuitive data driven cultures through Cloudera Search, leveraging Impala coupled with Zoomdata visualization.
We have developed some open-source tools for building and
scaling systems for realtime data analysis with data music
videos and data gastronomification. We'll discuss the theory
behind these two data analysis methods, and then we'll present
case studies on how our tools are used to enable business
analytics and instill a data-driven culture.
Neural Networks (also known as Deep Learning) are biologically inspired machine learning models. In this talk, I will explain what neural networks are, how they work, and how they were used to achieve the recent record-breaking performance on speech recognition and visual object recognition.
Rodney is probably the most influential skateboarder in history. He’ll gladly discuss how to balance the analytic methods that help us learn with the internal feel for what we are learning. You’ll get tips for perfecting your Heelflip, and discover how skateboarders perform death-defying stunts without, you know, dying.
Open data has been established in government circles, particularly with the launch of open data government initiatives such as Data.gov and Data.gov.uk. These efforts are grounded on an overall philosophy that data should be available for to use and share without restrictions, and that government data has value as public infrastructure. But what about all of that commercial data – what role do...
Going from raw data to reproducible and production-ready machine-learning in data pipelines and applications is an unsolved problem, leaving businesses with their valuable data unused. New algorithms and frameworks aim to improve the situation and this talk will introduce some of these using examples of real-world machine learning projects.
We feel safer in big numbers, and we believe that numbers don't lie. But numbers don't actually speak for themselves - people speak for them.
NewSQL has followed quickly on the heels of NoSQL - providing scale-out of NoSQL along with SQL and ACID guarantees. We'll discuss NewSQL with customer examples and contrast it with SQL on Hadoop implementations.
PostgreSQL is an advanced open source database known for its reliability. It also features a rich extension ecosystem that enables features like semi-structured data types, new SQL operators, and a columnar data store. This talk examines extensions available to PostgreSQL users and how CitusDB turns PostgreSQL into a scalable data platform for addressing real world analytics problems.
At GeoPoll we are building a mobile integration platform to poll millions around the world via their own mobile phones. We do this by integrating with mobile carriers in places like Afghanistan and Congo to target users by location, make messages free, & pay users directly. This is hard. We have learned many dos and don'ts which we would like to share.
Today companies have no idea what makes their best employees tick, or why one team outperforms another that has the exact same processes. With the explosion of sensors in the workplace, however, we can now discover these best practices in real time. Using real-world case studies, we'll discuss how this fundamentally changes how people work, manage, and change.
The gap between legendary and anonymity in sports is often less than a 1% performance difference in elite sports. Thus, finding the core, modifiable variables that determine performance and tweaking them ever so slightly can alchemize silver medals into gold ones.
Join industry analyst Susan Etlinger as she demonstrates how leading brands are deriving actionable intelligence from a holistic view of social and enterprise data, the challenges and opportunities in doing so, and the criteria required to achieve social data intelligence maturity.
Combine your best algorithms and smartest data architecture, and what do you get? Without humans, you have an expensive, high tech brick. Humans generate data, which is used by and for humans to achieve human goals. If you want your data department to earn its keep by showing real value, you must build your social systems as meticulously as you build your pipeline.
Visualization is a weak link in big data tools: shoving 1MM rows into standard charts breaks their visual design and kills interactivity. In our mission to scale charts, we built the Superconductor language. It automatically compiles declarative visualizations into GPU code (WebCL+WebGL). This talk will explore how we're redesigning and optimizing core charts like heat maps and line graphs.
The better we tune our practice, the more practice will make perfect.
This presentation highlights the business processes, data architecture and analytic tools AutoTrader.com has put in place to enable robust analysis across subject areas, yielding improvements in the consumer experience and ultimately in customer value.
Everyone wants to know, how do I get the most value out of my data? Data requires time and money for proper ingestion, transformation, and analysis, so there better be some concrete ROI. While there are many internal uses for this data (the importance of which cannot be overstated), in this “Insight Economy,” companies must realize that the value of data extends outside the organization as well.
Netflix is a data-driven company. While "data-driven" is often no more than a lofty buzzword, we'll discuss how we make it a reality. We'll dive into the technologies we use and the philosophies underpinning how we get things done. We'll cover our "cloud native" data infrastructure, our use and contributions to open source software, and our open and enabling data environment.
Learn how and why it is now possible for Apache Hadoop to serve as a virtual Enterprise Data Warehouse (EDW) framework for native Big Data (stored in HDFS) - making it no longer necessary to move that data into the EDW at great expense simply for analysis. In this session, attendees will get an architect-level view of the solution and explore an example configuration and benchmark numbers.
Millions of sensors measure oceans and atmosphere to guide decisions made in offshore, renewables, maritime and other industries. Fast and responsive big data solution created by Marinexplore allows many organizations to plan their ocean activities next to others for the first time. New resulting workflows reduce financial, safety and environmental risks through improved decision making.
In our highly data-driven environment, businesses are essentially becoming semi-autonomous agents, constantly competing for resources, customers and talent.
Creating value from big, messy data sets can be a daunting task. The session introduces the Sidekick Pattern: using small, curated data to increase the value of Big Data. Drawing on lessons from data science for Jawbone’s UP fitness tracker, we will see how smart selection of data sidekicks can accelerate analysis, solve cold start problems, and simplify complicated data pipelines.
We're failing at big data, and bigger technology isn't helping. Complex infrastructure shouldn't justify complicated experiences. Let's apply the principles of consumer app culture to enterprise decision-making in a way that goes beyond dashboards. Let's use design thinking and metadata to connect people to information in a world where complexity is inevitable and technology alone is insufficient.
Why have powerful tools if you aren't asking the right questions? Good questions trump shiny tools, but our community has done little to improve how we train people in the "soft side" of data science. We will show how to borrow ideas from design, the humanities, consulting practices to structure problems and improve the questions we ask of our data.
Using the right tool for the job, understanding how the right data helps make better decisions, and having a sound data infrastructure are needed before big data will come to your rescue. I'll tell a few stories of marketers failing at data, and one or two about the rare client who does it right.
This presentation discusses how we used complex event processing (CEP) and MapReduce based technologies to track and process data from a soccer match as part of the annual DEBS event processing challenge while achieving throughput in excess of 100,000 events/sec.
In the haste to build and ship product, metrics to measure effectiveness and learn from user behavior can't be left behind. The heart of TripIt is parsing travel itineraries into trips that users can access anywhere, on web, mobile or tablet. TripIt uses data to navigate and prioritize support for a wide range of travel confirmation templates.