In the era of M2M Communication and the Internet of Things on top of traditional 3V's of Big Data - Volume, Variety and Velocity we need to be able to process ephemeral data produced by dispersed sources which needs to be organized and distributed to multiple services. Real-time response, security, compliancy, compatibility, breaking silos - require new approaches to data management.
Analytics is useless if it doesn't lead to action. It is often desirable to put a computer in control of decision making. In this talk I'll discuss bandit algorithms, a class of decision making algorithms that solve a simple but widely applicable decision problem, and have found application in ad serving, content recommendation, and more.
The tutorial will give a first introduction running Big Data Analyses in the statistical software R. R brings together latest Big Data technologies and latest high-level statistical methods. Bring your laptop, use your web browser to access a RStudio based analyses platform in the cloud and leave with a lot of new ideas for efficient Big Data analyses with R.
Big Data are very interesting for official statistics. Results obtained by analyzing large amounts of Dutch traffic loop detection records, Mobile phone data and Dutch social media messages are discussed to illustrate this.
The "smart grid" isn't just about smart meters. Every stage of our electrical power infrastructure has to be "smart," including generation, transmission and distribution. Sophisticated sensors connected to software platforms that continuously gather, visualize and analyze massive amounts of data in real time to produce actionable insights are critical to optimizing our energy assets.
Data science may seem like a revolutionary new field, but it is merely the latest incarnation of a tradition as old as we are: storytelling. And because it is part of such an inherently human practice, it is most valuable when it takes humanity into account. This talk explores how to use data and the techniques associated with data to build things that matter, by looking back to look forward.
It has been said by many that 80% data science is scrubbing data. In this talk we'll cover how you can use Cascalog to scrub, transform, manipulate and mangle data into the formats you need, fix things that are wrong and filter out things that are broken. Clojure and Cascalog together provide fantastic tools for this. Learn about using Hadoop with the messy data that exists in the real world.
Communicating Data Clearly describes how to draw clear, concise, accurate graphs that are easier to understand than many of the graphs one sees today. The tutorial emphasizes how to avoid common mistakes that produce confusing or even misleading graphs. Graphs for one, two, three, and many variables are covered as well as general principles for creating effective graphs.
How do you do data journalism when you are not the Guardian, the New York Times or the Washington Post? You don't need a data team, developers, much time or any funding to get started and produce data journalism that grabs headlines and engages readers. This workshop will focus on quick start techniques for getting started and making the most of few resources.
What questions would you ask if you have a Facebook-like graph of what your customer likes, what they bought, and what they viewed? This is what we built at uSwitch by transforming flat data from Hadoop into Neo4J. This talk will walk through how we bridged big data and linked data technologies and the results of such amalgamation.
The NHS produces an amazing amount of detailed raw data about health, prescribing, doctors, hospitals, and so on. The data's a great resource for data scientists to experiment with and learn on - it's very rich, interesting, and important to society.
This session will discuss the available datasets and work through some example analyses of the data from different perspectives.
How do we know what we know? Increasingly discoveries are made from computed data, possibly sourced from the internet. If we are to trust these discoveries, how conclusions are reached is critical. Examples from work in Big Data analytics infrastructure for life sciences and social media analysis will illustrate the key issues.
As data scientists, uncertainty is all around us: data is noisy, missing, wrong or inherently uncertain. In this talk I want to introduce a branch of statistics called Bayesian reasoning which is a unifying, consistent, logical and practically successful way of handling uncertainty. In short, I'd like to convince people that Bayes rule is the E=MC^2 of data science.
Keynote by Tim Kelsey, National Director for Patients and Information, National Health Service.
For a two-sided marketplace like Airbnb, the search engine is the main driver of the health of the business. We developed an open-source technology stack and a set of analytical methods to optimize the search experience for our users and search conversion for our business. We’ll discuss the tools we use for data crunching, analysis and reporting, as well as our thoughts on experimental design.
The economy is in a mess. But good data can help fix it. Timely analysis of large data sets is beginning to provide insight into what's really happening to business growth, employment and prosperity. We'll look at some of the most exciting examples of how Big Data is changing the way we look at the economy, and how governments and businesses can use them to their advantage.
Gavin Starks, CEO, Open Data Institute (ODI).
Attendees will leave this session with a deeper understanding of how organizations are using Hadoop to solve real business problems today, and how recent advancements in the Hadoop ecosystem are expanding the platform's capabilities to serve larger enterprise requirements for a virtual EDW.
In this tutorial we'll use the Cloudera Development Kit (CDK) to build a Java web app that logs application events to Hadoop, and then run ad hoc and scheduled queries against the collected data.
Even if one has big data, sometimes there is a lack of key data. This is a problem for predictive analytics: if there is only a limited amount of training material (e.g. user ratings, categorized documents), then it is hard to generate accurate models. The talk introduces new semi-supervised learning methods to overcome this problem by utilizing the vast amount of unlabeled data.
We're getting better all the time. See how the Cato Institute used responsive design and D3.js to show how human development indicators improve as economic freedom spreads.
The UK Government team behind the GOV.UK website talk about their work on the Performance Platform, a suite of services and a cultural shift taking people away from immensely detailed value stream maps about a call-centre and paper process (which might be an inherently 5-day long journey), to something that's digital, lightweight, fast and pleasant to use.
To take the right decision, you need the right data. As complexity and abundance of data increase, the communication of data analysis results becomes more challenging. Grounding our talk in the pharma R&D arena, we illustrate how animated and interactive graphics can streamline communication on complex data analysis and inform decision making.
Dat aims to bring a distributed collaboration flow to big data. Git and Github have done it for source code, but we don't yet have a social data solution.
The Lean Startup model showed a generation of founders how to launch companies smarter and faster. At the core of this model is a constant cycle of building, measuring, and learning. In this session, we'll look at the "measure" part of this cycle, and how organizations of all sizes can use data to build a better business faster.
Bookies have been "crowdsourcing" long before the Internet has made the term popular. Gambling lines, however, don't necessarily represent who should win a game; rather, they represent how the risk the betting public is willing to stomach on its teams. There can be short term gaps between expectations and outcomes. Exploit the "line bias" and win (maybe?).
"In order to classify documents, simply first convert them to vectors, train, test and finally apply the model." Sounds easy - in theory. Converting documents to vectors usually is the tricky part. This talk walks you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
NICE Systems is a leading provider of Customer Experience Management software, providing real-time offer management and predictive analytics applications based on HBase. We have recently migrated to HBase from our own custom-built data store, and in this session we will share the challenges we overcame getting HBase to perform to our demanding performance requirements.
Fast read and write performance and scalability of distributed in-memory clusters is making it possible to retrain machine learning algorithms in real-time. The application of such algorithms to risk, infrastructure security and other areas can be transformative.
We hear stories of how big data is unprecedented and about the latest disruptive products to hit the market, products that are totally different and will change everything. Yet looking at the underlying concepts, most of these aren’t all that new and the ones that are new are being explained in the terms of the old, in the same way cars were described as “horseless carriages.”
Predictive Analytics has emerged as one of the primary use cases for Hadoop, leveraging various Machine Learning techniques to increase revenue or reduce costs. In this talk we provide real-world use cases from several different industries, and then discuss the open source technologies available to companies wishing to implement Predictive Analytics with Hadoop.
How do you indentify duplicate data and why is it important? What do you do with such data when you find it? Data Matching using the mathematics of probability has been around since the 1950’s. But, how does it actually work? What is the mathematics behind it? How do probabilities allow us to identify duplicate entries?
People want more out of Hive. They want it to be fast, useful, and connect to their tools. Work is being done to reduce start up time, improve the optimizer, extend it to use Tez, process records 50x faster, add support for functions like RANK, add subqueries, and add standard SQL datatypes. We will review this work plus show current benchmarks.
Apache Hadoop has become popular from its specialization in the execution of MapReduce programs. However, it has been hard to leverage existing Hadoop infrastructure for various other processing paradigms such as real-time streaming, graph processing and message-passing. That was true until the introduction of Apache Hadoop YARN in Apache Hadoop 2.0.
This tutorial will describe how to process real-time streams and using the open-source Storm framework. We will define Storm's core concepts whilst focusing on creating a simple topology that counts, in real-time, key-words and hashtags seen in Twitter's public (1%) feed.
To keep analyzing more data, and faster, we need a secret weapon:
cheating. In this brief survey, learn how you may be doing too much
work in your analytics and learning processes, and how giving up a
little accuracy can gain a lot of performance. With examples from
Apache Hadoop, Mahout, and ML tools from Cloudera.
In this presentation Chris will look at seven Map Reduce techniques that he enjoys playing with – the bits that are fun, exciting, and that can provide valuable insight into your big data. With examples of code (and where you can look for it) for web, image and text processing he'll show things that can quickly allow you to extend your analysis beyond traditional data mining.
As big data analytics evolves beyond simple batch jobs, there is a need for both lower-latency processing (interactive queries and steam processing) and more complex analytics (e.g. machine learning, graph algorithms). This talk will introduce Spark and Shark, popular open source projects from Berkeley that address this need through an optimized runtime engine and in-memory computing capabilities.
In this talk Felienne will summarize her recently completed PhD research on the topic of spreadsheet structure visualization, spreadsheet smells and clone detection, as well as presenting a sneak peek into the future of spreadsheet research as Delft University.
To feed LinkedIn's data-driven products, we need to run a complex graph of ETL workflows that deliver the right data to the right systems reliably on a 24x7 basis. To achieve this goal, we have developed a metadata system that captures process dependencies, data dependencies, and execution histories -- this system also lays the foundation for a combined dataflow and workflow engine.
Big data has proved it's worth in a number of industries, but it's not the size or the storage that is making the difference. The organisations that are delivering most value are the ones that have realised the need to drive analytics into the heart of their decision making process.
As technology further pervades enterprises, each generates
more data. Once harnessed, this data can enhance business, enabling
growth. A new home for data has arrived to better support this: the
Enterprise Data Hub, with Apache Hadoop at its center. Doug will
discuss the trends that drive this and speculate on where they lead.
This talk will discuss how particle physics research can inform the field of data science. The importance of blind analyses and machine learning algorithms will be discussed as tools for filtering growing bodies of data as the big data trend continues.
Maps are powerful tools for people to learn from data. In this project, we combine large-scale data processing with Hadoop and data visualization through CartoDB to make over six years of bi-monthly deforestation data accessible in an interactive map on the web. This talk will tell the story of how large-scale data paired with visualization can make data accessible in important new ways.
How can business and IT users easily explore, analyse and visualise data in Hadoop? Learn about alternatives to manually writing jobs or setting up predefined schemas and how a leading enterprise used Splunk and their Hadoop distribution to empower them with new access to Hadoop data. See how they got up and running in under an hour and enabled their developers to start writing big data apps.
Being good is hard. Being evil is much more fun and gets you paid a lot more. We give a survey of the field of doing high-impact evil with data and analysis. We will look at some of the simplest things you can do to make the maximum (negative) impact on your friends, your business and the world. If you happen to learn something about doing good with data that will be your problem.
How can companies use social and business data together to gain insight? See how Tableau's native Google BigQuery connector links seamlessly to live data in BigQuery and creates interactive visualizations without writing a single line of code. Find out how to share your results on the web and mobile in minutes.
I will discuss and showcase polyglot persistence and the lambda architecture.
Based on real-world examples and case studies from our customer base and the
wider community of practitioners, incl. the financial, energy & media industries
and the realm of public data sources (government data), we will elaborate on
opportunities and challenges of these new data management and processing memes.