Schedule: Data Science sessions

Add to your personal schedule
Location: 122-123
Garrett Grolemund (RStudio)
Average rating: ****.
(4.21, 14 ratings)
This tutorial will teach you how to streamline your code and your thinking when doing data science. Analysts often spend over 80% of their time preparing and exploring data sets before they begin more formal analysis work. In this tutorial, I will introduce a set of principles -- and R packages -- that make this work easier and faster. Read more.
Add to your personal schedule
Location: 113
Simon Worgan (Jagex Ltd), Samuel Kerrien (RESEREC)
Average rating: ***..
(3.29, 14 ratings)
We will detail the development of a bi-directional event stream recommendation system in RuneScape, a massively multiplayer online game. By capturing a feature rich relationship between player and content we were able to train different 'flavours' of recommendation. Delivered in real-time these 'flavours' balance engagement, monetisation and enjoyment according to shifting business needs. Read more.
Add to your personal schedule
Location: 113
Hossein Falaki (Databricks Inc.)
Average rating: ****.
(4.07, 14 ratings)
Apache Spark enables interactive analysis of big data by reducing query latency to the range of human interactions through caching. Additionally, Spark’s unified programming model and diverse programming interfaces enable smooth integration with popular visualization tools, such as ggplot and matplotlib. We can use these to perform visual exploratory big data analysis with Spark. Read more.
Add to your personal schedule
Location: 113
Sean Owen (Cloudera)
Average rating: ****.
(4.00, 20 ratings)
Apache Spark is a popular new paradigm for computation on Hadoop. It's particularly effective for iterative algorithms relevant to data science like clustering, which can be used to detect anomalies in data. Curious? Get a taste of Spark MLlib, Scala and k-means clustering in this walkthrough of anomaly detection as applied to network intrusion, using the KDD Cup '99 data set. Read more.
Add to your personal schedule
Location: 113
Jeroen Janssens (Data Science Workshops)
Average rating: ***..
(3.73, 11 ratings)
The Data Science Toolbox is a new, open source virtual environment for data science. Its mission is to: (1) get data scientists started in a matter minutes, (2) enable teachers and authors to offer a custom virtual environment for their students and readers, and (3) encourage researchers to set up reproducible experiments. We'll discuss its importance, its technology, and its future. Read more.
Add to your personal schedule
Location: 113
Aaron Davidson (Databricks)
Average rating: ****.
(4.73, 11 ratings)
Apache Spark lets users build unified data analytic pipelines that combine diverse processing types. In this talk, we will leverage the versatility of Spark to combine SQL, machine learning, and realtime streaming processing to build a complete data pipeline in a single, short program which we will build up throughout the session. Read more.
Add to your personal schedule
Location: 127-128
Get certified as a Spark Developer at Strata + Hadoop World in Barcelona. Read more.
Add to your personal schedule
Location: 115
Garrett Grolemund (RStudio)
Average rating: ****.
(4.78, 18 ratings)
The ggvis package makes it easy to create interactive data graphics with R, with a declarative syntax similar to that of ggplot2. Like ggplot2, ggvis uses concepts from the grammar of graphics, but it also adds the ability to create interactive graphics and deliver them over the web. Read more.
Add to your personal schedule
Location: 113
Alex Dorman (Magnetic), Michal Laclavik (Magnetic)
Average rating: ***..
(3.00, 2 ratings)
The need to categorize short text strings arises in many domains: online advertising, search engines, social networking, etc. In this session, we will share strategies for categorizing large volumes of queries and keywords in the advertising space, our successes with open document collections (Wikipedia, DBPedia, Freebase), and details on our solution using Hadoop and Solr. Read more.
Add to your personal schedule
Location: 115
Average rating: ****.
(4.73, 11 ratings)
Linking data to create broader data sets can dramatically improve analysis results, but what if the data sets lack common identifiers? Similarly, duplicates in data is very common, and can seriously skew analysis results. This talk covers common techniques from record linkage research for solving this, as well as an open source tool implementing those techniques, and real-world examples. Read more.
Add to your personal schedule
Location: 115
Mikio Braun (Zalando SE)
Average rating: ****.
(4.19, 16 ratings)
Processing huge volume event streams in realtime poses quite some challenges. Based on our experience with social media data and realtime user interaction data, we discuss our experience with building such systems starting with a single computer. We have distilled this experience in a number of realtime data analysis patterns, which solve key aspects of such systems. Read more.
Add to your personal schedule
Location: 115
Shawn Scully (Dato)
Average rating: ***..
(3.95, 19 ratings)
One of the most exciting areas in Big Data is the development of new data products; predictive applications used to drive product recommendations, predict machine failures, forecast airfare, social match-make, identify fraud, predict disease outbreaks, and repurpose pharmaceuticals. In this talk, I’ll share the trends we’re seeing in predictive application development, show how to.... Read more.
Add to your personal schedule
Location: 115
Ofer Ron (LivePerson)
Average rating: ***..
(3.69, 13 ratings)
Many people assume that researching/designing a predictive modeling algorithm is the hard part of building a predictive modeling system over Big Data. We will focus on the far less romantic infrastructure needed to support a system, by reviewing the necessary components and the common pitfalls encountered when trying to automate both horizontally and vertically scalable systems. Read more.
Add to your personal schedule
Location: 115
Ted Dunning (MapR Technologies)
Average rating: ****.
(4.88, 17 ratings)
Computing various quantities such as medians or the number of unique elements requires a lot of time or a lot of memory or both. It is, however, possible to get really close to the right answer with much less time and much less memory. Such algorithms can be simpler than you might expect. I will describe these and show how they can be applied to applications like anomaly detection. Read more.
Add to your personal schedule
Location: 114
Tomas Petricek (University of Cambridge)
Average rating: ****.
(4.33, 6 ratings)
The world of data is inherently diverse and "messy". Wouldn't it be nice if your programming language was aware of the external data sources that you are accessing? In this talk, we look at doing data science with F#, which provides unique way of integrating external data sources and libraries. You can access data, but also Matlab scripts or R packages, all from a single environment. Read more.