Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Scalable machine learning

Mikio Braun (Zalando)
11:45–12:25 Thursday, 7/05/2015
Data Science
Location: King's Suite - Balmoral
Average rating: ****.
(4.40, 5 ratings)
Slides:   1-PDF 

Prerequisite Knowledge

Some knowledge of existing Big Data tool landscape, no prior knowledge of advanced data analysis required

Description

Big Data platforms like Hadoop have matured significantly in the past few years. Now it is possible to deal with huge amounts of data in a scalable fashion both for storage and processing. Newer projects like Apache Spark overcome some of the issues of MapReduce-based batch processing, allowing users to implement complex learning and data analysis methods.

However, truly scalable implementations of complex data analysis algorithms are still challenging. So far, such approaches have relied less on massive parallelization than on clever algorithmic tricks leading to approximate and fast algorithms. Algorithms like stochastic gradient descent that can deal with huge amounts of data, however, are notoriously hard to parallelize.

In this talk, I will review classic approaches to large scale learning as well as some of the recent developments in the field like Google’s DistBelief, an approach to parallelize deep learning using concepts like parameter servers; or the use of approximate algorithms like Count-Min-Sketches as building blocks to create fast and scalable machine learning algorithms. The talk will thus attempt to identify the key concepts that will guide the field in the coming years.

Photo of Mikio Braun

Mikio Braun

Zalando

Mikio Braun is co-founder of streamdrill, a startup focused on approximative approaches for real-time big data, and post-doc researcher at TU Berlin, Germany. He holds a Ph.D. in Machine Learning and has worked in research for a number of years, before becoming interested in putting research results into good use in the industry. His current interests focus on anything to do with real-time data analysis, in particular using approximative approaches beyond scaling.