Big Data platforms like Hadoop have matured significantly in the past few years. Now it is possible to deal with huge amounts of data in a scalable fashion both for storage and processing. Newer projects like Apache Spark overcome some of the issues of MapReduce-based batch processing, allowing users to implement complex learning and data analysis methods.
However, truly scalable implementations of complex data analysis algorithms are still challenging. So far, such approaches have relied less on massive parallelization than on clever algorithmic tricks leading to approximate and fast algorithms. Algorithms like stochastic gradient descent that can deal with huge amounts of data, however, are notoriously hard to parallelize.
In this talk, I will review classic approaches to large scale learning as well as some of the recent developments in the field like Google’s DistBelief, an approach to parallelize deep learning using concepts like parameter servers; or the use of approximate algorithms like Count-Min-Sketches as building blocks to create fast and scalable machine learning algorithms. The talk will thus attempt to identify the key concepts that will guide the field in the coming years.
Mikio Braun is co-founder of streamdrill, a startup focused on approximative approaches for real-time big data, and post-doc researcher at TU Berlin, Germany. He holds a Ph.D. in Machine Learning and has worked in research for a number of years, before becoming interested in putting research results into good use in the industry. His current interests focus on anything to do with real-time data analysis, in particular using approximative approaches beyond scaling.
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.