Developers, Data Scientists, Researchers
Familiarity with Python and Jupyter notebooks is assumed. Some basic experience with Pandas and Scikit-Learn will be helpful.
Students should bring a laptop. If they want to work on their own computer for the first half (recommended) then they should install Anaconda (or self-install all of the necessary libraries). Alternatively we will provide a cloud-hosted solution. Their laptop should be able to reach external websites and view ports 8888 and 8787.
Gain hands-on experience with a variety of tools including the standard library, Spark, and Dask. A general understanding of how to think about parallel data analysis, and how to choose the right tool for the job.
The Python Data science stack including NumPy, Pandas, Scikit-Learn and other libraries is efficient and intuitive for data scientists. However it was designed to run on data that fits in-memory and runs only in a single core. This tutorial teaches you to parallelize and scale your Python data science workloads to multi-core machines and multi-machine clusters. We cover a variety of tools including the standard library, Spark, and Dask. This comparative approach will help students understand how to think broadly about parallel applications, and choosing the right tool for the job.
In this hands-on tutorial students will work through guided exercises presented as Jupyter notebooks. They will start on their own personal computer for the first half of the tutorial and for the second half will switch to a cloud-hosted cluster for hands-on experience with a distributed machine (we will provide all setup).
Students will walk away from this comparative tutorial both with hands-on experience with a few parallel computing tools and with an understanding
of how to choose the right tool for the job.
Basic introduction, outline, and background
Part One: Programming paradigms on a single machine
1. Data ingestion with embarrassingly parallel map
2. Flexible algorithms with futures
3. Big Data collections
4. Capstone exercise with machine learning
Part Two: Applying lessons to a cluster
1. Replay previous exercise on a cluster. Experience that programming
techniques in the last section carry over to distributed systems.
2. Learn about the costs of communication and how performance
characteristics can motivate different algorithms
3. Branch out to one of a few domain-specific exercises, experimenting
with previous lessons
Conclusion. Recap of lessons learned and links for more information.
Cloud resources generously donated by Google.
Matthew Rocklin is an open source software developer focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.
Ben is a data scientist and developer at Continuum Analytics. He has several years of experience with Python and is passionate about any and all forms of data. Currently he spends his time thinking about usability of large data systems and infrastructure problems as they relate to data management and analysis.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com