The Python data science stack, which includes NumPy, pandas, and scikit-learn, is efficient and intuitive for data scientists. However, it was designed to run on data that fits in memory and runs only in a single core. Matthew Rocklin and Ben Zaitlen demonstrate how to parallelize and scale your Python workloads to multicore machines and multimachine clusters using a variety of tools, including the standard library, Spark, and Dask.
Using guided exercises in Jupyter notebooks, you’ll gain hands-on experience with parallel computing tools so you understand how to choose the right tool for the job.
Programming paradigms on a single machine
Applying lessons to a cluster
Wrap-up and Q&A
Cloud resources generously donated by Google.
Matthew Rocklin is an open source software developer at Anaconda focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.
Ben Zaitlen is a data scientist and developer at Anaconda. He has several years of experience with Python and is passionate about any and all forms of data. Currently, he spends his time thinking about usability of large data systems and infrastructure problems as they relate to data management and analysis.
Comments on this page are now closed.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org