Your easy move to serverless computing and radically simplified data processing

Gil Vernik (IBM)

1:15pm–1:55pm Wednesday, September 25, 2019

Location: 1E 07/08

Data Engineering and Architecture

Secondary topics: Cloud Platforms and SaaS, Data Integration and Data Processing

Download slides (PDF)

Level

Beginner

Suppose you wrote Python code for Monte Carlo simulations to analyze financial data. The general process involves writing the code and running a simulation over small set of data to test it. Assuming this all goes smoothly, you now must run the same code at a massive scale, with parallelism, on terabytes of data, doing millions of Monte Carlo simulations. Clearly you’d prefer not to need to learn the intricacies of setting up virtual machines, suffer long setup times for the virtual machines, nor become an expert in scaling up Python code. This is exactly where serverless computing could come to the rescue. With serverless computing, you don’t need to set up the computing environment and only pay for the actual amount of resources your application consumes rather than prepurchased units of capacity. Here you’ll learn how to easily gain these benefits.

Gil Vernik takes a deep dive into the challenge of how serverless computing can be easily used for a broad range of scenarios, like high-performance computing (HPC), Monte Carlo simulations, and data preprocessing for AI. You’ll focus on how to connect existing code and frameworks to serverless without the painful process of starting from scratch and or learning new skills. To achieve this, you’re based on the open source PyWren framework that introduces serverless computing with minimal effort, and its new fusion with serverless computing brings automated scalability and the use of existing frameworks into the picture. You can simply write a Python function and provide an input pointing to the dataset in a storage bucket. Then PyWren does the magic by automatically scaling and executing the user function as a serverless action at massive scale.

Gil demonstrates how this capability allowed IBM to run broad range of scenarios over serverless, including Monte Carlo simulations to predict future stock prices and hyperparameter optimizations for ML models. IBM managed to complete the entire Monte Carlo simulation for stock price prediction in about 90 seconds with 1,000 concurrent invocations, compared to 247 minutes with almost 100% CPU utilization running the same flow over a laptop with 4 CPU cores. He’ll also show you how to bond TensorFlow and serverless for the data-preparation phases. Existing TensorFlow code can be easily adapted and benefit serverless with only minimal code modifications and without users having to learn serverless architectures and deployments.

Prerequisite knowledge

A basic understanding of Python, big data storage solutions like cloud object storage, and serverless computing

What you'll learn

Learn how to connect existing code and frameworks to serverless without the painful process of starting from scratch and or learning new skills and how serverless computing may provide great benefit for different HPC flows, Monte Carlo simulations, big data, and AI processing frameworks

Gil Vernik

IBM

Gil Vernik is a researcher in the Storage Clouds, Security, and Analytics Group at IBM, where he works with Apache Spark, Hadoop, object stores, and NoSQL databases. Gil has more than 25 years of experience as a code developer on both the server side and client side and is fluent in Java, Python, Scala, C/C++, and Erlang. He holds a PhD in mathematics from the University of Haifa and held a postdoctoral position in Germany.