Your easy move to serverless computing and radically simplified data processing
Who is this presentation for?anyone who wants explore serverless usage for data processing
Suppose a data scientist writes Python code for Monte Carlo simulations to analyze financial data. The general process involves writing the code and running a simulation over small set of data to test it. Assuming this all goes smoothly, how does the data scientist now run the same code at a massive scale, with parallelism, on terabytes of data, doing millions of Monte Carlo simulations? Clearly, she would prefer not to need to learn the intricacies of setting up virtual machines, suffer long setup times for the virtual machines, nor become an expert in scaling up Python code. This is exactly where serverless computing could come to the rescue. With serverless computing, the data scientist does not need to set up the computing environment and only pays for the actual amount of resources her application consumes, rather than pre-purchased units of capacity. In this talk we show how to easily gain these benefits.
In this talk we address the challenge how serverless computing can be easily used for broad range of scenarios, like HPC, Monte Carlo simulations and data preprocessing for AI. We focus on how to connect existing code and frameworks to serverless without the painful process of starting from scratch and or learning new skills. To achieve this, we are based on the open source PyWren framework that introduces serverless computing with minimal effort and its new fusion with serverless computing brings automated scalability and the use of existing frameworks into the picture. Users can simply write a Python function and provide an input pointing to the dataset in a storage bucket. Then PyWren does the magic by automatically scaling and executing the user function as a serverless action at massive scale.
We will demonstrate how this capability allowed us to run broad range of scenarios over serverless, including Monte Carlo simulations to predict future stock prices, hyper-parameter optimizations for ML models. As we will further show, we managed to complete the entire Monte Carlo simulation for stock price prediction in about 90 seconds with 1000 concurrent invocations, compared to 247 minutes with almost 100% CPU utilization running the same flow over a laptop with 4 CPU cores. We also show how we can bond TensorFlow and serverless for the data preparation phases. Existing TensorFlow code can be easily adapted and benefit serverless, with only minimal code modifications and without users having to learn serverless architectures and deployments.
Prerequisite knowledgeAudience required to have minimal knowledge of Python, Big Data storage solutions like cloud object storage. Very minimal knowledge of serverless computing.
What you'll learn
Gil Vernik is a researcher in IBM’s Storage Clouds, Security, and Analytics group, where he works with Apache Spark, Hadoop, object stores, and NoSQL databases. Gil has more than 25 years of experience as a code developer on both the server side and client side and is fluent in Java, Python, Scala, C/C++, and Erlang. He holds a PhD in mathematics from the University of Haifa and held a postdoctoral position in Germany.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts