Building and maintaining complex distributed systems
June 19–20, 2017: Training
June 20–22, 2017: Tutorials & Conference
San Jose, CA

A hands-on data science crash course for modeling and predicting the behavior of (large) distributed systems

Bart De Vylder (CoScale)
1:30pm–5:00pm Tuesday, June 20, 2017
Systems Engineering
Location: LL20 A/B
Level: Intermediate
Average rating: ***..
(3.00, 3 ratings)

Who is this presentation for?

  • DevOps engineers, researchers, and anyone who wants to learn how to get started with data science and apply it to distributed systems

Prerequisite knowledge

  • A working knowledge of Python (equivalent to first five chapters of the Python tutorial)

Materials or downloads needed in advance

As announced we plan to give the tutorial online so you only need your browser to participate. However, due to a large number of participants, we'd like to foresee a backup in case we run into bandwidth/latency problems at the conference. Therefore, if you find the time to install the following ahead of the conference:
  • git
  • anaconda, any python version is ok
Then you have the opportunity to run the tutorial from you own laptop and save some of the bandwidth.

What you'll learn


Data science is a hot topic. However, the high number of available software libraries, languages, and platforms is often overwhelming for those who want to get started in the field. Bart De Vylder offers a practical introduction that goes beyond the hype, exploring data analysis and modeling techniques applied to the behavior of distributed systems.

Using hosted iPython notebooks and a real-world dataset of monitoring data originating from a nontrivial distributed application, consisting of both stateful and stateless services communicating over a message bus, Bart walks you through the Python scientific ecosystem (NumPy, SciPy and scikit-learn) as he demonstrates different data visualization techniques that help the interpretation of the data and the models built from it. Bart discusses data clustering techniques, such as those to automatically discover which servers or containers are running in a load-balanced fashion, shows you how to apply correlation analysis and dimensionality reduction techniques. Modern monitoring systems easily capture tens of thousands of metrics, but many of these metrics are highly correlated and don’t convey much extra information. Applying dimensionality reduction techniques to automatically discover these correlations helps in understanding and visualizing the data and is a step in the process of preparing and modeling the data.

Bart also outlines supervised machine-learning techniques to model data and touches on the important concepts of overfitting and cross-validation, considering the advantages and disadvantages of both simple linear techniques and more advanced ones. Bart then shows how to put these models in action and make predictions, discussing techniques for performing what-if analyses related to capacity planning (e.g., which resource will be the next bottleneck if the number of web requests keeps increasing?) and robustness (e.g., what is the impact on a service’s SLA if a node falls out?).

Bart ends with a challenging problem on the given dataset using one of the discussed techniques—with a nice prize for the attendee with the best solution.

Photo of Bart De Vylder

Bart De Vylder


Bart De Vylder is a data scientist at CoScale. Previously, Bart was active in software engineering and architecture, with a focus on distributed systems. His interests lie in machine learning and building reliable, scalable data processing systems. Bart holds a PhD in artificial intelligence from the Free University of Brussels.