July 20–24, 2015
Portland, OR

Microservices, containers, and machine learning

Paco Nathan (derwen.ai)
10:40am–11:20am Thursday, 07/23/2015
Data Portland 256
Average rating: ****.
(4.12, 8 ratings)
Slides:   external link

Prerequisite Knowledge

Some programming in Python and SQL would be a good to understand the code. Familiarity with Linux and shell scripting will help with the services and containers section. If you are familiar with text analytics already, this talk will hopefully introduce some relatively novel approaches. Realistically, the biggest requirement is to have some experience using email forums for open source communities.


In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.

Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:

  • What are the trending topic summaries?
  • Who are the leaders in the community for various topics?
  • Who discusses most frequently with whom?

This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.

Photo of Paco Nathan

Paco Nathan


Paco Nathan is an O’Reilly author (Just Enough Math and Enterprise Data Workflows with Cascading) and a “player/coach” who’s led innovative data teams building large-scale apps. He is director of community evangelism for Apache Spark with Databricks, and an advisor to Amplify Partners. Paco is an expert in machine learning, cluster computing, and enterprise use cases for big data. His interests include Spark, Ag+Data, open data, Mesos, PMML, Cascalog, Scalding, Clojure, Python, Chatbots, and NLP.

Comments on this page are now closed.


Picture of Paco Nathan
Paco Nathan
07/23/2015 7:03am PDT

Here are the slides for the talk