Presented By O'Reilly and Cloudera
Make Data Work
Sept 29–Oct 1, 2015 • New York, NY

Data science for Wall Street

Sean Owen (Cloudera), Juliet Hougland (Cloudera), Sandy Ryza (Clover Health)
9:00am–12:30pm Tuesday, 09/29/2015
Data Science & Advanced Analytics
Location: 3D 03/10 Audience level: Advanced
Average rating: **...
(2.96, 24 ratings)

Materials or downloads needed in advance

Attendees are expected to bring a laptop computer with a working wireless internet. An internet connection and access to a cluster with data will be supplied at the venue.


Other industries are catching on to what Wall Street has known for years – the collection of data and application of analytic methods can provide enormous value to enterprises.

In this tutorial, attendees will get a taste of how large scale data science techniques and technologies developed for the consumer internet can be applied in the world of finance. Attendees will enrich stock tick data with Wikipedia page view traffic data as well as the text of pages. We will guide an exploration of the relationship between the traffic on Wikipedia pages to the movement of stock prices.

In this tutorial attendees will learn how to:

  • Clean and transform data sets in Spark
  • Join together varying data sets (text and time series) by defining conformed dimensions
  • Use MLLib and other statistical libraries to build and evaluate models. Examples of model types to be covered include anomaly detection, time series forecasting, and/or textual analysis
  • Use Hue to run Spark jobs, and visualize the results.
Photo of Sean Owen

Sean Owen


Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.

Photo of Juliet Hougland

Juliet Hougland


Juliet Hougland is a data scientist at Cloudera, and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil and gas pipelines at Deep Signal, and designing and building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in applied mathematics from the University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in math-physics.

Photo of Sandy Ryza

Sandy Ryza

Clover Health

Sandy Ryza is a senior data scientist at Clover Health. He was previously at Cloudera doing engineering and data science. He is an author of O’Reilly’s Advanced Analytics with Spark, as well as a Spark committer and member of the Hadoop project management committee. He graduated Phi Beta Kappa from Brown University.