Integrating Big Data into a Programming Language

Tomas Petricek (University of Cambridge)
Average rating: ****.
(4.33, 6 ratings)

The world of data is inherently diverse and “messy”. As a data scientist, you have to work with numerous different tools and languages to get all the data you need in a usable form, before you can even start doing the interesting part of your job. Wouldn’t it be nice if your programming language was aware of the external data sources that you are accessing?

In this talk, we look at doing data science with F#, which is an open-source and cross-platform programming language that provides unique way of integrating external data sources and tools into a single environment. This means that you can access data, but also Matlab scripts or R statistical and visualization packages, all from a single environment. Interactive coding with rich editor support is the F# way of communicating your ideas and exploring your Hadoop Hive, third-party REST services, as well as open-government data in CSV, XML or even HTML formats.

In a live coding part of the talk, you’ll see how one can combine data from Hadoop, CSV and JSON-based REST services, explore and visualize the dataset interactively and create a transparent and reproducible research report documenting the work.

F# has been used heavily in the finance and insurance industries, but is gaining traction in other areas including bioinformatics. This talk looks at the recent F# open-source libraries and tools for data-science, developed in collaboration by the speaker at University of Cambridge, the open-source F# community and industrial partners such as BlueMountain Capital.

Photo of Tomas Petricek

Tomas Petricek

University of Cambridge

Tomas is a computer scientist, book author and open-source developer. He is the lead developer of several F# data-science libraries (Deedle and F# Data), but he also contributed to the design of the F# language itself as an intern and independent consultant. He is the author of a popular book called “Real-World Functional Programming” and is currently editing a collection of practical F# case studies.

Tomas is a PhD student at the University of Cambridge, working on types for understanding context usage in programming languages. He is a founder of DualNotion ltd. where he provides training and consulting services. He recently spent 3 months in New York, working on F# tools for data science at BlueMountain Capital.