Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Schedule: Data Integration and Data Pipelines sessions sessions

Machine learning applications rely on data. The first step is to bring together existing data sources and when appropriate, enrich them with them with other data sets. In most cases data needs to be refined and prepared before it’s ready for analytic applications. This series of talks showcase some modern approaches to data preparation, data integration and the creation and maintenance of data pipelines.

14:0514:45 Wednesday, 23 May 2018
Data science and machine learning
Location: Capital Suite 15/16 Level: Intermediate
Ihab Ilyas (University of Waterloo)
Average rating: ****.
(4.40, 5 ratings)
Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution. Read more.
16:3517:15 Wednesday, 23 May 2018
Big data and data science in the cloud, Data science and machine learning
Location: Capital Suite 13 Level: Intermediate
Sergey Ermolin (Intel), Olga Ermolin (MLS Listings)
Average rating: ****.
(4.00, 1 rating)
Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Sergey Ermolin and Olga Ermolin detail an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages a transfer learning Siamese architecture based on VGG-16 CNN topology. Read more.
11:1511:55 Thursday, 24 May 2018
Data engineering and architecture
Location: S11B Level: Intermediate
Irene Gonzálvez (Spotify)
Average rating: ***..
(3.88, 8 ratings)
Irene Gonzálvez shares Spotify's process for ensuring data quality, covering why and how the company became aware of its importance, the products it has developed, and future strategy. Read more.
12:0512:45 Thursday, 24 May 2018
Big data and data science in the cloud, Data engineering and architecture
Location: Capital Suite 8/9 Level: Intermediate
Adesh Rao (Qubole), Abhishek Somani (Qubole)
Average rating: ***..
(3.00, 2 ratings)
Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Read more.
14:5515:35 Thursday, 24 May 2018
Data science and machine learning, Expo Hall
Location: Expo Hall Level: Beginner
Stamatis Stefanakos (D ONE AG)
Average rating: ****.
(4.33, 3 ratings)
Switzerland-based startup WinJi capitalizes on two current megatrends: big data and renewable energy. Stamatis Stefanakos offers an overview of WinJi's TruePower Asset Management Platform, covering the overall architecture and the motivation behind it, the physics behind the data, and the business case. Read more.
14:5515:35 Thursday, 24 May 2018
Eugene Kirpichov (Google)
Average rating: ****.
(4.50, 2 ratings)
Apache Beam offers users a novel programming model in which the classic batch-streaming dichotomy is erased and ships with a rich set of I/O connectors to popular storage systems. Eugene Kirpichov explains why Beam has made these connectors flexible and modular—a key component of which is Splittable DoFn, a novel programming model primitive that unifies data ingestion between batch and streaming. Read more.