Presented By O'Reilly and Cloudera
Make Data Work
31 May–1 June 2016: Training
1 June–3 June 2016: Conference
London, UK

Architecting a data platform

John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Data Whisperers)
13:30–17:00 Wednesday, 1/06/2016
Enterprise adoption
Location: Capital Suite 12 Level: Intermediate
Average rating: ***..
(3.39, 18 ratings)

Prerequisite knowledge

Attendees should be familiar with big data technologies, data warehouses, and enterprise data systems.

Description

What are the essential components of a data platform? John Akred and Stephen O’Sullivan explain how the various parts of the Hadoop and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

By tracing the flow of data from source to output, John and Stephen explore the options and considerations for components, including:

  • Acquisition: from internal and external data sources
  • Ingestion: offline and real-time processing
  • Storage
  • Analytics: batch and interactive
  • Providing data services: exposing data to applications

Other topics include:

  • Tool selection
  • The function of the major Hadoop components and other big data technologies such as Spark and Kafka
  • Integration with legacy systems
Photo of John Akred

John Akred

Silicon Valley Data Science

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Photo of Stephen O'Sullivan

Stephen O'Sullivan

Data Whisperers

A leading expert on big data architectures, Stephen O’Sullivan has 25 years of experience creating scalable, high-availability data and applications solutions. A veteran of Silicon Valley Data Science, @WalmartLabs, Sun, and Yahoo. Stephen is an independent adviser to enterprises on all things data..

Comments on this page are now closed.

Comments

Picture of Stephen O'Sullivan
Stephen O'Sullivan
28/05/2016 23:14 BST

David Pardoe – We will cover something similar during the talk. Feel free to ask your question during the talk if we need to go further. Also I’m doing office hours on Thursday at 11:15am

Picture of Stephen O'Sullivan
Stephen O'Sullivan
28/05/2016 23:11 BST

Naveen Siddareddygari – We will cover some of that. Feel free to ask your question during the talk. Also I’m doing office hours on Thursday at 11:15am

David Pardoe
27/05/2016 10:18 BST

Hi there,

I am building a Data Science platform for a (very large) recruiting business. We have a mixture of structured and unstructured data that reside in different systems. We are already exploring Elastic search and are putting a considerable amount of data into JSON documents with an Elastic index sitting on top (if those are the correct terms!). We also have strcutured (ie. relational) data underpinning operational systems and used for MI cubes (hence a summary of the operational systems). I am particularly interested in architectures that will allow us to access the data across these different types of data stores in single queries (eg. Apache Drill); primarily for advanced data science a activities (machine learning, predictive model training and deployement etc).

Will you cover any of these types of things?

Naveen Siddareddygari
27/05/2016 2:38 BST

Hello,
we set some guidelines for architecting our current platform as below
Metadata (data lineage, dependencies)
Re-use (Master data -XREF, business rules)
Modularity (Business rules, environments)
Workflow orchestration ( data driven, error reprocessing, deterministic outcome)

Above have worked well but we are looking to Big Data stack and trying to figure out whats the best way to achieve this.

biggest items are workflow or job dependencies that impact how master data is built or late arriving reference data which triggers reprocessing of data. and Big data is immutable,so, looking forward to design options. Thanks

Picture of John Akred
John Akred
18/05/2016 4:39 BST

HI Clemens – we can definitely dig into this a bit during the tutorial! Please feel free to remind us as we will ask for questions at the end of major sections.

Clemens Valiente
17/05/2016 11:13 BST

Hi John & Stephen,
looking forward to the tutorial!
I would have one question that hopefully fits this tutorial. How would you design a backup solution accounting for a possible cluster failure for your data architecture?