Mar 15–18, 2020

Data engineering workshop

Sunday, March 15—Monday, March 16
Location: Winchester 1/2

Participants should plan to attend both days of training course. Note: to attend training courses, you must be registered for a Platinum or Training pass; does not include access to tutorials on Monday.

Jorge Villamariona outlines how organizations using a single platform for processing all types of big data workloads are able to manage growth and complexity, react faster to customer needs, and improve collaboration—all at the same time. You'll leverage Apache Spark and Hive to build an end-to-end solution to address business challenges common in retail and ecommerce.

What you'll learn, and how you can apply it

  • Learn how to ingest data, build data pipelines, and deploy analytics and machine learning applications using popular data processing engines such as Apache Spark and Hive

Who is this presentation for?

Data engineers, data architects, developers




  • Experience with object oriented programming and writing SQL
  • A Gmail account (for sign up and account enablement)
  • A basic understanding of big data, Apache Spark, Apache Hive, Spark SQL, and cloud computing (useful but not required)

Hardware and/or installation requirements:

  • A WiFi-enabled laptop (no tablets or smartphones)


Day 1

Data ingestion

  • Learn to ingest structured data
  • Learn to ingest semistructured data

Data exploration

  • Explore data with Spark
  • Explore data with Hive
  • Build dashboards

Data batch pipelines

  • Build batch data pipelines
  • Orchestrate data pipelines with Airflow

Processing your data

  • Join structured and unstructured datasets
  • Deriving value from your joined datasets

Day 2

Optimizing your cloud platform

  • Autoscaling rules
  • Managing heterogeneous clusters
  • Estimating cost

Fill your data lake with batch and streaming data

  • Learn how to take advantage of batch and streaming datasets

Data mining

  • Learn how to use Spark MLlib for data mining
  • Build a simple recommendation engine

Contest awards

About your instructor

Photo of Jorge Villamariona

Jorge Villamariona is a senior technical marketing engineer on the product marketing team at Qubole. Over the years, Jorge has acquired extensive experience in relational databases, business intelligence, big data engines, ETL, and CRM systems. He enjoys complex data challenges and helping customers gain greater insight and value from their existing data.

Conference registration

Get the Platinum pass or the Training pass to add this course to your package. Early Price ends February 7.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires