Sep 23–26, 2019
Please log in

Protect your private data in your Hadoop clusters with ORC column encryption

Owen O'Malley (Cloudera)
3:45pm4:25pm Thursday, September 26, 2019
Location: 1E 09
Average rating: ****.
(4.00, 4 ratings)

Who is this presentation for?

  • Data engineers and software engineers




Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly.

Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally. Owen also demonstrates an end-to-end scenario that shows how to leverage this capability.

Prerequisite knowledge

  • Experience with either Hive or Spark to process big data

What you'll learn

  • Learn how to use ORC column encryption to protect your sensitive data
Photo of Owen O'Malley

Owen O'Malley


Owen O’Malley is a cofounder and technical fellow at Cloudera, formerly Hortonworks. Cloudera’s software includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Previously, he was the architect of MapReduce, Security, and now Hive. He’s driving the development of the ORC file format and adding ACID transactions to Hive.

Comments on this page are now closed.


kishore veeragandham | Bigdata Administrator
09/29/2019 9:27am EDT

Hello I was at Strata last week however could not attend this session. currently I have a bottleneck with performance when I enable ACID transactions in hive due to high cardinality among other things. please share your presentation for that matter any help toward solving the bottleneck is appreciated. on a standard hive table I get the count in 0.6 secs same query(data is inserted into a new table) on transactional table it is taking 3.46 secs

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  •, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    For conference registration information and customer service

    For more information on community discounts and trade opportunities with O’Reilly conferences

    For information on exhibiting or sponsoring a conference

    For media/analyst press inquires