Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Building data lakes in the cloud

alex bordei (Bigstep )
1:15pm–1:55pm Wednesday, 09/28/2016
Enterprise adoption
Location: River Pavilion Level: Intermediate
Average rating: ***..
(3.67, 6 ratings)

Prerequisite knowledge

  • Familiarity with enterprise IT challenges and big data technologies
  • What you'll learn

  • Understand why building a data lake in the cloud entails different particularities than building it on premises
  • Description

    Every industry has both proven and potential data lake use cases. With enterprise data warehouses (EDWs) being rendered ever more inefficient when facing new business needs, cloud-based data lakes have been gaining popularity with enterprises looking to cover the technology gap. Cloud data lakes are purpose-built to meet the data management requirements of the evolving enterprise landscape.

    Alex Bordei walks you through the steps required to build a data lake in the cloud and connect it to on-premises environments, covering best practices in architecting cloud data lakes and key aspects such as performance, security, data lineage, and data maintenance. The technologies presented range from basic HDFS storage to real-time processing with Spark Streaming.

    Topics include:
    Why enterprises should build data lakes in the cloud
    The main drivers for enterprise adoption of the data lake have been the need for agility and custom, enterprise-wide access to datasets, data streams, and data analysis tools. However, more and more companies have started using cloud data lakes as prototyping workbenches and have embraced the researcher mindset in order to build fully functional data laboratories in the cloud. Apart from offering an extremely convenient method of bypassing the tedious integration and configuration of big data applications and the costly acquisition and tuning of the underlying on-premises infrastructure, using a cloud data lake offers the opportunity to experiment with an ever growing array of big data technologies.

    Solutions for securely extending the on-premises network in the cloud
    To protect against unauthorized access, the data lake uses computer network authentication protocols, such as Kerberos, and it encrypts data both when transmitted across networks and while at rest. Security measures suited to cloud data lakes must also cover efficient backup protocols. Ideally, replication is configured on a per-file basis so users can decide the extent to which the most sensitive data is safeguarded against loss.

    Integration solutions for multiple Active Directory domains and multiple secure Hadoop environments
    The data lake should easily integrate any corporate Active Directory (LDAP) or third-party authentication method. Identity services integration is crucial when building a data lake in the cloud. As data provisioning, management, and governance become easier and safer, cloud-based Hadoop architectures better mirror and seamlessly integrate with on-premises architectures.

    Solutions for increasing performance
    Despite its impact on the IT landscape and its enthusiastic adoption across industry sectors, the virtualized cloud is far from being the best underlying architecture solution for data lakes—and for big data projects in general. A fairly new breed of cloud, the bare-metal cloud, offers a much better environment in terms of performance, isolation, and flexibility. Platforms offering such environments provide the high computation power and security of bare metal with the full flexibility of the cloud.

    Software solutions typically used for data lakes
    From concept to deployment, creating a production-ready enterprise cloud data lake should take minutes, not months. Every hardware connection should be software-defined, and every software component should be ready for deployment, scaling, and connecting to a data source. Along with the data lake’s powerful processing capabilities, its software stack is the main aspect that differentiates a cloud data lake from large-scale storage repositories such as an enterprise data warehouse.

    Photo of alex bordei

    alex bordei


    Alex Bordei has been developing infrastructure products for over nine years. Before becoming Bigstep’s Product Manager, he was one of the core developers for Hostway Corporation’s provisioning platform. He then focused on defining and developing products for Hostway’s EMEA market and was one of the pioneers of virtualization in the company. After successfully launching two public clouds based on VMware software, he created the first prototype of Bigstep’s Full Metal Cloud in 2011. He now focuses on guaranteeing that the Full Metal Cloud is the highest performance cloud in the world, for big data applications. Twitter: @alexandrubordei

    Comments on this page are now closed.


    Picture of Arun Boghra
    09/30/2016 12:23pm EDT

    Can you provide the slides.
    Were told in the session that slides will be provided

    09/28/2016 9:29am EDT

    Hi — will the slides be available on here after the talk?