Every industry has both proven and potential data lake use cases. With enterprise data warehouses (EDWs) being rendered ever more inefficient when facing new business needs, cloud-based data lakes have been gaining popularity with enterprises looking to cover the technology gap. Cloud data lakes are purpose-built to meet the data management requirements of the evolving enterprise landscape.
Alex Bordei walks you through the steps required to build a data lake in the cloud and connect it to on-premises environments, covering best practices in architecting cloud data lakes and key aspects such as performance, security, data lineage, and data maintenance. The technologies presented range from basic HDFS storage to real-time processing with Spark Streaming.
Why enterprises should build data lakes in the cloud
The main drivers for enterprise adoption of the data lake have been the need for agility and custom, enterprise-wide access to datasets, data streams, and data analysis tools. However, more and more companies have started using cloud data lakes as prototyping workbenches and have embraced the researcher mindset in order to build fully functional data laboratories in the cloud. Apart from offering an extremely convenient method of bypassing the tedious integration and configuration of big data applications and the costly acquisition and tuning of the underlying on-premises infrastructure, using a cloud data lake offers the opportunity to experiment with an ever growing array of big data technologies.
Solutions for securely extending the on-premises network in the cloud
To protect against unauthorized access, the data lake uses computer network authentication protocols, such as Kerberos, and it encrypts data both when transmitted across networks and while at rest. Security measures suited to cloud data lakes must also cover efficient backup protocols. Ideally, replication is configured on a per-file basis so users can decide the extent to which the most sensitive data is safeguarded against loss.
Integration solutions for multiple Active Directory domains and multiple secure Hadoop environments
The data lake should easily integrate any corporate Active Directory (LDAP) or third-party authentication method. Identity services integration is crucial when building a data lake in the cloud. As data provisioning, management, and governance become easier and safer, cloud-based Hadoop architectures better mirror and seamlessly integrate with on-premises architectures.
Solutions for increasing performance
Despite its impact on the IT landscape and its enthusiastic adoption across industry sectors, the virtualized cloud is far from being the best underlying architecture solution for data lakes—and for big data projects in general. A fairly new breed of cloud, the bare-metal cloud, offers a much better environment in terms of performance, isolation, and flexibility. Platforms offering such environments provide the high computation power and security of bare metal with the full flexibility of the cloud.
Software solutions typically used for data lakes
From concept to deployment, creating a production-ready enterprise cloud data lake should take minutes, not months. Every hardware connection should be software-defined, and every software component should be ready for deployment, scaling, and connecting to a data source. Along with the data lake’s powerful processing capabilities, its software stack is the main aspect that differentiates a cloud data lake from large-scale storage repositories such as an enterprise data warehouse.
Alex Bordei has been developing infrastructure products for over nine years. Before becoming Bigstep’s Product Manager, he was one of the core developers for Hostway Corporation’s provisioning platform. He then focused on defining and developing products for Hostway’s EMEA market and was one of the pioneers of virtualization in the company. After successfully launching two public clouds based on VMware software, he created the first prototype of Bigstep’s Full Metal Cloud in 2011. He now focuses on guaranteeing that the Full Metal Cloud is the highest performance cloud in the world, for big data applications. Twitter: @alexandrubordei
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.