Around the time Hadoop turned 10, longtime users began realizing they lacked viable methods for it. Mature IT groups continue to be appalled by the governance-free data dumping and lack of audit trail common with Hadoop, and business users are frustrated by the low value and trust they get from Hadoop data. Philip Russom explains how a data lake can improve the role of Hadoop in data-driven business management. With the right end-user tools, a data lake can enable self-service data practices that wring business value from big data and modernize and extend programs for data warehousing, analytics, data integration, and other data-driven solutions.
When designed well, a data lake is an effective data-driven design pattern for capturing a wide range of data types, both old and new, at large scale. By definition, a data lake is optimized for the quick ingestion of raw, detailed source data, plus on-the-fly processing of such data for exploration, analytics, and operations. Even so, traditional, latent data practices are possible too.
There are two broad types of data lakes, based on which data platform is used: Hadoop-based data lakes and relational data lakes. Today, Hadoop is far more common than relational databases as a lake platform. However, a quarter of survey respondents say that their production lake spans both. And those platforms may be on-premises, in the cloud, or both. Hence, some data lakes are multiplatform and hybrid, like many other modern data ecosystems today. Furthermore, data lakes rarely stand alone; most are integrated tightly with larger hybrid environments.
Organizations are adopting the data lake design pattern (whether on Hadoop or a relational database) because lakes provision the kind of raw data that users need for data exploration and discovery-oriented forms of advanced analytics. A data lake can also be a consolidation point for both new and traditional data, thereby enabling analytic correlations across all data. A recent survey by TDWI has found data lakes in most mainstream industries, including finance, insurance, and even data-sensitive healthcare. Since these organizations and departments are getting value from data lakes, you can too.
Philip Russom is the research director for data management at TDWI, where, as an industry analyst, he oversees many of the company’s research-oriented publications, services, and events. A well-known figure in data warehousing, business intelligence, data management, and analytics, Philip has published 550+ research reports, magazine articles, opinion columns, speeches, and webinars. Previously, he was an industry analyst covering BI at Forrester Research and Giga Information Group; ran his own business as an independent industry analyst and BI consultant; was contributing editor to leading IT magazines; and held technical and marketing positions for various database vendors.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com