There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp.
Naghman Waheed and Brian Arnold offer an overview of Monsanto’s Data Historian platform, a cloud-based data platform built entirely from open source components that provides the user with the ability to efficient ingest, process, store, and access datasets without compromising ease of use, governance, or security. The platform was conceived to provide Monsanto with a simple tool to store files that reside on local computer drives and file shares into a central repository. Besides a user-friendly file ingestion interface, the original tool also gathered metadata both through user input and automatic parsing of files, and the uploaded content was immediately made available via an API. From those humble beginnings, Data Historian has turned into a full-blown well-managed data lake and is continuously being enhanced with new features.
Data Historian provides batch, streaming, and API-based ingestion in addition to simple file ingestion. As data is ingested, metadata is collected at the time of ingestion, making datasets immediately searchable in other tools such as Monsanto’s enterprise metadata management system as well as in the enterprise data catalog. The data from Data Historian can be accessed via an API or SQL queries. Security on datasets is controlled through an existing entitlement work flow based on virtual directory services. Even though the system is relatively young, it is already being used by several predictive models that query data out of Data Historian using an access API. In addition, descriptive analytics have been enabled via ODBC/JDBC connectivity, allowing traditional BI tools to interact with the datasets directly, thus increasing the utility of the platform.
Like other data lake platforms, Data Historian has numerous other features, such as scheduling and monitoring data loads, archiving data to low-cost storage, automated data deletion based on company data retention policies, and capturing and reporting platform adoption rate metrics, to name a few. The platform has been built using open source software, including Hadoop and AWS EMR as a processing engine, Sqoop for batch data loads, Ozzie for scheduling, Hive and Presto for query processing, Lambda for event triggering, and S3, Glacier, RDS, and DynamoDB for data storage. The platform is also fully integrated with AKAN and VDS (virtual directory service) and utilizes the OAuth 2.0 security model.
Naghman and Brian explain how Monsanto built this platform, focusing on the technical design and various phases of the system build. They also cover the technical architecture and share insights into why the team chose certain open source components to instantiate the platform and lessons learned along the way. Along the way, Naghman and Brian explain how the system is being used to provide analytics on top of datasets loaded into the system.
Naghman Waheed is the data platforms lead at Bayer Crop Science, where he’s responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order to cash, finance, and procurement. Throughout his 20+ year career at Bayer, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.
Brian Arnold is the lead architect for the Data Historian platform at Monsanto, where he is responsible for guiding the technical direction and implementation for the platform. Previously, he assisted in implementing Monsanto’s enterprise Kafka platform. Brian has 10 years of experience as an IT professional, working on a large-scale ecommerce website and implementing various big data applications. Brian is passionate about big data, the cloud, data science, and functional programming and is experienced in technologies and building recommendations system platforms and enterprise data lakes. Brian holds a BS in computer engineering with a minor in mathematics from Marquette University.
©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org