Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

You call it Data Lake, we call it Data Historian

Naghman Waheed (Monsanto), Brian Arnold (Monsanto)
16:3517:15 Thursday, 24 May 2018

Who is this presentation for?

Data Managers, Data Cloud Engineers and Architects

Prerequisite knowledge

- AWS cloud and services. - Data management. - Metadata management. - Understanding of API

What you'll learn

- Data Lake can be an extremely useful platform especially when deployed with appropriate guard rails. - Open Source technology is more mature then what may be obvious on first sight. - Modern architecture dictate that data and scalable architecture go hand in hand.

Description

Data historian is a cloud based data platform, built entirely from open source components. The platform provides the user with the ability to efficient ingestion, processing, storage and access data sets. The platform inception dates back to about 16 months when a simple request from the business was to provide them with a simple tool to store files which reside on local computer drives and file shares into a central repository. Besides a user-friendly file ingestion interface, the tool also gathered metadata both thru user input as well as automatic parsing of files. Finally, the uploaded content was immediately made available via an API. From those humble beginnings, fast forward a year, and now a simple file ingestion tool has turned into a full blown well managed data lake and is continuously being enhanced with new features.

Data Historian provides batch, streaming and API based ingestion in addition to simple file ingestion. As data is ingested, metadata is collected at time of ingestion making data sets immediately searchable in other tools such as Monsanto’s enterprise metadata management system as well as in enterprise data catalog. The data from data historian can be accessed via an API or thru SQL queries. Security on data sets is controlled thru an existing entitlement work flow based on virtual directory services. Even though the system is relatively young, it is already being used by several predictive models which query data out of Data Historian using access API. In addition, descriptive analytics have been enabled via ODBC/JDBC connectivity resulting in traditional BI tools to interact with the data sets directly thus increasing the utility of the platform.

Like other data lake platforms, Data Historian has numerous other features such as scheduling and monitor data loads, archiving data to low cost storage, automated data deletion based on company data retention policies, capturing and reporting platform adoption rate metrics to name a few. The platform has been built using open source software, including Hadoop and AWS EMR as a processing engine, Sqoop for batch data loads, Ozzie for scheduling, Hive and Presto for query processing, Lambda for event triggering , S3, Glacier, RDS and DynamoDB for data storage . The platform is also fully integrated with AKAN and VDS (virtual directory service) and utilizes the OAuth2.0 security model.

In this talk Naghman Waheed and Brian Arnold will explain how Monsanto built this platform, focusing on the technical design and various phases of the system build. They will also cover the technical architecture and share insights into why the team chose certain open source components to instantiate the platform and lessons learned along the way. They will also highlight the value derived out of the new platform through examples of how the system is being used to provide analytics on top of datasets loaded into the system to date.

Photo of Naghman Waheed

Naghman Waheed

Monsanto

Naghman Waheed leads the Data Platforms team at Monsanto and is responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order-to-cash, finance, and procurement. Throughout his 20+-year career at Monsanto, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Photo of Brian Arnold

Brian Arnold

Monsanto

Brian Arnold is the lead architect for the Data Historian Platform at Monsanto, and is responsible for guiding the technical direction and implementation for the platform. Brian has 10 years of experience as an IT professional, working on a large-scale ecommerce website and implementing various Big Data applications. Utilizing numerous Big Data and Cloud technologies, he is experienced in building recommendations system platforms, as well as an enterprise data lakes. While at Monsanto, he assisted in implementing our enterprise Kafka platform, and is now focused on leading a team of engineers to build Data Historian. Brian is passionate about Big Data technologies, the Cloud, Data Science, and Functional Programming. Brian holds a BS in Computer Engineering with a minor in mathematics from Marquette University.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)