Sep 23–26, 2019
Please log in

Finding your needle in a haystack

Naghman Waheed (Bayer Crop Science), John Cooper (Bayer)
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 23/24
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Managers, cloud architects, cloud engineers, and data stewards




The need for a robust metadata and knowledge management system was a gap that had existed at Bayer Crop Sciences Division data environment for a while. As new systems were introduced and other systems exited, the complexity of the entire data ecosystem increased significantly over time. Often finding and understanding the nature and meaning of datasets and processes was difficult and limited to a select few. In order to remedy the situation, the data platform architecture and engineering team embarked on creating a scalable metadata and knowledge platform named Haystack. The result has been an easy-to-use system that’s now being used across the globe for collecting technical and business metadata and organizing business glossary for all data systems at the company.

Naghman Waheed and John Cooper dive into technical design and build of the entire system. They explore the technical architecture, how and why Bayer chose certain open source components, and the lessons learned along the way. You’ll discover the value derived out of the new platform through examples of how the system is used to streamline gathering metadata information from business and technical users, making it simple for everyone to easily search and learn about datasets that exist within Bayer.

The system was designed with several key architecture and engineering principles in mind. Instantiated in AWS cloud, and using only open source components, the system is fully scalable for both processing and storage needs. Moreover, integration with existing key data systems and ease of use for information entry are some of the key features incorporated into the overall design of the new system. The entire platform was built using open source software components. Its key components include MediaWiki as an information storage engine, Kafka producers and consumers that move metadata in and out of Haystack, and an Elasticsearch cluster integrated with MediaWiki’s search engine. Moreover, what started as a small Slack bot to retrieve simple queries within Haystack has now evolved into a multiplatform AI that can use machine learning and natural language processing for interpreting queries and retrieving information, resulting in a unique personal experience for the end user.

Prerequisite knowledge

  • Familiarity with AWS Cloud and services, data management, and metadata management

What you'll learn

  • Discover how metadata and knowledge management systems can significantly aid data stewardship function, agility and scalability behind cloud solution can be a competitive advantage for your business, and using open source components can allow you to build innovative solutions
Photo of Naghman Waheed

Naghman Waheed

Bayer Crop Science

Naghman Waheed is the data platforms lead at Bayer Crop Science, where he’s responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order to cash, finance, and procurement. Throughout his 20+ year career at Bayer, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Photo of John Cooper

John Cooper


John Cooper is a technical architect at Bayer.

  • Cloudera
  • O'Reilly
  • Google Cloud
  • IBM
  • Cisco
  • Dataiku
  • Intel
  • Io-Tahoe
  • MemSQL
  • Microsoft Azure
  • Oracle Cloud Infrastructure
  • SAS
  • Arcadia Data
  • BMC Software
  • Hazelcast
  • SAP
  • Amazon Web Services
  • Anaconda
  • Esri
  •, Inc.
  • Kyligence
  • Pitney Bowes
  • Talend
  • Google Cloud
  • Confluent
  • DataStax
  • Dremio
  • Immuta
  • Impetus Technologies Inc.
  • Keyence
  • Kyvos Insights
  • StreamSets
  • Striim
  • Syncsort
  • SK holdings C&C

    Contact us

    For conference registration information and customer service

    For more information on community discounts and trade opportunities with O’Reilly conferences

    For information on exhibiting or sponsoring a conference

    For media/analyst press inquires