Feature engineering is a critical and time-consuming activity in the development and deployment of any modeling pipeline. It is also exacerbated as data science teams seek to incorporate new data sources into their pipelines that are at a scale far larger than previously employed. Furthermore, the transition to production environments is littered with complexity as these pipelines are exposed to the dynamic, and fragile, world of ongoing data feeds, data corrections, and evolving data models.
In this talk we will introduce Ivory, a new open-source, Hadoop-based data store that seeks to address these challenges. Ivory is a scalable and extensible data store for storing facts and extracting features. It is optimised specifically for the feature engineering stages of modelling pipelines, simultaneously simplifying and adding rigour to them.
This session will walk through an example of how Ivory can be used in the typical data scientist’s workflow, and then how that extends to migrating pipelines into production. It will impart all of the basic concepts of Ivory such as repositories, the dictionary, its fact-based data model, and virtual features. It will also demonstrate the benefits of Ivory being an immutable data store and the unique opportunities that creates.
Ben is a co-founder and the CTO of Ambiata, a startup focused on creating products that allow organisations to take a more scientific and automated approach to business. At Ambiata he has lead the deployment of large scale machine learning systems into enterprises in industries such as finance, telecommunications, retail, and insurance. Before Ambiata, Ben previously led an engineering and research team at NICTA, as well as started the open source Scoobi project.
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.