Today, many companies are using machine learning, and ML teams are growing—along with the complexity of ML projects. Establishing a well-defined and manageable process has become a central issue in this environment. ML model and dataset versioning is an essential first step in the direction of establishing a good process.
Although source code versioning tools are mature, and the best software engineering practices are well defined, these tools and practices don’t fit well into the ML workflow. ML requires managing models and large dataset files and tightening them along with code for reproducibility where traditional tools like Git work poorly.
Dmitry Petrov and Ivan Shcheklein explore open source tools for ML models and datasets versioning, from traditional Git to tools like Git-LFS and Git-annex and the ML project-specific tool Data Version Control or DVC.org.
Dmitry Petrov is a creator of the open-source tool Data Version Control (DVC.org), a building block for MLOps infrastructure.
Dmitry is a former data scientist at Microsoft with a Ph.D. in Computer Science. Today, he is based in San Francisco working on tools for machine learning and data versioning as a Co-Founder and CEO of Iterative.AI.
Ivan Shcheklein is cofounder and CTO at Iterative AI, where he’s working on tools for data scientists. Previously, he was team lead for open source project Sedna.org and cofounded the Tweeted Times (acquired by Yandex in 2011). He holds an MS in CS.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org