Versioning Machine Learning repositories

by Rainer Kern // Apr 23, 2020

Around the globe individuals, corporations and organizations of all sizes are building Machine Learning (ML) models to produce reliable predictors for real-world problem sets. Many of the modern methods of building models is by using data to train algorithms, in the hope that their training data fits the distribution patterns of real world data. The general idea is that with enough data the algorithms will be able to generalise solutions learned from input cases and apply them to similar unknown scenarios. To reach good resuts, the most normal way of work is to iterate on the composition of the training data set - there are many approaches, such as synthesizing data, transforming data, balancing data and so on. To not over-arch this article, we will simply assume the tideousness of data manipulation in Machine Learning.

Data manupulations mean several steps of iterative change. To record, collaborate and explain a trained model, one needs to efficiently track changes. In software development, this is most efficiently done through distributed version control - GIT is the tool of choice of millions of developers to reach efficency and collaboration.

The case for GIT

Git is a Version Control Software trusted by millions of users worldwide. It allows programmers to collaborate and track changes on their source code and improve the quality of their work. Git is reliable, flexible, and distributed. In GIT terms distribution means caching a copy of the repository's complete history which includes all versions of all files in all branches of the project on every collaborator's computer.

This approach works very well for mainly text-based projects, as source files have a small footprint and since they are mostly text, they compress very well. This results in manageable repository sizes, fast transfers and overall easy handling.

Despite its usefullness in modern software development git owns significant drawbacks for ML projects. By providing the complete repository history on every development node - what is arguably GIT's biggest strength - will most certainly constrain hosting and transferring large volumes of data impractical or completely unfeasible.

Enter GIT LFS

Thanks to the entire GIT community, there is a very practical solution for ML projects. It is called GIT Large File Storage, LFS for short. It allows storing large files outside of the actual GIT repository. The repository only contains a reference to the file, which will then be downloaded automatically by the client when it is included in a checkout.

This way only files that are actually needed by a developer are transferred and stored on their machines. GIT LFS works completely seamless. It hooks into the existing GIT workflow and manages the complexity of separately uploading, downloading, and referencing of large files completely transparent to the user.

Most hosted GIT providers like e.g. Github, Gitlab, Bitbucket, Azure Repositories, and many others support GIT LFS natively.

In our case, we used Gitlab's open-source technology for hosting Machine Learning repositories For your development convenience, we have activated GIT LFS for all ML project repositories. If needed, you always can disable this in your project settings. In addition, you can verify if you have GIT LFS installed by running git lfs version in your terminal.

To install GIT LFS for your local GIT client please download GIT LFS from https://git-lfs.github.com/ or via a package manager Under Ubuntu and Debian run sudo apt-get install git-lfs To use Homebrew, run brew install git-lfs. To use MacPorts, run `port install git-lfs``. Under Windows download the installer here

Further Ressources

https://git-scm.com/

https://git-lfs.github.com/

https://dvc.org/