9 Versioning

The approach in this text makes a hard cut between the data product and doing research to improve the product. The former should be treated as software, all best practices of software engineering apply to it. We have looked at automation and testing in the previous chapters, now we turn to versioning and version control. A version is a distinct snapshot of the working data product, to which a version number is assigned. Versioning makes it easier to trace the evolution of the product. In a machine learning model for instance, you can trace a performance measure as you go along. It is more than just assigning numbers, it is a clear indication that the product has been improved and that you close off a line of research.

9.1 Old Code

As you move from one version to the next, a part of the code written earlier might become obsolete. You might be tempted to keep this code around because it might be needed in future times. I would advise against that. As the project progresses the amount of unused legacy code grows, as well as the amount of used product code, because the product tends to become more sophisticated as you move on. It will become harder to figure out what part of the code is actually used, making it more difficult to reason about the product. Also, the package namespace starts to become clogged. It happened more than once to me that I had created two functions with the same name when there was some old version I was no longer aware of lingering. Happy debugging!

I think the product should only contain code what is used in the current version. This also implies that we should not do versioning within the code. This is something I have done a lot in the past and also seen with other data scientists. Whether versioned or not, you might be inclined to leave old versions of a product, such as a meta-function to call a model or to create a report, in place. Rather than versioning in your code, use the version control system. Keep your code base clean and as small as possible. If you really want to keep your code around, and don’t want to rely on version control rollbacks to retrieve old results, keep an archive folder outside the /R folder so it is not part of the namespace.

9.2 Version Control

Data scientists more than software engineers tend to work in solitude. Fully using version control systems can then be considered as overkill. In practice, they might just be used as an external backup of the code. Write some code, push it to master, write some more code, push it to master. Whether working alone or together, it is a good idea to use version control for, well, version control. (If you are not yet comfortable with using git, Jenny Bryan has your back with this extensive resource.)

9.2.1 Master is Production

To move away from the external backup use of git, you should use branches instead of only using master. It is good practice to reserve master for finished versions of the product only. In ‘normal’ software engineering a new feature is developed on a fresh branch. Once the feature is finished, tested and approved in a code review, the feature branch is merged to master. When there is continuous integration in place the merge to master is deployment, the product is automatically updated when master is.

Remember that the workflow in this text makes a distinction between research tasks and software tasks. Research tasks are performed completely outside the /R folder, they do not alter the product. Whether the tested hypothesis in the task proved to improve the product or not, you probably want to merge the vignette that explores the hypothesis to master to keep an overview of the work you did. This is fine, as long as you indeed did not touch the /R folder. This is where the product is defined and where the versioning happens. As soon as the research task did indicate the product can be improved, start a software task to adjust the product. Merging the software task branch to master is to create a new version of the product, but merging a research task branch is not.

As an example, the git graph below is a simple example of the start of a new machine learning project. (The graph is read top to bottom.) After initiation there is the first setup of the project. A first query to the source database is created and we split into train and test. We then create a simple base rate model, this is v1 of the product. Next, a research story to check if some algorithm can improve our current product is started. We do data prep and training and analyse the results. Alas, the algorithm did not improve the simple model too much, and we decide it is not worth the trouble. Still the branch of the research task is merged to master so we have an overview of what we did in the past in the vignettes folder. We try another algorithm. This one does a lot better and we decide to use it in the product instead of the simple model. First, the research task branch in which this was explored is merged to master, then we create a new branch for the software task. As soon as this work is finished and the branch is merged to master we release a new version.

9.2.2 Working Together

Because branching is used, several data scientists can work on different aspects of the model simultaneously. Creating separate branches to explore different hypotheses. It is advisable, however, to work on separate parts of the project in parallel. However, working on two or more software tasks at the same time is not advisable. This might give you headaches when you both try to merge to master, because the same files are modified. Also from a versioning perspective it is cleaner to make sequential changes. You can keep better track of the evolution of your product.

Software engineers rarely work alone. They check each others work by doing code reviews. Version control systems can even be configured such that a branch can only be merged to master when the changes are approved by at least one or two colleagues. Working together with multiple data scientists on the same project can greatly enhance the quality of the product. You can challenge each others ideas and you can look critically at each others work. Using version control properly will greatly improve your cooperation.