7 Automation
R users tend to run code by hand. Press a button, see results, press some more buttons, see some more results. R is designed with interactivity in mind, it enables the user to have a direct Q&A with the data. This is an unparalleled feature of the R language, making it uniquely suitable for drawing quick insights and conclusions from data. A data product, however, should not be run by hand, it should be running automatically. From start to end no human should be needed to obtain and expose results. There are at least three advantages of automation.
7.0.1 Efficiency
The obvious reason data science products should be automated is that it will save a lot of time. It takes extra effort to set up an automated pipeline, but once it is in place only monitoring and maintenance are needed. If instead each report or each model update should be created by manually running the code, the data scientist should always schedule time to do so. This is not only a waste of time, but also a waste of focus. Once you have reached a stage where the product is of satisfactory quality and can run without interference for a time, you should be able to move on to a new project without thinking about the completed one too much. However, if the product is run manually you have to keep spending a part of your attention to the completed product. Continuously switching between projects is cognitively strenuous and not satisfying.
Not automating also creates a risk for the continuity of the product. It is not healthy to be completely dependent on the knowledge and actions of a single data scientist, because it is not guaranteed they will continue to work at the organisation for the duration of the product. A well automated product can be maintained by anybody with the right skills. If the project has to be continued by a different person, it will be a lot easier to maintain the product when automation is in place.
7.0.2 Automation is Documentation
A data science product is software, not random walk code. This means it is built bottom-up, small, unit-testable functions are used together to create more complex actions. The higher-level functions can then be used together in even higher order functions to create specific parts of the pipeline, such as data cleaning, preparing features or training a model. It does not, and should not, matter where each of the lower and higher level functions are created. As soon as you load the R package into memory all the functions are available. The functions are created to be applied in a certain order, of course. For instance first you load the data, then you clean it, prepare it and apply an algorithm on it. By completely automating the product, it is immediately clear how everything should play together.
7.0.3 Reproducibility
You can’t be Agile without being completely reproducible. Results on the same input data should be the same each time the product is run, independent of when and by whom it is done. Small bugs in the code or unexpected values in the data can be compensated for on the fly when the code is ran manually. This is undesirable because this implies that the same problems will keep popping up in future runs. Even if you are the sole data scientist on the project, you might forget how you resolved it the last time. Rather, bugs and unexpected values should always be systematically resolved. This means that as soon as you hit something, you have to invest time in finding the source of this error, rather than quickly fixing it and moving on. As with the good practices discussed in the next chapter, not cutting corners will cost more time and effort in the short run but will pay off in both time and confidence in the product quality. Automation is a great way to enforce yourself to be completely reproducible, results will only be produced when the product is fault free.
7.1 How to Automate
We have discussed why we should automate, but not how to do it.
You can build one meta function that comprises all the intermediate steps of the product.
Calling the function will set everything in motion, capturing intermediate results and sending these to the next pipeline step.
But as usual in R, someone else has already done the hard work for you.
I suggest using Will Landau’s fantastic drake
package, instead of figuring out how to set-up the pipeline yourself.
It caches data after the completion of each step, so when you make changes to the product you only have to rerun the parts that are affected by the change.
Having all the intermediate data automatically stored has the additional benefit that you have them readily available when doing research for model improvements.