10 Exploratory Tasks

We have explored the aspects of creating a data product in Agile data science workflow. The second part, the exploratory part, does not need so much attention, because by definition you can do what you like in here. One aspect deserves some consideration.

10.1 (Un)Reproducible Research

In the chapter on versioning it was advocated that you should not keep old code lingering as you move on. Rather use version control systems to keep track of the changes you are making to the product. Research tasks are explored in vignettes, these are kept around as the project goes along. It is likely that in research reports code from the software product and data from the pipeline is used. These are readily available, enabling quick exploration of hypotheses. However, as you move to next versions, the software product and the data will be updated. This means that the reproducibility of almost all research reports will be lost at some point.

It is up to you and the characteristics of the project if this is considered a problem. If you want to keep all reports to be reproducible at any time, there is a considerable amount of extra work to the research tasks. Data has to be stored separately and the source code used should be frozen somewhere outside the \R folder. I prefer to accept that the reports will get out of sync with the data and the source code. The conclusions are what matter most, and the code is always around to check if they were derived properly, even if it can no longer be run. If you really have to rerun the research report to verify results, you can always checkout the commit of the report and rerun the pipeline. The source code and state of the data at that “time point” are then available. I tend to number my research scripts to the version of the data product of that moment. This will make it easy to understand the context in which the research was done at a later time point.