A bullet point overview of this text:
- Agile is a reaction to Waterfall, the standard for doing software development up until the nineties.
- Waterfall is often slow, projects can take years to finish.
- Agile wants to deliver software continuously, then improve it constantly, instead of one big delivery.
- The Agile Manifesto is a set of four values and twelve principles.
- Workflows are serving the Agile philosophy and are not holy, change them if they don’t serve you.
- Scrum works with sprints, in which the team commits to a set of tasks to complete in that time.
- Scrum tasks are formulated in user stories that make clear how the completion of it adds value.
- In Kanban tasks are gathered on a backlog, the most important task is done next.
- Central metric is the cycle time, how long does it last on average to complete a task.
- Scrum is strict in its commitments and ceremonies, Kanban is flexible.
- Ineffective workflows are one of the causes why many data science projects are currently failing.
- Early delivery of a minimal product or model will keep stakeholders excited, will make you fail fast and will get you feedback sooner.
- You should work closely with business people, if you want to work effectively.
- Every possible improvement to the product should be deployed right away.
- Improvements only count when they have left your system and others can benefit from them.
- Adopt the best practices from software engineering to your data project to assure high-quality code.
- Continuously reflect on the effectiveness of the workflow you are following.
- Data science is different from regular software engineering, because it needs to exploit relationships yet to be discovered in the data.
- This makes it unfeasible to give a tight planning on what will change to the product, as is done in Scrum.
- A two-way workflow is proposed in which a hard split is made between the product and exploratory research.
- Tasks are added to a Kanban board and are limited on doing exploration or changing the product.
- For exploration it might be a good idea to scope tasks as the amount of time a data scientist can spend on it.
- If there is a need for exploring the subject more after it, create a new task for it.
- Working Agile is what matters, not adhering to a specific process. If something does not work for your product, change it immediately.
- The R package structure is very suitable to organise a two-way workflow with exploration and product.
- Create low level and meta functions in the package, apply in a pipeline.
- Use Rmarkdown in the Vignettes folder to do data explorations.
- The data product should be automated as much as possible.
- Automation saves time and energy in the long run, because the data scientist does not have to rerun the same thing over and over.
- Automation serves as documentation, because it makes clear how the product pieces play together.
- Automation assures reproducibility, if it is not reproducible it will not run.
- R has the fantastic
drakepackage for automation.
- Code should fail fast and informatively if you want to be able to continuously deliver.
- Because R is weakly typed it is a good idea to implement type checking at the start of functions.
- If your function depends on the presence of certain columns in a data frame, check for it at the start of the function.
- It is also a good idea to check all the implicit assumptions you have to the possible values of your data.
- The short term investments for unit testing pay off in the long run.
- Unit tests capture all the assumptions the programmer had at the moment of writing the code. They are an externalised memory of all the cases your code should handle.
- Because unit testing will slow you down and have you think through all possible situations, it will make you write better code right away.
- Use versioning for the data product, increment the version number each time the product is meaningfully changed.
- Delete code that is no longer part of the product, use version control for versioning instead of your code base.
- Use branches to change the product. Get a “merge into master is pushing to production” mindset.
- Research branches should not contain changes to the product, so they can be merged to master without creating a new version.
- Using branches is essential if you work with multiple data scientists on a project at the same time.
- It will take a lot of effort to keep old research results in the vignettes reproducible as the project progresses. It is up to the data scientist and the project specifics if this is considered as a problem or not.
- You could number your research scripts to the version of the data product at that moment. You can rollback the code if you have to rerun the analysis.