4 Agile Data Science
4.1 Data Science Waterfall
As discussed in the previous chapter, Agile is a response to the Waterfall methodology that was widely adopted in the eighties and nineties. Many projects using Waterfall did not deliver, because of their long duration. In data science, as far as I am aware of, there are no such formal and widespread methodologies. However, there are many data science projects that do not deliver. An ineffective workflow is by no means the only cause of this, but I am convinced it is one of the main culprits. Just like software development in Waterfall, data science projects can take many months or even years before the results are in production. Apparently, more often than not the plug is pulled before that stage is reached. Data science is especially susceptible to what I call local optimisation. The data scientist might optimise endlessly to give the best predictions possible or have all the envisioned dashboard functionalities in place, before sharing the results. In the meantime, stakeholders might lose interest and confidence in the successful completion of the project. Moreover, defects in the project setup remain hidden for a long time with this way of working. The code might be poorly organised, making the product unfit to be applied in the “real world”. Or there might be ambiguity on what to predict or which data sources are in scope, due to lack of communication between stakeholders, business people and data scientists. Whatever the reason, adhering to the principles of Agile can get you more productive and efficient. Here we take the time to interpret the twelve principles in the data science context.
4.2 The Twelve Principles in the Data Science Context
- Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.
Just as Waterfall prescribes a complete and fault-free product delivered at once, data scientists might be inclined to only release a machine learning model to production once they are confident its predictions are spot on. This principle was a revolutionary break from Waterfall, you should not wait with releasing software until it’s perfect. Instead get it out in the open when it is just good enough. This is called the MVP (Minimal Viable Product). After the MVP is released, the user interaction with it is closely monitored, to establish where is the biggest room for improvement. The team tackles the biggest current problem, creating a slightly improved version, which is again released right away. This cycle of release, monitor, improve, release is repeated many times, such that the product gets better and better. There is no clear definition of done, instead there is debate on possible improvementes and if the corresponding required investments are worthwhile.
The machine learning equivalent to this would be a Minimal Viable Model, a model that is just good enough to put into action. Other data science projects, such as reporting, ETL and apps, might have more software elements in them and less data uncertainty. For these an MVP can be defined. Releasing something that is barely good enough might be scary and counterintuitive to the high standards you hold for yourself, but it is preferable over long optimisation before releasing, for at least the following reasons:
- It will keep stakeholders excited. Managers and users of the model who commissioned the data science project are anxious to see results. As the project drags on without any output they are likely to lose interest and confidence the project will end well. Eventually they might pull the plug or put it on hold before anything meaningful has come out. If they can interact with the results soon, even if they are far from perfect, their enthusiasm and confidence will remain high.
- You will fail fast. There is a wide array of reasons a data science project might fail; the business problem appears not be translatable into a data science problem, data quality is low, there is not enough historical data, or the necessary relationships simply don’t exist. If there are such problems, you will not be able to create a MVM. This is a clear sign the project is probably not going to be successful, and you can consider stopping it early.
- You will get feedback sooner. Lets say you build a churn model which the sales department uses for customer retention. You agree with stakeholders what the model should look like and start to build an MVM. As soon as they start using the MVM they find out that the prediction interval is too short, some customers already have a contract with a competing party at the moment of interference. Instead of further optimising this model, you first change the model structure so it predicts a longer time ahead. Or the user of Shiny app realises that the ratio you agreed on is not what she wanted as soon as she sees the app with the first variable implemented. Before including the twelve other variables, you first fix the ratio such that it is as she wants it.
What the MVM looks like is project-dependent of course, but it might make sense to define it in terms of regular model quality statistics, such as recall or mean absolute error. In some situations there might be a natural baseline that the MVM should outperform, such as a business rule that the model needs to replace. Data science products that are not model-based might be captured with an MVP, instead of an MVM. For both an MVM and an MVP, there is the possibility of performing a targeted release, i.e. only releasing the product to a subset of your target audience, such as a particular geographical area or users of a certain age.
- Welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.
This principle comes naturally to data science, since the success of a project typically depends on the data relationships discovered during the project. The Waterfall approach in which every step of the project is planned would be hideous to even the biggest lover of process in the data science context. Keep in mind that flexibility should not only be exercised towards assumptions of your data or the models and algorithms you use. Requirements can also be in the framing of the business problem or the way the results are exposed. The very reason we are releasing early is that customers can put it to the test and through interaction they discover what they really want. Developing the product is a discovery for them as much it is for you.
- Deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale.
Whereas the first principle is about the philosophy of early deployment and iteration, this one is about the frequency of deploying updates. The Scrum framework is really strict in the amount of time that can be spent until the next release. The team commits itself to making certain changes to the product in typically a two-week period. At the end of this period the improvements, small as they might be, are deployed. The Scrum mindset is not totally applicable for data science, especially when there is a lot of data uncertainty, as we will explore in the next chapter. However, it is good to keep in mind that every improvement to the model should be deployed as soon as it is ready. This creates momentum and excitement with customers, stakeholders, your teammates and yourself.
- Business people and developers must work together daily throughout the project.
Too often data scientists are operating in isolation, especially so when projects are of a more explorative nature. Having no connection to the business makes it unlikely your work will ever affect anything outside your computer. Stakeholder management can be done by a representative of the technical team, such as a product owner. If there is no such role on the team, the stakeholder management is the responsibility of the data scientist. A second way the business should be involved, is for getting information about underlying processes. Unless the data scientist has extensive knowledge of the business, she needs somebody to validate the generated results and answer questions about surprising finds. Finally, you might need to involve a person with technical knowledge of the data, typically a DBA, who can explain the structure and peculiarities of the data you work with.
- Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done.
This principle is the antithesis of Waterfall in optima forma. Instead of meticulously describing how the job should be done, just set the goals of the project and leave it up to the team to figure out how these goals should be attained. Data scientists typically already enjoy this type of freedom for the sheer reason that stakeholders don’t really understand how the products are built. It is possible that business people become overly involved in the process, they can have a strong opinion on which predictors to use or how the target should be defined or what the best way is to visualise something in a dashboard. Take their advice at heart but also trust your instincts. If you feel a different approach will yield better results then rely on your expertise. You know about overfitting, multicollinearity, non-converging algorithms, theory behind data visualisation and many other topics the business has no intuition for. If you think your approach is better, take the time to explain why you think a different approach is better (in lay men terms of course).
- The most efficient and effective method of conveying information to and within a development team is face-to-face conversation.
A data science project is rarely done end-to-end by one single person. Data might be made available by a DBA, a back-ender might expose the product on the website, the front-ender builds the interface for interacting with predictions, etc. If possible, working with these people directly will speed up decision making and improve alignment. Communication by email or chat programs is often slow. Make an effort to be in the same room with your direct colleagues, for at least a part of the project time.
- Working software is the primary measure of progress.
If your product did not leave your system you have not attained any results yet. As long as your Shiny app only runs on your laptop, it does not exist. As long as the machine learning model predictions are not reaching someone, it does not exist. You have not added any value as long as your product is not “out there”. Only when the update to the predictions is fully implemented and the predictions are ready to be consumed by the business, there has been true improvement. Only when you have shipped your MVM app to the company server and Sales is basing decisions on it, you have delivered something. Sometimes reported model improvement discovered in research scripts does not hold when it is implemented in the full model pipeline. It has been done on just a subset of the data that was conveniently available, or the new feature was tested in isolation and there is not yet a sense of multicollinearity. There is only one true measure of how well we are currently doing, and it is what is currently exposed to others as the product.
- Agile processes promote sustainable development. The sponsors, developers, and users should be able to maintain a constant pace indefinitely.
The deadline way of doing data science: the stakeholders and the data scientist meet, they agree to have a first version of the project ready at some future moment. The sponsors forget about the project until right before the deadline, busy with attending meetings and writing memos. The data scientist goes to work, having ample time before the deadline there are many things that can be explored. The result is an array of research scripts and intermediate results. Suddenly, as the deadline comes near, all this separate research has somehow come together. Pulling an all-nighter he is able to deliver a result, which is presented to the sponsors. The project is then continued, a new deadline is set, and the cycle starts over.
Don’t do deadlines!
They are a recipe for hastily created, non-reproducible results. They promote a workflow of taking it easy at first, stressing out when the deadline comes near and exhaustion after it. Set small goals that are quick to attain instead, update the product if the results are favorable, then set a new small goal. This will result in better quality code, reproducibility and a happier team. Moreover, it will result in a product that is constantly improved, which excites sponsors and users.
- Continuous attention to technical excellence and good design enhances agility.
Most data scientists have much to learn from software engineers when it comes to standards and rigor. What makes data science unique is that a part of the code is written for personal use. It is not meant to ship in the final product, it will never leave the data scientist’s system. Still it is very much part of the project. Data cleaning, splitting into train and validation sets, running algorithms that produce the models, doing research on relationships in the data and many more steps are often for the data scientist’s eyes only. It is tempting to cut corners when you are the sole user of your own code. Why go to the trouble of writing unit tests and documentation for your functions?
Continuously improving your product in short-cycled deliveries is only feasible if the entire product is of high quality. If you want to expand a poorly designed product, it is much more likely the whole thing comes tumbling down as soon as you start to make adjustments to it. There will be a lot of attention to this topic in the second part of this text, because you simply cannot be Agile if your code is of poor quality.
- Simplicity–the art of maximizing the amount of work not done–is essential.
A machine learning project’s goal is often straightforward, predict y as best you can such that some business goals can be achieved. Unlike software development there is not much choice in which features should and should not be included in the final product (features as in characteristics, not as in predictors). The options on how to arrive at predicting y however, are abundant. The biggest challenge is often “what should I explore next?”. Should we explore another database in which we might find new predictors or should we try a different model on the current predictors which involves some additional preprocessing of the data?
We can roughly estimate what the amount of work would be to explore both options. It is, however, very hard to predict what the amount of value is the new part will add. A good rule of thumb is that when in doubt choose the option with the least unknown components. Choose an algorithm you know well over one you have never used in practice. Only tap into an unknown data source if you are convinced that the options on the well-known database are exhausted. Data science is a field with rapid developments, it is often tempting to seize the opportunity to dive into a new technique or algorithm. Be critical before doing so, is there really no way to obtain similar results with something already familiar to you?
- The best architectures, requirements, and designs emerge from self-organizing teams.
This is another principle that is a clear antidote to Waterfall. Instead of meticulously planning every aspect of the project upfront, let the developers come up with the most important project designs as they go. It is impossible to foresee all the aspects of the software project before implementing it, so trying to come up with them before writing code is a guarantee for going back and forth between the planning and implementation stages. Due to the iterative nature of building data science products and the uncertainty we have upfront, this principle seems quite natural in our context.
- At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.
Every development process has its inefficiencies, whether they are unclear goals, poor communication or not having the right priorities. Reflecting on the process forces you to look critically at all aspects of the project. Inefficiencies can quickly become project features when they exist for a while, the sooner they are tackled the better.
Even when you are not in a team following an official methodology such as Scrum or Kanban, plan regular reflection meetings. Even when you are the only data scientist or even the only developer on the team, you should also address your technical issues here. Maybe you wanted to refactor a certain part of the project for a while but are unsure if it is worth the time. Business colleagues might not understand the technical aspect of the problem, but they can still challenge you on your choices.
4.3 In Summary
After this lengthy point-by-point exploration let me conclude to what I think is the essence of Agile Data Science:
Deploy products as early as possible, don’t lose yourself in endless exploration. Update the product continuously, instead of saving everything until some deadline is near. Hold high standards for all the code used by the product, event if it doesn’t get shipped along with the product. Have stakeholders, colleagues and customers closely involved at each stage, don’t retreat.