12 Case Study: Building the Valuecheck
My current employer funda is the market place for selling and buying homes in the Netherlands. A long standing company wish was to use the data of recent house sales for predicting the current value of all houses in the Netherlands. We knew that home owners used the asking prices of neighboring houses published on funda to keep track of local market trends. We wanted to facilitate them to translate the recent sale prices in an official estimate of their specific house. They no longer had to look at the houses that are offered, our statistical model had already done that. Moreover, they did no longer had to make the translation of offered houses to their own house informally, the model determined which characteristics of a house mattered and which did not. A final advantage was that we could use of the selling prices instead of the asking prices, these are not shown on the website. So the product reflected what their house could be sold for, instead of what the typical asking price of their home was.
To create what eventually would become the Valuecheck, a colleague data scientist and I joined an existing Scrum team. This team comprised of two back end developers, a front end developer, a UX designer, and a product owner.
12.1 Trying Scrum
The team had a lot of experience with Scrum and had all the workflows in place, so it made sense to try to fit our tasks into this framework. At first this worked out quite well, because the first set of tasks we had to complete were essentially software tasks. Setting up a server, build a first query so we had a modelling set, splitting into train and test, and doing some data cleaning. These tasks were well scopeable, we could estimate the time we needed to complete them quite accurately and they had a clear definition of done. Then the model building started and we got more and more trouble fitting the tasks into the tight Scrum methodology. We could not tell what the model would like in two weeks time, it depended on the relationships we would find in the data. We certainly could not give estimates for what the model quality would be then.
12.2 Informing the Business
Our product owner informed management about the progress of both the product and the model. In consultation with them he decided how we would roll out the product. We had to provide him with the information required to make such a decision. A Shiny dashboard appeared to be the way to go. In this dashboard we could show the basic model performance, reflected in an agreed-upon statistic. Moreover, the regional performance was shown on a map, making it clear where the model was performing well and where it was doing poorly. After a model update, the data frame with cross-validated scores underlying the dashboard was replaced to show the new situation.
12.3 Moving to Kanban
Having scopeable tasks is essential for building proper Scrum sprints. As a team you have to commit to what you are going to complete in the upcoming two weeks. No longer being able to do that, we could not really be part of the Scrum rhythm anymore. We found the alternative in moving the data science tasks to a separate Kanban board, stepping out of the Scrum cycles. The circular nature of data science, as discussed in Chapter 5, does not lend itself well for tight planning. We started with a Kanban board with six lanes to do - test hypothesis - code review hypothesis - update model - code review update model - done. This reflected a full cycle of researching a hypothesis and updating the software. It worked quite well, however, a portion of the tasks did not reach the finish line, because the hypothesis tested appeared not to improve the model. This problem does not exist if the research part and the software part are split in separate tasks, the software task is only created if the research indicated the model could be improved. We never did this during this project, but this insight improved our workflow in subsequent projects.
12.4 Building an MVM for the MVP
Building a predictive model that is part of a dedicated product is both challenging and rewarding. Too often data science projects are initiated as a proof of concept, without a clear vision on how to implement if the prediction can be successfully done. Knowing from the start that the model is going to be used is very motivating. On the other hand, this means that you need constant alignment with the team that develops the product around the predictions. The houses offered for sale on funda have many characteristics filed, giving us a rich feature set to work with. However, as an MVP we wanted to present the users with an estimation, without them having to fill in all kinds of characteristics of their house. Developing a product from static house predictions is far less complex and time consuming than from a dynamic model with adjustable inputs, both from modelling and a software perspective. This implied that we could only use features that are freely available for every house in the Netherlands. Luckily, this was true for the two most important features, location and time. Also the surface area of the houses were available in a public database. From this we started to build our initial prediction models. First using a simple regression model to create a baseline. We have a preference for statistical models over machine learning algorithms, because they not only give us predictions, but also insight. However, we needed some decent predictions fast, and it was clear we needed to exploit some nonlinear relationships. We therefore used ensemble methods that gave superior predictions over regression models.
Already it was decided, we would only release the MVP in geographical areas in which the MVM performed well enough. This is called a soft launch, release the product without giving it too much noise for a selected group. Even then we did not quite make the minimal performance goals we set ourselves. However, we could include a categorical feature and simply export the predictions for every level of the feature for every house, as long as there were not too many levels. The type of the house (apartment, one of several Dutch types of houses) appeared another crucial predictor. Finally, we wanted to show lower and upper bounds to a prediction, not only giving a point estimate. After some research we were able to this with random forests, that were trained on the desired quantiles. Predictions were exported in csv files, a front end and back end were built around these. Doing a prediction on the website was just a simple look-up.
12.5 Improving the Product
From the start users could provide us with feedback, using a simple thumbs up, thumbs down and if they wished subsequent comments. Of course you want your work to be liked, but as I quickly learned, in this stage the best feedback is negative feedback. You know the product is barely good enough at this point in time, both from a software and a data science perspective. Negative feedback indicates people care about the product and it bothers them it does not fully meet their expectations. Moreover the feedback can point to directions that gives the biggest satisfaction jump when improved. If it appeared that the users did not care about the product in the first place, the project could be killed and little resource was wasted on it. Fortunately, users did care, so we went ahead and started improving.
Using an interactive model instead of the static MVM would improve the product in several ways. It enabled us to use features in the model that were not publicly available, the user could enter them. Also, we knew that the data on house surface area was not of consistent quality. Comparing them to the “real” data in our database for houses that were placed on our website indicated that they could be off in both directions. In fact, for the MVM we used a correction to predict the “real” surface area based on the public data. When the product changed to interactive, the user could correct the prefilled information if necessary. Finally, it improved transparency and user experience. Having an interactive product meant we had to bring the model to the product, not just the predictions.
12.5.2 Changing the Model
From the start we saw the problem we were trying to solve would have a natural fit with Bayesian regression models. Price trends are hugely important in the housing market. With the prior - posterior structure of the Bayesian framework, we could easily update each month. Making time an implicit variable, rather than a feature we use for modelling. Secondly, we always wanted to give a lower and upper bound to the predictions, these are part of the posterior of Bayesian predictions. The only problem was, we did not have hands-on experience with Bayesian modelling, we just knew the theory behind it. After extensive discussion if we did not see another way of significantly improving the model, we decided to go for it. This was a gamble, because we needed a few weeks to refresh our knowledge and to get our feet wet with Stan. The modelling was basically placed on hold then. Fortunately, it paid off. After some experimenting we found a good model structure for our problem, improving the model especially at locations we did not do so well at before. Moreover, we could now start to add more features to the model. This improved both the accuracy of the predictions, the confidence bounds around our estimates shrunk, and improved user experience significantly.
12.6 Productionising the Interactive Model
Up to this moment in my career, productionising data science products meant building a shiny dashboard to interact with results or exporting plain text files. My background is in statistics, not software engineering, I could not tell what was required to expose a model to the millions of visitors to our website. Luckily our data engineer could help. I (my data science colleague was working on a new project) exported the posteriors of the parameters in flat files. He built a python API that took the feature values as inputs and returned the lower and upper bounds and the point estimate for the requested house. Instead of me telling him how the model scoring should be done, I built the same functionality in R. He then copied that functionality to python and added his caching, load balancing and garbage collecting magic.
Up until this point I am not sure if it is possible to create an R API that is up to the task. Sometimes it is argued by python evangelicals that you should only use python because you can do everything in one language. Doing it in R first and then in python causes double work, which is a waste. I beg to differ. First of all, the majority of the work did not have to be re-implemented, data prepping, model training and model updating are done on train data only. It is only the scoring module we implemented both in R and python, which is just a fraction of the entire R code base. Even this is not a waste, rather it served as double bookkeeping. A number of small bugs were caught because the python API did not return the exact same results as the R module.
12.7 Thank You!
Building the Valuecheck was an amazing experience in which I learned so much. About working with Agile, about productionising software and about cooperating closely with a software engineering team. It has substantially changed my views on what it means to do data science. A major thanks for goes to Daniar, Oriol, Stephan, Rick, Sander, Riccardo, and Marco. You are awesome!