That’s so Random

Overengineering in ML - business life is not a Kaggle competition

2020-10-14T16:41:00+00:00

“Overengineering is the act of designing a product to be more robust or have more features than often necessary for its intended use, or for a process to be unnecessarily complex or inefficient.” This is how the Wikipedia page on overengineering starts. It is the diligent engineer who wants to make sure that every possible feature is incorporated in the product, that creates an overengineered product. We find overengineering in real world products, as well as in software. It is a relevant concept in data science as well. First of all, because software engineering is very much a part of data science. We should be careful not to create dashboards, reports and other products that are too complex and contain more information than the user can stomach. But maybe there is a second, more subtle lesson, in overengineering for data scientists. We might create machine learning models that predict too well. Sounds funny? Let me explain what I mean by it.

In machine learning, theoretically at least, there is an optimal model give the available data in the train set. It is the one that gives the best predictions on new data, is the one that has just the right level of complexity. It is not too simple, such that it would miss predictive relationships between feature and target (aka is not underfitting), but it also not so complex that it incorporates random noise in the train set (aka is not overfitting).The golden standard within machine learning is to hold out a part of the train set to represent new data, to gauge where on the bias-variance continuum the predictor is. Either by using a test set, by using cross-validation, or, ideally, using both.

Machine learning competitions, like the ones on Kaggle, challenge data scientists to find the model that is as close to the theoretical optimum as possible. Since different models and machine learning algorithms typically excel in different areas, oftentimes the optimal result is attained by combining them in what called an ensemble. Not seldom are ML competitions won by multiple contestants who joined forces and combined their models into one big super model.

In the ML competition context, there is no such thing as “predicting too well”. Predicting as well as you can is the sheer goal of these competitions. However, in real-world applications this is not the case, in my opinion. There the objective is (or maybe should be) creating as much business value as possibles. With this goal in mind we should realize that optimizing machine learning models comes with costs. Obviously, there is the salary of the data scientist(s) involved. As you come closer to the optimal model, the more you’ll need to scrape for improvement. Most likely, there will be diminishing returns on the time spent as the project progresses in terms of gained prediction accuracy.

But costs can also be in the complexity of the implementation. I don’t mean the model complexity here, but the complexity of the product as a whole. The amount of code written might increase sharply when more complex features are introduced. Or using a more involved model might require the training to run on multiple cores or will increase the training time by, say, fivefold. Making your product more complex makes it more vulnerable for bugs and more dificult to maintain in the future. Although the predictions of a more complex model might be (slightly) better, it’s business value might actually be lower than a simpler solution, because of this vulnaribility.

The strange-sounding statement in the introduction of this blog “We might create machine learning models that perform too well”, might make more sense now. Too much time and money can be invested, creating a product that is too complex and performs too well for the business needs it serves. With other words, we are overengineering the machine learning solution.

Figthing overengineering

There are at least two ways that will help you not to overengineer a machine learning product. First of all, by building a product incrementally. Probably no surprise coming from a proponent of working in an agile way, I think starting small and simple is the way to go. If the predictions are not up to par with the business requirements, see where the biggest improvement can be made in the least amount of time adding the least amount of complexity to the product. Then, assess again and start another cycle if needed. Until you arrive at a solution that is just good enough for the business need. We could call this Occam’s model, the simplest possible solution that fulfills the requirements.

Secondly, by realising that the call if the predictions are good enough to meet business needs is a business decision, not a data science choice. If you have someone on your team who is responsible for allocation of resources, planning, etc. (PO, manager, business lead, however they is called), it should be predominantly their call if there is need for further improvement. The question of these people to data scientists is too often “Is the model good enough, already?”, where it should be “What is the current performance of the model?”. As a data scientist, in the midst of optimisation, you might not be the best judge of good enough. Our ideas for further optimisation and general perfectionism could cloud our judgement. Rather, we should make it our job to inform the business people as best as we can about the current performance, and leave the final call to them.

Using {drake} for Machine Learning

2020-05-14T15:00:00+00:00

A few weeks ago, Miles McBain took us for a tour through his project organisation in this blogpost. Not surprisingly given Miles’ frequent shoutouts about the package, it is completely centered around drake. About a year ago on twitter, he convinced me to take this package for a spin. I was immediately sold. It cured a number of pains I had over the years in machine learning projects; storing intermediate results, reproducibility, having a single version of the truth, forgetting in which order steps should be applied, etc. In addition to Miles, I’d like to share my drake-centered workflow. As I found out from reading the blog, there is a great deal of overlap in our workflows.

This blog post highlights the differences and additions in my workflow from the one described by Miles. It is assumed you have read his blog post. I just highlight a few of the benefits he mentioned, on which I totally agree.

drake figures out which of the targets needs to be recalculated, this saves heaps of time.
abstracting away steps in functions, will make your plan a clear and concise overview of your entire workflow.
with R’s endless options to interact with other languages and platforms, the plan does not only serve as the concert master of the R part of the project, but it can direct all aspects of it start to finish.

Building machine learning models with `drake`

Where Miles’ workflow focuses on delivering insights on short notice with R markdown, most of the projects I am involved in are about the delivery of machine learning models. They typically span many months, in which the model is incrementally improved until it reaches a satisfactory level of prediction (or we fail, and the plug is pulled). As described at length in Agile Data Science with R, I have adopted a two-way approach to model development. Either I am researching how the model can be improved, or I am improving the model. You might wonder what the difference is. The former is quick-and-dirty, aimed at testing a hypothesis as quickly as possible. For the second, there are high standards to code quality, reproducibility and building a coherent product.

In terms of building a coherent product, to me, there is a time before and a time after drake. One of the biggest struggles I had over the years, is how to manage data flows through all the different stages of data fetching, wrangling, determining cases in scope, model-related preprocessing, modelling, evaluating, predicting, etc. I have adopted many systems and naming conventions, but never was this stress alleviated. Out-of-the box drake takes care of this. It is immediately clear how the output of one step serves as input for the next. Even more than the cache-management, the dependency-management is what makes drake such a breakthrough in workflow to me. The entire project is staged around the pipeline(s), it is the heart of the product. From here, everything branches out. It is also a build check to me. As long as it runs start to finish without interference, the work is sufficiently automated.

Using R packages with `drake`

I always organize my work in an R package structure, even though I rarely actually build them. A great benefit of this is that you always have all your function available in memory. Within the package folder, all functions are simply loaded by devtools::load_all(), which I often follow by a utils function called settings() that will set all the necessary settings and load the dependencies. A second reason I love to work with packages, is that they enforce you to develop every step in the form of functions. As Miles stipulated, functions are not just means to reduce repetitions, they are also great to abstract away complexity. The combination of functions and a drake plan express your workflow as a narrative, without being bothered by the technical details of implementation. The drake plans I keep outside the /R folder, so it is not part of the package. These are part of scripts that run inside the package folder, so that all the code and dependencies needed are loaded by specifying devtools::load_all() at the prework argument when calling drake::make().

I adopted a drake + R packages workflow because its potential for creating robust data science products with high reproducibility. It did not disappoint on this. However, there appeared to be a second, equally important benefit, that I did not anticipate. It enables me to test hypotheses for improvements way faster than I did before. Not only are all the functions used readily available by using devtools::load_all(), also all the different stages of data preparations are stored in drake’s cache. Before drake, a significant amount of time had to be spent in determining in what shape of form the data should be in order to test an idea, and subsequently get it in that form (often by just copying-and-pasting large code chunks from different scripts). This was messy and time consuming. Now, all the data stages are neatly laid out in the plan, and we can just grab the appropriate intermediate result from the cache using drake::readd() or drake::loadd(). Moreover, after we have made the modifications to the data according to the hypothesis, we can plug the data back into the pipeline to have all subsequent steps ran on it. The cross-validated model performance is always part of the pipeline, so we can quickly figure out if the changes made result in an improvement in the relevant model quality statistics.

Fire queries to data bases and call other languages with `drake`

As mentioned before, we use drake as the concert master to not only manage the R part of the project, but all the steps start to finish. This always involves calls to data bases or clusters and oftentimes modelling tools and languages such as Stan and H20. Something I struggled for a while is how to ensure the sequence of the steps if there are no objects returned to R. Oftentimes, we have a sequence of queries, each storing their output so the next query uses the previous’ result. Only the last result is fetched to R. In order to execute the queries in the right order the drake steps have to depend on an earlier target. An effective way to do this is wrapping all the steps that don’t return results in R in a function that simply returns the Sys.time(). That is stored in a target, which is used as an input of a subsequent step. Not only does it tell drake on which target a step is dependent, you are as well creating logging as part of your pipeline, telling you when the external code ran for the last time.

A word of thanks

I cannot stop being amazed about the fantastic packages that keep being created for this language. Creating a tool that is so complex and still works so smoothly, I cannot imagine the number of hours and the sweat it took to create drake. A major thanks goes to Will Landau and all the people of ROpenSci that helped him creating it. I would not have give drake a try if I did not here Miles McBain repeatedly promote it. drake is a revolutionary package in my opinion, and it needs more people like Miles that helps promote it and convince people to take the first hurdle.

Some More Thoughts on Impostering

2020-01-07T16:30:00+00:00

Two years ago, I wrote about meta-learning to fight imposter feelings. In this blog I made a distinction between impostering because you don’t feel you are up to the job, and because you feel you ought to know something which you don’t. The meta-learning blog focuses on how you define yourself as a data scientist and what, as a consequence, you decide to learn (and more importantly what not). Staying sane while doing data science is something that always has my interest. Imposter feelings are a major foe to the joy this work can bring. I came across/to two more insights on the topic that I found worthwhile sharing. The first is intellectual humility, which I learned about in the book Superforecasters, the art & science of prediction by Philip Tetlock. The second is seeing impostering as a learning alarm, thereby turning it to something positive.

Intellectual Humility

Superforecasters is not about data science, but about forecasting single events, as typically done by intelligence agencies. I thought it was a very interesting read overall, but what I want to highlight here is its treatment of the concept of intellectual humility.

Just like the forecasters, data scientists are faced with complex problems to which there is no perfect solution. Whether doing machine learning, statistical modelling or an exploratory data analysis, we try to paint an overall picture from incomplete information. We have to look carefully at nonlinear relationships, interaction effects and weak correlations. All very difficult for the human mind to conceive. Often the task ahead is daunting and when we are not careful it can quickly inflict feelings of being incapable. Tetlock discusses that the best forecasters are very aware of their limitations and the possibility that their judgement might be off. But at the same time these forecasters do not doubt that they are the person for the job:

The humility required for good judgement is not self-doubt - the sense that you are untalented, unintelligent, or unworthy. It is intellectual humility. It is the recognition that reality is profoundly complex, that seeing things clearly is a constant struggle, when it can be done at all, and that human judgement must therefore be riddled with mistakes. This is true for fools and geniuses alike. So it’s quite possible to think highly of yourself and be intellectual humble. In fact, this combination can be wonderfully fruitful. Intellectual humility compels the careful reflection necessary for good judgement; confidence in one’s abilities inspires determined action. Superforecaster p. 228-229

I usually refrain from a You got this! approach to fight imposter feelings, because I honestly don’t know if you do. I don’t know it for a reader I have never spoken to, and frankly I often don’t know it for myself. Therefore, I typically focus on the ought to know part. However, what I realized when reading Tetlock, was that the fact that you are puzzled by the problem you are working on, is by no means an indication you are a phoney. Acknowledging that data science is freaking, freaking hard is not a sign of weakness, it is a sign of realism. Don’t look at data science problems as something you are ought to solve on a whim. They are mysteries and only by hard work and perseverance you can chip away some of that mystery and you might even get to insights that are useful.

Learning Alarm

Now turning to my favorite ought to know part of impostering. The feeling of shame, when someone else knows stuff that you don’t. You, the data scientist, does not know this simple thing and therefore you deserve to be fired, to be stripped of all your diplomas and work as a store clerk for the rest of your life. Turn that feeling of shame into a learning alarm. Be excited that you discovered something new, something you can add to your knowledge stack.

Focusing on the shame part of ought-to-know will turn you attention inwards. Chances are that you are so busy giving yourself a hard time that you don’t use the opportunity to learn. It is completely useless to ruminate on if you should have known this already. Fact is that you don’t. Instead turn your attention outwards. If it is another person’s ability that induced the feeling, don’t shy away or, even worse, pretend like you know (yes, I have done this myself). Pick their brain! Otherwise, Google, read blogs, practice, whatever is needed to master the new material.

Thanks for Reading

I realize these two topics are a little less applicable than the earlier blog and are a little more philosophical in nature. Still, I hope you find them as practical. They serve as default responses for situations data scientists can be confronted with; being puzzled by the problem they are trying to solve and not knowing something they think they should.

The Psychology of Flame Wars

2019-06-26T15:20:00+00:00

Within the data science community we see quite some flame wars. For those who don’t know what I am talking about, there are different ways of doing data science. There are the two major languages R and python, with their own implementations for analysing data. Then within R there are the different flavours of using the base language or applying the functions of the tidyverse. Within the tidyverse there is the dplyr package to do data wrangling, of which the functionality of the data.table package greatly overlaps. Each of the choices are wildly contested by users of both of the options. Oftentimes, these debates are first presented as objective comparisons between two options in which the favoured option clearly stands out. This then evokes fierce responses from the other camp and before you know it we are down to the point where we call each other’s choices not elegant or even ugly.

I hope to convince you that these debates are useless by looking into some of the underlying psychological principles that makes us vulnerable for this type of quarreling.

Cognitive Ease

The main reason I think the discussion of the merits of different implementations is fruitless, is that the two camps can never fully understand each other. Sure, objective comparisons can be made in computation speed and to some extent functionality. We can read from the authors and maintainers what the motivation is for implementing something in a certain way. But the true prove of the pudding remains in the eating, that is enabling the user in effectively putting it to use. Before you can effectively and joyously apply a complex systems (which all these implementations are), you need to spend countless hours sweating and swearing, reading documents and googling error messages. Day after day, hour after hour, you will fight yourself into mastery of one of the systems. To be most effective and consistent almost everybody will have a go-to system, in which day-to-day task will be done. You don’t roll a dice in the mourning to decide if this is going to be an R or a python day. You don’t switch from data.table to dplyr in the middle of an analysis, without a very good reason. You stick with what you know, because it will give you the answer you are after the quickest. There is a major path-dependency here. You initially start with one of the systems and with each time you use it your understanding and appreciation grows. Because of this you will keep using the system and so your love affair begins. Before you know it, you have a cognitive lock-in from your weapon-of-choice.

A big part of the appreciation for the system comes from your understanding of it. This understanding relieves the large cognitive strain of doing data science a little bit. In the phenomenal Thinking Fast and Slow by Daniel Kahneman a full chapter is devoted to the topic of cognitive ease. It is shown that you get a good feeling out of things that are relatively easy to you. There is a very good evolutionary explanation for this; your brain consumes massive amounts of energy so parsimonious use of it is rewarded by feeling great. It is you body telling you, keep doing this not so hard thing, you are going to last a long time this way. Because the time you spent developing skills in your favourite system, looking at code from this system will give you a lot more cognitive ease than looking at code from a system you are far less familiar with. To understand code from the unfamiliar system at the same level as the familiar one, will require spending a lot of cognitive resources. This will be accompanied by negative emotions such as frustration, feeling tired, and even anger. It is then a very understandable but also a very silly mistake that these emotions are an indication that the unfamiliar system is poorly implemented or even ugly. However, it is ignorance, not the software being bad that caused these emotions.

From cognitive psychology we turn to social psychology. In the seventies Henri Tajfel developed the theory of social identity. Part of the way we view ourselves is determined by the groups we are part of. If a group we are part of does do well by some measure, we will personally start to feel better about ourselves even when we did not had any part in the achievement. Just look at the crowds that celebrate at a victory parade of the local sports teams. They did not spent a minute on the field, they might not even been at the stadium supporting the players, and still they experience the victory as theirs as well. Using a software system a good part of your waking hours, day after day, will inevitably lead to that system being part of your social identity. Gradually you are not just using the software, you are becoming the software as well.

On its own this is not a bad thing, as humans we need this sense of belonging. However, it will also change your behaviour, and oftentimesno not for the good. In order to boost your self-esteem you want your “team” to be on the winning end. Each year when Stackoverflow shares the rises and falls of use of software languages, users of both R and python jubilantly share the results (both are growing year after year). While this is an objective development in which you had no part other than being one of the users, discussions about the merits of software systems have active user involvement. Probably the easiest way to make your team look better is to make the other team look worse. Just think about the tiring debates between fans of different sports teams to discuss which is the best. Or the endless mud throwing at political races. Flame wars are no difference, by mocking the other system we celebrate our own and we are part of the winning team.

Now, here is the good news. In sports and politics, the different parties are also objectively in a competition. They play a zero-sum game, in which one’s victory must mean the other’s defeat. We as a data science community are not in a zero-sum game and we often seem to forget it. Even when the ‘competing’ system is on the rise you can do your job effectively and with joy in the one you prefer. Instead of mocking each other we should be thankful for the wealth of options we have to do our jobs. When our primary system does not offer the functionality we are looking for, we might find it in the other. The different systems also can influence each other positively, functionality in other systems might inspire authors to make theirs more complete.

Conclusion

I have looked into two psychological mechanisms that I think do stir-up flame wars. I hope it will make you think again before posting a comment on Twitter or starting a heated discussion with a colleague that leads nowhere. We have several options to do our daily jobs and each of them has proven itself in practice. Each is used by at least tens of thousands of analysts and programmers, who use them to bring real value to real organisations. Mocking one of them is not only harmful, it is disrespectful. Only the brightest and most determined of our peers could create systems that are so complex, complete and fault free. They have committed thousands of hours, often unpaid, to serve the community because they care. Mocking their labour because you don’t properly understand the system they designed or because you want to feel better about yourself is ignorant, and you should refrain from it.

padr is updated

2019-06-12T12:50:00+00:00

Yesterday v.0.5.0 of the padr package hit CRAN. You will find the main changes in the thicken function, that has gained two new arguments. First of all, by an idea of Adam Stone, you are now enabled to drop the original datetime variable from the data frame by using drop = TRUE. This argument defaults to FALSE to ensure backwards compatibility. Without setting drop to TRUE the datetime variable will be returned alongside the added, thickened variable:

library(padr)
thicken(coffee, interval = "hour")

##            time_stamp amount     time_stamp_hour
## 1 2016-07-07 09:11:21   3.14 2016-07-07 09:00:00
## 2 2016-07-07 09:46:48   2.98 2016-07-07 09:00:00
## 3 2016-07-09 13:25:17   4.11 2016-07-09 13:00:00
## 4 2016-07-10 10:45:11   3.14 2016-07-10 10:00:00

Now, with drop = TRUE the original datetime variable will no longer be returned:

thicken(coffee, interval = "hour", drop = TRUE)

##   amount     time_stamp_hour
## 1   3.14 2016-07-07 09:00:00
## 2   2.98 2016-07-07 09:00:00
## 3   4.11 2016-07-09 13:00:00
## 4   3.14 2016-07-10 10:00:00

Secondly, thicken has gained the ties_to_earlier argument. By default when the rounding argument in thicken is set to “up” and the original observation is equal to a value in the higher interval variable, the observation is mapped to the next value in the new variable. (For example 2019-04-14 13:00:00 would be mapped to 2019-04-14 14:00:00 when rounding is “up” and interval is “hour”.) This can be undesired. When this argument is set to TRUE tied observations are mapped to their own value (thus to one value earlier in the new variable). For completeness this argument also works when rounding is “down”. Then, when original and new value are equal, the original value is mapped to the previous value of the higher level interval variable. (For example 2019-04-14 13:00:00 will be mapped to 2019-04-14 12:00:00 when the interval is hour). Feature request by github user stribstrib.

Finally, along some minor bug fixes, there is a major bug fix that was reported by github user levi-nagy. thicken preserves missing values; missing values in the original datetime column are also found in the newly added variable. The missing values were placed on the wrong position however. They were placed on the original position + the number of NAs already seen in the original datetime variable, instead of on the NA position where they are supposed to be. Only the first missing value was on the correct position, all the others had an unwanted offset. This is now fixed, all the missing values are in the correct place in the thickened variable.

Predictability of Tennis Grand Slams

2019-05-26T14:00:00+00:00

The European tennis season is in full swing, with Roland Garros starting today and Wimbledon taking place in a few weeks. For a sports buff like me, it is the essence of summer (together with the Tour de France). Time to dive into some tennis data. As a follower of both the men’s and the women’s tour it occurred to me that in the latter the tournaments are less predictable. My gut feel was that in the men’s matches the favourite wins the match more frequently than at women’s matches. Of course gut feels are what the world makes go round, unless you are a data scientist. So lets analyse all the matches that were played at the four Grand Slam tournamants for the past fourty years.

The Seeding System

But first we need to think of a way to measure if the favourite indeed won. Who the favourite is to win a match might be dependent on who you ask, how do we determine who is the favourite to win? Luckily, there is an element of objectivity in each tournamant that is played. The top players of that moment according to the general rankings are seeded. The world number one is seeded first, the number two as second, up until the world number 32. The seeds are placed in the schedule in such a way that the strongest players only meet in the final rounds of the tournamant (if they beat the weaker ones first of course), The numbers one and two cannot meet before the final. They cannot meet the numbers three and four before semi-finals, and will not play against the numbers five up until eight before the quarterfinals. The higher your seed, the more weaker players you’ll meet during the tournamants before you meet a stronger player. From your seed you are expected to win all the rounds until you meet a stronger player, we could say a seeded player is ought to loose in a certain round (except the number one seed who is ought to win the tournamant).

Building an upset score

If seeded players loose in a round that is before the “ought-to-loose round” belonging their seed, we can say they underperformed. A way to measure this underperformance is to award an upset point for every round a player loses earlier than its “ought-to-loose round”. The number one seed is ought to win the tournament, if he or she loses the final it will results in one upset point, a loss in the semi-final yields two upset points, a loss in the quarter-final three etcetera. For the number two seed a loss in the semi-final yields one upset point, a loss in the quarter-final two, the round before three etcetera. In the table below all the possible upset points for the sixteen highest seeded players are displayed.

Seed	F	SF	QF	R16	R32	R64	R128
1	1	2	3	4	5	6	7
2	0	1	2	3	4	5	6
3-4	0	0	1	2	3	4	5
5-8	0	0	0	1	2	3	4
9-16	0	0	0	0	1	2	3

The total predictability of a tournament is simply the sum of all the individual upset points. Because until 2001 there were only sixteen seeds, instead of 32, we keep using the first sixteen seeds after it as well. Then the tournamant upset scores can range from 0 (every seeded player lost in its “ought-to-loose round” and the first seeded won the whole thing) to 63 (every seeded player lost in the first round).

The data

Jeff Sackman maintains a database of all the tennis matches played on both the atp (men’s) and wta (women’s) tour. It is on github so I simply cloned the git repositories on which they are maintained. Using the following code.

git clone https://github.com/JeffSackmann/tennis_atp
git clone https://github.com/JeffSackmann/tennis_wta

Resulting in two folders with a csv file for every year. Many thanks to Jeff for making our lives so easy here.

The results

In the graph below the trends are shown for the four Grand Slam tournaments, since 1980, both men (atp) and women (wta). Rather than looking at individual events, I am interested in overall trends. That is why a smoothed curve is used, rather than a line or dot plot.

Indeed the men’s events are more predictable than the women’s, but too my great surprise this is only the case for the last ten years. In the thirty years before that the top women players consistently outperformed the top men players at the big tournaments. However, with players like Federer, Nadal and Djokovic starting to dominate on every surface, the probability of seeing an upset rapidly decreased at the men’s events the last years. Whereas, the women’s tour rapidly grew in unpredictability over the last decade. Many things to explore here. What happened in the late nineties on the men’s tour? Why do we see so many upsets at the women’s? Can we forecast the number of upsets we will see the upcoming tournaments? I leave those questions for someone else to answer, or maybe future-me. Here is the full code of this analysis (if you want to reproduce set the data_path to the folder where you stored the data first).

library(tidyverse)

gs_data_one_year <- function(year, tour = c("atp", "wta")) {
  read_csv(str_glue("{data_path}tennis_{tour}/{tour}_matches_{year}.csv"),
           col_types =  cols(loser_seed = col_integer())) %>% 
    filter(tourney_level == "G") %>% 
    mutate(year = year, tour = tour) %>% 
    select(loser_seed, round, tourney_name, year, tour) %>% 
    mutate(tourney_name = case_when(
      tourney_name == "French Open" ~ "Roland Garros",
      tourney_name == "Us Open"     ~ "US Open",
      TRUE                          ~ tourney_name
    ))
}

the_data <- bind_rows(
  map_dfr(1980:2018, ~gs_data_one_year(.x, "atp")),
  map_dfr(1980:2018, ~gs_data_one_year(.x, "wta"))
)

upset_scores_df <- function(round_char, 
                            upset_points,
                            top_only = 16) {
  stopifnot(length(upset_points) == 6)
  tibble(
    round = round_char,
    loser_seed = 32:1,
    upset_points = rep(upset_points, c(16, 8, 4, 2, 1, 1))
  ) %>% 
    filter(loser_seed <= top_only)
}

all_rounds <- unique(the_data$round)
all_upset_points <- list(2:7, 1:6, 0:5, c(rep(0, 2), 1:4), c(rep(0, 3), 1:3), 
                         c(rep(0, 4), 1:2), c(rep(0, 5), 1))

upset_df <- map2_dfr(all_rounds, all_upset_points, upset_scores_df,
                     top_only = 16)

upset_set <- left_join(the_data, upset_df) %>% 
  group_by(tourney_name, year, tour) %>% 
  summarise(upset_score = sum(upset_points, na.rm = TRUE)) %>% 
  ungroup() 

upset_set %>% 
  ggplot(aes(year, upset_score, col = tour)) +
  geom_smooth() +
  facet_wrap(~tourney_name) +
  labs(x        = "Year", 
       y        = "Upset score",
       title    = "Predictability at tennis Grand Slams",
       subtitle = "A comparison across time, tournamants and tours") +
  theme_bw() +
  scale_color_manual(values = c("forestgreen", "firebrick"))

Code and Data in a large Machine Learning project

2019-03-18T14:00:00+00:00

We did a large machine learning project at work recently. It involved two data scientists, two backend engineers and a data engineer, all working on-and-off on the R code during the project. The project had many interesting and new aspects to me, among them are doing data science in an agilish way, how to keep track of the different model versions and how to deal with directories, data and code on different machines. I planned to do a series of write-ups this summer, describing each of them, but then this happened

Let me know if you write this up somewhere and I could summarize and/or link to it. I think it would be good to have an overview of different approaches to the Path Problem.
— Jenny Bryan (@JennyBryan) February 28, 2019

Compliant as I am, here is already the story on the latter topic.

We knew upfront that the model we were trying to create would take many iterations of improvement before it was production worthy. This implied that we were to create a lot of code and a lot of data files. If not organized properly we could easily drawn in the ocean we were about to create. We had a large server at our disposal that could do the heavy lifting. But, because we sometimes needed all the cores for training for a prolonged period of time we also worked on our local machines.

The server was our principal machine for building the project, because it had a lot more RAM and cores than our local machines and because it was the central place where data was stored (more about that later). The first challenge we had to overcome was how we could work on the server simultaneously on different aspects of the project. Ideally every different exploration and adjustment to the model went in its own git branch so we could use all the best practices of software development, like doing code reviews before merging to master. Working in parallel on the same machine on different aspects made this really hard to do. Then, a DevOps we discussed our challenges with came with the simplest solution ever. Just give every user his own code folder on the same machine, just like every user has a code folder on its own machine. All of a sudden everything worked smoothly, such a simple solution proved to be turning point in the project.

On to the organization of the code itself. From past experiences we knew that reproducibility of results was absolutely vital, both for the quality of the model and for the retention of our mental health. Therefore, we decided that from day one we would use the R package structure to develop the code. This has two major advantages over placing scripts in regular Rstudio projects. First, it will not build if you place R code in the scripts that is not a function or a method. Thereby enforcing writing code that is independent of the state of the user’s machine. Second, by using devtools::load_all() you have all functionality at your disposal at every step of the analysis. You don’t have to load or run certain scripts first, before you can go to work.

But what about doing explorative analysis? You cannot get to much insight by just writing functions. Well, R packages already have a very convenient solution in place in the form of Vignettes. These are normally used to write examples on how the package should be used, for example this one for dplyr. One way to write a Vignette is in a Rmarkdown file, a format ideal for data exploration because it allows for mixing text with code. We were very strict about the code quality in the R scripts, but the Vignettes are called the Feyerabend files (after the epistemologist who claimed that anything goes as valid science). You can mess about here as much as you want as long as the results and insights are subsequently transferred to the R scripts. This allowed for very quick hypothesis testing.

Then finally data. Our principal data source was the company data base. Since the queries to produce an analysis set took a long time to run, we needed to store the results locally. A couple of smaller data sets, such as the IDs of all the cases in the train set, were used so regularly that it was most convenient to have them at our immediate disposal at all times. We included them in the R package as data files. (Just like packages from CRAN have datasets shipped with them). However, most files were too large to hold in memory all the time, and we certainly not wanted to have them in version control. As mentioned, each user had its own code folder on the server, and sometimes we had to work locally as well. While syncing code was easy, using version control, syncing data was hard if everybody kept data inside his own folder, but did not check it in. On the server this was relatively easy to overcome, by using a single data folder outside the code folders. To make sure we could also sync the data locally we made strict arrangements about the creation of data files. Every single one of them had to be produced by a function in the R folder of the package. This included all queries to the data base, although this caused some overhead the reproducibility and clarity it gave us made it well worth it. We put the code that produced the data in version control, not the data itself. Data files could then be created on every system independently. Finally, saving and loading the data in a uniform way. How did we deal with that what keeps Jenny awake at night? In the utils file of the package functions for writing and reading were created. Before loading or saving, the functions check the name of the system and the name of the user. It would then load from or save to the folder belonging to the system or the user. Here is an example of the structure for saving as .Rds files.

save_as_rds <- function(file, 
                        filename) {
  
  node <- Sys.info()["nodename"]
  user <- Sys.info()['user']
  
  if (node == "server_node_name") {
    path <- "path/to_the_data/on_the/server"
  } else if (user == "user1") {
    path <- "path/for/user1"
  } else if (user == "user2") {
    path <- "user2/has_data/stored/here"
  }
  
  file_path <- file.path(path, filename)
  saveRDS(file, file_path)
}

Every user added their name and local path to these functions. Throughout the code we only used these functions, so we were never bothered with changing directories.

Every project is different, but I think the challenges to developing a complicated model with a team are universal. Hopefully, these practical solutions can help you when you find yourself in such a situation. Of course I am very interested in your best practices. Post a reply or send an email.

Using Rstudio Jobs for training many models in parallel

2019-02-26T17:00:00+00:00

Recently, Rstudio added the Jobs feature, which allows you to run R scripts in the background. Computations are done in a separate R session that is not interactive, but just runs the script. In the meantime your regular R session stays live so you can do other work while waiting for the Job to complete. Instead refreshing your Twitter for the 156th time, you can stay productive! (I am actually writing this blog in Rmarkdown while I am waiting for my results to come in.) The number of jobs you can spin-up is not limited to one. As each new job is started at a different core, you can start as many jobs as your system got cores (although leaving one idle is a good idea for other processes, like your interactive R session).

Recently, I needed to train many Bayesian models on subsets of a large dataset. The subsets varied greatly in size, with most of the models needing a few minutes to train, but the ones trained on the larger subsets took up to half an hour. With just a single Job the whole thing would have lasted over fifteen hours. Luckily, I have a server at my disposal with many cores and a lot of RAM. I chose to use Jobs for running the model training in parallel for two reasons. First, as mentioned, it allows me to do other R work while waiting for the results to come in. Second, I train the models with rstan which allows for using multiple cores for each model, each chain gets its own core. So, we have parallelization within the parallel Jobs. I could only get this to work efficiently with Jobs. Packages for parallelization did not seem to be able to handle this parallel within parallel. (Disclaimer, I am no expert on parallelization, a more knowledgeable person might figure it out. I stopped digging once I got what I wanted with Jobs).

Here, I share the steps I took to get the whole thing running.

Create the Jobs script

First you should capture everything that is within every Job in an R script. You could choose to import your current global environment when starting the job, be careful with this when starting multiple Jobs. If you have large objects in your environment it will be loaded to every Job you start and you might run out of memory. Importing the global environment also poses a threat for the reproducibility of the outcome, because the Job can only run when the required objects are loaded in the global. Rather, I suggest you make a script that is completely self-contained. Everything that is needed within the Job, dependencies, loaded data, your own functions, should be in the script (or sourced by the script). To run the same Job multiple times on different data, include a parameter that differs over the Jobs within the script. More about that later.

Divide your work

Within each Job, the computations will run sequentially. If you want the parallel Jobs to finish up in approximately the same time, you should divide the work they have to do in approximately equal chunks. The number of Jobs you can runs simultaneously is limited by the number of cores you have available and the RAM on your system. The first is easy to detect, just run

parallel::detectCores()

If you have 8 cores on your machine, you can spin up to 7 Jobs parallel if you have enough memory available. Figuring out how much memory your Job consumes is a bit trial-and-error. If you run a task for the first time it might be a good to run a test Job and check on its memory use. If you are on a unix system you can see how much memory the Job consumes by using the top function on the command line. This will show all running processes and the percentage of the available RAM they use. If the test Job consumes about 15% of your available memory, you could spin up about 4 to 5 parallel Jobs, assuming that the amount of memory used is approximately equal for each part of the data. Always allow for some room for fluctuation. I don’t let the Jobs consume more than 75% of the available memory. If your Jobs are very volatile in their memory use you might even want to be more conservative, since it is one strike and your out. (I know absolutely nothing about Windows, if you are on Windows and not sure how to monitor the memory use you have to do some Googling I am afraid).

After you have decided how many jobs you can run in parallel it is time to split up the work. I needed to train a great number of models on datasets varying in size. Splitting up the subsets in equal number of models to train is a suboptimal strategy, because if several larger sets end up in the same partition you have to wait long on one of the Jobs while the others are long finished. I used the number of rows of each subset as an indication for the time it would take for the model trained on it to complete. With the following little helper function you can create approximately equal chunks of the data. x is a vector with id’s for every group on which a model is trained, w is a vector with weights (the number of rows), g is the number of desired groups, this is equal to the number of Jobs you’ll start.

assign_to_chunk <- function(x, w, g) {
  stopifnot(length(x) == length(w),
            is.numeric(w),
            length(g) == 1,
            g %% 1 == 0)
  splits <- (sum(w) / g) * (1:(g - 1))
  total_sizes <- cumsum(w)
  end_points <- sapply(splits, function(x) which.min(abs(total_sizes - x)))
  start_points <- c(1, end_points + 1)
  end_points   <- c(end_points, length(x))
  chunk_assignment <- rep(1:g, end_points - start_points + 1)
  data.frame(x, chunk_assignment)
}

If you have a large dataset, like I did, it is a good idea to partition data physically into separate files. Otherwise each Job has to hold the complete data set in its memory, wasting RAM on data records it doesn’t need.

Getting the whole thing to run

Now all that is left is setting your system on fire by using its full capacity. There are two ways you can get your Job to run on different data sets. The first is to give your Job script a parameter variable, of which you change the value before starting the next Job. Say, you have split your data into five parts, called filename_part1.csv up until filename_part5.csv. You could start you Job script then with

part <- 1
read.csv(paste0("location_to_file/filename_part", part, ".csv"))

Next, you start the Job manually in the Jobs pane in Rstudio. For the next part you change part to the value 2 and you start another Job. Until you have five Jobs started. This is a little tedious and error prone. A more elegant solution was proposed by Rstudio’s Jonathan McPherson, in which the rstudioapi::jobRunScript function is used to kick-off the Job. In a second script you update the part variable after starting each Job and you start the Job by using the function instead of the pane.

part <- 1 
rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE)
part <- 2
rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE)

# ...

part <- 5
rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE)

You now only have to run this second script to have all the Jobs started at once. This even more convenient if you have several variables value that vary over the Jobs. A downside of this approach is that you have to import the global environment in every one of them. As mentioned, you must then make sure the Job is reproducible and the total memory of your system is not exceeded.

A major thanks to the Rstudio team for providing this awesome functionality. It is a great productivity boost to me!

Dealing with failed projects

2018-11-22T09:20:00+00:00

Recently, I came up with Thoen’s law. It is an empirical one, based on several years of doing data science projects in different organisations. Here it is: The probability that you have worked on a data science project that failed, approaches one very quickly as the number of projects done grows. I think many, far more than we as a community like to admit, will deal with projects that don’t meet their objectives. This blog does not explore why data science projects have a high risk of failing. Jacqueline Nolis already did this adequately. Rather, I’ll look for strategies how we might deal with projects that are failing. Disappointing as they may be, failed projects are inherently part of the novel and challenging discipline data science is in many organisations. The following approach might reduce the probability of failure, but that is not the main point. Their objective is to prevent failing in silence after too long a period of project time. In which you try to figure out things on your own. They will shift failure from the silent personal domain to the public collective one. Hopefully, reducing stress and blame by yourself and others.

Make failing an option from the start

At the beginning of a project the levels enthusiasm and optimism are always at its peak. Especially in data science projects. Isn’t data the new oil? This is the time we are finally going to dig into that well and leverage our data in unprecedented ways! No setbacks are experienced yet. There is only one road ahead and it will lead us to success. Probably at this stage you, the data scientist, are already well aware of a number of project risks. You might want to keep these concerns to yourself, as you don’t want to come across as negative, or worse, someone who is not up to the job ahead. Please don’t, if you foresee possible problems at this stage and you don’t speak out, they can come back as a boomerang when the problems actually occur. Rather, invite all stakeholders to perform a risk analysis together. As a group, you list the requirements for a successful outcome, and try to identify what can get in the way of them. These requirements differ from project to project, of course. However, usual suspects are having enough history in the database, having data of adequate quality, being able to join data from different sources, having an organization that is ready to adopt the project, and having a strong enough relationship between relevant variables in the first place. Doing a risk analysis serves two purposes. Obviously, by describing possible problems up front, it is more likely they can be prevented or mitigated. Moreover, it can subtly shift the jubilant mood that is so typical for stakeholders at a data science project start, to a more realistic vision on the project. Making them realise there is no guaranteed success.

Plan realistically and put in f$@k-up slack

Doing data science well takes time. Time to properly understand the problem, time write good quality code, time to figure out the relationships in your data. Whether it is by your boss, a project manager, a client, or your colleague, you are going to be asked how long it is going to take you to complete (a part of) the job. Trying to please when having to give an estimate, will almost certainly backfire later on. Try to list all the components that are part of the total job ahead and have a realistic estimate of how much time it will take you to do each properly. Next, and this is crucial, add a f$@k-up slack to it. This is a percentage of time that the project is going to last longer because you are going to f$@k-up. How am I so sure you will? Because data science is hard and you are human. Junior, medior, senior, we all f$@k-up. Numbers don’t add up because we programmed something incorrectly, taking us a day-and-half to find the error. We thought we understood the data, but we didn’t, some part of the analysis needs to be redone. We finally have a fancy server to train on the full set, but it keeps running out of memory while it shouldn’t. You can f$@k-up in so many ways, so saying you will somehow is a pretty save bet. I think adding ten to twenty percent to the project time for unforeseen f$@k-ups is certainly not too much.

Keep stakeholders in the loop

This is as obvious as it is postponed or not done at all. A good project manager often asks the data scientist how things are going, and communicates progress to stakeholders. If you are in a project in which this there is no project manager, make it your responsibility to inform the stakeholders at due times. Be disciplined and meet with them, or if this cannot be arranged email them, at preset moments. Don’t delay updates until you have better or more news than you currently do. There is a big pitfall in letting thoughts like “I am sure the model will improve by x when I try this new fancy algorithm, just need to get it running, shouldn’t take long” postpone your updates. Don’t be apologetic in the updates, try to be as factual as you can be. “We tried this and it gave no performance enhancement, we are now off to try this.” If a project is going in the direction of failure, the stakeholders are aware of this from the start. Making actual failure better to accept than when they are confronted with it unexpectedly.

Write a final report

Often failed projects are never really completed. There is always new stuff left to try. Maybe a different algorithm, or a different data source. All the time that was in the original planning has been used up, the project goals are not yet met, stakeholders start losing interest and the project is left to linger. Leaving you dissatisfied and maybe unwilling to give up. Writing a report that describes why the project was not a success, is a good way to officially close it. Try to write down as meticulous as you can what the project goals were and why they were not attained. Again being factual is the way to go, quantify, quantify, quantify. “For 101,221 out of the 567,436 customers in database A, there was no record in database B. So, for 17.8% of all our customers this crucial predictor was not available.” If you think there is still life in the project, include recommendations for restarting it. The final report informs the stakeholder and forces you to objectively assess the failure, thereby reducing self-blame for an unfinished, unsuccessful project.

Why your S3 method isn’t working

2018-06-15T15:54:00+00:00

Throughout the last years I noticed the following happening with a number of people. One of those people was actually yours truely a few years back. Person is aware of S3 methods in R through regular use of print, plot and summary functions and decides to give it a go in own work. Creates a function that assigns a class to its output and then implements a bunch of methods to work on the class. Strangely, some of these methods appear to be working as expected, while others throw an error. After a confusing and painful debugging session, person throws hands in the air and continues working without S3 methods. Which was working fine in the first place. This is a real pity, because all the person is overlooking is a very small step in the S3 chain: the generic function.

A nonworking method

So we have a function doing all kinds of complicated stuff. It outputs a list with several elements. We assign a S3 class to it before returning, so we can subsequently implement a number of methods¹. Lets just make something up here.

my_func <- function(x) {
  ret <- list(dataset = x, 
              d = 42, 
              y = rnorm(10), 
              z = c('a', 'b', 'a', 'c'))
  class(ret) <- "myS3"
  ret
}
out <- my_func(mtcars)

Perfect, now lets implement a print method. Rather than outputting the whole list, we just want to know the most vital information when printing.

print.myS3 <- function(x) {
  cat("Original dataset has", nrow(x$dataset), "rows and",
      ncol(x$dataset), "columns\n", 
      "d is", x$d)
}
out

## Original dataset has 32 rows and 11 columns
##  d is 42

Ha, that is working!. Now we do a mean method, that gives us the mean of the y variable.

mean.myS3 <- function(x) {
  mean(x$y)
}
mean(out)

## [1] 0.2631094

Works too! And finally we do a count_letters method. It takes z from the output and counts how often each letter occurs.

count_letters.myS3 <- function(x) {
  table(out$z)
}
count_letters(out)

## Error in count_letters(out): could not find function "count_letters"

What do you mean “could not find function”? It is right there! Maybe we made a typo. Mmmm, no it doesn’t seem so. Maybe, mmmm, lets look into this…. Half an hour, a bit of swearing and feelings of stupidity later. Pfff, lets not bother about S3, we were happy with just using functions in the first place.

Generics

Now why are print and mean working just fine, but count_letters isn’t? Lets look under the hood of print and mean.

print

## function (x, ...) 
## UseMethod("print")
## <bytecode: 0x7ff0f7069200>
## <environment: namespace:base>

mean

## function (x, ...) 
## UseMethod("mean")
## <bytecode: 0x7ff0f5ce3428>
## <environment: namespace:base>

They look exactly the same! They call the UseMethod function on their own function name. Looking into the help file of UseMethod, it all of a sudden starts to make sense.

“When a function calling UseMethod(“fun”) is applied to an object with class attribute c(“first”, “second”), the system searches for a function called fun.first and, if it finds it, applies it to the object. If no such function is found a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used, if it exists, or an error results.”

So by calling print and mean on the myS3 object we were not calling the method itself. Rather, we call the general functions print and mean (the generics) and they call the function UseMethod. This function then does the method dispatch: lookup the method belonging to the S3 object the function was called on. We were just lucky the print and mean generics were already in place and called our methods. However, the count_letters function indeed doesn’t exist (as the error message tells us). Only the count_letters method exist, for objects of class myS3. We just learned that methods cannot get called directly, but are invoked by generics. All we need to do, thus, is build a generic for count_letters and we are all set.

count_letters <- function(x) {
  UseMethod("count_letters")
}
count_letters(out)

## 
## a b c 
## 2 1 1

It is actually ill-advised to assign a S3 class directly to an output. Rather use a constructor, see 16.3.1 of Advanced R for the how and why. ↩