- Overengineering in ML - business life is not a Kaggle competition “Overengineering is the act of designing a product to be more robust or have more features than often necessary for its intended use, or for a process to be unnecessarily complex or inefficient.” This is how the Wikipedia page on overengineering starts. It is the diligent engineer who wants to make sure that every possible feature is incorporated in the product, that creates an overengineered product. We find overengineering in real world products, as well as in software. It is a relevant concept in data science as well. First of all, because software engineering is very much a part of data science. We should be careful not to create dashboards, reports and other products that are too complex and contain more information than the user can stomach. But maybe there is a second, more subtle lesson, in overengineering for data scientists. We might create machine learning models that predict too well. Sounds funny? Let me explain what I mean by it.
- Using {drake} for Machine Learning A few weeks ago, Miles McBain took us for a tour through his project organisation in this blogpost. Not surprisingly given Miles’ frequent shoutouts about the package, it is completely centered around drake. About a year ago on twitter, he convinced me to take this package for a spin. I was immediately sold. It cured a number of pains I had over the years in machine learning projects; storing intermediate results, reproducibility, having a single version of the truth, forgetting in which order steps should be applied, etc. In addition to Miles, I’d like to share my drake-centered workflow. As I found out from reading the blog, there is a great deal of overlap in our workflows.
- Some More Thoughts on Impostering Two years ago, I wrote about meta-learning to fight imposter feelings. In this blog I made a distinction between impostering because you don’t feel you are up to the job, and because you feel you ought to know something which you don’t. The meta-learning blog focuses on how you define yourself as a data scientist and what, as a consequence, you decide to learn (and more importantly what not). Staying sane while doing data science is something that always has my interest. Imposter feelings are a major foe to the joy this work can bring. I came across/to two more insights on the topic that I found worthwhile sharing. The first is intellectual humility, which I learned about in the book Superforecasters, the art & science of prediction by Philip Tetlock. The second is seeing impostering as a learning alarm, thereby turning it to something positive.
- The Psychology of Flame Wars Within the data science community we see quite some flame wars. For those who don’t know what I am talking about, there are different ways of doing data science. There are the two major languages R and python, with their own implementations for analysing data. Then within R there are the different flavours of using the base language or applying the functions of the tidyverse. Within the tidyverse there is the dplyr package to do data wrangling, of which the functionality of the data.table package greatly overlaps. Each of the choices are wildly contested by users of both of the options. Oftentimes, these debates are first presented as objective comparisons between two options in which the favoured option clearly stands out. This then evokes fierce responses from the other camp and before you know it we are down to the point where we call each other’s choices not elegant or even ugly.
- padr is updated Yesterday v.0.5.0 of the padr package hit CRAN. You will find the main changes in the thicken function, that has gained two new arguments. First of all, by an idea of Adam Stone, you are now enabled to drop the original datetime variable from the data frame by using drop = TRUE. This argument defaults to FALSE to ensure backwards compatibility. Without setting drop to TRUE the datetime variable will be returned alongside the added, thickened variable:
- Predictability of Tennis Grand Slams The European tennis season is in full swing, with Roland Garros starting today and Wimbledon taking place in a few weeks. For a sports buff like me, it is the essence of summer (together with the Tour de France). Time to dive into some tennis data. As a follower of both the men’s and the women’s tour it occurred to me that in the latter the tournaments are less predictable. My gut feel was that in the men’s matches the favourite wins the match more frequently than at women’s matches. Of course gut feels are what the world makes go round, unless you are a data scientist. So lets analyse all the matches that were played at the four Grand Slam tournamants for the past fourty years.
- Code and Data in a large Machine Learning project We did a large machine learning project at work recently. It involved two data scientists, two backend engineers and a data engineer, all working on-and-off on the R code during the project. The project had many interesting and new aspects to me, among them are doing data science in an agilish way, how to keep track of the different model versions and how to deal with directories, data and code on different machines. I planned to do a series of write-ups this summer, describing each of them, but then this happened
- Using Rstudio Jobs for training many models in parallel Recently, Rstudio added the Jobs feature, which allows you to run R scripts in the background. Computations are done in a separate R session that is not interactive, but just runs the script. In the meantime your regular R session stays live so you can do other work while waiting for the Job to complete. Instead refreshing your Twitter for the 156th time, you can stay productive! (I am actually writing this blog in Rmarkdown while I am waiting for my results to come in.) The number of jobs you can spin-up is not limited to one. As each new job is started at a different core, you can start as many jobs as your system got cores (although leaving one idle is a good idea for other processes, like your interactive R session).
- Dealing with failed projects Recently, I came up with Thoen’s law. It is an empirical one, based on several years of doing data science projects in different organisations. Here it is: The probability that you have worked on a data science project that failed, approaches one very quickly as the number of projects done grows. I think many, far more than we as a community like to admit, will deal with projects that don’t meet their objectives. This blog does not explore why data science projects have a high risk of failing. Jacqueline Nolis already did this adequately. Rather, I’ll look for strategies how we might deal with projects that are failing. Disappointing as they may be, failed projects are inherently part of the novel and challenging discipline data science is in many organisations. The following approach might reduce the probability of failure, but that is not the main point. Their objective is to prevent failing in silence after too long a period of project time. In which you try to figure out things on your own. They will shift failure from the silent personal domain to the public collective one. Hopefully, reducing stress and blame by yourself and others.
- Why your S3 method isn't working Throughout the last years I noticed the following happening with a number of people. One of those people was actually yours truely a few years back. Person is aware of S3 methods in R through regular use of print, plot and summary functions and decides to give it a go in own work. Creates a function that assigns a class to its output and then implements a bunch of methods to work on the class. Strangely, some of these methods appear to be working as expected, while others throw an error. After a confusing and painful debugging session, person throws hands in the air and continues working without S3 methods. Which was working fine in the first place. This is a real pity, because all the person is overlooking is a very small step in the S3 chain: the generic function.