- Overengineering in ML - business life is not a Kaggle competition “Overengineering is the act of designing a product to be more robust or have more features than often necessary for its intended use, or for a process to be unnecessarily complex or inefficient.” This is how the Wikipedia page on overengineering starts. It is the diligent engineer who wants to make sure that every possible feature is incorporated in the product, that creates an overengineered product. We find overengineering in real world products, as well as in software. It is a relevant concept in data science as well. First of all, because software engineering is very much a part of data science. We should be careful not to create dashboards, reports and other products that are too complex and contain more information than the user can stomach. But maybe there is a second, more subtle lesson, in overengineering for data scientists. We might create machine learning models that predict too well. Sounds funny? Let me explain what I mean by it.
- Using {drake} for Machine Learning A few weeks ago, Miles McBain took us for a tour through his project organisation in this blogpost. Not surprisingly given Miles’ frequent shoutouts about the package, it is completely centered around drake. About a year ago on twitter, he convinced me to take this package for a spin. I was immediately sold. It cured a number of pains I had over the years in machine learning projects; storing intermediate results, reproducibility, having a single version of the truth, forgetting in which order steps should be applied, etc. In addition to Miles, I’d like to share my drake-centered workflow. As I found out from reading the blog, there is a great deal of overlap in our workflows.
- Some More Thoughts on Impostering Two years ago, I wrote about meta-learning to fight imposter feelings. In this blog I made a distinction between impostering because you don’t feel you are up to the job, and because you feel you ought to know something which you don’t. The meta-learning blog focuses on how you define yourself as a data scientist and what, as a consequence, you decide to learn (and more importantly what not). Staying sane while doing data science is something that always has my interest. Imposter feelings are a major foe to the joy this work can bring. I came across/to two more insights on the topic that I found worthwhile sharing. The first is intellectual humility, which I learned about in the book Superforecasters, the art & science of prediction by Philip Tetlock. The second is seeing impostering as a learning alarm, thereby turning it to something positive.
- The Psychology of Flame Wars Within the data science community we see quite some flame wars. For those who don’t know what I am talking about, there are different ways of doing data science. There are the two major languages R and python, with their own implementations for analysing data. Then within R there are the different flavours of using the base language or applying the functions of the tidyverse. Within the tidyverse there is the dplyr package to do data wrangling, of which the functionality of the data.table package greatly overlaps. Each of the choices are wildly contested by users of both of the options. Oftentimes, these debates are first presented as objective comparisons between two options in which the favoured option clearly stands out. This then evokes fierce responses from the other camp and before you know it we are down to the point where we call each other’s choices not elegant or even ugly.
- padr is updated Yesterday v.0.5.0 of the padr package hit CRAN. You will find the main changes in the thicken function, that has gained two new arguments. First of all, by an idea of Adam Stone, you are now enabled to drop the original datetime variable from the data frame by using drop = TRUE. This argument defaults to FALSE to ensure backwards compatibility. Without setting drop to TRUE the datetime variable will be returned alongside the added, thickened variable:
- Predictability of Tennis Grand Slams The European tennis season is in full swing, with Roland Garros starting today and Wimbledon taking place in a few weeks. For a sports buff like me, it is the essence of summer (together with the Tour de France). Time to dive into some tennis data. As a follower of both the men’s and the women’s tour it occurred to me that in the latter the tournaments are less predictable. My gut feel was that in the men’s matches the favourite wins the match more frequently than at women’s matches. Of course gut feels are what the world makes go round, unless you are a data scientist. So lets analyse all the matches that were played at the four Grand Slam tournamants for the past fourty years.
- Code and Data in a large Machine Learning project We did a large machine learning project at work recently. It involved two data scientists, two backend engineers and a data engineer, all working on-and-off on the R code during the project. The project had many interesting and new aspects to me, among them are doing data science in an agilish way, how to keep track of the different model versions and how to deal with directories, data and code on different machines. I planned to do a series of write-ups this summer, describing each of them, but then this happened
- Using Rstudio Jobs for training many models in parallel Recently, Rstudio added the Jobs feature, which allows you to run R scripts in the background. Computations are done in a separate R session that is not interactive, but just runs the script. In the meantime your regular R session stays live so you can do other work while waiting for the Job to complete. Instead refreshing your Twitter for the 156th time, you can stay productive! (I am actually writing this blog in Rmarkdown while I am waiting for my results to come in.) The number of jobs you can spin-up is not limited to one. As each new job is started at a different core, you can start as many jobs as your system got cores (although leaving one idle is a good idea for other processes, like your interactive R session).
- Dealing with failed projects Recently, I came up with Thoen’s law. It is an empirical one, based on several years of doing data science projects in different organisations. Here it is: The probability that you have worked on a data science project that failed, approaches one very quickly as the number of projects done grows. I think many, far more than we as a community like to admit, will deal with projects that don’t meet their objectives. This blog does not explore why data science projects have a high risk of failing. Jacqueline Nolis already did this adequately. Rather, I’ll look for strategies how we might deal with projects that are failing. Disappointing as they may be, failed projects are inherently part of the novel and challenging discipline data science is in many organisations. The following approach might reduce the probability of failure, but that is not the main point. Their objective is to prevent failing in silence after too long a period of project time. In which you try to figure out things on your own. They will shift failure from the silent personal domain to the public collective one. Hopefully, reducing stress and blame by yourself and others.
- Why your S3 method isn't working Throughout the last years I noticed the following happening with a number of people. One of those people was actually yours truely a few years back. Person is aware of S3 methods in R through regular use of print, plot and summary functions and decides to give it a go in own work. Creates a function that assigns a class to its output and then implements a bunch of methods to work on the class. Strangely, some of these methods appear to be working as expected, while others throw an error. After a confusing and painful debugging session, person throws hands in the air and continues working without S3 methods. Which was working fine in the first place. This is a real pity, because all the person is overlooking is a very small step in the S3 chain: the generic function.
- A recipe for recipes If you build statistical or machine learning models, the recipes package can be useful for data preparation. A recipe object is a container that holds all the steps that should be performed to go from the raw data set to the set that is fed into model a algorithm. Once your recipe is ready it can be executed on a data set at once, to perform all the steps. Not only on the train set on which the recipe was created, but also on new data, such as test sets and data that should be scored by the model. This assures that new data gets the exact same preparation as the train set, and thus can be validly scored by the learned model. The author of recipes, Max Kuhn, has provided abundant material to familiarize yourself with the richness of the package, see here, here, and here. I will not dwell on how to use the package. Rather, I’d like to share what in my opinion is a good way to create new steps and checks to the package1. Use of the package is probably intuitive. Developing new steps and checks, however, does require a little more understanding of the package inner workings. With this procedure, or recipe if you like, I hope you will find adding (and maybe even contributing) your own steps and checks becomes easier and more organized. Together with Max I added the checks framework to the package. Where steps are transformations of variables, checks are assertions of them. If a check passes nothing happens. If it fails, it will break the bake method of the recipe and throws an informative error. ↩
- Gold diggers at the Olympics It came to a close two weeks ago. Over two weeks of obsessive watching everything that moves on skis, skates or boards. My tiny country of the Netherlands won about half of all the medals at the speed skating events. Since there are many medals to be won here this made it to end up high in the medal table. This makes it look as winter sports nation, where all we skating buffs. (Sadly, we can’t hardly skate anymore last years. Stop global warming everyone, the Dutch need their ice!) To me, this begged the question, are other top nations also in the high ranks because they dominate a single sport. Or are they all round snow and ice eating badasses?
- Blog about something you just learned Great effort has recently been made to encourage also the not-so-experienced to jump into the water and blog about data science. Some of the community’s hot shots gracefully draw attention to blogs of first-times on twitter to give them extra exposure. Most of the blogger newbies write about an analysis they did on a topic they care about. That sounds like the obvious thing to do. However, I think there is a second option that is not considered by many. That is writing about a topic you just learned. This might seem strange, why would you want to tell the world about something you are by no means an expert on? Won’t you just make a fool of yourself by pretending you know stuff you just picked up? I don’t think so, let me give you four reasons why I think this is actually a good idea.
- Curb your imposterism, start meta-learning Recently, there has been a lot of attention for the imposter syndrome. Even seasoned programmers admit they suffer from feelings of anxiety and low self-esteem. Some share their personal stories, which can be comforting for those suffering in silence. I focus on a method that helped me grow confidence in recent years. It is a simple, yet very effective way to deal with being overwhelmed by the many things a data scientist can acquaint him or herself with.
- Make your own color palettes with paletti Last week I blogged about the dutchmasters color palettes package, which was inspired by the wonderful ochRe package. As mentioned I shamelessly copied the package. I replaced the list with character vectors containing hex colors and did a find and replace to make it dutchmasters instead of ochRe. This was pretty ugly. I realized that when we would refactor the ochRe functions, thus creating functions that create the functions, there would no longer be a need to copy-paste and find-and-replace. So that is what I did. I refactored and expanded ochRe’s chore into paletti. (Name chosen because I liked the ring of it). You grab it from Github, with devtools::install_github("edwinth/paletti").
- Color palettes derived from the Dutch masters Among tulip fields, canals and sampling cheese, the museums of the Netherlands are one of its biggest tourist attractions. And for very good reasons! During the seventeenth century, known as the Dutch Golden Age, there was an abundance of talented painters. If you ever havethe chance to visit the Rijksmuseum you will be in awe by the landscapes,households and portraits, painted with incredible detail and beautiful colors.
- A two-stage workflow for data science projects If you are a data scientist who primarily works with R, chances are you had no formal training in software development. I certainly did not pick up many skills in that direction during my statistics masters. For years my workflow was basically load a dataset and hack away on it. In the best case my R-script came to some kind of conclusion or final data set, but usually it abruptly ended. Complex projects could result in a great number of scripts and data exports. Needless to say, reproducability was typically low and stress could run high when some delivery went wrong.
- padr version 0.4.0 now on CRAN I am happy to share that the latest version of padr just hit CRAN. This new version comprises bug fixes, performance improvements and new functions for formatting datetime variables. But above all, it introduces the custom paradigm that enables you to do asymmetric analysis.
- A ggplot-based Marimekko/Mosaic plot One of my first baby steps into the open source world, was when I answered this SO question over four years ago. Recently I revisited the post and saw that Z.Lin did a very nice and more modern implementation, using dplyr and facetting in ggplot2. I decided to merge here ideas with mine to create a general function that makes MM plots. I also added two features: counts, proportions, or percentages to the cells as text and highlighting cells by a condition.
- Non-standard evaluation, how tidy eval builds on base R As with many aspects of the tidyverse, its non-standard evaluation (NSE) implementation is not something entirely new, but built on top of base R. What makes this one so challenging to get your mind around, is that the Honorable Doctor Sir Lord General and friends brought concepts to the realm of the mortals that many of us had no, or only a vague, understanding of. Earlier, I gave an overview of the most common actions in tidy eval. Although appreciated by many, it left me unsatisfied, because it made clear to me I did not really understand NSE. Neither in base R, nor in tidy eval. Therefore, I bit the bullet and really studied it for a few evenings. Starting with base R NSE, and later learning what tidy eval actually adds to it. I decided to share the things I learned in this, rather lengthy, blog. I think it captures the essentials in NSE, although it surely is incomplete and might be even erronous at places. Still, I hope you find it worthwhile and it will help you understand NSE better and apply it with more confidence.
- Tidy evaluation, most common actions Tidy evaluation is a bit challenging to get your head around. Even after reading programming with dplyr several times, I still struggle when creating functions from time to time. I made a small summary of the most common actions I perform, so I don’t have to dig in the vignettes and on stackoverflow over and over. Each is accompanied with a minimal example on how to implement it. I thought others might find this useful too, so here it is in a blog post. This list is meant to be a living thing so additions and improvements are most welcome. Please do a PR on this file or send an email.
- Span Dates and Times without Overhead I am working on v.0.4.0 of the padr package this summer. Two new features that will be added are wrappers around seq.Date and seq.POSIXt. Since it is going to take a while before the new release is on CRAN, I go ahead and do an early presentation of these functions. Date and datetime parsing in base R are powerful and comprehensive, but also tedious. They can slow you down in your programming or analysis. Luckily, good wrappers and alternatives exist, at least the ymd{_h}{m}{s} suite from lubridate and Dirk Eddelbuettel’s anytime. These functions remove much of the overhead of date and datetime parsing, allowing for quick formatting of vectors in all kinds of formats. They also alleviate the pain of using seq.Date() and seq.POSIXt a little, because the from and the to arguments should be parsed dates or datetimes. Take the following example.
- Quickly Check your ID Variables Virtually every dataset has them; id variables that link a record to a subject and/or time point. Often one column, or a combination of columns, forms the unique id of a record. For instance, the combination of patient_id and visit_id, or ip_adress and visit_time. The first step in most of my analyses is almost always checking the uniqueness of a variable, or a combination of variables. If it is not unique, may assumptions about the data may be wrong, or there are data quality issues. Since I do this so often, I decided to make a little wrapper around this procedure. The unique_id function will return TRUE if the evaluated variables indeed are the unique key to a record. If not, it will return all the records for which the id variable(s) are duplicated so we can pinpoint the problem right away. It uses dplyr v.0.7.1, so make sure that it is loaded.
- Check Data Quality with padr The padr package was designed to prepare datetime data for analysis. That is, to take raw, timestamped data, and quickly convert it into a tidy format that can be analyzed with all the tidyverse tools. Recently, a colleague and I discovered a second use for the package that I had not anticipated: checking data quality. Every analysis should contain checking if data are as expected. In the case of timestamped data, observations are sometimes missing due to technical malfunction of the system that produced them. Here are two examples that show how pad and thicken can be leveraged to detect problems in timestamped data quickly.
- Here is the new padr I am very happy to announce v0.3.0 of the padr package, which was introduced in January. As requested by many, you are now able to use intervals of which the unit is different from 1. In earlier version the eight interval values only allowed for a single unit (e.g. year, day, hour). Now you can use any time period that is accepted by seq.Date or seq.POSIXt (e.g. 2 months, 6 hours, 5 minutes) in both thicken and pad. get_interval does test for both the interval and the unit of the interval of the datetime variable from now on.
- Binning Outliers in a Histogram The functionality has been implemented in the function ggoutlier_hist in the ggoutlier package, which you can find here https://github.com/EdwinTh/ggoutlier. Please use this as your source, this is where the code will be maintained.
- Preparing Datetime Data for Analysis with padr and dplyr Two months ago padr was introduced, followed by an improved version that allowed for applying pad on group level. See the introduction blogs or the vignette("padr") for more package information. In this blog I give four more elaborate examples on how to go from raw data to insight with padr, dplyr and ggplot2. They might serve as recipes for time series problems you want to solve. The dataset emergency is available in padr, and contains about a year of emergency data in Montgomery County, PA. For this blog I only use the twelve most prevalent emergencies.
- Tree-based univariate testing When building a predictive model it is a good idea to do a univariate analysis, before throwing the whole bunch in a complex algorithm. This way we get a feel for the potential contribution of each predictor. When a lot of predictors are available one can often make a first selection and only use predictors that show univariate predictive power. Not having to sift through loads of unpromising variables can speed up learning algorithms significantly. Besides, knowing the individual predictive value can improve the data scientist’s discussion with business people. Especially when it is combined with a correlation analysis of the predictors. Due to multicollinearity variables with known predictive value might not end up in the model, which can lead to distrust towards the data scientist’s work. It is essential to report properly why some variables did not make it to a final model.
- padr::pad does now do group padding A few weeks ago padr was introduced on CRAN, allowing you to quickly get datetime data ready for analysis. If you have missed this, see the introduction blog or vignette("padr") for a general introduction. In v0.2.0 the pad function is extended with a group argument, which makes your life a lot easier when you want to do padding within groups.
- A wrapper around nested ifelse The ifelse function is the way to do vectorised if then else in R. One of the first cool things I learned to do in R a few years back, I got from Norman Matloff’s The Art of R Programming. When you have more than one if then statements, you just nest multiple ifelse functions before you reach the else.
- Introducing padr I am happy to introduce the padr package, which is now available on CRAN. If you frequently work with data containing a timestamp, especially automatically created data, you might find this package helpful. It solves two problems that you can be confronted with when preparing datetime data for analysis. First, data is often recorded on too low a level for your analysis. For instance the timestamp records the moment up to the second, where you want to do the analysis on an hourly level. Second, when no events toke place there are typically no data records. This is sensible from a storage perspective, but often unhelpful for analyzing the data. When calculating a moving average for example, you want missing observations to have the value 0. You don’t want them to be lacking from your set.
- Building a column selecter Maybe the following sounds familiar. You have a large data set with many, many columns of which the most are irrelevant to you. Typically, a dump from a database or the full set extracted from an API. Several times I found myself the better part of an afternoon going back and forth between a view of the data where I tried to figure out which columns to keep, and an R session where I wrote the code for creating the subset of columns. Wouldn’t it be nice to have an app in which you could just click the columns you would like to keep? This seemed a perfect opportunity to get my feet wet with Shiny gadgets, which I still wanted to do since I first heard about it on useR2016.
- Designing our bathroom with R R has been an indispensable tool since I started working with it about five years ago. Of course in my day job as a data scientist I couldn’t live without it, but it also proved to be a great aid in private life. Recently we bought our first house and R came to the rescue several times in the process. We compared the impact of different mortgages on our finances in ten and twenty years time and I kept an eye on our spending through a Shiny app (I’ll admit the latter would have been less time consuming if I would have done it in Excel, like normal people).