8 Code that Fails
Your code will fail. Not once, not twice, but hundreds, thousands of times over the course of a large data science project. Everybody’s code fails, many times. Your code, my code, Hadley Wickham’s code. Not because we are bad at what we do, but because we cannot foresee all the possible ways the code will be used or all the possible data inputs that the code will be confronted with at the moment we write it. Getting better at programming does not mean you will write fault-free code (although you will make fewer errors as you grow more experienced), but writing code that fails fast and informatively. Writing such code means that you will spend more time developing than with the random walk method described in the previous chapter. This is an investment that will pay off big time as your project grows in complexity.
But before we look into how to write code that fails fast and informatively, why is this important for Agile Data Science? High quality code is part of the twelve principles, but this is a means to an end. The essence of Agile is continuous delivery. Updating and shipping your product every time you make progress can only be done when your code is adjustable. As the project advances, the end-to-end product will grow in complexity. Sooner rather than later it will reach a point in which you cannot have a full overview of all the aspects of the product anymore. When introducing a new feature you want to make sure all the other elements in the code base keep doing what they were designed for. If not, you want to be clearly informed of what is going wrong so you can either adjust the newly introduced feature or modify the existing elements so it works with the new feature. If every time you want to make a change the whole thing comes tumbling down without informing you what goes wrong, you cannot achieve the objective of fast, continuous delivery.
8.1 Assumptions to your Data
Automation means not calling functions yourself anymore.
You just flip the switch and the whole thing is set in motion.
This means that you no longer get the immediate feedback you are used to have when calling the functions yourself.
When something goes wrong the whole thing just breaks with an error message.
When you are lucky the error message is informative and you are quick to find the bug.
When you are not, you may be confronted with object of type 'closure' is not subsettable
, and you may have a grumpy afternoon ahead.
Best to not depend on luck, but make sure you will always get a proper indication of what went wrong.
8.1.1 Type Checking
Most languages using functional programming are strongly typed. This means that for every argument that a function takes, its datatype is specified at function creation, as well as the datatype that the output takes. When the function is then called it is first checked if the data on which it is called is of the correct type. If not, it will break immediately and will tell the user why. R is weakly typed, the data type of the function arguments are not specified. When a function is called it just has a go on the objects that are fed to it. The function only breaks when somewhere in the body another function gets called with an invalid data type. Take the following for instance:
add_two_numbers <- function(x, y) {
print("Reaching this")
print("and this")
print("Doing some random other stuff")
x + y
}
Now, what will happen if we accidentally call it on a string?
It will only break when we hit the +
operator, all the code before it runs.
The error message it gives us is not so informative that we immediately figure out what went wrong.
## [1] "Reaching this"
## [1] "and this"
## [1] "Doing some random other stuff"
## Error in x + y: non-numeric argument to binary operator
Pfff, what is a binary operator again?
When this function is part of a framework in which higher order functions call lower order functions you may be up for an hour or two of sifting through your code to locate the bug.
Fortunately type checking is very easily added to a function by asserting all argument data types in stopifnot
.
add_two_numbers <- function(x, y) {
stopifnot(is.numeric(x), is.numeric(y))
print("Reaching this")
print("and this")
print("Doing some random other stuff")
x + y
}
add_two_numbers(42, "MacGyver")
## Error in add_two_numbers(42, "MacGyver"): is.numeric(y) is not TRUE
We are now specifically informed which argument does not meet the assumptions and what the name of the sub function is.
8.1.2 Column Validation
Central to most data products will be the modification of data frames. Most functions you’ll create are very specific to your project and will be called only once. For these cases you probably want to save yourself the overhead of making each function completely generic by parameterising the column names that are used. Say you want to add the log of the target to the data frame and you create the following function:
Now what if target
was mistakenly removed from x
in the step preceding this one in the product?
Will it throw an informative error?
## Error in log(x$target): non-numeric argument to mathematical function
Uhh, no, not exactly. It gives the impression we tried to call the function on a non-numeric object, while actually the object is completely missing. It is a good idea to check if all the columns that are required for the operations in the function are present before starting them. I use this little function:
df_has_cols <- function(x, cols) {
parent_frame <- sys.parent()
calling_function <- sys.call(parent_frame)[[1]]
stopifnot(is.data.frame(x))
if(!all(cols %in% colnames(x))) {
not_present <- setdiff(cols, colnames(x))
stop(paste(not_present, collapse = ", "), " missing from the data frame in function ", calling_function)
}
}
which is added to each function that uses a data frame
add_log_target <- function(x) {
df_has_cols(x, "target")
x$target_log <- log(x$target)
x
}
add_log_target(mtcars)
## Error in df_has_cols(x, "target"): target missing from the data frame in function add_log_target
8.1.3 Data Validation
Where the first two checks are about the type and structure of the inputs, data validation tests the assumptions to the data itself.
Variables that cannot contain missing values, variables that must be strictly positive, or characters that can only take a limited number of values.
If you have any such assumptions in your data, you do best to check them before you unleash internal functions on it.
You could use external packages such as validate
or recipes
, or you can write them yourself.
The stopifnot
function can be used to check any assumption you have, as long as you can create a single logical for it.
In the example below, since we are taking the log of target
and it is the target variable it must be strictly positive and non-missing.
8.2 Unit Testing
The above type checking, column validation, and data validation test assumptions to the data going into the functions. To data scientists learning about these it usually makes a lot of sense, because most who worked on a larger product without these have been bitten quite a bit. Moreover, these measures are quick to implement by just inserting one or two lines of code at the beginning of a function. Unit testing your code does not come so naturally to data scientists, unless they have a background in software development. Writing unit tests might seem as a pointless and time consuming exercise at first. This is because its benefits are not immediately obvious to someone who never applied or used them. Still, you should make it part of your software development routine, even when you are the only one using the software you write.
If you are unsure what unit tests are or how to do them in R, please read the chapter in R packages carefully.
8.2.1 Externalising Assumptions
Reading someone else’s code is demanding, even when written by an excellent programmer. At the moment of writing the programmer is fully submerged in the problem, accommodating for all the inputs she foresees the code can be confronted with. Reading her code is tough because you have to recover from it what the problems are it tried to solve. After you wrap up a part of the project and move on to work on something else, the wrapped up code quickly becomes as if written by someone else. Unit tests capture all the cases the programmer envisioned the unit of code should handle. It is a service to the future collaborator, who very well could be the same person, so she doesn’t have to mentally fight her way (back) into the code. As soon as something is changed in the function such that it no longer does everything it was designed to do, the unit test informs us. A workflow without unit tests means that changes to code need to be interactively checked against, as was done when the code was written. The code fails, but it might be unclear why it fails, leaving the programmer with quite a puzzle. An even worse scenario would be if the code does not fail, for now. The exception that should be handled happens not to be in the data that is used to interactively check the modification. All seems fine and the code is adjusted. Then when the edge case does appear again in the data the product breaks mysteriously.
From the above it should be clear that to work in an Agile way automated code testing should be in place. Agile’s core objective, continuous delivery, can only be obtained if modifications to the product can be made quickly. Omitting testing will result in fingers crossed programming, which is neither effective nor fun. At first it might seem that skipping unit tests will get you to delivery faster. However, as time progresses and the product grows in complexity, a unit tester will be able to keep the same steady pace in delivering new features, where the system of those who cut corners will start grinding to a halt because they have to keep digging in old code that no longer works because of the newly introduced features.
In the visual representation of building a data science product below each feature to be added is represented by a color. On top is the data scientist who does unit test, below the one who doesn’t. The unit tester takes quite some time to implement the first two features, the non-unit tester delivers them quicker. After implementing the third feature, however, some code that was written for the first feature keeps failing and he doesn’t know why. When the fourth is implemented, these problems arise again, in a slightly different form. When he finally manages to control it the code of the fourth feature gets unstable, asking for his attention. Meanwhile, the unit tester soldiers on. Sometimes she needs to make small adjustments to the old features, to make them click with new ones. In general however, she can uphold Agile’s promise of continuous, equally paced, delivery.
Of course this example is contrived. After adding a new feature it gradually blends into the code base of the product. You will probably not say “I am going to revisit feature 2 now”, but rather “I will look into this error that keeps showing up now and then when we run it on new data”. Still, I think it is a good idea to keep this picture in mind, use unit tests and you won’t be solving the same problems over and over again.
8.2.2 Writing Better Code
An additional benefit of unit testing is that it will improve the code quality directly. A point also made by Hadley Wickham in the first section of the chapter on unit testing. In order to be unit testable the code should comprised of many small parts, instead of big chunks. Like a broken machine that is easier to fix when components can be replaced separately, code comprised of many small components is easier to debug, optimise and maintain.
The main reason people are not eager to write unit tests, because at first sight it seems redundant and time consuming work. I hope by now you are convinced it is certainly not redundant. As you get the hang of it, you will notice that being slowed down does actually make you stop and think. What are the edge cases that this function can meet and how do you want it to behave when it does? By thinking through the scenarios before you put the function to work you can circumvent many problems upfront.