If you build statistical or machine learning models, the recipes
package can be useful for data preparation. A recipe
object is a container that holds all the steps that should be performed to go from the raw data set to the set that is fed into model a algorithm. Once your recipe is ready it can be executed on a data set at once, to perform all the steps. Not only on the train set on which the recipe was created, but also on new data, such as test sets and data that should be scored by the model. This assures that new data gets the exact same preparation as the train set, and thus can be validly scored by the learned model. The author of recipes
, Max Kuhn, has provided abundant material to familiarize yourself with the richness of the package, see here, here, and here. I will not dwell on how to use the package. Rather, I’d like to share what in my opinion is a good way to create new steps
and checks
to the package1. Use of the package is probably intuitive. Developing new steps
and checks
, however, does require a little more understanding of the package inner workings. With this procedure, or recipe if you like, I hope you will find adding (and maybe even contributing) your own steps
and checks
becomes easier and more organized.
S3 classes in recipes
Lets build a very simple recipe:
library(tidyverse)
library(recipes)
rec <- recipe(mtcars) %>%
step_center(everything()) %>%
step_scale(everything()) %>%
check_missing(everything())
rec %>% class()
## [1] "recipe"
As mentioned a recipe is a container for the steps and checks. It is a list of class recipe
on which the prep
, bake
, print
and tidy
methods do the work as described in their respective documentation files. The steps
and checks
added to the recipe are stored inside this list. As you can see below, each of them are of their own subclass, as well as of the generic step
or check
classes.
rec$steps %>% map(class)
## [[1]]
## [1] "step_center" "step"
##
## [[2]]
## [1] "step_scale" "step"
##
## [[3]]
## [1] "check_missing" "check"
Each subclass has the same four methods defined as the recipe
class. As one of the methods is called on the recipe, it will call the same method on all the steps and checks that are added to the recipe. (Exception is prep.recipe
, which does not only callprep
on all the steps and checks, but also bake
). This means that when implementing a new step
or check
, you should provide these four methods. Additionally, we’ll need the function that is called by the user to add it to the recipe object and a constructor (if you are not sure what this is, see 15.3.1 of Advanced R by Hadley Wickham).
The recipes for recipes
When writing a new step
or check
you will probably be inclined to copy-paste an existing step and start tweaking it from the top. Thus, first writing the function, then the constructor, and then the methods one by one. I think this is suboptimal and can get messy quickly. My preferred way is to start by not bothering about recipes
at all, but write a general function that does the preparation work on a single vector. Once you are happy with this function you sit and think about which arguments to this function should be provided with upfront. These should be added as arguments to the function that is called by the user. Next you think about which function arguments are statistics that should be learned on the train set in the prep
part of the recipe
. You’ll then go on and do the constructor that is called by both the main function and the prep
method. Only then you’ll write the function that is called by the user. You’ll custom step
or check
is completed by writing the four methods, since the functionality is already created, these are mostly bookkeeping.
For both checks and steps I made a skeleton available, in which all the “overhead” that should be in a new step or check is present. This is more convenient than copy-paste and existing example and try to figure out what is step specific and should be erased. You’ll find the skeletons on github. Note that for creating your own steps and checks you should clone the package source code, since the helper functions that are used are not exported by the package.
Putting it to practice
We are going to do two examples in which the recipe for recipes is applied.
Example 1: A signed log
First up is a signed-log (inspired by Practical Data Science with R), which is taking the log over the absolute value of a variable, multiplied by its original sign. Thus, enabling us to take logs over negative values. If a variable is between -1 and 1, we’ll set it to 0, otherwise things get messy.
1) preparing the function
This is what the function on a vector could look like, if we did not bother about recipes
:
signed_log <- function(x, base = exp(1)) {
ifelse(abs(x) < 1,
0,
sign(x) * log(abs(x), base = base))
}
2) think about the arguments
The only argument of the function is base
, that should be provided upfront when adding the step to a recipe object. There are no statistics to be learned in the prep
step.
3) the constructor
Now we are going to write the first of the recipes functions. This is the constructor that produces new instances of the object of class step_signed_log
. The first four arguments are present in each step or check, they are therefore part of the skeletons. The terms
argument will hold the information on which columns the step should be performed. For role
, train
and skip
, please see the documentation in one of the skeletons. base
is step_signed_log
specific, as used in 1). In prep.step_signed_log
the tidyselect arguments will be converted to a character vector holding the actual column names. columns
will store these names in the step_signed_log
object. This container argument will not be necessary if the columns names are also present in another way. For instance, step_center
has the argument means
, that will be populated by the means of the variables of the train set by its prep
method. In the bake
method the names of the columns to be prepared are already provided by the names of the means and there is need to use the columns
argument.
step_signed_log_new <-
function(terms = NULL,
role = NA,
skip = FALSE,
trained = FALSE,
base = NULL,
columns = NULL) {
step(
subclass = "signed_log",
terms = terms,
role = role,
skip = skip,
trained = trained,
base = base,
columns = columns
)
}
4) the function to add it to the recipe
Next up is the function that will be called by the user to add the step to its recipe. You’ll see the internal helper function add_step
is called. It will expand the recipe with the step_signed_log
object that is produced by the constructor we just created.
step_signed_log <-
function(recipe,
...,
role = NA,
skip = FALSE,
trained = FALSE,
base = exp(1),
columns = NULL) {
add_step(
recipe,
step_signed_log_new(
terms = ellipse_check(...),
role = role,
skip = skip,
trained = trained,
base = base,
columns = columns
)
)
}
5) the prep
method
As recognized in 2) we don’t have to do much in the prep
method of this particular step, since the preparation of new sets does not depend on statistics learned on the train set. The only thing we do here is applying the internal function terms_select
function to replace the tidyselect
selections, by the actual names of the columns on which step_signed_log
should be applied. We call the constructor again, indicating that the step is trained and we supplying the column names at the columns
argument.
prep.step_signed_log <- function(x,
training,
info = NULL,
...) {
col_names <- terms_select(x$terms, info = info)
step_signed_log_new(
terms = x$terms,
role = x$role,
skip = x$skip,
trained = TRUE,
base = x$base,
columns = col_names
)
}
6) the bake
method
We are now ready to apply the baking function, designed at 1), inside the recipes framework. We loop through the variables, apply the function and return the updated data frame.
bake.step_signed_log <- function(object,
newdata,
...) {
col_names <- object$columns
for (i in seq_along(col_names)) {
col <- newdata[[ col_names[i] ]]
newdata[, col_names[i]] <-
ifelse(abs(col) < 1,
0,
sign(col) * log(abs(col), base = object$base))
}
as_tibble(newdata)
}
7) the print
method
This assures pretty printing of the recipe
object to which step_signed_log
is added. You use the internal printer
function with a message specific for the step.
print.step_signed_log <-
function(x, width = max(20, options()$width - 30), ...) {
cat("Taking the signed log for ", sep = "")
printer(x$columns, x$terms, x$trained, width = width)
invisible(x)
}
8) the tidy
method
Finally, tidy
will add a line for this step to the data frame when the tidy
method is called on a recipe.
tidy.step_signed_log <- function(x, ...) {
if (is_trained(x)) {
res <- tibble(terms = x$columns)
} else {
res <- tibble(terms = sel2char(x$terms))
}
res
}
Lets do a quick check to see if it works as expected
recipe(data_frame(x = 1)) %>%
step_signed_log(x) %>%
prep() %>%
bake(data_frame(x = -3:3))
## # A tibble: 7 x 1
## x
## <dbl>
## 1 -1.0986123
## 2 -0.6931472
## 3 0.0000000
## 4 0.0000000
## 5 0.0000000
## 6 0.6931472
## 7 1.0986123
Example 2: A range check
Model predictions might be invalid when the range of a variable in new data is shifted from the range of the variable in the train set. Lets do a second example in which we check if the range of a numeric variable is approximately equal to the range of the variable in the train set. We do so by checking if the variable’s minimum value in the new data is not smaller than its minimum value in the train set. The variable’s maximum value in the test set should not exceed the maximum in the train set. We allow for some slack (proportion of the variable range in the train set) to account for natural variation.
1) preparing the function
As mentioned, checks
are about throwing informative errors if assumptions are not met. This is a function we could apply on new variables, without bothering about recipes
:
range_check_func <- function(x,
lower,
upper,
slack_prop = 0.05,
colname = "x") {
min_x <- min(x)
max_x <- max(x)
slack <- (upper - lower) * slack_prop
if (min_x < (lower - slack) & max_x > (upper + slack)) {
stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack,
"\n", "max x is ", max_x, ", upper bound is ", upper + slack,
call. = FALSE)
} else if (min_x < (lower - slack)) {
stop("min ", colname, " is ", min_x, ", lower bound is ", lower - slack,
call. = FALSE)
} else if (max_x > (upper + slack)) {
stop("max ", colname, " is ", max_x, ", upper bound is ", upper + slack,
call. = FALSE)
}
}
2) think about the arguments
The slack_prop
is a choice that the user of the check should make upfront. This is thus an argument of check_range
. Then there are two statistics to be learned in the prep
method: lower
and upper
. These should be arguments of the function and the constructor as well. However, when calling the function these are always NULL
, they are container arguments that will filled when calling prep.check_range
.
3) the constructor
We start again with the four arguments present in every step or check. Subsequently we add the three arguments that we recognized to be part of the check.
check_range_new <-
function(terms = NULL,
role = NA,
skip = FALSE,
trained = FALSE,
lower = NULL,
upper = NULL,
slack_prop = NULL) {
check(subclass = "range",
terms = terms,
role = role,
trained = trained,
lower = lower,
upper = upper,
slack_prop = slack_prop)
}
4) the function to add it to the recipe
As we know by now, it is just calling the constructor and adding it to the recipe.
check_range <-
function(recipe,
...,
role = NA,
skip = FALSE,
trained = FALSE,
lower = NULL,
upper = NULL,
slack_prop = 0.05) {
add_check(
recipe,
check_range_new(
terms = ellipse_check(...),
role = role,
skip = skip,
trained = trained,
lower = lower,
upper = upper,
slack_prop = slack_prop
)
)
}
5) the prep
method
Here the method is getting a lot more interesting, because we actually have work to do. We are calling vapply
on each of the columns the check should be applied on, to derive the minimum and maximum. Again the constructor is called and the learned statistics are now populating the lower
and upper
arguments.
prep.check_range <-
function(x,
training,
info = NULL,
...) {
col_names <- terms_select(x$terms, info = info)
lower_vals <- vapply(training[ ,col_names], min, c(min = 1),
na.rm = TRUE)
upper_vals <- vapply(training[ ,col_names], max, c(max = 1),
na.rm = TRUE)
check_range_new(
x$terms,
role = x$role,
trained = TRUE,
lower = lower_vals,
upper = upper_vals,
slack_prop = x$slack_prop
)
}
6) the bake
method
The hard work has been done already. We just get the columns on which to apply the check and check them with the function we created at 1).
bake.check_range <- function(object,
newdata,
...) {
col_names <- names(object$lower)
for (i in seq_along(col_names)) {
colname <- col_names[i]
range_check_func(newdata[[ colname ]],
object$lower[colname],
object$upper[colname],
object$slack_prop,
colname)
}
as_tibble(newdata)
}
7) the print
method
print.check_range <-
function(x, width = max(20, options()$width - 30), ...) {
cat("Checking range of ", sep = "")
printer(names(x$lower), x$terms, x$trained, width = width)
invisible(x)
}
8) the tidy
method
tidy.check_range <- function(x, ...) {
if (is_trained(x)) {
res <- tibble(terms = x$columns)
} else {
res <- tibble(terms = sel2char(x$terms))
}
res
}
Again, we check quickly if it works
cr1 <- data_frame(x = 0:100)
cr2 <- data_frame(x = -1:101)
cr3 <- data_frame(x = -6:100)
cr4 <- data_frame(x = 0:106)
recipe_cr <- recipe(cr1) %>% check_range(x) %>% prep()
cr2_baked <- recipe_cr %>% bake(cr2)
cr3_baked <- recipe_cr %>% bake(cr3)
## Error: min x is -6, lower bound is -5
cr4_baked <- recipe_cr %>% bake(cr4)
## Error: max x is 106, upper bound is 105
Conclusion
If you like to add your own data preparation steps and data checks to the recipes
package, I advise you to do this in a structured way so you are not distracting by the bookkeeping while implementing the functionality. I propose eight subsequent parts to develop a new step or check:
1) Create a function on a vector that could be applied in the bake
method, but does not bother about recipes
yet.
2) Recognize which function arguments should be provided upfront and which should be learned in the prep
method.
3) Create a constructor in the form step_<name>_new
or check_<name>_new
.
4) Create the actual function to add the step or check to a recipe, in the form step_<name>
or check_<name>
.
5) Write the prep
method.
6) Write the bake
method.
7) Write the print
method.
8) Write the tidy
method.
As mentioned, the source code is maintained here. Make sure to clone the latest version to access the package internal functions.
-
Together with Max I added the
checks
framework to the package. Wheresteps
are transformations of variables,checks
are assertions of them. If a check passes nothing happens. If it fails, it will break thebake
method of the recipe and throws an informative error. ↩