Using Rstudio Jobs for training many models in parallel

Edwin Thoen bio photo By Edwin Thoen Comment

Recently, Rstudio added the Jobs feature, which allows you to run R scripts in the background. Computations are done in a separate R session that is not interactive, but just runs the script. In the meantime your regular R session stays live so you can do other work while waiting for the Job to complete. Instead refreshing your Twitter for the 156th time, you can stay productive! (I am actually writing this blog in Rmarkdown while I am waiting for my results to come in.) The number of jobs you can spin-up is not limited to one. As each new job is started at a different core, you can start as many jobs as your system got cores (although leaving one idle is a good idea for other processes, like your interactive R session).

Recently, I needed to train many Bayesian models on subsets of a large dataset. The subsets varied greatly in size, with most of the models needing a few minutes to train, but the ones trained on the larger subsets took up to half an hour. With just a single Job the whole thing would have lasted over fifteen hours. Luckily, I have a server at my disposal with many cores and a lot of RAM. I chose to use Jobs for running the model training in parallel for two reasons. First, as mentioned, it allows me to do other R work while waiting for the results to come in. Second, I train the models with rstan which allows for using multiple cores for each model, each chain gets its own core. So, we have parallelization within the parallel Jobs. I could only get this to work efficiently with Jobs. Packages for parallelization did not seem to be able to handle this parallel within parallel. (Disclaimer, I am no expert on parallelization, a more knowledgeable person might figure it out. I stopped digging once I got what I wanted with Jobs).

Here, I share the steps I took to get the whole thing running.

Create the Jobs script

First you should capture everything that is within every Job in an R script. You could choose to import your current global environment when starting the job, be careful with this when starting multiple Jobs. If you have large objects in your environment it will be loaded to every Job you start and you might run out of memory. Importing the global environment also poses a threat for the reproducibility of the outcome, because the Job can only run when the required objects are loaded in the global. Rather, I suggest you make a script that is completely self-contained. Everything that is needed within the Job, dependencies, loaded data, your own functions, should be in the script (or sourced by the script). To run the same Job multiple times on different data, include a parameter that differs over the Jobs within the script. More about that later.

Divide your work

Within each Job, the computations will run sequentially. If you want the parallel Jobs to finish up in approximately the same time, you should divide the work they have to do in approximately equal chunks. The number of Jobs you can runs simultaneously is limited by the number of cores you have available and the RAM on your system. The first is easy to detect, just run

parallel::detectCores()

If you have 8 cores on your machine, you can spin up to 7 Jobs parallel if you have enough memory available. Figuring out how much memory your Job consumes is a bit trial-and-error. If you run a task for the first time it might be a good to run a test Job and check on its memory use. If you are on a unix system you can see how much memory the Job consumes by using the top function on the command line. This will show all running processes and the percentage of the available RAM they use. If the test Job consumes about 15% of your available memory, you could spin up about 4 to 5 parallel Jobs, assuming that the amount of memory used is approximately equal for each part of the data. Always allow for some room for fluctuation. I don’t let the Jobs consume more than 75% of the available memory. If your Jobs are very volatile in their memory use you might even want to be more conservative, since it is one strike and your out. (I know absolutely nothing about Windows, if you are on Windows and not sure how to monitor the memory use you have to do some Googling I am afraid).

After you have decided how many jobs you can run in parallel it is time to split up the work. I needed to train a great number of models on datasets varying in size. Splitting up the subsets in equal number of models to train is a suboptimal strategy, because if several larger sets end up in the same partition you have to wait long on one of the Jobs while the others are long finished. I used the number of rows of each subset as an indication for the time it would take for the model trained on it to complete. With the following little helper function you can create approximately equal chunks of the data. x is a vector with id’s for every group on which a model is trained, w is a vector with weights (the number of rows), g is the number of desired groups, this is equal to the number of Jobs you’ll start.

assign_to_chunk <- function(x, w, g) {
  stopifnot(length(x) == length(w),
            is.numeric(w),
            length(g) == 1,
            g %% 1 == 0)
  splits <- (sum(w) / g) * (1:(g - 1))
  total_sizes <- cumsum(w)
  end_points <- sapply(splits, function(x) which.min(abs(total_sizes - x)))
  start_points <- c(1, end_points + 1)
  end_points   <- c(end_points, length(x))
  chunk_assignment <- rep(1:g, end_points - start_points + 1)
  data.frame(x, chunk_assignment)
}

If you have a large dataset, like I did, it is a good idea to partition data physically into separate files. Otherwise each Job has to hold the complete data set in its memory, wasting RAM on data records it doesn’t need.

Getting the whole thing to run

Now all that is left is setting your system on fire by using its full capacity. There are two ways you can get your Job to run on different data sets. The first is to give your Job script a parameter variable, of which you change the value before starting the next Job. Say, you have split your data into five parts, called filename_part1.csv up until filename_part5.csv. You could start you Job script then with

part <- 1
read.csv(paste0("location_to_file/filename_part", part, ".csv"))

Next, you start the Job manually in the Jobs pane in Rstudio. For the next part you change part to the value 2 and you start another Job. Until you have five Jobs started. This is a little tedious and error prone. A more elegant solution was proposed by Rstudio’s Jonathan McPherson, in which the rstudioapi::jobRunScript function is used to kick-off the Job. In a second script you update the part variable after starting each Job and you start the Job by using the function instead of the pane.

part <- 1 
rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE)
part <- 2
rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE)

# ...

part <- 5
rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE)

You now only have to run this second script to have all the Jobs started at once. This even more convenient if you have several variables value that vary over the Jobs. A downside of this approach is that you have to import the global environment in every one of them. As mentioned, you must then make sure the Job is reproducible and the total memory of your system is not exceeded.

A major thanks to the Rstudio team for providing this awesome functionality. It is a great productivity boost to me!

comments powered by Disqus