Pad the datetime column of a data frame

pad will fill the gaps in incomplete datetime variables, by figuring out what the interval of the data is and what instances are missing. It will insert a record for each of the missing time points. For all other variables in the data frame a missing value will be inserted at the padded rows.

pad(x, interval = NULL, start_val = NULL, end_val = NULL, by = NULL,
  group = NULL, break_above = 1)

Arguments

x	A data frame containing at least one variable of class `Date`, `POSIXct` or `POSIXlt`.
interval	The interval of the returned datetime variable. Any character string that would be accepted by `seq.Date()` or `seq.POSIXt`. When NULL the the interval will be equal to the interval of the datetime variable. When specified it can only be lower than the interval and step size of the input data. See Details.
start_val	An object of class `Date`, `POSIXct` or `POSIXlt` that specifies the start of the returned datetime variable. If NULL it will use the lowest value of the input variable.
end_val	An object of class `Date`, `POSIXct` or `POSIXlt` that specifies the end of returned datetime variable. If NULL it will use the highest value of the input variable.
by	Only needs to be specified when `x` contains multiple variables of class `Date`, `POSIXct` or `POSIXlt`. Indicates which variable to use for padding.
group	Optional character vector that specifies the grouping variable(s). Padding will take place within the different groups. When interval is not specified, it will be determined applying `get_interval` on the datetime variable as a whole, ignoring groups (see last example).
break_above	Numeric value that indicates the number of rows in millions above which the function will break. Safety net for situations where the interval is different than expected and padding yields a very large dataframe, possibly overflowing memory.

Value

The data frame x with the datetime variable padded. All non-grouping variables in the data frame will have missing values at the rows that are padded. The result will always be sorted on the datetime variable. If group is not NULL result is sorted on grouping variable(s) first, then on the datetime variable.

Details

The interval of a datetime variable is the time unit at which the observations occur. The eight intervals in padr are from high to low year, quarter, month, week, day, hour, min, and sec. Since padr v.0.3.0 the interval is no longer limited to be of a single unit. (Intervals like 5 minutes, 6 hours, 10 days are possible). pad will figure out the interval of the input variable and the step size, and will fill the gaps for the instances that would be expected from the interval and step size, but are missing in the input data. Note that when start_val and/or end_val are specified, they are concatenated with the datetime variable before the interval is determined.

Rows with missing values in the datetime variables will be retained. However, they will be moved to the end of the returned data frame.

Examples

simple_df <- data.frame(day = as.Date(c('2016-04-01', '2016-04-03')),
                        some_value = c(3,4))
pad(simple_df)
#> pad applied on the interval: 2 day
#>          day some_value
#> 1 2016-04-01          3
#> 2 2016-04-03          4
pad(simple_df, interval = "day")
#>          day some_value
#> 1 2016-04-01          3
#> 2 2016-04-02         NA
#> 3 2016-04-03          4

library(dplyr) # for the pipe operator
month <- seq(as.Date('2016-04-01'), as.Date('2017-04-01'),
              by = 'month')[c(1, 4, 5, 7, 9, 10, 13)]
month_df <- data.frame(month = month,
                       y = runif(length(month), 10, 20) %>% round)
# forward fill the padded values with tidyr's fill
month_df %>% pad %>% tidyr::fill(y)
#> pad applied on the interval: month
#>         month  y
#> 1  2016-04-01 16
#> 2  2016-05-01 16
#> 3  2016-06-01 16
#> 4  2016-07-01 10
#> 5  2016-08-01 14
#> 6  2016-09-01 14
#> 7  2016-10-01 17
#> 8  2016-11-01 17
#> 9  2016-12-01 15
#> 10 2017-01-01 12
#> 11 2017-02-01 12
#> 12 2017-03-01 12
#> 13 2017-04-01 16

# or fill all y with 0
month_df %>% pad %>% fill_by_value(y)
#> pad applied on the interval: month
#>         month  y
#> 1  2016-04-01 16
#> 2  2016-05-01  0
#> 3  2016-06-01  0
#> 4  2016-07-01 10
#> 5  2016-08-01 14
#> 6  2016-09-01  0
#> 7  2016-10-01 17
#> 8  2016-11-01  0
#> 9  2016-12-01 15
#> 10 2017-01-01 12
#> 11 2017-02-01  0
#> 12 2017-03-01  0
#> 13 2017-04-01 16

# padding a data.frame on group level
day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month')
x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4),
                       grp2 = letters[1:2],
                       y    = runif(12, 10, 20) %>% round(0),
                       date = sample(day_var, 12, TRUE)) %>%
 arrange(grp1, grp2, date)

# pad by one grouping var
x_df_grp %>% pad(group = 'grp1')
#> pad applied on the interval: month
#>    grp1 grp2  y       date
#> 1     A    a 18 2016-06-01
#> 2     A    b 13 2016-06-01
#> 3     A <NA> NA 2016-07-01
#> 4     A <NA> NA 2016-08-01
#> 5     A <NA> NA 2016-09-01
#> 6     A <NA> NA 2016-10-01
#> 7     A    a 10 2016-11-01
#> 8     A    b 15 2016-11-01
#> 9     B    a 19 2016-05-01
#> 10    B    b 16 2016-05-01
#> 11    B    b 16 2016-05-01
#> 12    B    a 15 2016-06-01
#> 13    C    b 18 2016-02-01
#> 14    C <NA> NA 2016-03-01
#> 15    C <NA> NA 2016-04-01
#> 16    C <NA> NA 2016-05-01
#> 17    C <NA> NA 2016-06-01
#> 18    C <NA> NA 2016-07-01
#> 19    C    b 19 2016-08-01
#> 20    C <NA> NA 2016-09-01
#> 21    C <NA> NA 2016-10-01
#> 22    C    a 20 2016-11-01
#> 23    C    a 11 2016-11-01

# pad by two groups vars
x_df_grp %>% pad(group = c('grp1', 'grp2'), interval = "month")
#> Warning: datetime variable does not vary for 2 of the groups, no padding applied on this / these group(s)
#>    grp1 grp2  y       date
#> 1     A    a 18 2016-06-01
#> 2     A    a NA 2016-07-01
#> 3     A    a NA 2016-08-01
#> 4     A    a NA 2016-09-01
#> 5     A    a NA 2016-10-01
#> 6     A    a 10 2016-11-01
#> 7     A    b 13 2016-06-01
#> 8     A    b NA 2016-07-01
#> 9     A    b NA 2016-08-01
#> 10    A    b NA 2016-09-01
#> 11    A    b NA 2016-10-01
#> 12    A    b 15 2016-11-01
#> 13    B    a 19 2016-05-01
#> 14    B    a 15 2016-06-01
#> 15    B    b 16 2016-05-01
#> 16    B    b 16 2016-05-01
#> 17    C    a 20 2016-11-01
#> 18    C    a 11 2016-11-01
#> 19    C    b 18 2016-02-01
#> 20    C    b NA 2016-03-01
#> 21    C    b NA 2016-04-01
#> 22    C    b NA 2016-05-01
#> 23    C    b NA 2016-06-01
#> 24    C    b NA 2016-07-01
#> 25    C    b 19 2016-08-01

# Using group argument the interval is determined over all the observations,
# ignoring the groups.
x <- data.frame(dt_var = as.Date(c("2017-01-01", "2017-03-01", "2017-05-01",
"2017-01-01", "2017-02-01", "2017-04-01")),
id = rep(1:2, each = 3), val = round(rnorm(6)))
pad(x, group = "id")
#> pad applied on the interval: month
#>       dt_var id val
#> 1 2017-01-01  1   0
#> 2 2017-02-01  1  NA
#> 3 2017-03-01  1   1
#> 4 2017-04-01  1  NA
#> 5 2017-05-01  1   0
#> 6 2017-01-01  2   0
#> 7 2017-02-01  2  -1
#> 8 2017-03-01  2  NA
#> 9 2017-04-01  2   1
# applying pad with do, interval is determined individualle for each group
x %>% group_by(id) %>% do(pad(.))
#> pad applied on the interval: 2 month
#> pad applied on the interval: month
#> # A tibble: 7 x 3
#> # Groups:   id [3]
#>   dt_var        id   val
#>   <date>     <int> <dbl>
#> 1 2017-01-01     1     0
#> 2 2017-03-01     1     1
#> 3 2017-05-01     1     0
#> 4 2017-01-01     2     0
#> 5 2017-02-01     2    -1
#> 6 2017-03-01    NA    NA
#> 7 2017-04-01     2     1

Pad the datetime column of a data frame

Arguments

Value

Details

Examples

Contents