pad
will fill the gaps in incomplete datetime variables, by figuring out
what the interval of the data is and what instances are missing. It will insert
a record for each of the missing time points. For all
other variables in the data frame a missing value will be inserted at the padded rows.
pad(x, interval = NULL, start_val = NULL, end_val = NULL, by = NULL, group = NULL, break_above = 1)
x | A data frame containing at least one variable of class |
---|---|
interval | The interval of the returned datetime variable.
Any character string that would be accepted by |
start_val | An object of class |
end_val | An object of class |
by | Only needs to be specified when |
group | Optional character vector that specifies the grouping
variable(s). Padding will take place within the different groups. When
interval is not specified, it will be determined applying |
break_above | Numeric value that indicates the number of rows in millions above which the function will break. Safety net for situations where the interval is different than expected and padding yields a very large dataframe, possibly overflowing memory. |
The data frame x
with the datetime variable padded. All
non-grouping variables in the data frame will have missing values at the rows
that are padded. The result will always be sorted on the datetime variable.
If group
is not NULL
result is sorted on grouping variable(s)
first, then on the datetime variable.
The interval of a datetime variable is the time unit at which the
observations occur. The eight intervals in padr
are from high to low
year
, quarter
, month
, week
, day
,
hour
, min
, and sec
. Since padr
v.0.3.0 the
interval is no longer limited to be of a single unit.
(Intervals like 5 minutes, 6 hours, 10 days are possible). pad
will
figure out the interval of the input variable and the step size, and will
fill the gaps for the instances that would be expected from the interval and
step size, but are missing in the input data.
Note that when start_val
and/or end_val
are specified, they are
concatenated with the datetime variable before the interval is determined.
Rows with missing values in the datetime variables will be retained. However, they will be moved to the end of the returned data frame.
simple_df <- data.frame(day = as.Date(c('2016-04-01', '2016-04-03')), some_value = c(3,4)) pad(simple_df)#>#> day some_value #> 1 2016-04-01 3 #> 2 2016-04-03 4pad(simple_df, interval = "day")#> day some_value #> 1 2016-04-01 3 #> 2 2016-04-02 NA #> 3 2016-04-03 4library(dplyr) # for the pipe operator month <- seq(as.Date('2016-04-01'), as.Date('2017-04-01'), by = 'month')[c(1, 4, 5, 7, 9, 10, 13)] month_df <- data.frame(month = month, y = runif(length(month), 10, 20) %>% round) # forward fill the padded values with tidyr's fill month_df %>% pad %>% tidyr::fill(y)#>#> month y #> 1 2016-04-01 16 #> 2 2016-05-01 16 #> 3 2016-06-01 16 #> 4 2016-07-01 10 #> 5 2016-08-01 14 #> 6 2016-09-01 14 #> 7 2016-10-01 17 #> 8 2016-11-01 17 #> 9 2016-12-01 15 #> 10 2017-01-01 12 #> 11 2017-02-01 12 #> 12 2017-03-01 12 #> 13 2017-04-01 16#>#> month y #> 1 2016-04-01 16 #> 2 2016-05-01 0 #> 3 2016-06-01 0 #> 4 2016-07-01 10 #> 5 2016-08-01 14 #> 6 2016-09-01 0 #> 7 2016-10-01 17 #> 8 2016-11-01 0 #> 9 2016-12-01 15 #> 10 2017-01-01 12 #> 11 2017-02-01 0 #> 12 2017-03-01 0 #> 13 2017-04-01 16# padding a data.frame on group level day_var <- seq(as.Date('2016-01-01'), length.out = 12, by = 'month') x_df_grp <- data.frame(grp1 = rep(LETTERS[1:3], each =4), grp2 = letters[1:2], y = runif(12, 10, 20) %>% round(0), date = sample(day_var, 12, TRUE)) %>% arrange(grp1, grp2, date) # pad by one grouping var x_df_grp %>% pad(group = 'grp1')#>#> grp1 grp2 y date #> 1 A a 18 2016-06-01 #> 2 A b 13 2016-06-01 #> 3 A <NA> NA 2016-07-01 #> 4 A <NA> NA 2016-08-01 #> 5 A <NA> NA 2016-09-01 #> 6 A <NA> NA 2016-10-01 #> 7 A a 10 2016-11-01 #> 8 A b 15 2016-11-01 #> 9 B a 19 2016-05-01 #> 10 B b 16 2016-05-01 #> 11 B b 16 2016-05-01 #> 12 B a 15 2016-06-01 #> 13 C b 18 2016-02-01 #> 14 C <NA> NA 2016-03-01 #> 15 C <NA> NA 2016-04-01 #> 16 C <NA> NA 2016-05-01 #> 17 C <NA> NA 2016-06-01 #> 18 C <NA> NA 2016-07-01 #> 19 C b 19 2016-08-01 #> 20 C <NA> NA 2016-09-01 #> 21 C <NA> NA 2016-10-01 #> 22 C a 20 2016-11-01 #> 23 C a 11 2016-11-01#> Warning: datetime variable does not vary for 2 of the groups, no padding applied on this / these group(s)#> grp1 grp2 y date #> 1 A a 18 2016-06-01 #> 2 A a NA 2016-07-01 #> 3 A a NA 2016-08-01 #> 4 A a NA 2016-09-01 #> 5 A a NA 2016-10-01 #> 6 A a 10 2016-11-01 #> 7 A b 13 2016-06-01 #> 8 A b NA 2016-07-01 #> 9 A b NA 2016-08-01 #> 10 A b NA 2016-09-01 #> 11 A b NA 2016-10-01 #> 12 A b 15 2016-11-01 #> 13 B a 19 2016-05-01 #> 14 B a 15 2016-06-01 #> 15 B b 16 2016-05-01 #> 16 B b 16 2016-05-01 #> 17 C a 20 2016-11-01 #> 18 C a 11 2016-11-01 #> 19 C b 18 2016-02-01 #> 20 C b NA 2016-03-01 #> 21 C b NA 2016-04-01 #> 22 C b NA 2016-05-01 #> 23 C b NA 2016-06-01 #> 24 C b NA 2016-07-01 #> 25 C b 19 2016-08-01# Using group argument the interval is determined over all the observations, # ignoring the groups. x <- data.frame(dt_var = as.Date(c("2017-01-01", "2017-03-01", "2017-05-01", "2017-01-01", "2017-02-01", "2017-04-01")), id = rep(1:2, each = 3), val = round(rnorm(6))) pad(x, group = "id")#>#> dt_var id val #> 1 2017-01-01 1 0 #> 2 2017-02-01 1 NA #> 3 2017-03-01 1 1 #> 4 2017-04-01 1 NA #> 5 2017-05-01 1 0 #> 6 2017-01-01 2 0 #> 7 2017-02-01 2 -1 #> 8 2017-03-01 2 NA #> 9 2017-04-01 2 1# applying pad with do, interval is determined individualle for each group x %>% group_by(id) %>% do(pad(.))#>#>#> # A tibble: 7 x 3 #> # Groups: id [3] #> dt_var id val #> <date> <int> <dbl> #> 1 2017-01-01 1 0 #> 2 2017-03-01 1 1 #> 3 2017-05-01 1 0 #> 4 2017-01-01 2 0 #> 5 2017-02-01 2 -1 #> 6 2017-03-01 NA NA #> 7 2017-04-01 2 1