padr version 0.4.0 now on CRAN

Edwin Thoen bio photo By Edwin Thoen Comment

I am happy to share that the latest version of padr just hit CRAN. This new version comprises bug fixes, performance improvements and new functions for formatting datetime variables. But above all, it introduces the custom paradigm that enables you to do asymmetric analysis.

Improvements to existing functions

thicken used to get slowish when the data frame got large. Several adjustments resulted in a considerable performance gain. It is now approximately 6 times faster than in version 0.3.0.

Both pad and thicken used to break with noninformative errors when the datetime variable contains missing values. Both functions allow for missing values now. thicken will leave the record containing the missing datetime value where it is and will enter a missing value in the column added to the data frame. pad will move all the records with missing values in the datetime variable to the end of the data frame and will further ignore them for padding. Both functions throw a warning when the datetime variable has missing values.

New functions

span_date and span_time are wrappers around seq.Date and seq.POSIXt and were previously described in this blog post. Two new functions are introduced to reformat datetime variable. These can be especially useful for showing the data in a plot or a graph. center_interval shifts the datetime point from the beginning of the interval to the (approximate) center. Especially bar, line and dot plots will reflect the data in a more accurate way after centering.

library(tidyverse)
library(padr)
jan_first <- emergency %>% 
  filter(as.Date(time_stamp, tz = "EST") == as.Date("2016-01-01")) %>% 
  thicken("3 hour") %>% 
  count(time_stamp_3_hour) 
ggplot(jan_first, aes(time_stamp_3_hour, n)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-1

jan_first %>% 
  mutate(ts_center = center_interval(time_stamp_3_hour)) %>% 
  ggplot(aes(ts_center, n)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-2

Secondly, there is the format_interval function, that will create a character from the interval using strftime on both the beginning and the end of the datetime points of each interval.

jan_first %>% 
  mutate(hours = format_interval(time_stamp_3_hour,
                                 start_format = "%b%d %H",
                                 end_format   = "%H",
                                 sep = "-")) %>% 
  ggplot(aes(hours, n)) + geom_bar(stat = "identity")

plot of chunk unnamed-chunk-3

When using the interval “week”, one might want to start the weeks on different day than Sunday. In thicken the start_val should than be specified with a day of the desired weekday. Finding this day by hand is a bit tedious, thats why closest_weekday is introduced.

thicken(coffee, "week", 
        start_val = closest_weekday(coffee$time_stamp, 6))
##            time_stamp amount time_stamp_week
## 1 2016-07-07 09:11:21   3.14      2016-07-02
## 2 2016-07-07 09:46:48   2.98      2016-07-02
## 3 2016-07-09 13:25:17   4.11      2016-07-09
## 4 2016-07-10 10:45:11   3.14      2016-07-09

The custom suite

thicken and pad are highly automated, because they assume all datetime points to be equally spaced. Internally they span vectors of the desired interval. However, it might be useful to have asymmetrical periods between datetime points. Especially when there are periods in which the number of observations is consistently lower, such as nightly hours or weekend days. The functions thicken_cust and pad_cust work like the original functions. However, the user has to provide his own spanning vector to which the observations are mapped.

Lets do an analysis of vehicle accidents in the emergency. We want to distinguish between morning rush hour (7-10), daytime (10-16), evening rush hour (16-19) and nighttime (19-7). This is the full analysis.

accidents_span <-
  span_around(emergency$time_stamp, "hour", start_shift = "2 hour") %>% 
  subset_span(list(hour = c(7, 10, 16, 19)))

emergency %>% 
  filter(title == "Traffic: VEHICLE ACCIDENT -") %>% 
  thicken_cust(accidents_span, "hour") %>% 
  count(hour) %>% 
  pad_cust(accidents_span) %>%
  fill_by_value() %>% 
  mutate(day    = as.Date(hour, tz = "EST"),
         period = format_interval(hour, start_format = "%H", sep = "-")) %>% 
  ggplot(aes(day, n, col = period)) + 
  geom_line() +
  facet_wrap(~period)

plot of chunk unnamed-chunk-5

The helper functions span_around and subset_span are used for building the spanning vector. The first takes the original datetime variable and spans a variable of the requested interval around it. This saves you the manual trouble of finding the min and the max of the variable and determine which points are respectively before and after them to build a variable of the interval. subset_span, subsequently, will only leave the datetime points in the input that meet the criteria given in a list. In the example these are the hours 7, 10, 16, and 19. In total there are eight different datetime parts you can subset on.

The remainder of the analysis is greatly similar to a regular padr analysis. Instead of an interval to which to thicken and pad to, you use the asymmetrically spaced accidents_span variable in both thicken_cust and pad_cust. thicken_cust will then map each observation to a datetime point in the spanned vector. pad_cust will insert rows for each of the observations that are in the spanned vector, but not in the datetime variable.

Thanks

This release completes the initial plans I had for padr. For the next release I plan on refactoring and further increasing performance. Do you find anything still missing from padr? Did you find a bug or an inconsistency? Please notify by sending an email or file an issue on the github page.

comments powered by Disqus