Quickly Check your ID Variables

Edwin Thoen bio photo By Edwin Thoen Comment

Virtually every dataset has them; id variables that link a record to a subject and/or time point. Often one column, or a combination of columns, forms the unique id of a record. For instance, the combination of patient_id and visit_id, or ip_adress and visit_time. The first step in most of my analyses is almost always checking the uniqueness of a variable, or a combination of variables. If it is not unique, may assumptions about the data may be wrong, or there are data quality issues. Since I do this so often, I decided to make a little wrapper around this procedure. The unique_id function will return TRUE if the evaluated variables indeed are the unique key to a record. If not, it will return all the records for which the id variable(s) are duplicated so we can pinpoint the problem right away. It uses dplyr v.0.7.1, so make sure that it is loaded.

library(dplyr)
some_df <- data_frame(a = c(1, 2, 3, 3, 4), b = 101:105, val = round(rnorm(5), 1))
some_df %>% unique_id(a)
## # A tibble: 2 x 3
##       a     b   val
##   <dbl> <int> <dbl>
## 1     3   103  -0.4
## 2     3   104  -0.9
some_df %>% unique_id(a, b)
## [1] TRUE

Here you find the source code of the function. You can also obtain it by installing the package accompanying this blog using devtools::install.github(edwinth/thatssorandom).

unique_id <- function(x, ...) {
  id_set <- x %>% select(...)
  id_set_dist <- id_set %>% distinct
  if (nrow(id_set) == nrow(id_set_dist)) {
    TRUE
  } else {
    non_unique_ids <- id_set %>% 
      filter(id_set %>% duplicated()) %>% 
      distinct()
    suppressMessages(
      inner_join(non_unique_ids, x) %>% arrange(...)
    )
  }
}
comments powered by Disqus