Virtually every dataset has them; id variables that link a record to a subject and/or time point. Often one column, or a combination of columns, forms the unique id of a record. For instance, the combination of patient_id and visit_id, or ip_adress and visit_time. The first step in most of my analyses is almost always checking the uniqueness of a variable, or a combination of variables. If it is not unique, may assumptions about the data may be wrong, or there are data quality issues. Since I do this so often, I decided to make a little wrapper around this procedure. The unique_id
function will return TRUE
if the evaluated variables indeed are the unique key to a record. If not, it will return all the records for which the id variable(s) are duplicated so we can pinpoint the problem right away. It uses dplyr
v.0.7.1, so make sure that it is loaded.
library(dplyr)
some_df <- data_frame(a = c(1, 2, 3, 3, 4), b = 101:105, val = round(rnorm(5), 1))
some_df %>% unique_id(a)
## # A tibble: 2 x 3
## a b val
## <dbl> <int> <dbl>
## 1 3 103 -0.4
## 2 3 104 -1.4
some_df %>% unique_id(a, b)
## [1] TRUE
Here you find the source code of the function. You can also obtain it by installing the package accompanying this blog using devtools::install.github("edwinth/thatssorandom"")
.
unique_id <- function(x, ...) {
id_set <- x %>% select(...)
id_set_dist <- id_set %>% distinct
if (nrow(id_set) == nrow(id_set_dist)) {
TRUE
} else {
non_unique_ids <- id_set %>%
filter(id_set %>% duplicated()) %>%
distinct()
suppressMessages(
inner_join(non_unique_ids, x) %>% arrange(...)
)
}
}