Section 4 Problems when using dplyr::full_join
## Warning: package 'powerjoin' was built under R version 4.3.1
## Warning: package 'tidyverse' was built under R version 4.3.1
## Warning: package 'purrr' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
4.1 Introduction
This part will explore some problems that we may cope with when using
dplyr::full_join
, and it is whypowerchoice::power_full_join
may come in handy.
- Assume that we have two people:
- name = John, age = 30, sex = “M”, treatment = NA (we don’t know John’s treatment)
- name = Marry, age = 45, sex = “F”, treatment = “A”
However, we have multiple data sets that contain incomplete information about these two people.
Let us consider the data that we have and the problems that we might have to cope with while trying to get data that contains as much information as possible from these two people by joining our available data sets.
4.2 Using dplyr::full_join
- First, let us consider the below data set. With this data set, we will get the same result while using
dfs %>% reduce(full_join)
anddfs %>% reduce(full_join, by= name)
. We do not see any problem here, and we collect all the information from our available data sets.
dfs <- list(
first = tibble(name = "John", age = 30),
second = tibble(name = c("John", "Mary"), sex = c("M", "F")),
third = tibble(name = "Mary", treatment = "A")
)
dfs %>% reduce(full_join)
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name)`
## # A tibble: 2 × 4
## name age sex treatment
## <chr> <dbl> <chr> <chr>
## 1 John 30 M <NA>
## 2 Mary NA F A
## # A tibble: 2 × 4
## name age sex treatment
## <chr> <dbl> <chr> <chr>
## 1 John 30 M <NA>
## 2 Mary NA F A
- However, let us consider when our dfs list has one more row (the fourth row) with the name and age of Mary.
dfs <- list(
first = tibble(name = "John", age = 30),
second = tibble(name = c("John", "Mary"), sex = c("M", "F")),
third = tibble(name = "Mary", treatment = "A"),
fourth = tibble(name = "Mary", age = 45)
)
The full_join
without identifying the key by="name"
may think that there are two people with the same name, “Mary”.
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name, age)`
## # A tibble: 3 × 4
## name age sex treatment
## <chr> <dbl> <chr> <chr>
## 1 John 30 M <NA>
## 2 Mary NA F A
## 3 Mary 45 <NA> <NA>
The full_join
with identifying the key by="name"
creates extra columns when we have duplicated column names
## # A tibble: 2 × 5
## name age.x sex treatment age.y
## <chr> <dbl> <chr> <chr> <dbl>
## 1 John 30 M <NA> NA
## 2 Mary NA F A 45
We may think about how to delete these extra columns by doing like below code chunk. However, by doing this, we lost the information about Mary’s age (which was available in the age.y column above).
## # A tibble: 2 × 4
## name age sex treatment
## <chr> <dbl> <chr> <chr>
## 1 John 30 M <NA>
## 2 Mary NA F A
4.3 Using powerjoin::power_full_join
Solve the problem using powerjoin.
Let’s consider the dfs list with one more row. The problems we have when using dplyr::full_join
and how powerjoin::power_full_join
can be helpful.
dfs <- list(
first = tibble(name = "John", age = 30),
second = tibble(name = c("John", "Mary"), sex = c("M", "F")),
third = tibble(name = "Mary", treatment = "A"),
fourth = tibble(name = "Mary", age = 45),
fifth = tibble(name = "Mary", sex = "F")
)
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name, age)`
## Joining with `by = join_by(name, sex)`
## # A tibble: 3 × 4
## name age sex treatment
## <chr> <dbl> <chr> <chr>
## 1 John 30 M <NA>
## 2 Mary NA F A
## 3 Mary 45 <NA> <NA>
## # A tibble: 2 × 6
## name age.x sex.x treatment age.y sex.y
## <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 John 30 M <NA> NA <NA>
## 2 Mary NA F A 45 F
The powerjoin package helps us collect all available information.
## # A tibble: 2 × 4
## name treatment age sex
## <chr> <chr> <dbl> <chr>
## 1 John <NA> 30 M
## 2 Mary A 45 F
4.4 What if we have conflict data sets
Now, consider that we have two data sets for Mary that have different values for “age” (the fourth and fifth), with ages equal to 45 and 65, respectively.
dfs <- list(
first = tibble(name = "John", age = 30),
second = tibble(name = c("John", "Mary"), sex = c("M", "F")),
third = tibble(name = "Mary", treatment = "A"),
fourth = tibble(name = "Mary", age = 45),
fifth = tibble(name = "Mary", age = 65)
)
Then the argument conflict = coalesce_xy
will take the first available value (age = 45), while conflict = coalesce_yx
will take the second available value (age = 65).
## # A tibble: 2 × 4
## name sex treatment age
## <chr> <chr> <chr> <dbl>
## 1 John M <NA> 30
## 2 Mary F A 45
## # A tibble: 2 × 4
## name sex treatment age
## <chr> <chr> <chr> <dbl>
## 1 John M <NA> 30
## 2 Mary F A 65