Section 4 Problems when using dplyr::full_join

library(powerjoin)

## Warning: package 'powerjoin' was built under R version 4.3.1

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.1

## Warning: package 'purrr' was built under R version 4.3.1

## ── Attaching core tidyverse packages ──────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

4.1 Introduction

This part will explore some problems that we may cope with when using dplyr::full_join, and it is why powerchoice::power_full_join may come in handy.

Assume that we have two people:
- name = John, age = 30, sex = “M”, treatment = NA (we don’t know John’s treatment)
- name = Marry, age = 45, sex = “F”, treatment = “A”

However, we have multiple data sets that contain incomplete information about these two people.

Let us consider the data that we have and the problems that we might have to cope with while trying to get data that contains as much information as possible from these two people by joining our available data sets.

4.2 Using dplyr::full_join

First, let us consider the below data set. With this data set, we will get the same result while using dfs %>% reduce(full_join) and dfs %>% reduce(full_join, by= name). We do not see any problem here, and we collect all the information from our available data sets.

dfs <- list(
   first = tibble(name = "John", age = 30),
   second = tibble(name = c("John", "Mary"), sex = c("M", "F")),
   third = tibble(name = "Mary", treatment = "A")
)

dfs %>% reduce(full_join)

## Joining with `by = join_by(name)`
## Joining with `by = join_by(name)`

## # A tibble: 2 × 4
##   name    age sex   treatment
##   <chr> <dbl> <chr> <chr>    
## 1 John     30 M     <NA>     
## 2 Mary     NA F     A

dfs %>% reduce(full_join, by="name")

## # A tibble: 2 × 4
##   name    age sex   treatment
##   <chr> <dbl> <chr> <chr>    
## 1 John     30 M     <NA>     
## 2 Mary     NA F     A

However, let us consider when our dfs list has one more row (the fourth row) with the name and age of Mary.

dfs <- list(
   first = tibble(name = "John", age = 30),
   second = tibble(name = c("John", "Mary"), sex = c("M", "F")),
   third = tibble(name = "Mary", treatment = "A"),
   fourth = tibble(name = "Mary", age = 45)
)

The full_join without identifying the key by="name" may think that there are two people with the same name, “Mary”.

dfs %>% reduce(full_join)

## Joining with `by = join_by(name)`
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name, age)`

## # A tibble: 3 × 4
##   name    age sex   treatment
##   <chr> <dbl> <chr> <chr>    
## 1 John     30 M     <NA>     
## 2 Mary     NA F     A        
## 3 Mary     45 <NA>  <NA>

The full_join with identifying the key by="name"creates extra columns when we have duplicated column names

dfs %>% reduce(full_join, by = "name")

## # A tibble: 2 × 5
##   name  age.x sex   treatment age.y
##   <chr> <dbl> <chr> <chr>     <dbl>
## 1 John     30 M     <NA>         NA
## 2 Mary     NA F     A            45

We may think about how to delete these extra columns by doing like below code chunk. However, by doing this, we lost the information about Mary’s age (which was available in the age.y column above).

dfs %>% reduce(full_join, by = "name", suffix = c("", ".y")) %>% select(-ends_with(".y"))

## # A tibble: 2 × 4
##   name    age sex   treatment
##   <chr> <dbl> <chr> <chr>    
## 1 John     30 M     <NA>     
## 2 Mary     NA F     A

4.3 Using powerjoin::power_full_join

Solve the problem using powerjoin.

Let’s consider the dfs list with one more row. The problems we have when using dplyr::full_join and how powerjoin::power_full_join can be helpful.

dfs <- list(
   first = tibble(name = "John", age = 30),
   second = tibble(name = c("John", "Mary"), sex = c("M", "F")),
   third = tibble(name = "Mary", treatment = "A"),
   fourth = tibble(name = "Mary", age = 45), 
   fifth = tibble(name = "Mary", sex = "F")
)

dfs %>% reduce(full_join)

## Joining with `by = join_by(name)`
## Joining with `by = join_by(name)`
## Joining with `by = join_by(name, age)`
## Joining with `by = join_by(name, sex)`

## # A tibble: 3 × 4
##   name    age sex   treatment
##   <chr> <dbl> <chr> <chr>    
## 1 John     30 M     <NA>     
## 2 Mary     NA F     A        
## 3 Mary     45 <NA>  <NA>

dfs %>% reduce(full_join, by = "name")

## # A tibble: 2 × 6
##   name  age.x sex.x treatment age.y sex.y
##   <chr> <dbl> <chr> <chr>     <dbl> <chr>
## 1 John     30 M     <NA>         NA <NA> 
## 2 Mary     NA F     A            45 F

The powerjoin package helps us collect all available information.

dfs %>% power_full_join(by= "name", conflict = coalesce_xy)

## # A tibble: 2 × 4
##   name  treatment   age sex  
##   <chr> <chr>     <dbl> <chr>
## 1 John  <NA>         30 M    
## 2 Mary  A            45 F

4.4 What if we have conflict data sets

Now, consider that we have two data sets for Mary that have different values for “age” (the fourth and fifth), with ages equal to 45 and 65, respectively.

dfs <- list(
   first = tibble(name = "John", age = 30),
   second = tibble(name = c("John", "Mary"), sex = c("M", "F")),
   third = tibble(name = "Mary", treatment = "A"),
   fourth = tibble(name = "Mary", age = 45), 
   fifth = tibble(name = "Mary", age = 65)
)

Then the argument conflict = coalesce_xy will take the first available value (age = 45), while conflict = coalesce_yx will take the second available value (age = 65).

dfs %>% power_full_join(by= "name", conflict = coalesce_xy)

## # A tibble: 2 × 4
##   name  sex   treatment   age
##   <chr> <chr> <chr>     <dbl>
## 1 John  M     <NA>         30
## 2 Mary  F     A            45

dfs %>% power_full_join(by= "name", conflict = coalesce_yx)

## # A tibble: 2 × 4
##   name  sex   treatment   age
##   <chr> <chr> <chr>     <dbl>
## 1 John  M     <NA>         30
## 2 Mary  F     A            65

4.5 References

https://github.com/moodymudskipper/powerjoin