Description
library(discord)
library(tidyverse)
The sample data built into the discord package contains 1200 rows of single-entered data from the NlsyLinks package containing height and weight for kin pairs. The column ‘extended_id’ is not a unique identifier for kin pairs, but rather a family (or similar grouping) identifier. For a family with three kin, we would see something like the following (from NlsyLinks):
tibble(
ExtendedID = c(1, 1, 1, 2),
SubjectTag_S1 = c(101, 101, 102, 201),
SubjectTag_S2 = c(102, 103, 103, 202),
R = c(.5, .25, .25, .5),
RelationshipPath = rep("Gen2Siblings", 4)
)
#> # A tibble: 4 × 5
#> ExtendedID SubjectTag_S1 SubjectTag_S2 R RelationshipPath
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 1 101 102 0.5 Gen2Siblings
#> 2 1 101 103 0.25 Gen2Siblings
#> 3 1 102 103 0.25 Gen2Siblings
#> 4 2 201 202 0.5 Gen2Siblings
discord_data()
requires the id variable to be a “unique kinship pair identifier”, meaning the extended id from NlsyLinks will not work. This causes an issue where the output of the discord_data()
could return multiple rows per kin-pair. Consider the ‘Gen1Housemates’ subset of the sample_data. This has 233 pairs with overlapping extended ids:
sample_data %>%
filter(relationship_path == "Gen1Housemates") %>%
count(extended_id, sort = TRUE) %>%
tibble() # to print nicely
#> # A tibble: 128 × 2
#> extended_id n
#> <int> <int>
#> 1 221 6
#> 2 300 6
#> 3 490 6
#> 4 516 6
#> 5 520 6
#> 6 58 3
#> 7 63 3
#> 8 74 3
#> 9 85 3
#> 10 110 3
#> # … with 118 more rows
Calling discord_data()
leads to additional rows being returned:
sample_data %>%
filter(relationship_path == "Gen1Housemates") %>%
discord_data(
outcome = "height",
predictors = "weight",
sex = NULL,
race = NULL,
demographics = "none",
id = "extended_id",
pair_identifiers = c("_s1", "_s2")
) %>%
tibble() # to print nicely
#> # A tibble: 623 × 9
#> id height_1 height_2 height_diff height_mean weight_1 weight_2 weight_diff
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 2.13 1.02 1.11 1.58 0.719 -0.252 0.971
#> 2 5 -1.74 -2.35 0.607 -2.05 1.08 -1.19 2.27
#> 3 13 1.03 0.274 0.759 0.654 0.486 0.203 0.283
#> 4 17 -0.384 -0.732 0.347 -0.558 -0.381 -0.853 0.472
#> 5 20 -0.839 -1.23 0.392 -1.03 0.163 -0.498 0.661
#> 6 23 0.945 0.634 0.311 0.790 0.203 0.716 -0.513
#> 7 27 0.281 -0.498 0.779 -0.109 -0.519 -0.787 0.268
#> 8 29 0.623 -0.0958 0.719 0.264 -0.418 0.639 -1.06
#> 9 34 -0.396 -0.850 0.453 -0.623 0.908 -0.318 1.23
#> 10 37 1.02 -0.479 1.50 0.270 -0.671 -0.580 -0.0908
#> # … with 613 more rows, and 1 more variable: weight_mean <dbl>
However, if we specify a unique id ourselves, we get the expected output (note the number of rows in the print-out):
sample_data %>%
filter(relationship_path == "Gen1Housemates") %>%
mutate(unique_id = row_number()) %>%
discord_data(
outcome = "height",
predictors = "weight",
sex = NULL,
race = NULL,
demographics = "none",
id = "unique_id",
pair_identifiers = c("_s1", "_s2")
) %>%
tibble() # to print nicely
#> # A tibble: 233 × 9
#> id height_1 height_2 height_diff height_mean weight_1 weight_2 weight_diff
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2.13 1.02 1.11 1.58 0.719 -0.252 0.971
#> 2 2 -1.74 -2.35 0.607 -2.05 1.08 -1.19 2.27
#> 3 3 1.03 0.274 0.759 0.654 0.486 0.203 0.283
#> 4 4 -0.384 -0.732 0.347 -0.558 -0.381 -0.853 0.472
#> 5 5 -0.839 -1.23 0.392 -1.03 0.163 -0.498 0.661
#> 6 6 0.945 0.634 0.311 0.790 0.203 0.716 -0.513
#> 7 7 0.281 -0.498 0.779 -0.109 -0.519 -0.787 0.268
#> 8 8 0.623 -0.0958 0.719 0.264 -0.418 0.639 -1.06
#> 9 9 -0.396 -0.850 0.453 -0.623 0.908 -0.318 1.23
#> 10 10 1.02 -0.479 1.50 0.270 -0.671 -0.580 -0.0908
#> # … with 223 more rows, and 1 more variable: weight_mean <dbl>