Output of `discord_data()` may return multiple rows per kin-pair if 'id' column has non-unique values #6

jdtrat · 2021-08-19T19:18:34Z

library(discord)
library(tidyverse)

The sample data built into the discord package contains 1200 rows of single-entered data from the NlsyLinks package containing height and weight for kin pairs. The column ‘extended_id’ is not a unique identifier for kin pairs, but rather a family (or similar grouping) identifier. For a family with three kin, we would see something like the following (from NlsyLinks):

tibble(
  ExtendedID = c(1, 1, 1, 2),
  SubjectTag_S1 = c(101, 101, 102, 201),
  SubjectTag_S2 = c(102, 103, 103, 202),
  R = c(.5, .25, .25, .5),
  RelationshipPath = rep("Gen2Siblings", 4)
)
#> # A tibble: 4 × 5
#>   ExtendedID SubjectTag_S1 SubjectTag_S2     R RelationshipPath
#>        <dbl>         <dbl>         <dbl> <dbl> <chr>           
#> 1          1           101           102  0.5  Gen2Siblings    
#> 2          1           101           103  0.25 Gen2Siblings    
#> 3          1           102           103  0.25 Gen2Siblings    
#> 4          2           201           202  0.5  Gen2Siblings

discord_data() requires the id variable to be a “unique kinship pair identifier”, meaning the extended id from NlsyLinks will not work. This causes an issue where the output of the discord_data() could return multiple rows per kin-pair. Consider the ‘Gen1Housemates’ subset of the sample_data. This has 233 pairs with overlapping extended ids:

sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  count(extended_id, sort = TRUE) %>%
  tibble() # to print nicely
#> # A tibble: 128 × 2
#>    extended_id     n
#>          <int> <int>
#>  1         221     6
#>  2         300     6
#>  3         490     6
#>  4         516     6
#>  5         520     6
#>  6          58     3
#>  7          63     3
#>  8          74     3
#>  9          85     3
#> 10         110     3
#> # … with 118 more rows

Calling discord_data() leads to additional rows being returned:

sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  discord_data(
    outcome = "height",
    predictors = "weight",
    sex = NULL,
    race = NULL,
    demographics = "none",
    id = "extended_id",
    pair_identifiers = c("_s1", "_s2")
  ) %>%
  tibble() # to print nicely
#> # A tibble: 623 × 9
#>       id height_1 height_2 height_diff height_mean weight_1 weight_2 weight_diff
#>    <int>    <dbl>    <dbl>       <dbl>       <dbl>    <dbl>    <dbl>       <dbl>
#>  1     3    2.13    1.02         1.11        1.58     0.719   -0.252      0.971 
#>  2     5   -1.74   -2.35         0.607      -2.05     1.08    -1.19       2.27  
#>  3    13    1.03    0.274        0.759       0.654    0.486    0.203      0.283 
#>  4    17   -0.384  -0.732        0.347      -0.558   -0.381   -0.853      0.472 
#>  5    20   -0.839  -1.23         0.392      -1.03     0.163   -0.498      0.661 
#>  6    23    0.945   0.634        0.311       0.790    0.203    0.716     -0.513 
#>  7    27    0.281  -0.498        0.779      -0.109   -0.519   -0.787      0.268 
#>  8    29    0.623  -0.0958       0.719       0.264   -0.418    0.639     -1.06  
#>  9    34   -0.396  -0.850        0.453      -0.623    0.908   -0.318      1.23  
#> 10    37    1.02   -0.479        1.50        0.270   -0.671   -0.580     -0.0908
#> # … with 613 more rows, and 1 more variable: weight_mean <dbl>

However, if we specify a unique id ourselves, we get the expected output (note the number of rows in the print-out):

sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  mutate(unique_id = row_number()) %>%
  discord_data(
    outcome = "height",
    predictors = "weight",
    sex = NULL,
    race = NULL,
    demographics = "none",
    id = "unique_id",
    pair_identifiers = c("_s1", "_s2")
  ) %>%
  tibble() # to print nicely
#> # A tibble: 233 × 9
#>       id height_1 height_2 height_diff height_mean weight_1 weight_2 weight_diff
#>    <int>    <dbl>    <dbl>       <dbl>       <dbl>    <dbl>    <dbl>       <dbl>
#>  1     1    2.13    1.02         1.11        1.58     0.719   -0.252      0.971 
#>  2     2   -1.74   -2.35         0.607      -2.05     1.08    -1.19       2.27  
#>  3     3    1.03    0.274        0.759       0.654    0.486    0.203      0.283 
#>  4     4   -0.384  -0.732        0.347      -0.558   -0.381   -0.853      0.472 
#>  5     5   -0.839  -1.23         0.392      -1.03     0.163   -0.498      0.661 
#>  6     6    0.945   0.634        0.311       0.790    0.203    0.716     -0.513 
#>  7     7    0.281  -0.498        0.779      -0.109   -0.519   -0.787      0.268 
#>  8     8    0.623  -0.0958       0.719       0.264   -0.418    0.639     -1.06  
#>  9     9   -0.396  -0.850        0.453      -0.623    0.908   -0.318      1.23  
#> 10    10    1.02   -0.479        1.50        0.270   -0.671   -0.580     -0.0908
#> # … with 223 more rows, and 1 more variable: weight_mean <dbl>

The text was updated successfully, but these errors were encountered:

Specifically, this commit: * Adds an internal function for checking whether id columns are supplied and/or valid. If no column name is supplied, row-wise ids are added quietly. If a column name is supplied but does not contain unique values for each kin-pair, a warning message states the problem and the row-wise ids will be used. * Updates `check_discord_errors()` with new id default and enhances unit-tests for this. * Updated documentation on for `discord_data()` and `discord_regression()`.

Fix #6 re: non-unique 'id' column.

jdtrat self-assigned this Aug 19, 2021

jdtrat added the bug label Aug 19, 2021

jdtrat mentioned this issue Aug 20, 2021

Fix #6 re: non-unique 'id' column. #7

Merged

smasongarrison closed this as completed in #7 Aug 25, 2021

smasongarrison added a commit that referenced this issue Aug 25, 2021

Merge pull request #7 from R-Computing-Lab/add-checks

b8eb5b7

Fix #6 re: non-unique 'id' column.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output of `discord_data()` may return multiple rows per kin-pair if 'id' column has non-unique values #6

Output of `discord_data()` may return multiple rows per kin-pair if 'id' column has non-unique values #6

jdtrat commented Aug 19, 2021

Output of discord_data() may return multiple rows per kin-pair if 'id' column has non-unique values #6

Output of discord_data() may return multiple rows per kin-pair if 'id' column has non-unique values #6

Comments

jdtrat commented Aug 19, 2021

Output of `discord_data()` may return multiple rows per kin-pair if 'id' column has non-unique values #6

Output of `discord_data()` may return multiple rows per kin-pair if 'id' column has non-unique values #6