Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Output of discord_data() may return multiple rows per kin-pair if 'id' column has non-unique values #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jdtrat opened this issue Aug 19, 2021 · 0 comments · Fixed by #7
Assignees
Labels

Comments

@jdtrat
Copy link
Collaborator

jdtrat commented Aug 19, 2021

library(discord)
library(tidyverse)

The sample data built into the discord package contains 1200 rows of single-entered data from the NlsyLinks package containing height and weight for kin pairs. The column ‘extended_id’ is not a unique identifier for kin pairs, but rather a family (or similar grouping) identifier. For a family with three kin, we would see something like the following (from NlsyLinks):

tibble(
  ExtendedID = c(1, 1, 1, 2),
  SubjectTag_S1 = c(101, 101, 102, 201),
  SubjectTag_S2 = c(102, 103, 103, 202),
  R = c(.5, .25, .25, .5),
  RelationshipPath = rep("Gen2Siblings", 4)
)
#> # A tibble: 4 × 5
#>   ExtendedID SubjectTag_S1 SubjectTag_S2     R RelationshipPath
#>        <dbl>         <dbl>         <dbl> <dbl> <chr>           
#> 1          1           101           102  0.5  Gen2Siblings    
#> 2          1           101           103  0.25 Gen2Siblings    
#> 3          1           102           103  0.25 Gen2Siblings    
#> 4          2           201           202  0.5  Gen2Siblings

discord_data() requires the id variable to be a “unique kinship pair identifier”, meaning the extended id from NlsyLinks will not work. This causes an issue where the output of the discord_data() could return multiple rows per kin-pair. Consider the ‘Gen1Housemates’ subset of the sample_data. This has 233 pairs with overlapping extended ids:

sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  count(extended_id, sort = TRUE) %>%
  tibble() # to print nicely
#> # A tibble: 128 × 2
#>    extended_id     n
#>          <int> <int>
#>  1         221     6
#>  2         300     6
#>  3         490     6
#>  4         516     6
#>  5         520     6
#>  6          58     3
#>  7          63     3
#>  8          74     3
#>  9          85     3
#> 10         110     3
#> # … with 118 more rows

Calling discord_data() leads to additional rows being returned:

sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  discord_data(
    outcome = "height",
    predictors = "weight",
    sex = NULL,
    race = NULL,
    demographics = "none",
    id = "extended_id",
    pair_identifiers = c("_s1", "_s2")
  ) %>%
  tibble() # to print nicely
#> # A tibble: 623 × 9
#>       id height_1 height_2 height_diff height_mean weight_1 weight_2 weight_diff
#>    <int>    <dbl>    <dbl>       <dbl>       <dbl>    <dbl>    <dbl>       <dbl>
#>  1     3    2.13    1.02         1.11        1.58     0.719   -0.252      0.971 
#>  2     5   -1.74   -2.35         0.607      -2.05     1.08    -1.19       2.27  
#>  3    13    1.03    0.274        0.759       0.654    0.486    0.203      0.283 
#>  4    17   -0.384  -0.732        0.347      -0.558   -0.381   -0.853      0.472 
#>  5    20   -0.839  -1.23         0.392      -1.03     0.163   -0.498      0.661 
#>  6    23    0.945   0.634        0.311       0.790    0.203    0.716     -0.513 
#>  7    27    0.281  -0.498        0.779      -0.109   -0.519   -0.787      0.268 
#>  8    29    0.623  -0.0958       0.719       0.264   -0.418    0.639     -1.06  
#>  9    34   -0.396  -0.850        0.453      -0.623    0.908   -0.318      1.23  
#> 10    37    1.02   -0.479        1.50        0.270   -0.671   -0.580     -0.0908
#> # … with 613 more rows, and 1 more variable: weight_mean <dbl>

However, if we specify a unique id ourselves, we get the expected output (note the number of rows in the print-out):

sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  mutate(unique_id = row_number()) %>%
  discord_data(
    outcome = "height",
    predictors = "weight",
    sex = NULL,
    race = NULL,
    demographics = "none",
    id = "unique_id",
    pair_identifiers = c("_s1", "_s2")
  ) %>%
  tibble() # to print nicely
#> # A tibble: 233 × 9
#>       id height_1 height_2 height_diff height_mean weight_1 weight_2 weight_diff
#>    <int>    <dbl>    <dbl>       <dbl>       <dbl>    <dbl>    <dbl>       <dbl>
#>  1     1    2.13    1.02         1.11        1.58     0.719   -0.252      0.971 
#>  2     2   -1.74   -2.35         0.607      -2.05     1.08    -1.19       2.27  
#>  3     3    1.03    0.274        0.759       0.654    0.486    0.203      0.283 
#>  4     4   -0.384  -0.732        0.347      -0.558   -0.381   -0.853      0.472 
#>  5     5   -0.839  -1.23         0.392      -1.03     0.163   -0.498      0.661 
#>  6     6    0.945   0.634        0.311       0.790    0.203    0.716     -0.513 
#>  7     7    0.281  -0.498        0.779      -0.109   -0.519   -0.787      0.268 
#>  8     8    0.623  -0.0958       0.719       0.264   -0.418    0.639     -1.06  
#>  9     9   -0.396  -0.850        0.453      -0.623    0.908   -0.318      1.23  
#> 10    10    1.02   -0.479        1.50        0.270   -0.671   -0.580     -0.0908
#> # … with 223 more rows, and 1 more variable: weight_mean <dbl>
@jdtrat jdtrat self-assigned this Aug 19, 2021
@jdtrat jdtrat added the bug label Aug 19, 2021
jdtrat added a commit that referenced this issue Aug 20, 2021
Specifically, this commit:

* Adds an internal function for checking whether id columns are supplied and/or valid. If no column name is supplied, row-wise ids are added quietly. If a column name is supplied but does not contain unique values for each kin-pair, a warning message states the problem and the row-wise ids will be used.
* Updates `check_discord_errors()` with new id default and enhances unit-tests for this.
* Updated documentation on for `discord_data()` and `discord_regression()`.
smasongarrison added a commit that referenced this issue Aug 25, 2021
Fix #6 re: non-unique 'id' column.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant