Output of `discord_data()` may return multiple rows per kin-pair if 'id' column has non-unique values

``` r
library(discord)
library(tidyverse)
```

The sample data built into the discord package contains 1200 rows of single-entered data from the NlsyLinks package containing height and weight for kin pairs. The column ‘extended\_id’ is not a *unique* identifier for kin pairs, but rather a family (or similar grouping) identifier. For a family with three kin, we would see something like the following (from [NlsyLinks](https://github.com/nlsy-links/NlsyLinks)):

``` r
tibble(
  ExtendedID = c(1, 1, 1, 2),
  SubjectTag_S1 = c(101, 101, 102, 201),
  SubjectTag_S2 = c(102, 103, 103, 202),
  R = c(.5, .25, .25, .5),
  RelationshipPath = rep("Gen2Siblings", 4)
)
#> # A tibble: 4 × 5
#>   ExtendedID SubjectTag_S1 SubjectTag_S2     R RelationshipPath
#>        <dbl>         <dbl>         <dbl> <dbl> <chr>           
#> 1          1           101           102  0.5  Gen2Siblings    
#> 2          1           101           103  0.25 Gen2Siblings    
#> 3          1           102           103  0.25 Gen2Siblings    
#> 4          2           201           202  0.5  Gen2Siblings
```

`discord_data()` requires the id variable to be a “unique kinship pair identifier”, meaning the extended id from NlsyLinks will not work. This causes an issue where the output of the `discord_data()` could return multiple rows per kin-pair. Consider the ‘Gen1Housemates’ subset of the sample\_data. This has 233 pairs with overlapping extended ids:

``` r
sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  count(extended_id, sort = TRUE) %>%
  tibble() # to print nicely
#> # A tibble: 128 × 2
#>    extended_id     n
#>          <int> <int>
#>  1         221     6
#>  2         300     6
#>  3         490     6
#>  4         516     6
#>  5         520     6
#>  6          58     3
#>  7          63     3
#>  8          74     3
#>  9          85     3
#> 10         110     3
#> # … with 118 more rows
```

Calling `discord_data()` leads to additional rows being returned:

``` r
sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  discord_data(
    outcome = "height",
    predictors = "weight",
    sex = NULL,
    race = NULL,
    demographics = "none",
    id = "extended_id",
    pair_identifiers = c("_s1", "_s2")
  ) %>%
  tibble() # to print nicely
#> # A tibble: 623 × 9
#>       id height_1 height_2 height_diff height_mean weight_1 weight_2 weight_diff
#>    <int>    <dbl>    <dbl>       <dbl>       <dbl>    <dbl>    <dbl>       <dbl>
#>  1     3    2.13    1.02         1.11        1.58     0.719   -0.252      0.971 
#>  2     5   -1.74   -2.35         0.607      -2.05     1.08    -1.19       2.27  
#>  3    13    1.03    0.274        0.759       0.654    0.486    0.203      0.283 
#>  4    17   -0.384  -0.732        0.347      -0.558   -0.381   -0.853      0.472 
#>  5    20   -0.839  -1.23         0.392      -1.03     0.163   -0.498      0.661 
#>  6    23    0.945   0.634        0.311       0.790    0.203    0.716     -0.513 
#>  7    27    0.281  -0.498        0.779      -0.109   -0.519   -0.787      0.268 
#>  8    29    0.623  -0.0958       0.719       0.264   -0.418    0.639     -1.06  
#>  9    34   -0.396  -0.850        0.453      -0.623    0.908   -0.318      1.23  
#> 10    37    1.02   -0.479        1.50        0.270   -0.671   -0.580     -0.0908
#> # … with 613 more rows, and 1 more variable: weight_mean <dbl>
```

However, if we specify a unique id ourselves, we get the expected output (note the number of rows in the print-out):

``` r
sample_data %>%
  filter(relationship_path == "Gen1Housemates") %>%
  mutate(unique_id = row_number()) %>%
  discord_data(
    outcome = "height",
    predictors = "weight",
    sex = NULL,
    race = NULL,
    demographics = "none",
    id = "unique_id",
    pair_identifiers = c("_s1", "_s2")
  ) %>%
  tibble() # to print nicely
#> # A tibble: 233 × 9
#>       id height_1 height_2 height_diff height_mean weight_1 weight_2 weight_diff
#>    <int>    <dbl>    <dbl>       <dbl>       <dbl>    <dbl>    <dbl>       <dbl>
#>  1     1    2.13    1.02         1.11        1.58     0.719   -0.252      0.971 
#>  2     2   -1.74   -2.35         0.607      -2.05     1.08    -1.19       2.27  
#>  3     3    1.03    0.274        0.759       0.654    0.486    0.203      0.283 
#>  4     4   -0.384  -0.732        0.347      -0.558   -0.381   -0.853      0.472 
#>  5     5   -0.839  -1.23         0.392      -1.03     0.163   -0.498      0.661 
#>  6     6    0.945   0.634        0.311       0.790    0.203    0.716     -0.513 
#>  7     7    0.281  -0.498        0.779      -0.109   -0.519   -0.787      0.268 
#>  8     8    0.623  -0.0958       0.719       0.264   -0.418    0.639     -1.06  
#>  9     9   -0.396  -0.850        0.453      -0.623    0.908   -0.318      1.23  
#> 10    10    1.02   -0.479        1.50        0.270   -0.671   -0.580     -0.0908
#> # … with 223 more rows, and 1 more variable: weight_mean <dbl>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Output of `discord_data()` may return multiple rows per kin-pair if 'id' column has non-unique values #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Output of discord_data() may return multiple rows per kin-pair if 'id' column has non-unique values #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Output of `discord_data()` may return multiple rows per kin-pair if 'id' column has non-unique values #6