Author: Ludvig R. Olsen (
[email protected] )
License:
MIT
Started: October
2016
R package: Methods for dividing data into groups. Create balanced partitions and cross-validation folds. Perform time series windowing and general grouping and splitting of data. Balance existing groups with up- and downsampling.
Main functions:
group_factor()
group()
splt()
partition()
fold()
balance()
Other tools:
find_starts()
differs_from_previous()
all_groups_identical()
%staircase%
%primes%
CRAN version:
install.packages(“groupdata2”)
Development version:
install.packages(“devtools”)
devtools::install_github(“LudvigOlsen/groupdata2”)
groupdata2 contains a number of vignettes with relevant use cases and descriptions.
vignette(package=“groupdata2”) # for an overview
vignette(“introduction_to_groupdata2”) # begin here
Returns a factor with group numbers, e.g. (1,1,1,2,2,2,3,3,3).
This can be used to subset, aggregate, group_by, etc.
Create equally sized groups by setting force_equal = TRUE
Randomize grouping factor by setting randomize = TRUE
Returns the given data as a data frame with added grouping factor made
with group_factor()
. The data frame is grouped by the grouping factor
for easy use with dplyr pipelines.
Creates the specified groups with group_factor()
and splits the given
data by the grouping factor with base::split
. Returns the splits in a
list.
Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on one categorical variable and/or one numerical variable. Make sure that all datapoints sharing an ID is in the same partition.
Creates (optionally) balanced folds for use in cross-validation. Balance folds on one categorical variable and/or one numerical variable. Ensure that all datapoints sharing an ID is in the same fold. Create multiple unique fold columns at once, e.g. for repeated cross-validation.
Uses up- and/or downsampling to fix the group sizes to the min, max, mean, or median group size or to a specific number of rows. Balancing can also happen on the ID level, e.g. to ensure the same number of IDs in each category.
There are currently 9 methods available. They can be divided into 5 categories.
Examples of group sizes are based on a vector with 57 elements.
Divides up the data greedily given a specified group size.
E.g. group sizes: 10, 10, 10, 10, 10, 7
Divides the data into a specified number of groups and distributes excess data points across groups.
E.g. group sizes: 11, 11, 12, 11, 12
Divides the data into a specified number of groups and fills up groups with excess data points from the beginning.
E.g. group sizes: 12, 12, 11, 11, 11
Divides the data into a specified number of groups. The algorithm finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size.
E.g. group sizes: 11, 11, 11, 11, 13
Divides the data into a specified number of groups. Excess data points are placed randomly in groups (only 1 per group).
E.g. group sizes: 12, 11, 11, 11, 12
Uses a list / vector of group sizes to divide up the data.
Excess data points are placed in an extra group.
E.g. n = c(11, 11) returns group sizes: 11, 11, 35
Uses a list of starting positions to divide up the data.
Starting positions are values in a vector (e.g. column in data frame).
Skip to a specific nth appearance of a value by using c(value,
skip_to).
E.g. n = c(11, 15, 27, 43) returns group sizes: 10, 4, 12, 16, 15
Identical to n = list(11, 15, c(27, 1), 43) where 1 specifies that we want the first appearance of 27 after the previous value 15.
If passing n = “auto” starting positions are automatically found with find_starts().
Uses step_size to divide up the data. Group size increases with 1 step for every group, until there is no more data.
E.g. group sizes: 5, 10, 15, 20, 7
Creates groups with sizes corresponding to prime numbers.
Starts at n (prime number). Increases to the the next prime number until
there is no more data.
E.g. group sizes: 5, 7, 11, 13, 17, 4
There are currently 4 methods for balancing on ID level in balance().
Balances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories.
Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done with repetition and by iteratively picking the ID with the number of rows closest to the lacking/excessive number of rows in the category.
Distributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others.
Balances the IDs within their categories, meaning that all IDs in a category will have the same number of rows.
# Attach packages
library(groupdata2)
library(dplyr)
library(knitr)
# Create data frame
df <- data.frame("x"=c(1:12),
"species" = rep(c('cat','pig', 'human'), 4),
"age" = sample(c(1:100), 12))
# Using group()
group(df, n = 5, method = 'n_dist') %>%
kable()
x | species | age | .groups |
---|---|---|---|
1 | cat | 68 | 1 |
2 | pig | 39 | 1 |
3 | human | 1 | 2 |
4 | cat | 34 | 2 |
5 | pig | 87 | 3 |
6 | human | 43 | 3 |
7 | cat | 14 | 3 |
8 | pig | 82 | 4 |
9 | human | 59 | 4 |
10 | cat | 51 | 5 |
11 | pig | 85 | 5 |
12 | human | 21 | 5 |
# Using group() with dplyr pipeline to get mean age
df %>%
group(n = 5, method = 'n_dist') %>%
dplyr::summarise(mean_age = mean(age)) %>%
kable()
.groups | mean_age |
---|---|
1 | 53.50000 |
2 | 17.50000 |
3 | 48.00000 |
4 | 70.50000 |
5 | 52.33333 |
# Using group() with 'l_starts' method
# Starts group at the first 'cat',
# then skips to the second appearance of "pig" after "cat",
# then starts at the following "cat".
df %>%
group(n = list("cat", c("pig",2), "cat"),
method = 'l_starts',
starts_col = "species") %>%
kable()
x | species | age | .groups |
---|---|---|---|
1 | cat | 68 | 1 |
2 | pig | 39 | 1 |
3 | human | 1 | 1 |
4 | cat | 34 | 1 |
5 | pig | 87 | 2 |
6 | human | 43 | 2 |
7 | cat | 14 | 3 |
8 | pig | 82 | 3 |
9 | human | 59 | 3 |
10 | cat | 51 | 3 |
11 | pig | 85 | 3 |
12 | human | 21 | 3 |
# Create data frame
df <- data.frame(
"participant" = factor(rep(c('1','2', '3', '4', '5', '6'), 3)),
"age" = rep(c(20,33,27,21,32,25), 3),
"diagnosis" = rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3),
"score" = c(10,24,15,35,24,14,24,40,30,50,54,25,45,67,40,78,62,30))
df <- df %>% arrange(participant)
df$session <- rep(c('1','2', '3'), 6)
# Using fold()
# First set seed to ensure reproducibility
set.seed(1)
# Use fold() with cat_col, num_col and id_col
df_folded <- fold(df, k = 3, cat_col = 'diagnosis',
num_col = "age",
id_col = 'participant')
#> 'old_name' and 'new_name' were identical.
#> 'old_name' and 'new_name' were identical.
# Show df_folded ordered by folds
df_folded %>%
arrange(.folds) %>%
kable()
participant | age | diagnosis | score | session | .folds |
---|---|---|---|---|---|
1 | 20 | a | 10 | 1 | 1 |
1 | 20 | a | 24 | 2 | 1 |
1 | 20 | a | 45 | 3 | 1 |
2 | 33 | b | 24 | 1 | 1 |
2 | 33 | b | 40 | 2 | 1 |
2 | 33 | b | 67 | 3 | 1 |
5 | 32 | b | 24 | 1 | 2 |
5 | 32 | b | 54 | 2 | 2 |
5 | 32 | b | 62 | 3 | 2 |
6 | 25 | a | 14 | 1 | 2 |
6 | 25 | a | 25 | 2 | 2 |
6 | 25 | a | 30 | 3 | 2 |
3 | 27 | a | 15 | 1 | 3 |
3 | 27 | a | 30 | 2 | 3 |
3 | 27 | a | 40 | 3 | 3 |
4 | 21 | b | 35 | 1 | 3 |
4 | 21 | b | 50 | 2 | 3 |
4 | 21 | b | 78 | 3 | 3 |
# Show distribution of diagnoses and participants
df_folded %>%
group_by(.folds) %>%
count(diagnosis, participant) %>%
kable()
.folds | diagnosis | participant | n |
---|---|---|---|
1 | a | 1 | 3 |
1 | b | 2 | 3 |
2 | a | 6 | 3 |
2 | b | 5 | 3 |
3 | a | 3 | 3 |
3 | b | 4 | 3 |
# Show age representation in folds
# Notice that we would get a more even distribution if we had more data.
# As age is fixed per ID, we only have 3 ages per category to balance with.
df_folded %>%
group_by(.folds) %>%
summarize(mean_age = mean(age),
sd_age = sd(age)) %>%
kable()
.folds | mean_age | sd_age |
---|---|---|
1 | 26.5 | 7.120393 |
2 | 28.5 | 3.834058 |
3 | 24.0 | 3.286335 |
Notice, that the we now have the opportunity to include the session variable and/or use participant as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.
We also have a balance in the representation of each diagnosis, which could give us better, more consistent results.
# Lets first unbalance the dataset by removing some rows
df_b <- df %>%
arrange(diagnosis) %>%
filter(!row_number() %in% c(5,7,8,13,14,16,17,18))
# Show distribution of diagnoses and participants
df_b %>%
count(diagnosis, participant) %>%
kable()
diagnosis | participant | n |
---|---|---|
a | 1 | 3 |
a | 3 | 2 |
a | 6 | 1 |
b | 2 | 3 |
b | 4 | 1 |
# First set seed to ensure reproducibility
set.seed(1)
# Downsampling by diagnosis
balance(df_b, size="min", cat_col = "diagnosis") %>%
count(diagnosis, participant) %>%
kable()
diagnosis | participant | n |
---|---|---|
a | 1 | 2 |
a | 3 | 1 |
a | 6 | 1 |
b | 2 | 3 |
b | 4 | 1 |
# Downsampling the IDs
balance(df_b, size="min", cat_col = "diagnosis",
id_col = "participant", id_method = "n_ids") %>%
count(diagnosis, participant) %>%
kable()
diagnosis | participant | n |
---|---|---|
a | 1 | 3 |
a | 3 | 2 |
b | 2 | 3 |
b | 4 | 1 |