The goal of tidytable is to be a tidy interface to data.table.
tidyverse-like syntax withdata.tablespeedrlangcompatibility - See here- Includes functions that
dtplyris missing, including manytidyrfunctions
Note: tidytable functions do not use data.table’s
modify-by-reference, and instead use the copy-on-modify principles
followed by the tidyverse and base R.
Install the released version from CRAN with:
install.packages("tidytable")Or install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("markfairbanks/tidytable")Enhanced selection support denoted by ES See examples here
dt(): Pipeabledata.tablesyntax. See heredt_get_dummies()%notin%
dt_arrange()dt_filter()dt_mutate():_if()/_at()/_all()/_across()- ES- The
_across()helper is new and can replace both_if()and_at()See here
- The
dt_select()- ESdt_summarize(): Group by specifications called inside. See?dt_summarize
dt_bind_cols()&dt_bind_rows()dt_case(): Similar todplyr::case_when(). See?dt_case()for syntaxdt_count()- ESdt_distinct()- ES- Joins:
dt_left_join(),dt_inner_join(),dt_right_join(),dt_full_join(), &dt_anti_join() dt_pull()dt_relocate()dt_rename():_if()/_at()/_all()/_across()- ES- Select helpers:
dt_starts_with(),dt_ends_with(),dt_contains(),dt_everything() dt_slice():_head()/_tail()/_max()/_min()- The
slice_*()helpers are likedt_top_n(), but are slightly easier to use
- The
dt_top_n()
dt_drop_na()- ESdt_fill(): Works on character/factor/logical types (data.table::nafill()does not) - ESdt_group_split()- ES- Nesting:
dt_group_nest()- ES &dt_unnest_legacy() dt_pivot_longer()- ES &dt_pivot_wider()- ESdt_replace_na()dt_separate()
dt_map(),dt_map2(),dt_map_*()variants, &dt_map2_*()variants
The code chunk below shows the tidytable syntax:
library(data.table)
library(tidytable)
example_dt <- data.table(x = c(1,2,3), y = c(4,5,6), z = c("a","a","b"))
example_dt %>%
dt_select(x, y, z) %>%
dt_filter(x < 4, y > 1) %>%
dt_arrange(x, y) %>%
dt_mutate(double_x = x * 2,
double_y = y * 2)
#> x y z double_x double_y
#> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1: 1 4 a 2 8
#> 2: 2 5 a 4 10
#> 3: 3 6 b 6 12Group by calls are done from inside any function that has group by
functionality (e.g. dt_summarize() & dt_mutate())
- A single column can be passed with
by = z - Multiple columns can be passed with
by = list(y, z)
example_dt %>%
dt_summarize(avg_x = mean(x),
count = .N,
by = z)
#> z avg_x count
#> <chr> <dbl> <int>
#> 1: a 1.5 2
#> 2: b 3.0 1Enhanced selection allows you to mix predicates like is.double with
normal selection. Some examples:
example_dt <- data.table(a = c(1,2,3),
b = c(4,5,6),
c = c("a","a","b"),
d = c("a","b","c"))
example_dt %>%
dt_select(is.numeric, d)
#> a b d
#> <dbl> <dbl> <chr>
#> 1: 1 4 a
#> 2: 2 5 b
#> 3: 3 6 cYou can also use this format to drop columns:
example_dt %>%
dt_select(-is.numeric)
#> c d
#> <chr> <chr>
#> 1: a a
#> 2: a b
#> 3: b cCurrently supported:
is.numeric/is.integer/is.double/is.character/is.factor
Enhanced selection allows the user to replace dt_mutate_if() &
dt_mutate_at() with one helper - dt_mutate_across().
Using _across() instead of _if():
example_dt <- data.table(a = c(1,1,1),
b = c(1,1,1),
c = c("a","a","b"),
d = c("a","b","c"))
example_dt %>%
dt_mutate_across(is.numeric, as.character)
#> a b c d
#> <chr> <chr> <chr> <chr>
#> 1: 1 1 a a
#> 2: 1 1 a b
#> 3: 1 1 b cUsing _across() instead of _at():
example_dt %>%
dt_mutate_across(c(a, b), ~ .x + 1)
#> a b c d
#> <dbl> <dbl> <chr> <chr>
#> 1: 2 2 a a
#> 2: 2 2 a b
#> 3: 2 2 b cThese two approaches can be combined in one call:
example_dt <- data.table(dbl_col1 = c(1.0,1.0,1.0),
dbl_col2 = c(1.0,1.0,1.0),
int_col1 = c(1L,1L,1L),
int_col2 = c(1L,1L,1L),
char_col1 = c("a","a","a"),
char_col2 = c("b","b","b"))
example_dt %>%
dt_mutate_across(c(is.double, int_col1), ~ .x + 1)
#> dbl_col1 dbl_col2 int_col1 int_col2 char_col1 char_col2
#> <dbl> <dbl> <dbl> <int> <chr> <chr>
#> 1: 2 2 2 1 a b
#> 2: 2 2 2 1 a b
#> 3: 2 2 2 1 a brlang quoting/unquoting can be used to write custom functions with
tidytable functions.
Note that quosures are not compatible with data.table, so enexpr()
must be used instead of enquo().
library(rlang)
example_dt <- data.table(x = c(1,1,1), y = c(1,1,1), z = c("a","a","b"))
add_one <- function(.data, new_name, add_col) {
new_name <- enexpr(new_name)
add_col <- enexpr(add_col)
.data %>%
dt_mutate(!!new_name := !!add_col + 1)
}
example_dt %>%
add_one(x_plus_one, x)
#> x y z x_plus_one
#> <dbl> <dbl> <chr> <dbl>
#> 1: 1 1 a 2
#> 2: 1 1 a 2
#> 3: 1 1 b 2example_df <- data.table(x = 1:10, y = c(rep("a", 6), rep("b", 4)), z = c(rep("a", 6), rep("b", 4)))
find_mean <- function(.data, grouping_cols, col) {
grouping_cols <- enexpr(grouping_cols)
col <- enexpr(col)
.data %>%
dt_summarize(avg = mean(!!col),
by = !!grouping_cols)
}
example_df %>%
find_mean(grouping_cols = list(y, z), col = x)
#> y z avg
#> <chr> <chr> <dbl>
#> 1: a a 3.5
#> 2: b b 8.5The dt() function makes regular data.table syntax pipeable, so you
can easily mix tidytable syntax with data.table
syntax:
example_dt <- data.table(x = c(1,2,3), y = c(4,5,6), z = c("a", "a", "b"))
example_dt %>%
dt(, list(x, y, z)) %>%
dt(x < 4 & y > 1) %>%
dt(order(x, y)) %>%
dt(, ':='(double_x = x * 2,
double_y = y * 2)) %>%
dt(, list(avg_x = mean(x)), by = z)
#> z avg_x
#> <chr> <dbl>
#> 1: a 1.5
#> 2: b 3.0Below are some speed comparisons of various functions. More functions will get added to the speed comps over time.
A few notes:
- Comparing times from separate functions won’t be very useful. For
example - the
summarize()tests were performed on a different dataset fromcase_when(). setDTthreads(4)was used fordata.table&tidytabletimings.- Modify-by-reference was used in
data.tablewhen being compared todt_mutate()&dplyr::mutate() dt_fill()&tidyr::fill()both work with character/factor/logical columns, whereasdata.table::nafill()does not. Testing only included numeric columns due to this constraint.- Currently
data.tabledoesn’t have its owncase_when()translation, so a multiple nestedfifelse()was used. - All tests can be found in the source code of the README.
pandascomparisons are in the process of being added - more will be added soon.- Lastly I’d like to mention that these tests were not rigorously created to cover all angles equally. They are just meant to be used as general insight into the performance of these packages.
all_marks
#> # A tibble: 13 x 6
#> function_tested tidyverse tidytable data.table pandas tidytable_vs_tidyverse
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 arrange 393.5ms 34.2ms 38.7ms 297ms 8.7%
#> 2 case_when 517ms 179ms 163ms 307ms 34.6%
#> 3 distinct 86.4ms 18.4ms 16.3ms 287ms 21.3%
#> 4 fill 127.9ms 36.3ms 32.5ms 146ms 28.4%
#> 5 filter 283ms 222ms 222ms 656ms 78.4%
#> 6 inner_join 70.9ms 60.2ms 59.7ms <NA> 84.9%
#> 7 left_join 66.8ms 46.2ms 48.8ms <NA> 69.2%
#> 8 mutate 69.3ms 53.9ms 74.4ms 85.2ms 77.8%
#> 9 nest 59ms 14.7ms 11.3ms <NA> 24.9%
#> 10 pivot_longer 186.6ms 37ms 30.7ms <NA> 19.8%
#> 11 pivot_wider 904ms 219ms 219ms <NA> 24.2%
#> 12 summarize 483ms 176ms 160ms 780ms 36.4%
#> 13 unnest 181.01ms 8.18ms 7.88ms <NA> 4.5%