Building on data.frame serialization provided by
fst, prt offers an interface for
working with partitioned data.frames, saved as individual fst files.
You can install the development version of prt from GitHub by running
source("https://install-github.me/nbenn/prt")Alternatively, if you have the remotes package available, the latest
release is available by calling install_github() as
# install.packages("remotes")
remotes::install_github("nbenn/prt@*release")Creating a prt object can be done either by calling new_prt() on a
list of previously created fst files or by coercing a data.frame
object to prt using as_prt().
tmp <- tempfile()
dir.create(tmp)
flights <- as_prt(nycflights13::flights, n_chunks = 2L, dir = tmp)
#> fstcore package v0.9.14
#> (OpenMP was not detected, using single threaded mode)
print(flights)
#> # A prt: 336,776 × 19
#> # Partitioning: [168,388, 168,388] rows
#> year month day dep_time sched_dep_t…¹ dep_delay arr_time sched_arr_…²
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> …
#> 336,772 2013 9 30 NA 1455 NA NA 1634
#> 336,773 2013 9 30 NA 2200 NA NA 2312
#> 336,774 2013 9 30 NA 1210 NA NA 1330
#> 336,775 2013 9 30 NA 1159 NA NA 1344
#> 336,776 2013 9 30 NA 840 NA NA 1020
#> # ℹ 336,771 more rows
#> # ℹ abbreviated names: ¹sched_dep_time, ²sched_arr_time
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>In case a prt object is created from a data.frame, the specified
number of files is written to the directory of choice (a newly created
directory within tempdir() by default).
list.files(tmp)
#> [1] "1.fst" "2.fst"Subsetting and printing is closely modeled after tibble and behavior
that deviates from that of tibble will most likely be considered a bug
(please report). Some design
choices that do set a prt object apart from a tibble include the use
of data.tables for any result of a subsetting operation and the
complete disregard for row.names.
In addition to standard subsetting operations involving the functions
`[`(), `[[`() and
`$`(), the base generic function subset() is
implemented for the prt class, enabling subsetting operations using
non-standard evaluation. Combined with random access to tables stored as
fst files, this can make data access more efficient in cases where
only a subset of the data is of interest.
jan <- flights[flights$month == 1, ]
identical(jan, subset(flights, month == 1))
#> [1] TRUE
print(jan)
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> 1: 2013 1 1 517 515 2 830 819
#> 2: 2013 1 1 533 529 4 850 830
#> 3: 2013 1 1 542 540 2 923 850
#> 4: 2013 1 1 544 545 -1 1004 1022
#> 5: 2013 1 1 554 600 -6 812 837
#> ---
#> 27000: 2013 1 31 NA 1325 NA NA 1505
#> 27001: 2013 1 31 NA 1200 NA NA 1430
#> 27002: 2013 1 31 NA 1410 NA NA 1555
#> 27003: 2013 1 31 NA 1446 NA NA 1757
#> 27004: 2013 1 31 NA 625 NA NA 934
#> arr_delay carrier flight tailnum origin dest air_time distance hour
#> 1: 11 UA 1545 N14228 EWR IAH 227 1400 5
#> 2: 20 UA 1714 N24211 LGA IAH 227 1416 5
#> 3: 33 AA 1141 N619AA JFK MIA 160 1089 5
#> 4: -18 B6 725 N804JB JFK BQN 183 1576 5
#> 5: -25 DL 461 N668DN LGA ATL 116 762 6
#> ---
#> 27000: NA MQ 4475 N730MQ LGA RDU NA 431 13
#> 27001: NA MQ 4658 N505MQ LGA ATL NA 762 12
#> 27002: NA MQ 4491 N734MQ LGA CLE NA 419 14
#> 27003: NA UA 337 <NA> LGA IAH NA 1416 14
#> 27004: NA UA 1497 <NA> LGA IAH NA 1416 6
#> minute time_hour
#> 1: 15 2013-01-01 05:00:00
#> 2: 29 2013-01-01 05:00:00
#> 3: 40 2013-01-01 05:00:00
#> 4: 45 2013-01-01 05:00:00
#> 5: 0 2013-01-01 06:00:00
#> ---
#> 27000: 25 2013-01-31 13:00:00
#> 27001: 0 2013-01-31 12:00:00
#> 27002: 10 2013-01-31 14:00:00
#> 27003: 46 2013-01-31 14:00:00
#> 27004: 25 2013-01-31 06:00:00A subsetting operation on a prt object yields a data.table. If the
full table is of interest, a prt-specific implementation of the
as.data.table() generic is available.
unlink(tmp, recursive = TRUE)