zap is an alternate serialization framework for R objects. It features
high compression at fast speeds.
Two aims for this package:
- Provide an alternate serialization framework to the one built-in to R.
- Write highly compressed data quickly by leveraging contextual information
zap_read(),zap_write()to read/write objects to raw vectors and fileszap_version()the version of the set of data transformations used internallyzap_opts()a way of building more detailed configuration options to use withzap()zap_count()a fast simple count of the bytes needed to hold the uncompressed output ofzap_write()(i.e. whencompress = "none")
Speed and compression performance are very dependent on the data being serialized.
The characteristics of any floating point data will have a big
influence, and it is worth trying other floating point transformations
e.g. zap_write(x, dbl = "shuffle")
For small data, there is less of a difference between the different serialization options.
You can install the latest development version from GitHub with:
# install.package('remotes')
remotes::install_github('coolbutuseless/zap')Pre-built source/binary versions can also be installed from R-universe
The graph below compares different serialization/compression options in R.
The x-axis is compression ratio - the size of the compressed data
compared to the size of the original. Bigger compression ratios are
better. Both zap and xz are able to highly compress this data.
The y-axis is compression speed - how quickly is the data compressed
and written to file. Saving with zap is comparable in speed to saving
the data uncompressed, and is much faster than xz.
zap_write(diamonds, dst = "diamonds.zap", compress = "zstd")| expression | min | median | itr/sec | size |
|---|---|---|---|---|
| saveRDS(compress=FALSE) | 2.39ms | 2.87ms | 286.5 | 3452964 |
| qs2::qs_save() | 8.02ms | 8.42ms | 118.6 | 570855 |
| zap_write() | 3.68ms | 4.16ms | 234.9 | 329681 |
| saveRDS(zstd) | 43.79ms | 44.64ms | 22.1 | 524483 |
| saveRDS(xz) | 815.38ms | 815.38ms | 1.2 | 332468 |
Writing ‘diamonds’ to file
zap_read("diamonds.zap")| expression | min | median | itr/sec | size |
|---|---|---|---|---|
| readRDS(compress=FALSE) | 1.8ms | 1.93ms | 520.7 | 3452964 |
| qs2::qs_read() | 3.25ms | 3.4ms | 292.3 | 570855 |
| zap_read() | 1.97ms | 2.31ms | 436.6 | 329681 |
| readRDS(gzip) | 3.45ms | 3.56ms | 280.2 | 513651 |
| readRDS(zstd) | 4.54ms | 4.72ms | 211.4 | 524483 |
| readRDS(xz) | 15ms | 15.29ms | 65.5 | 332468 |
Reading ‘diamonds’ from file
When verbosity = 64 a data.frame of object information is returned
instead of the serialized object.
depthrecursion depthtypethe SEXP typestart/endare the locations of this data in the uncompressed zap streamaltrepwas this item ALTREP?rserializedid this item use the fallback R serialization infrastructure?
zap_write(mtcars, verbosity = 64)
#> # A tibble: 20 × 7
#> depth type start end altrep rserialize attrs
#> <int> <fct> <int> <int> <lgl> <lgl> <lgl>
#> 1 1 VECSXP 5 3512 FALSE FALSE TRUE
#> 2 2 REALSXP 7 274 FALSE FALSE TRUE
#> 3 2 REALSXP 274 541 FALSE FALSE TRUE
#> 4 2 REALSXP 541 808 FALSE FALSE TRUE
#> 5 2 REALSXP 808 1075 FALSE FALSE TRUE
#> 6 2 REALSXP 1075 1342 FALSE FALSE TRUE
#> 7 2 REALSXP 1342 1609 FALSE FALSE TRUE
#> 8 2 REALSXP 1609 1876 FALSE FALSE TRUE
#> 9 2 REALSXP 1876 2143 FALSE FALSE TRUE
#> 10 2 REALSXP 2143 2410 FALSE FALSE TRUE
#> 11 2 REALSXP 2410 2677 FALSE FALSE TRUE
#> 12 2 REALSXP 2677 2944 FALSE FALSE TRUE
#> 13 2 LISTSXP 2944 3489 FALSE FALSE TRUE
#> 14 3 SYMSXP 2946 2956 FALSE FALSE TRUE
#> 15 3 STRSXP 2956 3013 FALSE FALSE TRUE
#> 16 3 SYMSXP 3013 3027 FALSE FALSE TRUE
#> 17 3 STRSXP 3027 3454 FALSE FALSE TRUE
#> 18 3 SYMSXP 3454 3464 FALSE FALSE TRUE
#> 19 3 STRSXP 3464 3487 FALSE FALSE TRUE
#> 20 2 STRSXP 3489 3512 FALSE FALSE TRUER objects are collections of SEXP elements. For the purposes of explaining this package consider SEXP objects to be broken up into 3 classes.
- Atomic vectors e.g. integers, strings
- Containers e.g. lists, environments, closures, data.frames
- Everything else e.g. bytecode, dots
Atomic Vectors are the core things users would consider data in R. Collections of integers, floating point numbers, logical values are what we actually want to compute with.
Containers allow R to organise atomic vectors into logical units - e.g. a data.frame is a collection of (mostly) atomic vectors. A list is a collection of other arbitrary R objects.
Everything else encompasses all the other details of R the language
that don’t need to be considered often by the user. E.g. the compiled
bytecode representation of a function, the actual ... object used in
function calls.
Both zap and R serialize objects the same way
- walk along each object
- serialize the atomic vectors it contains
- serialize its attributes
- for any nested containers within the object, recurse into them and serialize their contents
Using R’s built-in serialization, raw bytes are compressed using entropy
coders such as gzip, xz and zstd.
These compressors look for redundancies/patterns and calculate optimal ways of using a smaller number of bits to represent common structures in the data.
Where zap differs is that it includes an extra layer of data-dependent
transformations prior to compression.
Using the custom serialization mechanism in zap, contextual
information about bytes is known as part of the process e.g. when
serializing an integer vector, we know that each set of 4 bytes is a
32-bit signed integer represented in twos-complement form.
Knowing the type of data that is in the bytes allows us to do some fast, low-memory transformations on the bytes which will:
- losslessly reduce the number of bits required to represent each byte
- make the data much more amenable to compression by the standard compressors
The following sections give a high-level overview of the transformations. To find out more details, the interested reader is directed to the C source code to read the implementation.
Note: these transformations were arrived at after some trial and error.
They are, by no means, the final answer to the best transformations to
use. See the Future work section below for some ideas on how this
package might be extended/improved.
Logical data may be packed:
- Take all the lowest bits of each logical value (this is the only bit which indicates if the value is TRUE or FALSE). Create a bitstream with 1-bit for each value.
- Encode the locations of
NAvalues in an auxilliary bitstream (1-bit for each value).
Each logical was originally stored in 32-bit data type, and is now represented by just 2 bits (one in each bitstream).
zzshufZigZag encoding with delta, then byte shufflingdelta_frameFrame-of-reference coding of the deltas (difference between consecutive elements)
- ZigZag encoding to recode integers without a sign bit
- Take the difference between consecutive numbers.
- Shuffle the bytes within the vector such that it is more likely
zeros will be next to each other.
- Each integer is 4 bytes i.e ABCD, ABCD, ABCD, …
- Reorder bytes to: AAA.., BBB…, CCC…, DDD…
General overview of frame-of-reference coding
- Take the difference between consecutive numbers
- If the largest difference is >= 4096, use
zzshufinstead - Find the number of bits to encode the largest difference
- Encode all differences with this number of bits
- Pack these low-bit representations of differences into 64-bit integers
- Encode the locations of
NAvalues in an auxilliary bitstream (1-bit for each number).
Factors may be packed:
- The number of levels of a factor is known without having to calculate anything
- If number of levels >= 4096, just encode factor as an integer using
zzshuf - Find the number of bits to encode the maximum level
- Encode all factors with this number of bits
- Pack these bits into 64-bit integers
- NA values are encoded as zero (since zero is not a valid factor level)
Encode a character vector as a mega string by writing out:
- The total length of all strings
- The concatenation of all the nul-terminated strings (where length is encoded implicitly by the position of the nul-bytes)
- Encode the locations of
NAvalues in an auxilliary bitstream (1-bit for each string).
This approach avoids encoding a separate length for each individual character string.
Floating point compression is notoriously difficult, and the best transformation to apply is heavily dependent on the characteristics of the data.
shuffleByte shuffledelta_shuffleDelta and byte shufflealpAdaptive Lossles floating Point compression
- Given each double is an 8-byte sequence: ABCDEFGH, ABCDEFGH, …
- Reorder the bytes to be: AA…, BB…, CC…, DD…, EE…, FF.., GG…, HH…
- Treat every double precision float as an unsigned 64-bit integer
- Take the difference between consecutive values
- Given each value is an 8-byte sequence: ABCDEFGH, ABCDEFGH, …
- Reorder the bytes to be: AA…, BB…, CC…, DD…, EE…, FF.., GG…, HH…
This method is adapted from a Afroozeh et al ALP: Adaptive Lossless floating-Point Compression. The original C++ code on github and the Rust implementation are available. Note: the rust version is faster than c
- Examine a sample of the values
- Determine if these numbers represent floating point numbers with a finite number of decimal places
- If many numbers fail this criteria, then fallback to
shuffleordelta_shuffletechnique - Determine powers of 10 to best convert numbers to integer form
- Convert numbers to integer form
- Apply differencing and byte shuffling
- Any individual values which were not successfully encoded are
encoded in an auxillary stream of “patches” to be applied when
un-transforming the data. These include
NA,NaN,Infas well as any floating point value not convertible to an integer.
Each of the data elements which support transformation (integer, logical, factor, double, character) support up to 256 possible transformations.
There is room here for experimentation, new transformations and heuristics to choose the “optimal” transformation based upon data characteristics.
There are some SEXPs which are still serialized using R’s built-in
mechanism, and then those raw bytes are inserted into the zap output.
It would be nice if zap managed to handle all SEXPs without resorting
to R.
Current SEXPs which use R’s serialization mechanism:
- BCODESXP - bytecode representations
- DOTSXP - representation of the
...object - SPECIALSXP
- BUILTIN
- PROMSXP(?)
- ANYSXP - not seen in real objects?
- EXTPTRSXP
- WEAKREFSXP
- S4SXP
Current bit packing occurs within unsigned 64-bit integers, and a packed element will never cross from one 64-bit integer to the next.
E.g. If it is known that all factor values fit into 10-bit integers, then 6 factor values will be packed into a single 64-bit destination - leaving 4 bits unused.
This packing is inefficient, but was easy to code.
It would be worth trialling a general purpose packing routine which can pack bits compactly at all sizes, with no wastage.
- Try StreamVByte for integer compression. I did experiment with this early on, and may have discounted it too quickly. Does it offer any speed/compression advantages over simple “delta + shuffle” once we add zstd compression?
- Port Rust version of ALP to R