zap

zap is an alternate serialization framework for R objects. It features high compression at fast speeds.

Two aims for this package:

Provide an alternate serialization framework to the one built-in to R.
Write highly compressed data quickly by leveraging contextual information

What’s in the box

zap_read(), zap_write() to read/write objects to raw vectors and files
zap_version() the version of the set of data transformations used internally
zap_opts() a way of building more detailed configuration options to use with zap()
zap_count() a fast simple count of the bytes needed to hold the uncompressed output of zap_write() (i.e. when compress = "none")

Caveats

Speed and compression performance are very dependent on the data being serialized.

The characteristics of any floating point data will have a big influence, and it is worth trying other floating point transformations e.g. zap_write(x, dbl = "shuffle")

For small data, there is less of a difference between the different serialization options.

Installation

You can install the latest development version from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/zap')

Pre-built source/binary versions can also be installed from R-universe

Example: Writing `diamonds` to file

The graph below compares different serialization/compression options in R.

The x-axis is compression ratio - the size of the compressed data compared to the size of the original. Bigger compression ratios are better. Both zap and xz are able to highly compress this data.

The y-axis is compression speed - how quickly is the data compressed and written to file. Saving with zap is comparable in speed to saving the data uncompressed, and is much faster than xz.

zap_write(diamonds, dst = "diamonds.zap", compress = "zstd")

expression	min	median	itr/sec	size
saveRDS(compress=FALSE)	2.39ms	2.87ms	286.5	3452964
qs2::qs_save()	8.02ms	8.42ms	118.6	570855
zap_write()	3.68ms	4.16ms	234.9	329681
saveRDS(zstd)	43.79ms	44.64ms	22.1	524483
saveRDS(xz)	815.38ms	815.38ms	1.2	332468

Writing ‘diamonds’ to file

Example: Reading `diamonds` from file

zap_read("diamonds.zap")

expression	min	median	itr/sec	size
readRDS(compress=FALSE)	1.8ms	1.93ms	520.7	3452964
qs2::qs_read()	3.25ms	3.4ms	292.3	570855
zap_read()	1.97ms	2.31ms	436.6	329681
readRDS(gzip)	3.45ms	3.56ms	280.2	513651
readRDS(zstd)	4.54ms	4.72ms	211.4	524483
readRDS(xz)	15ms	15.29ms	65.5	332468

Reading ‘diamonds’ from file

Verbose output

When verbosity = 64 a data.frame of object information is returned instead of the serialized object.

depth recursion depth
type the SEXP type
start/end are the locations of this data in the uncompressed zap stream
altrep was this item ALTREP?
rserialize did this item use the fallback R serialization infrastructure?

zap_write(mtcars, verbosity = 64)
#> # A tibble: 20 × 7
#>    depth type    start   end altrep rserialize attrs
#>    <int> <fct>   <int> <int> <lgl>  <lgl>      <lgl>
#>  1     1 VECSXP      5  3512 FALSE  FALSE      TRUE 
#>  2     2 REALSXP     7   274 FALSE  FALSE      TRUE 
#>  3     2 REALSXP   274   541 FALSE  FALSE      TRUE 
#>  4     2 REALSXP   541   808 FALSE  FALSE      TRUE 
#>  5     2 REALSXP   808  1075 FALSE  FALSE      TRUE 
#>  6     2 REALSXP  1075  1342 FALSE  FALSE      TRUE 
#>  7     2 REALSXP  1342  1609 FALSE  FALSE      TRUE 
#>  8     2 REALSXP  1609  1876 FALSE  FALSE      TRUE 
#>  9     2 REALSXP  1876  2143 FALSE  FALSE      TRUE 
#> 10     2 REALSXP  2143  2410 FALSE  FALSE      TRUE 
#> 11     2 REALSXP  2410  2677 FALSE  FALSE      TRUE 
#> 12     2 REALSXP  2677  2944 FALSE  FALSE      TRUE 
#> 13     2 LISTSXP  2944  3489 FALSE  FALSE      TRUE 
#> 14     3 SYMSXP   2946  2956 FALSE  FALSE      TRUE 
#> 15     3 STRSXP   2956  3013 FALSE  FALSE      TRUE 
#> 16     3 SYMSXP   3013  3027 FALSE  FALSE      TRUE 
#> 17     3 STRSXP   3027  3454 FALSE  FALSE      TRUE 
#> 18     3 SYMSXP   3454  3464 FALSE  FALSE      TRUE 
#> 19     3 STRSXP   3464  3487 FALSE  FALSE      TRUE 
#> 20     2 STRSXP   3489  3512 FALSE  FALSE      TRUE

`zap` Technical details

R object structure

R objects are collections of SEXP elements. For the purposes of explaining this package consider SEXP objects to be broken up into 3 classes.

Atomic vectors e.g. integers, strings
Containers e.g. lists, environments, closures, data.frames
Everything else e.g. bytecode, dots

Atomic Vectors are the core things users would consider data in R. Collections of integers, floating point numbers, logical values are what we actually want to compute with.

Containers allow R to organise atomic vectors into logical units - e.g. a data.frame is a collection of (mostly) atomic vectors. A list is a collection of other arbitrary R objects.

Everything else encompasses all the other details of R the language that don’t need to be considered often by the user. E.g. the compiled bytecode representation of a function, the actual ... object used in function calls.

Serializing R objects

Both zap and R serialize objects the same way

walk along each object
serialize the atomic vectors it contains
serialize its attributes
for any nested containers within the object, recurse into them and serialize their contents

Compressing R objects

Using R’s built-in serialization, raw bytes are compressed using entropy coders such as gzip, xz and zstd.

These compressors look for redundancies/patterns and calculate optimal ways of using a smaller number of bits to represent common structures in the data.

Where zap differs is that it includes an extra layer of data-dependent transformations prior to compression.

`zap` includes lightweight transforms for improving compression

Using the custom serialization mechanism in zap, contextual information about bytes is known as part of the process e.g. when serializing an integer vector, we know that each set of 4 bytes is a 32-bit signed integer represented in twos-complement form.

Knowing the type of data that is in the bytes allows us to do some fast, low-memory transformations on the bytes which will:

losslessly reduce the number of bits required to represent each byte
make the data much more amenable to compression by the standard compressors

The following sections give a high-level overview of the transformations. To find out more details, the interested reader is directed to the C source code to read the implementation.

Note: these transformations were arrived at after some trial and error. They are, by no means, the final answer to the best transformations to use. See the Future work section below for some ideas on how this package might be extended/improved.

`zap` Transformations

Logical transformation

Logical data may be packed:

Take all the lowest bits of each logical value (this is the only bit which indicates if the value is TRUE or FALSE). Create a bitstream with 1-bit for each value.
Encode the locations of NA values in an auxilliary bitstream (1-bit for each value).

Each logical was originally stored in 32-bit data type, and is now represented by just 2 bits (one in each bitstream).

Integer transformations

zzshuf ZigZag encoding with delta, then byte shuffling
delta_frame Frame-of-reference coding of the deltas (difference between consecutive elements)

Integer: `zzshuf` ZigZag encoding with byte shuffling

ZigZag encoding to recode integers without a sign bit
Take the difference between consecutive numbers.
Shuffle the bytes within the vector such that it is more likely zeros will be next to each other.
- Each integer is 4 bytes i.e ABCD, ABCD, ABCD, …
- Reorder bytes to: AAA.., BBB…, CCC…, DDD…

Integer: `delta_frame` Frame-of-reference coding of deltas

General overview of frame-of-reference coding

Take the difference between consecutive numbers
If the largest difference is >= 4096, use zzshuf instead
Find the number of bits to encode the largest difference
Encode all differences with this number of bits
Pack these low-bit representations of differences into 64-bit integers
Encode the locations of NA values in an auxilliary bitstream (1-bit for each number).

Factor transformation

Factors may be packed:

The number of levels of a factor is known without having to calculate anything
If number of levels >= 4096, just encode factor as an integer using zzshuf
Find the number of bits to encode the maximum level
Encode all factors with this number of bits
Pack these bits into 64-bit integers
NA values are encoded as zero (since zero is not a valid factor level)

Character transformation

Encode a character vector as a mega string by writing out:

The total length of all strings
The concatenation of all the nul-terminated strings (where length is encoded implicitly by the position of the nul-bytes)
Encode the locations of NA values in an auxilliary bitstream (1-bit for each string).

This approach avoids encoding a separate length for each individual character string.

Floating point transformation

Floating point compression is notoriously difficult, and the best transformation to apply is heavily dependent on the characteristics of the data.

shuffle Byte shuffle
delta_shuffle Delta and byte shuffle
alp Adaptive Lossles floating Point compression

Floating point: `shuffle` byte shuffle

Given each double is an 8-byte sequence: ABCDEFGH, ABCDEFGH, …
Reorder the bytes to be: AA…, BB…, CC…, DD…, EE…, FF.., GG…, HH…

Floating point: `delta_shuffle` delta and byte shuffle

Treat every double precision float as an unsigned 64-bit integer
Take the difference between consecutive values
Given each value is an 8-byte sequence: ABCDEFGH, ABCDEFGH, …
Reorder the bytes to be: AA…, BB…, CC…, DD…, EE…, FF.., GG…, HH…

Floating point: `alp` Adaptive Lossless floating Point compression

This method is adapted from a Afroozeh et al ALP: Adaptive Lossless floating-Point Compression. The original C++ code on github and the Rust implementation are available. Note: the rust version is faster than c

Examine a sample of the values
Determine if these numbers represent floating point numbers with a finite number of decimal places
If many numbers fail this criteria, then fallback to shuffle or delta_shuffle technique
Determine powers of 10 to best convert numbers to integer form
Convert numbers to integer form
Apply differencing and byte shuffling
Any individual values which were not successfully encoded are encoded in an auxillary stream of “patches” to be applied when un-transforming the data. These include NA, NaN, Inf as well as any floating point value not convertible to an integer.

Future work

Each of the data elements which support transformation (integer, logical, factor, double, character) support up to 256 possible transformations.

There is room here for experimentation, new transformations and heuristics to choose the “optimal” transformation based upon data characteristics.

Remove need for R’s `serialize()`

There are some SEXPs which are still serialized using R’s built-in mechanism, and then those raw bytes are inserted into the zap output. It would be nice if zap managed to handle all SEXPs without resorting to R.

Current SEXPs which use R’s serialization mechanism:

BCODESXP - bytecode representations
DOTSXP - representation of the ... object
SPECIALSXP
BUILTIN
PROMSXP(?)
ANYSXP - not seen in real objects?
EXTPTRSXP
WEAKREFSXP
S4SXP

Bitstream

Current bit packing occurs within unsigned 64-bit integers, and a packed element will never cross from one 64-bit integer to the next.

E.g. If it is known that all factor values fit into 10-bit integers, then 6 factor values will be packed into a single 64-bit destination - leaving 4 bits unused.

This packing is inefficient, but was easy to code.

It would be worth trialling a general purpose packing routine which can pack bits compactly at all sizes, with no wastage.

Integer transformations

Try StreamVByte for integer compression. I did experiment with this early on, and may have discounted it too quickly. Does it offer any speed/compression advantages over simple “delta + shuffle” once we add zstd compression?

Floating point transformation

Port Rust version of ALP to R

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devcontainer		.devcontainer
.github		.github
R		R
inst		inst
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md

Uh oh!

License

Licenses found

Uh oh!

coolbutuseless/zap

Folders and files

Latest commit

History

Repository files navigation

zap

What’s in the box

Caveats

Installation

Example: Writing diamonds to file

Example: Reading diamonds from file

Verbose output

zap Technical details

R object structure

Serializing R objects

Compressing R objects

zap includes lightweight transforms for improving compression

zap Transformations

Logical transformation

Integer transformations

Integer: zzshuf ZigZag encoding with byte shuffling

Integer: delta_frame Frame-of-reference coding of deltas

Factor transformation

Character transformation

Floating point transformation

Floating point: shuffle byte shuffle

Floating point: delta_shuffle delta and byte shuffle

Floating point: alp Adaptive Lossless floating Point compression

Future work

Remove need for R’s serialize()

Bitstream

Integer transformations

Floating point transformation

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example: Writing `diamonds` to file

Example: Reading `diamonds` from file

`zap` Technical details

`zap` includes lightweight transforms for improving compression

`zap` Transformations

Integer: `zzshuf` ZigZag encoding with byte shuffling

Integer: `delta_frame` Frame-of-reference coding of deltas

Floating point: `shuffle` byte shuffle

Floating point: `delta_shuffle` delta and byte shuffle

Floating point: `alp` Adaptive Lossless floating Point compression

Remove need for R’s `serialize()`

Packages