Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Fast object serialization with high compression

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

coolbutuseless/zap

Repository files navigation

zap

CRAN R-CMD-check

zap is an alternate serialization framework for R objects. It features high compression at fast speeds.

Two aims for this package:

  1. Provide an alternate serialization framework to the one built-in to R.
  2. Write highly compressed data quickly by leveraging contextual information

What’s in the box

  • zap_read(), zap_write() to read/write objects to raw vectors and files
  • zap_version() the version of the set of data transformations used internally
  • zap_opts() a way of building more detailed configuration options to use with zap()
  • zap_count() a fast simple count of the bytes needed to hold the uncompressed output of zap_write() (i.e. when compress = "none")

Caveats

Speed and compression performance are very dependent on the data being serialized.

The characteristics of any floating point data will have a big influence, and it is worth trying other floating point transformations e.g. zap_write(x, dbl = "shuffle")

For small data, there is less of a difference between the different serialization options.

Installation

You can install the latest development version from GitHub with:

# install.package('remotes')
remotes::install_github('coolbutuseless/zap')

Pre-built source/binary versions can also be installed from R-universe

Example: Writing diamonds to file

The graph below compares different serialization/compression options in R.

The x-axis is compression ratio - the size of the compressed data compared to the size of the original. Bigger compression ratios are better. Both zap and xz are able to highly compress this data.

The y-axis is compression speed - how quickly is the data compressed and written to file. Saving with zap is comparable in speed to saving the data uncompressed, and is much faster than xz.

zap_write(diamonds, dst = "diamonds.zap", compress = "zstd")

expression min median itr/sec size
saveRDS(compress=FALSE) 2.39ms 2.87ms 286.5 3452964
qs2::qs_save() 8.02ms 8.42ms 118.6 570855
zap_write() 3.68ms 4.16ms 234.9 329681
saveRDS(zstd) 43.79ms 44.64ms 22.1 524483
saveRDS(xz) 815.38ms 815.38ms 1.2 332468

Writing ‘diamonds’ to file

Example: Reading diamonds from file

zap_read("diamonds.zap")

expression min median itr/sec size
readRDS(compress=FALSE) 1.8ms 1.93ms 520.7 3452964
qs2::qs_read() 3.25ms 3.4ms 292.3 570855
zap_read() 1.97ms 2.31ms 436.6 329681
readRDS(gzip) 3.45ms 3.56ms 280.2 513651
readRDS(zstd) 4.54ms 4.72ms 211.4 524483
readRDS(xz) 15ms 15.29ms 65.5 332468

Reading ‘diamonds’ from file

Verbose output

When verbosity = 64 a data.frame of object information is returned instead of the serialized object.

  • depth recursion depth
  • type the SEXP type
  • start/end are the locations of this data in the uncompressed zap stream
  • altrep was this item ALTREP?
  • rserialize did this item use the fallback R serialization infrastructure?
zap_write(mtcars, verbosity = 64)
#> # A tibble: 20 × 7
#>    depth type    start   end altrep rserialize attrs
#>    <int> <fct>   <int> <int> <lgl>  <lgl>      <lgl>
#>  1     1 VECSXP      5  3512 FALSE  FALSE      TRUE 
#>  2     2 REALSXP     7   274 FALSE  FALSE      TRUE 
#>  3     2 REALSXP   274   541 FALSE  FALSE      TRUE 
#>  4     2 REALSXP   541   808 FALSE  FALSE      TRUE 
#>  5     2 REALSXP   808  1075 FALSE  FALSE      TRUE 
#>  6     2 REALSXP  1075  1342 FALSE  FALSE      TRUE 
#>  7     2 REALSXP  1342  1609 FALSE  FALSE      TRUE 
#>  8     2 REALSXP  1609  1876 FALSE  FALSE      TRUE 
#>  9     2 REALSXP  1876  2143 FALSE  FALSE      TRUE 
#> 10     2 REALSXP  2143  2410 FALSE  FALSE      TRUE 
#> 11     2 REALSXP  2410  2677 FALSE  FALSE      TRUE 
#> 12     2 REALSXP  2677  2944 FALSE  FALSE      TRUE 
#> 13     2 LISTSXP  2944  3489 FALSE  FALSE      TRUE 
#> 14     3 SYMSXP   2946  2956 FALSE  FALSE      TRUE 
#> 15     3 STRSXP   2956  3013 FALSE  FALSE      TRUE 
#> 16     3 SYMSXP   3013  3027 FALSE  FALSE      TRUE 
#> 17     3 STRSXP   3027  3454 FALSE  FALSE      TRUE 
#> 18     3 SYMSXP   3454  3464 FALSE  FALSE      TRUE 
#> 19     3 STRSXP   3464  3487 FALSE  FALSE      TRUE 
#> 20     2 STRSXP   3489  3512 FALSE  FALSE      TRUE

zap Technical details

R object structure

R objects are collections of SEXP elements. For the purposes of explaining this package consider SEXP objects to be broken up into 3 classes.

  1. Atomic vectors e.g. integers, strings
  2. Containers e.g. lists, environments, closures, data.frames
  3. Everything else e.g. bytecode, dots

Atomic Vectors are the core things users would consider data in R. Collections of integers, floating point numbers, logical values are what we actually want to compute with.

Containers allow R to organise atomic vectors into logical units - e.g. a data.frame is a collection of (mostly) atomic vectors. A list is a collection of other arbitrary R objects.

Everything else encompasses all the other details of R the language that don’t need to be considered often by the user. E.g. the compiled bytecode representation of a function, the actual ... object used in function calls.

Serializing R objects

Both zap and R serialize objects the same way

  • walk along each object
  • serialize the atomic vectors it contains
  • serialize its attributes
  • for any nested containers within the object, recurse into them and serialize their contents

Compressing R objects

Using R’s built-in serialization, raw bytes are compressed using entropy coders such as gzip, xz and zstd.

These compressors look for redundancies/patterns and calculate optimal ways of using a smaller number of bits to represent common structures in the data.

Where zap differs is that it includes an extra layer of data-dependent transformations prior to compression.

zap includes lightweight transforms for improving compression

Using the custom serialization mechanism in zap, contextual information about bytes is known as part of the process e.g. when serializing an integer vector, we know that each set of 4 bytes is a 32-bit signed integer represented in twos-complement form.

Knowing the type of data that is in the bytes allows us to do some fast, low-memory transformations on the bytes which will:

  • losslessly reduce the number of bits required to represent each byte
  • make the data much more amenable to compression by the standard compressors

The following sections give a high-level overview of the transformations. To find out more details, the interested reader is directed to the C source code to read the implementation.

Note: these transformations were arrived at after some trial and error. They are, by no means, the final answer to the best transformations to use. See the Future work section below for some ideas on how this package might be extended/improved.

zap Transformations

Logical transformation

Logical data may be packed:

  1. Take all the lowest bits of each logical value (this is the only bit which indicates if the value is TRUE or FALSE). Create a bitstream with 1-bit for each value.
  2. Encode the locations of NA values in an auxilliary bitstream (1-bit for each value).

Each logical was originally stored in 32-bit data type, and is now represented by just 2 bits (one in each bitstream).

Integer transformations

  1. zzshuf ZigZag encoding with delta, then byte shuffling
  2. delta_frame Frame-of-reference coding of the deltas (difference between consecutive elements)

Integer: zzshuf ZigZag encoding with byte shuffling

  1. ZigZag encoding to recode integers without a sign bit
  2. Take the difference between consecutive numbers.
  3. Shuffle the bytes within the vector such that it is more likely zeros will be next to each other.
    • Each integer is 4 bytes i.e ABCD, ABCD, ABCD, …
    • Reorder bytes to: AAA.., BBB…, CCC…, DDD…

Integer: delta_frame Frame-of-reference coding of deltas

General overview of frame-of-reference coding

  1. Take the difference between consecutive numbers
  2. If the largest difference is >= 4096, use zzshuf instead
  3. Find the number of bits to encode the largest difference
  4. Encode all differences with this number of bits
  5. Pack these low-bit representations of differences into 64-bit integers
  6. Encode the locations of NA values in an auxilliary bitstream (1-bit for each number).

Factor transformation

Factors may be packed:

  1. The number of levels of a factor is known without having to calculate anything
  2. If number of levels >= 4096, just encode factor as an integer using zzshuf
  3. Find the number of bits to encode the maximum level
  4. Encode all factors with this number of bits
  5. Pack these bits into 64-bit integers
  6. NA values are encoded as zero (since zero is not a valid factor level)

Character transformation

Encode a character vector as a mega string by writing out:

  1. The total length of all strings
  2. The concatenation of all the nul-terminated strings (where length is encoded implicitly by the position of the nul-bytes)
  3. Encode the locations of NA values in an auxilliary bitstream (1-bit for each string).

This approach avoids encoding a separate length for each individual character string.

Floating point transformation

Floating point compression is notoriously difficult, and the best transformation to apply is heavily dependent on the characteristics of the data.

  1. shuffle Byte shuffle
  2. delta_shuffle Delta and byte shuffle
  3. alp Adaptive Lossles floating Point compression

Floating point: shuffle byte shuffle

  1. Given each double is an 8-byte sequence: ABCDEFGH, ABCDEFGH, …
  2. Reorder the bytes to be: AA…, BB…, CC…, DD…, EE…, FF.., GG…, HH…

Floating point: delta_shuffle delta and byte shuffle

  1. Treat every double precision float as an unsigned 64-bit integer
  2. Take the difference between consecutive values
  3. Given each value is an 8-byte sequence: ABCDEFGH, ABCDEFGH, …
  4. Reorder the bytes to be: AA…, BB…, CC…, DD…, EE…, FF.., GG…, HH…

Floating point: alp Adaptive Lossless floating Point compression

This method is adapted from a Afroozeh et al ALP: Adaptive Lossless floating-Point Compression. The original C++ code on github and the Rust implementation are available. Note: the rust version is faster than c

  1. Examine a sample of the values
  2. Determine if these numbers represent floating point numbers with a finite number of decimal places
  3. If many numbers fail this criteria, then fallback to shuffle or delta_shuffle technique
  4. Determine powers of 10 to best convert numbers to integer form
  5. Convert numbers to integer form
  6. Apply differencing and byte shuffling
  7. Any individual values which were not successfully encoded are encoded in an auxillary stream of “patches” to be applied when un-transforming the data. These include NA, NaN, Inf as well as any floating point value not convertible to an integer.

Future work

Each of the data elements which support transformation (integer, logical, factor, double, character) support up to 256 possible transformations.

There is room here for experimentation, new transformations and heuristics to choose the “optimal” transformation based upon data characteristics.

Remove need for R’s serialize()

There are some SEXPs which are still serialized using R’s built-in mechanism, and then those raw bytes are inserted into the zap output. It would be nice if zap managed to handle all SEXPs without resorting to R.

Current SEXPs which use R’s serialization mechanism:

  • BCODESXP - bytecode representations
  • DOTSXP - representation of the ... object
  • SPECIALSXP
  • BUILTIN
  • PROMSXP(?)
  • ANYSXP - not seen in real objects?
  • EXTPTRSXP
  • WEAKREFSXP
  • S4SXP

Bitstream

Current bit packing occurs within unsigned 64-bit integers, and a packed element will never cross from one 64-bit integer to the next.

E.g. If it is known that all factor values fit into 10-bit integers, then 6 factor values will be packed into a single 64-bit destination - leaving 4 bits unused.

This packing is inefficient, but was easy to code.

It would be worth trialling a general purpose packing routine which can pack bits compactly at all sizes, with no wastage.

Integer transformations

  • Try StreamVByte for integer compression. I did experiment with this early on, and may have discounted it too quickly. Does it offer any speed/compression advantages over simple “delta + shuffle” once we add zstd compression?

Floating point transformation

About

Fast object serialization with high compression

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published