jzip

Near-lossless compression for unit-norm embedding vectors using spherical coordinates, 1.5x compression ratio.

Note

Near-lossless means the reconstruction error is below 1e-7 (the float32 machine epsilon). So indistinguishable at float32 precision, but we still can't call it bit-exact. Read our paper for more details.

Build

Requires zstd library.

# macOS
brew install zstd

# Ubuntu/Debian
apt install libzstd-dev

# Build
make

Usage

# Compress: input.bin (N vectors of D dimensions) -> output.jz
jzip -c input.bin output.jz N D [LEVEL]

# Decompress: output.jz -> output.bin
jzip -d output.jz output.bin

LEVEL is the zstd compression level (1-22, default: 1). Higher levels are slower with negligible compression gain because the spherical transform already minimizes entropy.

Example

# Compress 1000 vectors of 384 dimensions
jzip -c embeddings.bin compressed.jz 1000 384

# Compress with higher zstd level (slower, same ratio)
jzip -c embeddings.bin compressed.jz 1000 384 19

# Decompress
jzip -d compressed.jz restored.bin

Python/NumPy Integration

Export embeddings to binary format:

import numpy as np

# Ensure float32 and unit-normalized
embeddings = embeddings.astype(np.float32)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Save to binary
embeddings.tofile('embeddings.bin')
n, d = embeddings.shape  # use these for jzip -c

With sentence-transformers:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts).astype(np.float32)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings.tofile('embeddings.bin')
# jzip -c embeddings.bin out.jz {len(texts)} {model.get_sentence_embedding_dimension()}

Load decompressed embeddings back:

restored = np.fromfile('restored.bin', dtype=np.float32).reshape(n, d)

File Format

Input: raw float32 binary (N x D floats, row-major)

Output: 16-byte header + zstd-compressed spherical angles

The input embeddings must be unit-normalized (L2 norm = 1).

Algorithm

Convert Cartesian coordinates to spherical angles (N x D -> N x D-1)
Transpose angle matrix to group same-position angles
Byte-shuffle to group IEEE 754 exponent bytes
Compress with zstd

Decompression reverses these steps. Reconstruction error is ~7e-8, below float32 machine epsilon.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
jzip.c		jzip.c
pipeline.png		pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jzip

Build

Usage

Example

Python/NumPy Integration

File Format

Algorithm

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

jina-ai/jzip-compressor

Folders and files

Latest commit

History

Repository files navigation

jzip

Build

Usage

Example

Python/NumPy Integration

File Format

Algorithm

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages