Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Compression for unit-norm embedding vectors using spherical coordinates

Notifications You must be signed in to change notification settings

jina-ai/jzip-compressor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jzip

Near-lossless compression for unit-norm embedding vectors using spherical coordinates, 1.5x compression ratio.

Note

Near-lossless means the reconstruction error is below 1e-7 (the float32 machine epsilon). So indistinguishable at float32 precision, but we still can't call it bit-exact. Read our paper for more details.

jzip pipeline

Build

Requires zstd library.

# macOS
brew install zstd

# Ubuntu/Debian
apt install libzstd-dev

# Build
make

Usage

# Compress: input.bin (N vectors of D dimensions) -> output.jz
jzip -c input.bin output.jz N D [LEVEL]

# Decompress: output.jz -> output.bin
jzip -d output.jz output.bin

LEVEL is the zstd compression level (1-22, default: 1). Higher levels are slower with negligible compression gain because the spherical transform already minimizes entropy.

Example

# Compress 1000 vectors of 384 dimensions
jzip -c embeddings.bin compressed.jz 1000 384

# Compress with higher zstd level (slower, same ratio)
jzip -c embeddings.bin compressed.jz 1000 384 19

# Decompress
jzip -d compressed.jz restored.bin

Python/NumPy Integration

Export embeddings to binary format:

import numpy as np

# Ensure float32 and unit-normalized
embeddings = embeddings.astype(np.float32)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

# Save to binary
embeddings.tofile('embeddings.bin')
n, d = embeddings.shape  # use these for jzip -c

With sentence-transformers:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts).astype(np.float32)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
embeddings.tofile('embeddings.bin')
# jzip -c embeddings.bin out.jz {len(texts)} {model.get_sentence_embedding_dimension()}

Load decompressed embeddings back:

restored = np.fromfile('restored.bin', dtype=np.float32).reshape(n, d)

File Format

Input: raw float32 binary (N x D floats, row-major)

Output: 16-byte header + zstd-compressed spherical angles

The input embeddings must be unit-normalized (L2 norm = 1).

Algorithm

  1. Convert Cartesian coordinates to spherical angles (N x D -> N x D-1)
  2. Transpose angle matrix to group same-position angles
  3. Byte-shuffle to group IEEE 754 exponent bytes
  4. Compress with zstd

Decompression reverses these steps. Reconstruction error is ~7e-8, below float32 machine epsilon.

About

Compression for unit-norm embedding vectors using spherical coordinates

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •