Seismic

Seismic is a highly efficient data structure for fast retrieval over learned sparse embeddings written in Rust 🦀. Designed with scalability and performance in mind, Seismic makes querying learned sparse representations seamless.

Details on how to use Seismic's core engine in Rust 🦀 can be found in docs/RustUsage.md.

The instructions below explain how to use it by using the Python API.

⚡ Installation

The easiest way to use Seismic is via its Python API, which can be installed in two different ways:

the easiest way is via pip as follows:

pip install pyseismic-lsr

via Rust compilation that allows deeper hardware optimizations as follows:

RUSTFLAGS="-C target-cpu=native" pip install --no-binary :all: pyseismic-lsr

Check docs/PythonUsage.md for more details.

🚀 Quick Start

Given a collection as a jsonl file, you can quickly index it by running

from seismic import SeismicIndex

json_input_file = "" # Your data collection

index = SeismicIndex.build(json_input_file)
print("Number of documents:", index.len)
print("Avg number of non-zero components:", index.nnz / index.len)
print("Dimensionality of the vectors:", index.dim)

index.print_space_usage_byte()

and then exploit Seismic to retrieve your set of queries quickly

import numpy as np

MAX_TOKEN_LEN = 30

string_type  = f'U{MAX_TOKEN_LEN}'

query = {"a": 3.5, "certain": 3.5, "query": 0.4}
query_id = "0"
query_components = np.array(list(query.keys()), dtype=string_type)
query_values = np.array(list(query.values()), dtype=np.float32)

results = index.search(
    query_id=query_id,
    query_components=query_components,
    query_values=query_values,
    k=10, 
    query_cut=3, 
    heap_factor=0.8,
)

📥 Download the Datasets

The embeddings in jsonl format for several encoders and several datasets can be downloaded from this HuggingFace repository, together with the queries representations.

As an example, the Splade embeddings for MSMARCO can be downloaded and extracted by running the following commands.

wget https://huggingface.co/datasets/tuskanny/seismic-msmarco-splade/resolve/main/documents.tar.gz?download=true -O documents.tar.gz 

tar -xvzf documents.tar.gz

or by using the Huggingface dataset download tool.

📄 Data Format

Documents and queries should have the following format. Each line should be a JSON-formatted string with the following fields:

id: must represent the ID of the document as an integer.
content: the original content of the document, as a string. This field is optional.
vector: a dictionary where each key represents a token, and its corresponding value is the score, e.g., {"dog": 2.45}.

This is the standard output format of several libraries to train sparse models, such as learned-sparse-retrieval.

The script convert_json_to_inner_format.py allows converting files formatted accordingly into the seismic inner format.

python scripts/convert_json_to_inner_format.py --document-path /path/to/document.jsonl --query-path /path/to/queries.jsonl --output-dir /path/to/output

This will generate a data directory at the /path/to/output path, with documents.bin and queries.bin binary files inside.

If you download the NQ dataset from the HuggingFace repo, you need to specify --input-format nq as it uses a slightly different format.

🪏 Resources

Check out our docs folder for detailed guides:

BestResults.md - A detailed guide on how to replicate results with optimized configurations.
RustUsage.md - How to use Seismic directly in Rust.
PythonUsage.md - How to use the Seismic Python API.
RunExperiments.md - How to run custom experiments.
TomlInstructions.md - TOML configuration reference.

🏆 Best Results

Seismic is an approximate algorithm designed for high-performance retrieval over learned sparse representations. We provide pre-optimized configurations for several common datasets, e.g., MsMarco. Check the detailed documentation in docs/BestResults.md and the available optimized configurations in experiments/best_configs.

🧩 Seismic Integration

Seismic is used in several modern libraries:

OpenSearch Project - From version 3.3.0.0, Seismic is an available approximate nearest neighbors search algorithm for learned sparse representations. Here, Seismic has been re-implemented in Java (link).
HuggingFace SentenceTransformers - Seismic is available as a search algorithm. This integration exploits the Seismic code (Python APIs) made available in this GitHub repo.
FlashRAG - Seismic is available as one possible choice to perform search over RAG pipelines. This integration exploits the Seismic code (Python APIs) made available in this GitHub repo.

📚 Bibliography

Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. "Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations." Proc. ACM SIGIR. 2024.
Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. "Pairing Clustered Inverted Indexes with κ-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations." Proc. ACM CIKM. 2024.
Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, and Leonardo Venuta. "Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets." Proc. ECIR. 2025.

Citation License

The source code in this repository is subject to the following citation license:

By downloading and using this software, you agree to cite the under-noted papers in any kind of material you produce where it was used to conduct a search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation. By using this software, you have agreed to the citation license.

SIGIR 2024

@inproceedings{bruch2024seismic,
  author    = {Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
  title     = {Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations},
  booktitle = {Proceedings of the 47th International {ACM} {SIGIR} {C}onference on Research and Development in Information Retrieval ({SIGIR})},
  pages     = {152--162},
  publisher = {{ACM}},
  year      = {2024},
  url       = {https://doi.org/10.1145/3626772.3657769},
  doi       = {10.1145/3626772.3657769}
}

CIKM 2024

@inproceedings{bruch2024pairing,
  author    = {Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
  title     = {Pairing Clustered Inverted Indexes with $\kappa$-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations},
  booktitle = {Proceedings of the 33rd International {ACM} {C}onference on {I}nformation and {K}nowledge {M}anagement ({CIKM})},
  pages     = {3642--3646},
  publisher = {{ACM}},
  year      = {2024},
  url       = {https://doi.org/10.1145/3627673.3679977},
  doi       = {10.1145/3627673.3679977}
}

ECIR 2025

@inproceedings{bruch2025investigating,
  author    = {Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano and Venuta, Leonardo},
  title     = {Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets},
  booktitle = {Advances in Information Retrieval},
  pages     = {437--445},
  publisher = {Springer Nature Switzerland},
  year      = {2025},
  url       = {https://doi.org/10.1007/978-3-031-88714-7_43},
  doi       = {10.1007/978-3-031-88714-7_43}
}

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
docs		docs
examples		examples
experiments		experiments
imgs		imgs
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Seismic

⚡ Installation

🚀 Quick Start

📥 Download the Datasets

📄 Data Format

🪏 Resources

🏆 Best Results

🧩 Seismic Integration

📚 Bibliography

Citation License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

TusKANNy/seismic

Folders and files

Latest commit

History

Repository files navigation

Seismic

⚡ Installation

🚀 Quick Start

📥 Download the Datasets

📄 Data Format

🪏 Resources

🏆 Best Results

🧩 Seismic Integration

📚 Bibliography

Citation License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages