📖 Infini-gram mini

This repo hosts the source code of the infini-gram mini search engine, which is described in this paper: Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index.

To learn more about infini-gram mini:

Paper: https://arxiv.org/abs/2506.12229
Project Home: https://infini-gram-mini.io/
Web Interface: https://infini-gram-mini.io/demo
API Endpoint: https://infini-gram-mini.io/api_doc
Code: https://infini-gram-mini.io/code
Benchmark Contamination Monitoring System: https://infini-gram-mini.io/bulletin

Overview

Infini-gram mini is an engine that processes queries on the largest body of text in the current open-source community (as of May 2025). It can count the occurrence of arbitrarily long strings in 45.6 TB of text corpora and retrieve their containing documents in seconds.

Infini-gram mini is powered by indexes based on FM-Index. This repo contains everything you might need for constructing an infini-gram mini index of a text corpus, and perform queries on this index.

Getting Started

To query a local index (e.g., pile-train), you need to initialize an engine with the corresponding index and invoke its methods. Below is a step-by-step example.

1. Initialize the engine

Create an engine instance using the appropriate index directory. You can configure:

Whether the index stays on disk (load_to_ram=False, uses less RAM but is slower), or is fully loaded into memory (load_to_ram=True, uses more RAM but is faster).
Whether to return metadata for each result (get_metadata=True).

from src.engine import InfiniGramMiniEngine

engine = InfiniGramMiniEngine(index_dirs=["../index/v2_piletrain"], load_to_ram=False, get_metadata=True)

2. Counting a query

To count the occurrences of a string natural language processing in Pile-train corpus:

query = "natural language processing"
engine.count(query)
#83,470

3. Retrieving a matching document

First, call find() to get information about where the query locates.

engine.find(query)
# {"cnt":83470, "segment_by_shard":[[442381579355,442381620985],[443017902435,443017944275]]}

segment_by_shard is a list of [start, end] byte ranges in each shard where the query appears.

Then, to retrieve text snippet around the first occurrance in shard 0:

engine.get_doc_by_rank(s=0, rank=442381579355, max_ctx_len=20)
# {"disp_len":67, "doc_ix":48649509, "doc_len":813513, "metadata":{"path": "06.jsonl", "linenum": 6526203, "metadata": {"meta": {"pile_set_name": "HackerNews"}}}, "needle_offset":20, "text":"Research Engineer \\- natural language processing\n\n    \n    \n      - "}

Customizing the engine

If you modify the C++ backend of the engine, follow the steps below to recompile and use your custom version:

1. Prerequisites

Make sure you have the following installed:

A C++ compiler with support for -std=c++17
The pybind11 Python package:
```
pip install pybind11
```

2. Compilation

Under engine folder, compile with the following command:

c++ -std=c++17 -O3 -shared -fPIC $(python3 -m pybind11 --includes) src/cpp_engine.cpp -o src/cpp_engine$(python3-config --extension-suffix) -I../sdsl/include -L../sdsl/lib -lsdsl -ldivsufsort -ldivsufsort64 -pthread

3. Import the engine

Once compiled, you can import and use the customized engine in Python:

from engine.src import InfiniGramMiniEngine

Indexing new datasets

1. Prerequisites

Run the following commands to create a specialized conda environment with a old version of GCC:

conda create -n infini-gram-mini
conda install -c conda-forge isl=0.12.2 mpc=1.0.3 mpfr=3.1.4
export LD_LIBRARY_PATH=/path-to-your-conda-installation/envs/infini-gram-mini/lib:$LD_LIBRARY_PATH
conda install psi4::gcc-5=5.2.0

2. Run the indexing script

Go to src/ and run python indexing.py with the appropriate arguments.

We have scripts for the full workflow of downloading datasets and indexing them, which you can refer to: index_v2_dclm.py, index_v2_cc.py, etc.

Citation

If you find infini-gram mini useful, please kindly cite our paper:

@misc{xu2025infinigramminiexactngram,
      title={Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index}, 
      author={Hao Xu and Jiacheng Liu and Yejin Choi and Noah A. Smith and Hannaneh Hajishirzi},
      year={2025},
      eprint={2506.12229},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.12229}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
api		api
docs		docs
engine		engine
nlohmann		nlohmann
parallel_sdsl		parallel_sdsl
sdsl		sdsl
src		src
suffix_array		suffix_array
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
environment_parallel.yml		environment_parallel.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📖 Infini-gram mini

Overview

Getting Started

1. Initialize the engine

2. Counting a query

3. Retrieving a matching document

Customizing the engine

1. Prerequisites

2. Compilation

3. Import the engine

Indexing new datasets

1. Prerequisites

2. Run the indexing script

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

xuhaoxh/infini-gram-mini

Folders and files

Latest commit

History

Repository files navigation

📖 Infini-gram mini

Overview

Getting Started

1. Initialize the engine

2. Counting a query

3. Retrieving a matching document

Customizing the engine

1. Prerequisites

2. Compilation

3. Import the engine

Indexing new datasets

1. Prerequisites

2. Run the indexing script

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages