Crate vicinity

Expand description

vicinity: Approximate Nearest Neighbor Search primitives.

Provides standalone implementations of state-of-the-art ANN algorithms:

§Which Index Should I Use?

Situation	Recommendation	Feature
General Purpose (Best Recall/Speed)	`hnsw::HNSWIndex`	`hnsw` (default)
Billion-Scale (Memory Constrained)	`ivf_pq::IVFPQIndex`	`ivf_pq`
Flat Graph (Simpler graph, often worth trying for modern embeddings)	`nsw::NSWIndex`	`nsw`
Attribute Filtering	`hnsw::filtered`	`hnsw`
Out-of-Core (SSD-based)	`diskann`	`diskann` (experimental)

Default features: hnsw, innr (SIMD).

Start with HNSW. It’s the industry standard for a reason. It offers the best trade-off between search speed and recall for datasets that fit in RAM.
Use IVF-PQ if your dataset is too large for RAM (e.g., > 10M vectors on a laptop). It compresses vectors (32x-64x) but has lower recall than HNSW.
Try NSW (Flat) if you want a simpler graph, or you are benchmarking on modern embeddings (hundreds/thousands of dimensions). Recent empirical work suggests the hierarchy may provide less incremental value in that regime (see arXiv:2412.01940). Note: HNSW is the more common default in production systems, so it’s still a safe first choice.
Use DiskANN (experimental) if you have an NVMe SSD and 1B+ vectors.

# Minimal (HNSW + SIMD)
vicinity = "0.1"

# With quantization support
vicinity = { version = "0.1", features = ["ivf_pq"] }

Flat vs hierarchical graphs: Munyampirwa et al. (2024) empirically argue that, on high-dimensional datasets, a flat small-world graph can match HNSW’s recall/latency benefits because “hub” nodes provide routing power without explicit hierarchy (arXiv:2412.01940). This doesn’t make HNSW “wrong” — it just means NSW is often a worthwhile baseline to benchmark.
Memory: for modern embeddings, the raw vector store (n × d × 4 bytes) can dominate. The extra hierarchy layers and graph edges still matter, but you should measure on your actual (n, d, M, ef) and memory layout.
Quantization: IVF-PQ and related techniques trade recall for memory. vicinity exposes IVF-PQ under the ivf_pq feature, but you should treat parameter selection as workload- dependent (benchmark recall@k vs latency vs memory).

Distance concentration: in high dimensions, nearest-neighbor distances can become less discriminative; see Beyer et al. (1999), “When Is Nearest Neighbor Meaningful?” (DOI: 10.1007/s007780050006).
Hubness: some points appear as nearest neighbors for many queries (“hubs”); see Radovanović et al. (2010), “Hubs in Space”.
Benchmarking: for real comparisons, report recall@k vs latency/QPS curves and include memory and build time. When in doubt, use the ann-benchmarks datasets and methodology: http://ann-benchmarks.com/.

For a curated bibliography covering HNSW/NSW/NSG/DiskANN/PQ/OPQ/ScaNN and related phenomena, see doc/references.md in the repo.

adaptive: Adaptive computation patterns for approximate nearest neighbor search.
ann: Unified Approximate Nearest Neighbor (ANN) search algorithms.
benchmark: Benchmark utilities for ANN evaluation.
classic: Classic ANN methods implementation.
compression: Lossless compression for vector IDs in ANN indexes.
diskann: DiskANN: Billion-scale ANN on a single machine with SSD.
distance: Distance metrics for dense vectors.
error: Error types for vicinity.
evoc: EVōC (Embedding Vector Oriented Clustering) implementation.
filtering: Filtering and lightweight faceting support for vector search.
hnsw: Hierarchical Navigable Small World (HNSW) approximate nearest neighbor search.
ivf_pq: IVF-PQ: Inverted File with Product Quantization.
lid: Local Intrinsic Dimensionality (LID) estimation.
matryoshka: Matryoshka Embedding Support
nsw: Flat Navigable Small World (NSW) graph.
partitioning: Partitioning/clustering interface for ANN methods.
persistence: Disk persistence for vicinity indexes.
pq_simd: SIMD-accelerated Product Quantization distance computation.
quantization: Vector quantization: compress vectors while preserving distance.
scann: ScaNN: Google’s algorithm for Maximum Inner Product Search (MIPS).
simd: Vector operations with SIMD acceleration.
sng: OPT-SNG: Auto-tuned Sparse Neighborhood Graph.
streaming: Streaming updates for vector indices.
vamana: Vamana approximate nearest neighbor search.