Thanks to visit codestin.com
Credit goes to docs.rs

Skip to main content

Crate vicinity

Crate vicinity 

Source
Expand description

vicinity: Approximate Nearest Neighbor Search primitives.

Provides standalone implementations of state-of-the-art ANN algorithms:

§Which Index Should I Use?

SituationRecommendationFeature
General Purpose (Best Recall/Speed)hnsw::HNSWIndexhnsw (default)
Billion-Scale (Memory Constrained)ivf_pq::IVFPQIndexivf_pq
Flat Graph (Simpler graph, often worth trying for modern embeddings)nsw::NSWIndexnsw
Attribute Filteringhnsw::filteredhnsw
Out-of-Core (SSD-based)diskanndiskann (experimental)

Default features: hnsw, innr (SIMD).

§Recommendation Logic

  1. Start with HNSW. It’s the industry standard for a reason. It offers the best trade-off between search speed and recall for datasets that fit in RAM.

  2. Use IVF-PQ if your dataset is too large for RAM (e.g., > 10M vectors on a laptop). It compresses vectors (32x-64x) but has lower recall than HNSW.

  3. Try NSW (Flat) if you want a simpler graph, or you are benchmarking on modern embeddings (hundreds/thousands of dimensions). Recent empirical work suggests the hierarchy may provide less incremental value in that regime (see arXiv:2412.01940). Note: HNSW is the more common default in production systems, so it’s still a safe first choice.

  4. Use DiskANN (experimental) if you have an NVMe SSD and 1B+ vectors.

# Minimal (HNSW + SIMD)
vicinity = "0.1"

# With quantization support
vicinity = { version = "0.1", features = ["ivf_pq"] }

§Notes (evidence-backed)

  • Flat vs hierarchical graphs: Munyampirwa et al. (2024) empirically argue that, on high-dimensional datasets, a flat small-world graph can match HNSW’s recall/latency benefits because “hub” nodes provide routing power without explicit hierarchy (arXiv:2412.01940). This doesn’t make HNSW “wrong” — it just means NSW is often a worthwhile baseline to benchmark.

  • Memory: for modern embeddings, the raw vector store (n × d × 4 bytes) can dominate. The extra hierarchy layers and graph edges still matter, but you should measure on your actual (n, d, M, ef) and memory layout.

  • Quantization: IVF-PQ and related techniques trade recall for memory. vicinity exposes IVF-PQ under the ivf_pq feature, but you should treat parameter selection as workload- dependent (benchmark recall@k vs latency vs memory).

§Background (kept short; pointers to sources)

  • Distance concentration: in high dimensions, nearest-neighbor distances can become less discriminative; see Beyer et al. (1999), “When Is Nearest Neighbor Meaningful?” (DOI: 10.1007/s007780050006).

  • Hubness: some points appear as nearest neighbors for many queries (“hubs”); see Radovanović et al. (2010), “Hubs in Space”.

  • Benchmarking: for real comparisons, report recall@k vs latency/QPS curves and include memory and build time. When in doubt, use the ann-benchmarks datasets and methodology: http://ann-benchmarks.com/.

For a curated bibliography covering HNSW/NSW/NSG/DiskANN/PQ/OPQ/ScaNN and related phenomena, see doc/references.md in the repo.

Re-exports§

pub use ann::traits::ANNIndex;
pub use distance::DistanceMetric;
pub use error::Result;
pub use error::RetrieveError;

Modules§

adaptive
Adaptive computation patterns for approximate nearest neighbor search.
ann
Unified Approximate Nearest Neighbor (ANN) search algorithms.
benchmark
Benchmark utilities for ANN evaluation.
classic
Classic ANN methods implementation.
compression
Lossless compression for vector IDs in ANN indexes.
diskann
DiskANN: Billion-scale ANN on a single machine with SSD.
distance
Distance metrics for dense vectors.
error
Error types for vicinity.
evoc
EVōC (Embedding Vector Oriented Clustering) implementation.
filtering
Filtering and lightweight faceting support for vector search.
hnsw
Hierarchical Navigable Small World (HNSW) approximate nearest neighbor search.
ivf_pq
IVF-PQ: Inverted File with Product Quantization.
lid
Local Intrinsic Dimensionality (LID) estimation.
matryoshka
Matryoshka Embedding Support
nsw
Flat Navigable Small World (NSW) graph.
partitioning
Partitioning/clustering interface for ANN methods.
persistence
Disk persistence for vicinity indexes.
pq_simd
SIMD-accelerated Product Quantization distance computation.
quantization
Vector quantization: compress vectors while preserving distance.
scann
ScaNN: Google’s algorithm for Maximum Inner Product Search (MIPS).
simd
Vector operations with SIMD acceleration.
sng
OPT-SNG: Auto-tuned Sparse Neighborhood Graph.
streaming
Streaming updates for vector indices.
vamana
Vamana approximate nearest neighbor search.