OctaneDB is a lightweight, high-performance Python vector database library built with modern Python and optimized algorithms. It's perfect for AI/ML applications requiring fast similarity search with HNSW indexing and flexible storage options.
- Fast HNSW indexing for approximate nearest neighbor search
- Sub-millisecond query response times for typical workloads
- Efficient insertion with configurable batch sizes
- Optimized memory usage with HDF5 compression
- HNSW (Hierarchical Navigable Small World) for ultra-fast approximate search
- FlatIndex for exact similarity search
- Configurable parameters for performance tuning
- Automatic index optimization
- Automatic text-to-vector conversion using sentence-transformers
- Multiple embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, etc.)
- GPU acceleration support (CUDA)
- Batch processing for improved performance
- In-memory for maximum speed
- Persistent file-based storage
- Hybrid mode for best of both worlds
- HDF5 format for efficient compression
- Multiple distance metrics: Cosine, Euclidean, Dot Product, Manhattan, Chebyshev, Jaccard
- Advanced metadata filtering with logical operators
- Batch search operations
- Text-based search with automatic embedding
pip install octanedbfrom octanedb import OctaneDB
# Initialize with text embedding support
db = OctaneDB(
dimension=384, # Will be auto-set by embedding model
embedding_model="all-MiniLM-L6-v2"
)
# Create a collection
collection = db.create_collection("documents")
db.use_collection("documents")
# Add text documents (ChromaDB-compatible!)
result = db.add(
ids=["doc1", "doc2"],
documents=[
"This is a document about pineapple",
"This is a document about oranges"
],
metadatas=[
{"category": "tropical", "color": "yellow"},
{"category": "citrus", "color": "orange"}
]
)
# Search by text query
results = db.search_text(
query_text="fruit",
k=2,
filter="category == 'tropical'",
include_metadata=True
)
for doc_id, distance, metadata in results:
print(f"Document: {db.get_document(doc_id)}")
print(f"Distance: {distance:.4f}")
print(f"Metadata: {metadata}")Here's a complete working example that demonstrates OctaneDB's core functionality:
from octanedb import OctaneDB
# Initialize database with text embeddings
db = OctaneDB(
dimension=384, # sentence-transformers default dimension
storage_mode="in-memory",
enable_text_embeddings=True,
embedding_model="all-MiniLM-L6-v2" # Lightweight model
)
# Create a collection
db.create_collection("fruits")
db.use_collection("fruits")
# Add some fruit documents
fruits_data = [
{"id": "apple", "text": "Apple is a sweet and crunchy fruit that grows on trees.", "category": "temperate"},
{"id": "banana", "text": "Banana is a yellow tropical fruit rich in potassium.", "category": "tropical"},
{"id": "mango", "text": "Mango is a sweet tropical fruit with a large seed.", "category": "tropical"},
{"id": "orange", "text": "Orange is a citrus fruit with a bright orange peel.", "category": "citrus"}
]
for fruit in fruits_data:
db.add(
ids=[fruit["id"]],
documents=[fruit["text"]],
metadatas=[{"category": fruit["category"], "type": "fruit"}]
)
# Simple text search
results = db.search_text(query_text="sweet", k=2, include_metadata=True)
print("Sweet fruits:")
for doc_id, distance, metadata in results:
print(f" • {doc_id}: {metadata.get('document', 'N/A')[:50]}...")
# Text search with filter
results = db.search_text(
query_text="fruit",
k=2,
filter="category == 'tropical'",
include_metadata=True
)
print("\nTropical fruits:")
for doc_id, distance, metadata in results:
print(f" • {doc_id}: {metadata.get('document', 'N/A')[:50]}...")# Batch text search
query_texts = ["machine learning", "artificial intelligence", "data science"]
batch_results = db.search_text_batch(
query_texts=query_texts,
k=5,
include_metadata=True
)
# Change embedding models
db.change_embedding_model("all-mpnet-base-v2") # Higher quality, 768 dimensions
# Get available models
models = db.get_available_models()
print(f"Available models: {models}")# Use pre-computed embeddings
custom_embeddings = np.random.randn(100, 384).astype(np.float32)
result = db.add(
ids=[f"vec_{i}" for i in range(100)],
embeddings=custom_embeddings,
metadatas=[{"source": "custom"} for _ in range(100)]
)# Optimize for speed vs. accuracy
db = OctaneDB(
dimension=384,
m=8, # Fewer connections = faster, less accurate
ef_construction=100, # Lower = faster build
ef_search=50 # Lower = faster search
)# Persistent storage
db = OctaneDB(
dimension=384,
storage_path="./data",
embedding_model="all-MiniLM-L6-v2"
)
# Save and load
db.save("./my_database.h5")
loaded_db = OctaneDB.load("./my_database.h5")# Complex filters
results = db.search_text(
query_text="technology",
k=10,
filter={
"$and": [
{"category": "tech"},
{"$or": [
{"year": {"$gte": 2020}},
{"priority": "high"}
]}
]
}
)-
Empty search results: Make sure to call
include_metadata=Truein your search methods to get metadata back. -
Query engine warnings: The query engine for complex filters is under development. For now, use simple string filters like
"category == 'tropical'". -
Index not built: The index is automatically built when needed, but you can manually trigger it with
collection._build_index()if needed. -
Text embeddings not working: Ensure you have
sentence-transformersinstalled:pip install sentence-transformers
# This will work correctly:
results = db.search_text(
query_text="fruit",
k=2,
filter="category == 'tropical'",
include_metadata=True # Important!
)
# Process results correctly:
for doc_id, distance, metadata in results:
print(f"ID: {doc_id}, Distance: {distance:.4f}")
if metadata:
print(f" Document: {metadata.get('document', 'N/A')}")
print(f" Category: {metadata.get('category', 'N/A')}")Test Environment:
- Hardware: Intel i5-1300H, 16GB RAM, SSD storage
- Dataset: 100K vectors, 384 dimensions (float32)
- Index Type: HNSW with default parameters (m=16, ef_construction=200, ef_search=100)
- Distance Metric: Cosine similarity
- Storage Mode: In-memory
Performance Results:
| Operation | Performance | Notes |
|---|---|---|
| Vector Insertion | 2,800-3,500 vectors/sec | Single-threaded insertion with metadata |
| Index Build Time | 45-60 seconds | HNSW index construction for 100K vectors |
| Single Query Search | 0.5-2.0 milliseconds | k=10 nearest neighbors |
| Batch Search (100 queries) | 150-200 queries/sec | k=10 per query |
| Memory Usage | ~1.5GB | Including vectors, metadata, and HNSW index |
| Storage Efficiency | ~15MB on disk | HDF5 compression for 100K vectors |
Performance Tuning Options:
- Faster Build: Reduce
ef_construction(trades accuracy for speed) - Faster Search: Reduce
ef_search(trades accuracy for speed) - Memory Optimization: Use
m=8instead ofm=16(fewer connections) - Storage Mode: In-memory for speed, persistent for data persistence
Benchmark Code:
# Run performance benchmarks using CLI
octanedb benchmark --count 100000 --dimension 384
# Or use the comprehensive Python benchmarking script
python benchmark_octanedb.py --vectors 100000 --dimension 384 --runs 5
# Or use the Python API directly
from octanedb import OctaneDB
db = OctaneDB(dimension=384)
# ... run your own benchmarksNote: Performance varies based on hardware, dataset characteristics, and HNSW parameters. These numbers represent typical performance on the specified hardware configuration.
OctaneDB
├── Core (OctaneDB)
│ ├── Collection Management
│ ├── Text Embedding Engine
│ └── Storage Manager
├── Collections
│ ├── Vector Storage (HDF5)
│ ├── Metadata Management
│ └── Index Management
├── Indexing
│ ├── HNSW Index
│ ├── Flat Index
│ └── Distance Metrics
├── Text Processing
│ ├── Sentence Transformers
│ ├── GPU Acceleration
│ └── Batch Processing
└── Storage
├── HDF5 Vectors
├── Msgpack Metadata
└── Compression
pip install octanedbpip install octanedb[gpu]git clone https://github.com/RijinRaju/octanedb.git
cd octanedb
pip install -e .- Python: 3.8+
- Core Dependencies: NumPy, h5py, msgpack, tqdm
- Text Embeddings: sentence-transformers, transformers, torch
- Optional: CUDA for GPU acceleration, matplotlib, pandas, seaborn for benchmarking
We welcome contributions! Please see our Contributing Guide for details.
git clone https://github.com/RijinRaju/octanedb.git
cd octanedb
pip install -e ".[dev]"
pytest tests/This project is licensed under the MIT License - see the LICENSE file for details.
- HNSW Algorithm: Based on the Hierarchical Navigable Small World paper
- Sentence Transformers: For text embedding capabilities
- HDF5: For efficient vector storage
- NumPy: For fast numerical operations
AI-Assisted Development: This codebase was extensively developed with the assistance of Large Language Models (LLMs). The LLM assistance included:
- Initial project structure
- Core algorithm implementations (HNSW indexing, vector operations)
- Documentation
- Performance optimization suggestions