Nearest neighbor search for Rails
Supports:
- Postgres (cube and pgvector)
- SQLite (sqlite-vec) - experimental
- MariaDB 11.6 Vector - experimental
- MySQL 9 (searching requires HeatWave) - experimental
Add this line to your application’s Gemfile:
gem "neighbor"Neighbor supports two extensions: cube and pgvector. cube ships with Postgres, while pgvector supports more dimensions and approximate nearest neighbor search.
For cube, run:
rails generate neighbor:cube
rails db:migrateFor pgvector, install the extension and run:
rails generate neighbor:vector
rails db:migrateAdd this line to your application’s Gemfile:
gem "sqlite-vec"And run:
rails generate neighbor:sqliteCreate a migration
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
def change
# cube
add_column :items, :embedding, :cube
# pgvector and MySQL
add_column :items, :embedding, :vector, limit: 3 # dimensions
# sqlite-vec and MariaDB
add_column :items, :embedding, :binary
end
endAdd to your model
class Item < ApplicationRecord
has_neighbors :embedding
endUpdate the vectors
item.update(embedding: [1.0, 1.2, 0.5])Get the nearest neighbors to a record
item.nearest_neighbors(:embedding, distance: "euclidean").first(5)Get the nearest neighbors to a vector
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean").first(5)Records returned from nearest_neighbors will have a neighbor_distance attribute
nearest_item = item.nearest_neighbors(:embedding, distance: "euclidean").first
nearest_item.neighbor_distanceSee the additional docs for:
Or check out some examples
Supported values are:
euclideancosinetaxicabchebyshev
For cosine distance with cube, vectors must be normalized before being stored.
class Item < ApplicationRecord
has_neighbors :embedding, normalize: true
endFor inner product with cube, see this example.
The cube type can have up to 100 dimensions by default. See the Postgres docs for how to increase this.
For cube, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 3
endSupported values are:
euclideaninner_productcosinetaxicabhammingjaccard
The vector type can have up to 16,000 dimensions, and vectors with up to 2,000 dimensions can be indexed.
The halfvec type can have up to 16,000 dimensions, and half vectors with up to 4,000 dimensions can be indexed.
The bit type can have up to 83 million dimensions, and bit vectors with up to 64,000 dimensions can be indexed.
The sparsevec type can have up to 16,000 non-zero elements, and sparse vectors with up to 1,000 non-zero elements can be indexed.
Add an approximate index to speed up queries. Create a migration with:
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
def change
add_index :items, :embedding, using: :hnsw, opclass: :vector_l2_ops
# or
add_index :items, :embedding, using: :ivfflat, opclass: :vector_l2_ops
end
endUse :vector_cosine_ops for cosine distance and :vector_ip_ops for inner product.
Set the size of the dynamic candidate list with HNSW
Item.connection.execute("SET hnsw.ef_search = 100")Or the number of probes with IVFFlat
Item.connection.execute("SET ivfflat.probes = 3")Use the halfvec type to store half-precision vectors
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
def change
add_column :items, :embedding, :halfvec, limit: 3 # dimensions
end
endIndex vectors at half precision for smaller indexes
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
def change
add_index :items, "(embedding::halfvec(3)) vector_l2_ops", using: :hnsw
end
endGet the nearest neighbors
Item.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "euclidean", precision: "half").first(5)Use the bit type to store binary vectors
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
def change
add_column :items, :embedding, :bit, limit: 3 # dimensions
end
endGet the nearest neighbors by Hamming distance
Item.nearest_neighbors(:embedding, "101", distance: "hamming").first(5)Use expression indexing for binary quantization
class AddIndexToItemsEmbedding < ActiveRecord::Migration[7.2]
def change
add_index :items, "(binary_quantize(embedding)::bit(3)) bit_hamming_ops", using: :hnsw
end
endUse the sparsevec type to store sparse vectors
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
def change
add_column :items, :embedding, :sparsevec, limit: 3 # dimensions
end
endGet the nearest neighbors
embedding = Neighbor::SparseVector.new({0 => 0.9, 1 => 1.3, 2 => 1.1}, 3)
Item.nearest_neighbors(:embedding, embedding, distance: "euclidean").first(5)Supported values are:
euclideancosinetaxicabhamming
For sqlite-vec, it’s a good idea to specify the number of dimensions to ensure all records have the same number.
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 3
endYou can also use virtual tables
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
def change
# Rails < 8
execute <<~SQL
CREATE VIRTUAL TABLE items USING vec0(
embedding float[3] distance_metric=L2
)
SQL
# Rails 8+
create_virtual_table :items, :vec0, [
"embedding float[3] distance_metric=L2"
]
end
endUse distance_metric=cosine for cosine distance
You can optionally ignore any shadow tables that are created
ActiveRecord::SchemaDumper.ignore_tables += [
"items_chunks", "items_rowids", "items_vector_chunks00"
]Create a model with rowid as the primary key
class Item < ApplicationRecord
self.primary_key = "rowid"
has_neighbors :embedding, dimensions: 3
endGet the k nearest neighbors
Item.where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)Filter by primary key
Item.where(rowid: [2, 3]).where("embedding MATCH ?", [1, 2, 3].to_s).where(k: 5).order(:distance)Use the type option for int8 vectors
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 3, type: :int8
endUse the type option for binary vectors
class Item < ApplicationRecord
has_neighbors :embedding, dimensions: 8, type: :bit
endGet the nearest neighbors by Hamming distance
Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)Supported values are:
euclideancosinehamming
For cosine distance with MariaDB, vectors must be normalized before being stored.
class Item < ApplicationRecord
has_neighbors :embedding, normalize: true
endVector columns must use null: false to add a vector index
class CreateItems < ActiveRecord::Migration[7.2]
def change
create_table :items do |t|
t.binary :embedding, null: false
t.index :embedding, type: :vector
end
end
endUse the bigint type to store binary vectors
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
def change
add_column :items, :embedding, :bigint
end
endNote: Binary vectors can have up to 64 dimensions
Get the nearest neighbors by Hamming distance
Item.nearest_neighbors(:embedding, 5, distance: "hamming").first(5)Supported values are:
euclideancosinehamming
Note: The DISTANCE() function is only available on HeatWave
Use the binary type to store binary vectors
class AddEmbeddingToItems < ActiveRecord::Migration[7.2]
def change
add_column :items, :embedding, :binary
end
endGet the nearest neighbors by Hamming distance
Item.nearest_neighbors(:embedding, "\x05", distance: "hamming").first(5)- Embeddings with OpenAI
- Binary embeddings with Cohere
- Sentence embeddings with Informers
- Hybrid search with Informers
- Sparse search with Transformers.rb
- Recommendations with Disco
Generate a model
rails generate model Document content:text embedding:vector{1536}
rails db:migrateAnd add has_neighbors
class Document < ApplicationRecord
has_neighbors :embedding
endCreate a method to call the embeddings API
def fetch_embeddings(input)
url = "https://api.openai.com/v1/embeddings"
headers = {
"Authorization" => "Bearer #{ENV.fetch("OPENAI_API_KEY")}",
"Content-Type" => "application/json"
}
data = {
input: input,
model: "text-embedding-3-small"
}
response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
JSON.parse(response.body)["data"].map { |v| v["embedding"] }
endPass your input
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = fetch_embeddings(input)Store the embeddings
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)And get similar documents
document = Document.first
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)See the complete code
Generate a model
rails generate model Document content:text embedding:bit{1024}
rails db:migrateAnd add has_neighbors
class Document < ApplicationRecord
has_neighbors :embedding
endCreate a method to call the embed API
def fetch_embeddings(input, input_type)
url = "https://api.cohere.com/v1/embed"
headers = {
"Authorization" => "Bearer #{ENV.fetch("CO_API_KEY")}",
"Content-Type" => "application/json"
}
data = {
texts: input,
model: "embed-english-v3.0",
input_type: input_type,
embedding_types: ["ubinary"]
}
response = Net::HTTP.post(URI(url), data.to_json, headers).tap(&:value)
JSON.parse(response.body)["embeddings"]["ubinary"].map { |e| e.map { |v| v.chr.unpack1("B*") }.join }
endPass your input
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = fetch_embeddings(input, "search_document")Store the embeddings
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)Embed the search query
query = "forest"
query_embedding = fetch_embeddings([query], "search_query")[0]And search the documents
Document.nearest_neighbors(:embedding, query_embedding, distance: "hamming").first(5).map(&:content)See the complete code
You can generate embeddings locally with Informers.
Generate a model
rails generate model Document content:text embedding:vector{384}
rails db:migrateAnd add has_neighbors
class Document < ApplicationRecord
has_neighbors :embedding
endLoad a model
model = Informers.pipeline("embedding", "sentence-transformers/all-MiniLM-L6-v2")Pass your input
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = model.(input)Store the embeddings
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: embedding}
end
Document.insert_all!(documents)And get similar documents
document = Document.first
document.nearest_neighbors(:embedding, distance: "cosine").first(5).map(&:content)See the complete code
You can use Neighbor for hybrid search with Informers.
Generate a model
rails generate model Document content:text embedding:vector{768}
rails db:migrateAnd add has_neighbors and a scope for keyword search
class Document < ApplicationRecord
has_neighbors :embedding
scope :search, ->(query) {
where("to_tsvector(content) @@ plainto_tsquery(?)", query)
.order(Arel.sql("ts_rank_cd(to_tsvector(content), plainto_tsquery(?)) DESC", query))
}
endCreate some documents
Document.create!(content: "The dog is barking")
Document.create!(content: "The cat is purring")
Document.create!(content: "The bear is growling")Generate an embedding for each document
embed = Informers.pipeline("embedding", "Snowflake/snowflake-arctic-embed-m-v1.5")
embed_options = {model_output: "sentence_embedding", pooling: "none"} # specific to embedding model
Document.find_each do |document|
embedding = embed.(document.content, **embed_options)
document.update!(embedding: embedding)
endPerform keyword search
query = "growling bear"
keyword_results = Document.search(query).limit(20).load_asyncAnd semantic search in parallel (the query prefix is specific to the embedding model)
query_prefix = "Represent this sentence for searching relevant passages: "
query_embedding = embed.(query_prefix + query, **embed_options)
semantic_results =
Document.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20).load_asyncTo combine the results, use Reciprocal Rank Fusion (RRF)
Neighbor::Reranking.rrf(keyword_results, semantic_results).first(5)Or a reranking model
rerank = Informers.pipeline("reranking", "mixedbread-ai/mxbai-rerank-xsmall-v1")
results = (keyword_results + semantic_results).uniq
rerank.(query, results.map(&:content)).first(5).map { |v| results[v[:doc_id]] }See the complete code
You can generate sparse embeddings locally with Transformers.rb.
Generate a model
rails generate model Document content:text embedding:sparsevec{30522}
rails db:migrateAnd add has_neighbors
class Document < ApplicationRecord
has_neighbors :embedding
endLoad a model to generate embeddings
class EmbeddingModel
def initialize(model_id)
@model = Transformers::AutoModelForMaskedLM.from_pretrained(model_id)
@tokenizer = Transformers::AutoTokenizer.from_pretrained(model_id)
@special_token_ids = @tokenizer.special_tokens_map.map { |_, token| @tokenizer.vocab[token] }
end
def embed(input)
feature = @tokenizer.(input, padding: true, truncation: true, return_tensors: "pt", return_token_type_ids: false)
output = @model.(**feature)[0]
values = Torch.max(output * feature[:attention_mask].unsqueeze(-1), dim: 1)[0]
values = Torch.log(1 + Torch.relu(values))
values[0.., @special_token_ids] = 0
values.to_a
end
end
model = EmbeddingModel.new("opensearch-project/opensearch-neural-sparse-encoding-v1")Pass your input
input = [
"The dog is barking",
"The cat is purring",
"The bear is growling"
]
embeddings = model.embed(input)Store the embeddings
documents = []
input.zip(embeddings) do |content, embedding|
documents << {content: content, embedding: Neighbor::SparseVector.new(embedding)}
end
Document.insert_all!(documents)Embed the search query
query = "forest"
query_embedding = model.embed([query])[0]And search the documents
Document.nearest_neighbors(:embedding, Neighbor::SparseVector.new(query_embedding), distance: "inner_product").first(5).map(&:content)See the complete code
You can use Neighbor for online item-based recommendations with Disco. We’ll use MovieLens data for this example.
Generate a model
rails generate model Movie name:string factors:cube
rails db:migrateAnd add has_neighbors
class Movie < ApplicationRecord
has_neighbors :factors, dimensions: 20, normalize: true
endFit the recommender
data = Disco.load_movielens
recommender = Disco::Recommender.new(factors: 20)
recommender.fit(data)Store the item factors
movies = []
recommender.item_ids.each do |item_id|
movies << {name: item_id, factors: recommender.item_factors(item_id)}
end
Movie.create!(movies)And get similar movies
movie = Movie.find_by(name: "Star Wars (1977)")
movie.nearest_neighbors(:factors, distance: "cosine").first(5).map(&:name)See the complete code for cube and pgvector
View the changelog
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/neighbor.git
cd neighbor
bundle install
# Postgres
createdb neighbor_test
bundle exec rake test:postgresql
# SQLite
bundle exec rake test:sqlite
# MariaDB
docker run -e MARIADB_ALLOW_EMPTY_ROOT_PASSWORD=1 -e MARIADB_DATABASE=neighbor_test -p 3307:3306 quay.io/mariadb-foundation/mariadb-devel:11.6-vector-preview
bundle exec rake test:mariadb
# MySQL
docker run -e MYSQL_ALLOW_EMPTY_PASSWORD=1 -e MYSQL_DATABASE=neighbor_test -p 3306:3306 mysql:9
bundle exec rake test:mysql