picosearch

Minimalistic full-text search implemented in Typescript.

📝 See the spec for more details.

Features:

Search across multiple fields (using BM25F) with optional text highlighting
Fast autocomplete via prefix or fuzzy matching
Native language analyzers included for English, German and support for custom analyzers
Native support for local-first applications via persistent storage drivers and syncing via HTTPs (see spec)
Patch API to efficiently update the search index with offline-generated updates
Extendable for custom language analyzers and tokenizers
JSON serializable index

Not yet implemented:

Snippets
Search in nested fields of documents
Patch updates for modifying or deleting documents

Installation

yarn add @picosearch/picosearch

Quick Start

Basic usage is straightforward. First initialize a new instance and index some documents:

import { Picosearch } from '@picosearch/picosearch';

type MyDoc = {
  id: string;
  text: string;
  additionalText: string;
};

const documents: MyDoc[] = [
  { id: '1', text: 'The quick brown fox', additionalText: 'A speedy canine' },
  { id: '2', text: 'Jumps over the lazy dog', additionalText: 'High leap' },
  { id: '3', text: 'Bright blue sky', additionalText: 'Clear and sunny day' },
];

const pico = new Picosearch<MyDoc>({ language: 'english' });
pico.insertMultipleDocuments(documents);

Please note that currently, a document must be flat, can only contain string values, and needs an id field (also a string)!

After indexing, you can now use searchDocuments() or autocomplete(). For example:

pico.searchDocuments('fox').then(console.log)
//[
//  {
//    id: '1',
//    score: 0.5430196556466306,
//    doc: {
//      id: '1',
//      text: 'The quick brown fox',
//      additionalText: 'A speedy canine'
//    }
//  }
//]

Advanced

Fetching documents on-demand

By default, the documents are included in the search index, which makes it pretty big. Picosearch provides the option to exclude them from the index via keepDocuments: false and two options to instead fetch documents only when needed at query time.

Option 1: Fetch documents via URL

const pico = new Picosearch<MyDoc>({ 
  language: 'english',
  keepDocuments: false,
  fetchDocumentUrl: 'https://my-files/docs/{id}.json'
});

Make sure that the URL contains {id} as a placeholder.

Option 2: Custom handler

const pico = new Picosearch<MyDoc>({ 
  language: 'english',
  keepDocuments: false,
  getDocumentById: async (id: string) => { ... }
});

Persistent Storage

You can persist your search index to a local storage. Native storage drivers supported for the browser are: localstorage and indexeddb.

const pico = new Picosearch({ language: 'english', storageDriver: 'localstorage' });
pico.insertMultipleDocuments(documents);
await pico.persist()

Checkout the docs for how to specify extra options for these. You can also create a custom storage driver by implementing the IStorageDriver interface, then pass it to the class intialization like this:

const pico = new Picosearch({ language: 'english', storageDriver: { type: 'custom', driver: MyCustomDriver } });
pico.insertMultipleDocuments(documents);
await pico.persist()

To load an existing index from storage, simply call .sync():

const pico = new Picosearch({ language: 'english', storageDriver: 'localstorage' });
await pico.sync()

Syncing

Picosearch allows for simple syncing of pre-built indexes via HTTPs. You can specify an URL for the index, and an URL pattern for patches. Checkout the spec to learn how it works under the hood.

const indexUrl = 'https://example.com/index.json';
const patchUrl = 'https://example.com/patches/v{version}.json';

const searchIndex = new Picosearch({
  indexUrl,
  patchUrl
});
await searchIndex.sync()

Call .sync() subsequently to check if new updates are available in the remote storage. You can create patches by adding new documents to the latest index via .createPatch(), then updload them with the correct version to your file server.

Note that you can combine syncing and persisting, .sync() already takes care of that if a storage is configured.

Example:

Custom Language Preprocessing

You can also provide a custom tokenizer (for splitting a document into words/tokens) and analyzer (processing a single token before indexing it). Just implement the types Tokenizer and Analyzer and provide these implementations to the constructor. Example:

import {
  Picosearch,
  type Analyzer,
  type Tokenizer,
} from '@picosearch/picosearch';

const myTokenizer: Tokenizer = (doc: string): string[] => doc.split(' ');

const myAnalyzer: Analyzer = (token: string): string =>
  // when the analyzer returns '', it is removed
  ['and', 'I'].includes(token) ? '' : token.toLowerCase();

const pico = new Picosearch({
  tokenizer: myTokenizer,
  analyzer: myAnalyzer,
});

JSON Serialization

Indexes can be exported to and imported from JSON. This is useful, for example, for performing the more compute-heavy indexing offline when the search runtime is in the browser. It is very important that if you have any functions as part of the constructor (e.g., getDocumentById, analyzer, tokenzier), you pass the same functions again and don't change any other constructor options. Here's an example:

import { Picosearch } from '@picosearch/picosearch';
const getDocById = async (id: string) => ...
const pico = new Picosearch<Doc>({ getDocumentById: getDocById, keepDocuments: true });
// ...index documents

const jsonIndex = pico.toJSON() 

// non-function options like `keepDocuments` are preserved in the JSON index
const fromSerialized = new Picosearch<Doc>({ getDocumentById: getDocById, jsonIndex });

Benchmark

The CI/CD pipeline includes a benchmarking step to ensure there are no performance regressions. It currently validates against three datasets of the BEIR benchmark. The performance is checked to be the same or slightly higher (due to multi-field matching) compared to the BM25 baseline.

	scidocs	nfcorpus	scifact
Picosearch+English (BM25F)	15.6%	32.9%	69.0%
Baseline (BM25) [1]	15.8%	32.5%	66.5%

[1] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. Ubiquitous Knowledge Processing Lab (UKP-TUDA), Department of Computer Science, Technische Universität Darmstadt. Retrieved from https://arxiv.org/pdf/2104.08663

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
.github/workflows		.github/workflows
docs/images		docs/images
examples		examples
packages		packages
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
SPEC.md		SPEC.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.base.json		tsconfig.base.json
tsconfig.build.json		tsconfig.build.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

picosearch

Installation

Quick Start

Advanced

Fetching documents on-demand

Persistent Storage

Syncing

Custom Language Preprocessing

JSON Serialization

Benchmark

About

Uh oh!

Releases 14

Uh oh!

Contributors 2

Uh oh!

Languages

License

olastor/picosearch

Folders and files

Latest commit

History

Repository files navigation

picosearch

Installation

Quick Start

Advanced

Fetching documents on-demand

Persistent Storage

Syncing

Custom Language Preprocessing

JSON Serialization

Benchmark

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 14

Uh oh!

Contributors 2

Uh oh!

Languages