Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/prebuilt-indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ The output of the command will be:
```
Index statistics
----------------
documents: 57359
documents (non-empty): 57030
unique terms: 77273
total terms: 2133799
documents: 3204
documents (non-empty): 3204
unique terms: 14363
total terms: 320968
```

Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
Expand Down
106 changes: 53 additions & 53 deletions src/test/java/io/anserini/doc/GeneratePrebuiltIndexesDocTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -58,81 +58,81 @@ public void generateDocs() throws IOException {

StringBuilder md = new StringBuilder();
md.append("""
# Anserini: Prebuilt Indexes
# Anserini: Prebuilt Indexes

Anserini ships with a number of prebuilt indexes.
This means that various indexes (inverted indexes, HNSW indexes, etc.) for common collections used in NLP and IR research have already been built and just needs to be downloaded (from UWaterloo/Hugging Face servers), which Anserini will handle automatically for you.
Anserini ships with a number of prebuilt indexes.
This means that various indexes (inverted indexes, HNSW indexes, etc.) for common collections used in NLP and IR research have already been built and just needs to be downloaded (from UWaterloo/Hugging Face servers), which Anserini will handle automatically for you.

Bindings for the available prebuilt indexes are in [`io.anserini.index.IndexInfo`](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java) and below.
For example, if you specify `-index msmarco-v1-passage`, Anserini will know that you mean the Lucene index of the MS MARCO V1 passage corpus.
It will then download the index from the servers and cache locally.
All of this happens automagically!
Bindings for the available prebuilt indexes are in [`io.anserini.index.IndexInfo`](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java) and below.
For example, if you specify `-index msmarco-v1-passage`, Anserini will know that you mean the Lucene index of the MS MARCO V1 passage corpus.
It will then download the index from the servers and cache locally.
All of this happens automagically!

## Getting Started
## Getting Started

To download a prebuilt index and view its statistics, you can use the following command:
To download a prebuilt index and view its statistics, you can use the following command:

```bash
bin/run.sh io.anserini.index.IndexReaderUtils -index cacm -stats
```
```bash
bin/run.sh io.anserini.index.IndexReaderUtils -index cacm -stats
```

The output of the command will be:
The output of the command will be:

```
Index statistics
----------------
documents: 3204
documents (non-empty): 3204
unique terms: 14363
total terms: 320968
```
```
Index statistics
----------------
documents: 3204
documents (non-empty): 3204
unique terms: 14363
total terms: 320968
```

Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
Nope, that's not a bug.
Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
Nope, that's not a bug.

## Managing Indexes
## Managing Indexes

The downloaded index will by default be in `~/.cache/pyserini/indexes/`.
(Yes, `pyserini`; this is so prebuilt indexes from both Pyserini and Anserini can live in the same location.)
You can specify a custom cache directory by setting the environment variable `$ANSERINI_INDEX_CACHE` or the system property `anserini.index.cache`.
The downloaded index will by default be in `~/.cache/pyserini/indexes/`.
(Yes, `pyserini`; this is so prebuilt indexes from both Pyserini and Anserini can live in the same location.)
You can specify a custom cache directory by setting the environment variable `$ANSERINI_INDEX_CACHE` or the system property `anserini.index.cache`.

Another helpful tip is to download and manage the indexes by hand.
All relevant information is stored in [`IndexInfo`](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java).
For example, `msmarco-v1-passage` can be downloaded from:
Another helpful tip is to download and manage the indexes by hand.
All relevant information is stored in [`IndexInfo`](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java).
For example, `msmarco-v1-passage` can be downloaded from:

```
https://huggingface.co/datasets/castorini/prebuilt-indexes-msmarco-v1/resolve/main/passage/original/lucene-inverted/tf/lucene-inverted.msmarco-v1-passage.20221004.252b5e.tar.gz
```
```
https://huggingface.co/datasets/castorini/prebuilt-indexes-msmarco-v1/resolve/main/passage/original/lucene-inverted/tf/lucene-inverted.msmarco-v1-passage.20221004.252b5e.tar.gz
```

and has an MD5 checksum of `678876e8c99a89933d553609a0fd8793`.
You can download, verify, and put anywhere you want.
With `-index /path/to/index/` you'll get exactly the same output as `-index msmarco-v1-passage`, except now you've got fine-grained control over managing the index.
and has an MD5 checksum of `678876e8c99a89933d553609a0fd8793`.
You can download, verify, and put anywhere you want.
With `-index /path/to/index/` you'll get exactly the same output as `-index msmarco-v1-passage`, except now you've got fine-grained control over managing the index.

By manually managing the indexes, you can share indexes between multiple users to conserve space.
The schema of the index location in `~/.cache/pyserini/indexes/` is the tarball name (after unpacking), followed by a dot and the checksum, so `msmarco-v1-passage` lives in following location:
By manually managing the indexes, you can share indexes between multiple users to conserve space.
The schema of the index location in `~/.cache/pyserini/indexes/` is the tarball name (after unpacking), followed by a dot and the checksum, so `msmarco-v1-passage` lives in following location:

```
~/.cache/pyserini/indexes/lucene-inverted.msmarco-v1-passage.20221004.252b5e.678876e8c99a89933d553609a0fd8793
```
```
~/.cache/pyserini/indexes/lucene-inverted.msmarco-v1-passage.20221004.252b5e.678876e8c99a89933d553609a0fd8793
```

You can download the index once, put in a common location, and have each user symlink to the actual index location.
Source would conform to the schema above, target would be where your index actually resides.
You can download the index once, put in a common location, and have each user symlink to the actual index location.
Source would conform to the schema above, target would be where your index actually resides.

## Recovering from Partial Downloads
## Recovering from Partial Downloads

A common issue is recovering from partial downloads, for example, if you abort the downloading of a large index tarball.
In the standard flow, Anserini downloads the tarball from the servers, verifies the checksum, and then unpacks the tarball.
If this process is interrupted, you'll end up in an inconsistent state.
A common issue is recovering from partial downloads, for example, if you abort the downloading of a large index tarball.
In the standard flow, Anserini downloads the tarball from the servers, verifies the checksum, and then unpacks the tarball.
If this process is interrupted, you'll end up in an inconsistent state.

To recover, go to `~/.cache/pyserini/indexes/` or your custom cache directory and remove any tarballs (i.e., `.tar.gz` files).
If there are any partially unpacked indexes, remove those also.
Then start over (e.g., rerun the command you were running before).
To recover, go to `~/.cache/pyserini/indexes/` or your custom cache directory and remove any tarballs (i.e., `.tar.gz` files).
If there are any partially unpacked indexes, remove those also.
Then start over (e.g., rerun the command you were running before).

## Available Prebuilt Indexes
## Available Prebuilt Indexes

Below is a summary of the prebuilt indexes that are currently available.
Below is a summary of the prebuilt indexes that are currently available.

Note that this page is automatically generated from [this script](../src/test/java/io/anserini/doc/GeneratePrebuiltIndexesDocTest.java), so do not modify this page directly; modify the script instead.
Note that this page is automatically generated from [this script](../src/test/java/io/anserini/doc/GeneratePrebuiltIndexesDocTest.java), so do not modify this page directly; modify the script instead.

""");

Expand Down