castorini · lilyjge · Jul 31, 2025 · Jul 31, 2025
diff --git a/docs/prebuilt-indexes.md b/docs/prebuilt-indexes.md
@@ -21,10 +21,10 @@ The output of the command will be:
 ```
 Index statistics
 ----------------
-documents:             57359
-documents (non-empty): 57030
-unique terms:          77273
-total terms:           2133799
+documents:             3204
+documents (non-empty): 3204
+unique terms:          14363
+total terms:           320968
 ```
 
 Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.

diff --git a/src/test/java/io/anserini/doc/GeneratePrebuiltIndexesDocTest.java b/src/test/java/io/anserini/doc/GeneratePrebuiltIndexesDocTest.java
@@ -58,81 +58,81 @@ public void generateDocs() throws IOException {
 
     StringBuilder md = new StringBuilder();
     md.append("""
-      # Anserini: Prebuilt Indexes
+    # Anserini: Prebuilt Indexes
 
-      Anserini ships with a number of prebuilt indexes.
-      This means that various indexes (inverted indexes, HNSW indexes, etc.) for common collections used in NLP and IR research have already been built and just needs to be downloaded (from UWaterloo/Hugging Face servers), which Anserini will handle automatically for you.
+    Anserini ships with a number of prebuilt indexes.
+    This means that various indexes (inverted indexes, HNSW indexes, etc.) for common collections used in NLP and IR research have already been built and just needs to be downloaded (from UWaterloo/Hugging Face servers), which Anserini will handle automatically for you.
 
-      Bindings for the available prebuilt indexes are in [`io.anserini.index.IndexInfo`](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java) and below.
-      For example, if you specify `-index msmarco-v1-passage`, Anserini will know that you mean the Lucene index of the MS MARCO V1 passage corpus.
-      It will then download the index from the servers and cache locally.
-      All of this happens automagically!
+    Bindings for the available prebuilt indexes are in [`io.anserini.index.IndexInfo`](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java) and below.
+    For example, if you specify `-index msmarco-v1-passage`, Anserini will know that you mean the Lucene index of the MS MARCO V1 passage corpus.
+    It will then download the index from the servers and cache locally.
+    All of this happens automagically!
 
-      ## Getting Started
+    ## Getting Started
 
-      To download a prebuilt index and view its statistics, you can use the following command:
+    To download a prebuilt index and view its statistics, you can use the following command:
 
-      ```bash
-      bin/run.sh io.anserini.index.IndexReaderUtils -index cacm -stats
-      ```
+    ```bash
+    bin/run.sh io.anserini.index.IndexReaderUtils -index cacm -stats
+    ```
 
-      The output of the command will be:
+    The output of the command will be:
 
-      ```
-      Index statistics
-      ----------------
-      documents:             3204
-      documents (non-empty): 3204
-      unique terms:          14363
-      total terms:           320968
-      ```
+    ```
+    Index statistics
+    ----------------
+    documents:             3204
+    documents (non-empty): 3204
+    unique terms:          14363
+    total terms:           320968
+    ```
 
-      Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
-      Nope, that's not a bug.
+    Note that unless the underlying index was built with the `-optimize` option (i.e., merging all index segments into a single segment), `unique_terms` will show -1.
+    Nope, that's not a bug.
 
-      ## Managing Indexes
+    ## Managing Indexes
 
-      The downloaded index will by default be in `~/.cache/pyserini/indexes/`.
-      (Yes, `pyserini`; this is so prebuilt indexes from both Pyserini and Anserini can live in the same location.)
-      You can specify a custom cache directory by setting the environment variable `$ANSERINI_INDEX_CACHE` or the system property `anserini.index.cache`.
+    The downloaded index will by default be in `~/.cache/pyserini/indexes/`.
+    (Yes, `pyserini`; this is so prebuilt indexes from both Pyserini and Anserini can live in the same location.)
+    You can specify a custom cache directory by setting the environment variable `$ANSERINI_INDEX_CACHE` or the system property `anserini.index.cache`.
 
-      Another helpful tip is to download and manage the indexes by hand.
-      All relevant information is stored in [`IndexInfo`](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java).
-      For example, `msmarco-v1-passage` can be downloaded from:
+    Another helpful tip is to download and manage the indexes by hand.
+    All relevant information is stored in [`IndexInfo`](https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexInfo.java).
+    For example, `msmarco-v1-passage` can be downloaded from:
 
-      ```
-      https://huggingface.co/datasets/castorini/prebuilt-indexes-msmarco-v1/resolve/main/passage/original/lucene-inverted/tf/lucene-inverted.msmarco-v1-passage.20221004.252b5e.tar.gz
-      ```
+    ```
+    https://huggingface.co/datasets/castorini/prebuilt-indexes-msmarco-v1/resolve/main/passage/original/lucene-inverted/tf/lucene-inverted.msmarco-v1-passage.20221004.252b5e.tar.gz
+    ```
 
-      and has an MD5 checksum of `678876e8c99a89933d553609a0fd8793`.
-      You can download, verify, and put anywhere you want.
-      With `-index /path/to/index/` you'll get exactly the same output as `-index msmarco-v1-passage`, except now you've got fine-grained control over managing the index.
+    and has an MD5 checksum of `678876e8c99a89933d553609a0fd8793`.
+    You can download, verify, and put anywhere you want.
+    With `-index /path/to/index/` you'll get exactly the same output as `-index msmarco-v1-passage`, except now you've got fine-grained control over managing the index.
 
-      By manually managing the indexes, you can share indexes between multiple users to conserve space.
-      The schema of the index location in `~/.cache/pyserini/indexes/` is the tarball name (after unpacking), followed by a dot and the checksum, so `msmarco-v1-passage` lives in following location:
+    By manually managing the indexes, you can share indexes between multiple users to conserve space.
+    The schema of the index location in `~/.cache/pyserini/indexes/` is the tarball name (after unpacking), followed by a dot and the checksum, so `msmarco-v1-passage` lives in following location:
 
-      ```
-      ~/.cache/pyserini/indexes/lucene-inverted.msmarco-v1-passage.20221004.252b5e.678876e8c99a89933d553609a0fd8793
-      ```
+    ```
+    ~/.cache/pyserini/indexes/lucene-inverted.msmarco-v1-passage.20221004.252b5e.678876e8c99a89933d553609a0fd8793
+    ```
 
-      You can download the index once, put in a common location, and have each user symlink to the actual index location.
-      Source would conform to the schema above, target would be where your index actually resides.
+    You can download the index once, put in a common location, and have each user symlink to the actual index location.
+    Source would conform to the schema above, target would be where your index actually resides.
 
-      ## Recovering from Partial Downloads
+    ## Recovering from Partial Downloads
 
-      A common issue is recovering from partial downloads, for example, if you abort the downloading of a large index tarball.
-      In the standard flow, Anserini downloads the tarball from the servers, verifies the checksum, and then unpacks the tarball.
-      If this process is interrupted, you'll end up in an inconsistent state.
+    A common issue is recovering from partial downloads, for example, if you abort the downloading of a large index tarball.
+    In the standard flow, Anserini downloads the tarball from the servers, verifies the checksum, and then unpacks the tarball.
+    If this process is interrupted, you'll end up in an inconsistent state.
 
-      To recover, go to `~/.cache/pyserini/indexes/` or your custom cache directory and remove any tarballs (i.e., `.tar.gz` files).
-      If there are any partially unpacked indexes, remove those also.
-      Then start over (e.g., rerun the command you were running before).
+    To recover, go to `~/.cache/pyserini/indexes/` or your custom cache directory and remove any tarballs (i.e., `.tar.gz` files).
+    If there are any partially unpacked indexes, remove those also.
+    Then start over (e.g., rerun the command you were running before).
 
-      ## Available Prebuilt Indexes
+    ## Available Prebuilt Indexes
 
-      Below is a summary of the prebuilt indexes that are currently available.
+    Below is a summary of the prebuilt indexes that are currently available.
 
-      Note that this page is automatically generated from [this script](../src/test/java/io/anserini/doc/GeneratePrebuiltIndexesDocTest.java), so do not modify this page directly; modify the script instead.
+    Note that this page is automatically generated from [this script](../src/test/java/io/anserini/doc/GeneratePrebuiltIndexesDocTest.java), so do not modify this page directly; modify the script instead.
 
     """);