castorini · lilyjge · Aug 16, 2025 · Aug 14, 2025 · Aug 16, 2025 · Aug 16, 2025
diff --git a/docs/regressions/regressions-bright-aops.splade-v3.onnx.md b/docs/regressions/regressions-bright-aops.splade-v3.onnx.md
@@ -0,0 +1,84 @@
+# Anserini Regressions: BRIGHT &mdash; AoPS
+
+**Model**: [SPLADE-v3](https://arxiv.org/abs/2403.06789) (using ONNX for on-the-fly query encoding)
+
+This page documents regression experiments for [BRIGHT &mdash; AoPS](https://brightbenchmark.github.io/) using using [SPLADE-v3](https://arxiv.org/abs/2403.06789).
+The model itself can be download [here](https://huggingface.co/naver/splade-v3).
+See the [official SPLADE repo](https://github.com/naver/splade) and the following paper for more details:
+
+> Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. [SPLADE-v3: New baselines for SPLADE.](https://arxiv.org/abs/2403.06789) _arXiv:2403.06789_.
+
+In these experiments, we are using ONNX to perform query encoding on the fly.
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/bright-aops.splade-v3.onnx.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/bright-aops.splade-v3.onnx.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and build Anserini to rebuild the documentation.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression bright-aops.splade-v3.onnx
+```
+
+All the BRIGHT corpora, encoded byt he SPLADE-v3 mode, are available for download:
+
+```bash
+wget https://huggingface.co/datasets/castorini/collections-bright/resolve/main/bright-splade-v3.tar -P collections/
+tar xvf collections/bright-splade-v3.tar -C collections/
+```
+
+The tarball is 1.5 GB and has MD5 checksum `434cd776b5c40f8112d2bf888c58a516`.
+After download and unpacking the corpora, the `run_regression.py` command above should work without any issue.
+
+## Indexing
+
+Typical indexing command:
+
+```
+bin/run.sh io.anserini.index.IndexCollection \
+  -threads 16 \
+  -collection JsonVectorCollection \
+  -input /path/to/bright-aops \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.bright-aops.splade-v3/ \
+  -impact -pretokenized \
+  >& logs/log.bright-aops &
+```
+
+The path `/path/to/bright-aops/` should point to the corpus downloaded above.
+The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the pre-encoded tokens.
+For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).
+
+## Retrieval
+
+Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+bin/run.sh io.anserini.search.SearchCollection \
+  -index indexes/lucene-inverted.bright-aops.splade-v3/ \
+  -topics tools/topics-and-qrels/topics.bright-aops.tsv.gz \
+  -topicReader TsvString \
+  -output runs/run.bright-aops.splade-v3-onnx.topics.bright-aops.txt \
+  -impact -pretokenized -removeQuery -hits 1000 -encoder SpladeV3 &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.bright-aops.txt runs/run.bright-aops.splade-v3-onnx.topics.bright-aops.txt
+bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.bright-aops.txt runs/run.bright-aops.splade-v3-onnx.topics.bright-aops.txt
+bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.bright-aops.txt runs/run.bright-aops.splade-v3-onnx.topics.bright-aops.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@10**                                                                                                  | **SPLADE-v3**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BRIGHT: AoPS                                                                                                 | 0.0692    |
+| **R@100**                                                                                                    | **SPLADE-v3**|
+| BRIGHT: AoPS                                                                                                 | 0.2602    |
+| **R@1000**                                                                                                   | **SPLADE-v3**|
+| BRIGHT: AoPS                                                                                                 | 0.4644    |
diff --git a/docs/regressions/regressions-bright-biology.splade-v3.onnx.md b/docs/regressions/regressions-bright-biology.splade-v3.onnx.md
@@ -0,0 +1,84 @@
+# Anserini Regressions: BRIGHT &mdash; Biology
+
+**Model**: [SPLADE-v3](https://arxiv.org/abs/2403.06789) (using ONNX for on-the-fly query encoding)
+
+This page documents regression experiments for [BRIGHT &mdash; Biology](https://brightbenchmark.github.io/) using using [SPLADE-v3](https://arxiv.org/abs/2403.06789).
+The model itself can be download [here](https://huggingface.co/naver/splade-v3).
+See the [official SPLADE repo](https://github.com/naver/splade) and the following paper for more details:
+
+> Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. [SPLADE-v3: New baselines for SPLADE.](https://arxiv.org/abs/2403.06789) _arXiv:2403.06789_.
+
+In these experiments, we are using ONNX to perform query encoding on the fly.
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/bright-biology.splade-v3.onnx.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/bright-biology.splade-v3.onnx.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and build Anserini to rebuild the documentation.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression bright-biology.splade-v3.onnx
+```
+
+All the BRIGHT corpora, encoded byt he SPLADE-v3 mode, are available for download:
+
+```bash
+wget https://huggingface.co/datasets/castorini/collections-bright/resolve/main/bright-splade-v3.tar -P collections/
+tar xvf collections/bright-splade-v3.tar -C collections/
+```
+
+The tarball is 1.5 GB and has MD5 checksum `434cd776b5c40f8112d2bf888c58a516`.
+After download and unpacking the corpora, the `run_regression.py` command above should work without any issue.
+
+## Indexing
+
+Typical indexing command:
+
+```
+bin/run.sh io.anserini.index.IndexCollection \
+  -threads 16 \
+  -collection JsonVectorCollection \
+  -input /path/to/bright-biology \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.bright-biology.splade-v3/ \
+  -impact -pretokenized \
+  >& logs/log.bright-biology &
+```
+
+The path `/path/to/bright-biology/` should point to the corpus downloaded above.
+The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the pre-encoded tokens.
+For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).
+
+## Retrieval
+
+Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+bin/run.sh io.anserini.search.SearchCollection \
+  -index indexes/lucene-inverted.bright-biology.splade-v3/ \
+  -topics tools/topics-and-qrels/topics.bright-biology.tsv.gz \
+  -topicReader TsvString \
+  -output runs/run.bright-biology.splade-v3-onnx.topics.bright-biology.txt \
+  -impact -pretokenized -removeQuery -hits 1000 -encoder SpladeV3 &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.bright-biology.txt runs/run.bright-biology.splade-v3-onnx.topics.bright-biology.txt
+bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.bright-biology.txt runs/run.bright-biology.splade-v3-onnx.topics.bright-biology.txt
+bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.bright-biology.txt runs/run.bright-biology.splade-v3-onnx.topics.bright-biology.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@10**                                                                                                  | **SPLADE-v3**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BRIGHT: Biology                                                                                              | 0.2095    |
+| **R@100**                                                                                                    | **SPLADE-v3**|
+| BRIGHT: Biology                                                                                              | 0.5602    |
+| **R@1000**                                                                                                   | **SPLADE-v3**|
+| BRIGHT: Biology                                                                                              | 0.8883    |
diff --git a/docs/regressions/regressions-bright-earth-science.splade-v3.onnx.md b/docs/regressions/regressions-bright-earth-science.splade-v3.onnx.md
@@ -0,0 +1,84 @@
+# Anserini Regressions: BRIGHT &mdash; Earth Science
+
+**Model**: [SPLADE-v3](https://arxiv.org/abs/2403.06789) (using ONNX for on-the-fly query encoding)
+
+This page documents regression experiments for [BRIGHT &mdash; Earth Science](https://brightbenchmark.github.io/) using using [SPLADE-v3](https://arxiv.org/abs/2403.06789).
+The model itself can be download [here](https://huggingface.co/naver/splade-v3).
+See the [official SPLADE repo](https://github.com/naver/splade) and the following paper for more details:
+
+> Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. [SPLADE-v3: New baselines for SPLADE.](https://arxiv.org/abs/2403.06789) _arXiv:2403.06789_.
+
+In these experiments, we are using ONNX to perform query encoding on the fly.
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/bright-earth-science.splade-v3.onnx.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/bright-earth-science.splade-v3.onnx.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and build Anserini to rebuild the documentation.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression bright-earth-science.splade-v3.onnx
+```
+
+All the BRIGHT corpora, encoded byt he SPLADE-v3 mode, are available for download:
+
+```bash
+wget https://huggingface.co/datasets/castorini/collections-bright/resolve/main/bright-splade-v3.tar -P collections/
+tar xvf collections/bright-splade-v3.tar -C collections/
+```
+
+The tarball is 1.5 GB and has MD5 checksum `434cd776b5c40f8112d2bf888c58a516`.
+After download and unpacking the corpora, the `run_regression.py` command above should work without any issue.
+
+## Indexing
+
+Typical indexing command:
+
+```
+bin/run.sh io.anserini.index.IndexCollection \
+  -threads 16 \
+  -collection JsonVectorCollection \
+  -input /path/to/bright-earth-science \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.bright-earth-science.splade-v3/ \
+  -impact -pretokenized \
+  >& logs/log.bright-earth-science &
+```
+
+The path `/path/to/bright-earth-science/` should point to the corpus downloaded above.
+The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the pre-encoded tokens.
+For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).
+
+## Retrieval
+
+Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+bin/run.sh io.anserini.search.SearchCollection \
+  -index indexes/lucene-inverted.bright-earth-science.splade-v3/ \
+  -topics tools/topics-and-qrels/topics.bright-earth-science.tsv.gz \
+  -topicReader TsvString \
+  -output runs/run.bright-earth-science.splade-v3-onnx.topics.bright-earth-science.txt \
+  -impact -pretokenized -removeQuery -hits 1000 -encoder SpladeV3 &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.bright-earth-science.txt runs/run.bright-earth-science.splade-v3-onnx.topics.bright-earth-science.txt
+bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.bright-earth-science.txt runs/run.bright-earth-science.splade-v3-onnx.topics.bright-earth-science.txt
+bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.bright-earth-science.txt runs/run.bright-earth-science.splade-v3-onnx.topics.bright-earth-science.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@10**                                                                                                  | **SPLADE-v3**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BRIGHT: Earth Science                                                                                        | 0.2670    |
+| **R@100**                                                                                                    | **SPLADE-v3**|
+| BRIGHT: Earth Science                                                                                        | 0.5776    |
+| **R@1000**                                                                                                   | **SPLADE-v3**|
+| BRIGHT: Earth Science                                                                                        | 0.8127    |
diff --git a/docs/regressions/regressions-bright-economics.splade-v3.onnx.md b/docs/regressions/regressions-bright-economics.splade-v3.onnx.md
@@ -0,0 +1,84 @@
+# Anserini Regressions: BRIGHT &mdash; Economics
+
+**Model**: [SPLADE-v3](https://arxiv.org/abs/2403.06789) (using ONNX for on-the-fly query encoding)
+
+This page documents regression experiments for [BRIGHT &mdash; Economics](https://brightbenchmark.github.io/) using using [SPLADE-v3](https://arxiv.org/abs/2403.06789).
+The model itself can be download [here](https://huggingface.co/naver/splade-v3).
+See the [official SPLADE repo](https://github.com/naver/splade) and the following paper for more details:
+
+> Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. [SPLADE-v3: New baselines for SPLADE.](https://arxiv.org/abs/2403.06789) _arXiv:2403.06789_.
+
+In these experiments, we are using ONNX to perform query encoding on the fly.
+
+The exact configurations for these regressions are stored in [this YAML file](../../src/main/resources/regression/bright-economics.splade-v3.onnx.yaml).
+Note that this page is automatically generated from [this template](../../src/main/resources/docgen/templates/bright-economics.splade-v3.onnx.template) as part of Anserini's regression pipeline, so do not modify this page directly; modify the template instead and build Anserini to rebuild the documentation.
+
+From one of our Waterloo servers (e.g., `orca`), the following command will perform the complete regression, end to end:
+
+```
+python src/main/python/run_regression.py --index --verify --search --regression bright-economics.splade-v3.onnx
+```
+
+All the BRIGHT corpora, encoded byt he SPLADE-v3 mode, are available for download:
+
+```bash
+wget https://huggingface.co/datasets/castorini/collections-bright/resolve/main/bright-splade-v3.tar -P collections/
+tar xvf collections/bright-splade-v3.tar -C collections/
+```
+
+The tarball is 1.5 GB and has MD5 checksum `434cd776b5c40f8112d2bf888c58a516`.
+After download and unpacking the corpora, the `run_regression.py` command above should work without any issue.
+
+## Indexing
+
+Typical indexing command:
+
+```
+bin/run.sh io.anserini.index.IndexCollection \
+  -threads 16 \
+  -collection JsonVectorCollection \
+  -input /path/to/bright-economics \
+  -generator DefaultLuceneDocumentGenerator \
+  -index indexes/lucene-inverted.bright-economics.splade-v3/ \
+  -impact -pretokenized \
+  >& logs/log.bright-economics &
+```
+
+The path `/path/to/bright-economics/` should point to the corpus downloaded above.
+The important indexing options to note here are `-impact -pretokenized`: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the pre-encoded tokens.
+For additional details, see explanation of [common indexing options](../../docs/common-indexing-options.md).
+
+## Retrieval
+
+Topics and qrels are stored [here](https://github.com/castorini/anserini-tools/tree/master/topics-and-qrels), which is linked to the Anserini repo as a submodule.
+
+After indexing has completed, you should be able to perform retrieval as follows:
+
+```
+bin/run.sh io.anserini.search.SearchCollection \
+  -index indexes/lucene-inverted.bright-economics.splade-v3/ \
+  -topics tools/topics-and-qrels/topics.bright-economics.tsv.gz \
+  -topicReader TsvString \
+  -output runs/run.bright-economics.splade-v3-onnx.topics.bright-economics.txt \
+  -impact -pretokenized -removeQuery -hits 1000 -encoder SpladeV3 &
+```
+
+Evaluation can be performed using `trec_eval`:
+
+```
+bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.bright-economics.txt runs/run.bright-economics.splade-v3-onnx.topics.bright-economics.txt
+bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.bright-economics.txt runs/run.bright-economics.splade-v3-onnx.topics.bright-economics.txt
+bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.bright-economics.txt runs/run.bright-economics.splade-v3-onnx.topics.bright-economics.txt
+```
+
+## Effectiveness
+
+With the above commands, you should be able to reproduce the following results:
+
+| **nDCG@10**                                                                                                  | **SPLADE-v3**|
+|:-------------------------------------------------------------------------------------------------------------|-----------|
+| BRIGHT: Economics                                                                                            | 0.1604    |
+| **R@100**                                                                                                    | **SPLADE-v3**|
+| BRIGHT: Economics                                                                                            | 0.4478    |
+| **R@1000**                                                                                                   | **SPLADE-v3**|
+| BRIGHT: Economics                                                                                            | 0.7804    |