This repository contains the SURE semantic-understanding evaluation data and scripts for studying how dense retrieval models behave under different semantic perturbations. It includes processed FIQA, MS MARCO, and NQ datasets, along with Python scripts for encoding, scoring, reranking, and computing SURE metrics.
SURE/
├── code/
│ ├── encode_corpus.py # Encode TSV corpora into JSONL embeddings
│ ├── encode_query.py # Encode queries into pickle files
│ ├── calculate_passage_variants_score.py # Score queries against semantic passage variants
│ └── eval_retriever.py # Generate rank files and compute SURE metrics
├── data/
│ ├── fiqa/
│ ├── msmarco/
│ └── nq/
└── results/ # Default experiment output directory
Each dataset directory contains the following files:
query-*.tsv: Original queries, formatted asquery_id<TAB>query_text.query-*-semantic-equivalence.tsv: Queries used for the semantic-equivalence task.qrels-*.tsv: Relevance labels, formatted asquery_id<TAB>0<TAB>passage_id<TAB>label.passages.jsonl: Original passages withidandcontentsfields.key-sentences.jsonl: Key sentences from relevant passages.key-sentences-with-noise.jsonl: Key sentences with added noise.summaries.jsonl: Summaries of relevant passages.expansions.jsonl: Expanded versions of relevant passages.passages-with-keywords-replaced.jsonl: Passages with keywords replaced.
SURE currently supports three evaluation tasks:
semantic-precision: Compares the ranking relationship among key sentences, original retrieval results, and noisy key sentences.semantic-abstraction: Compares the ranking relationship among summaries, original retrieval results, and expanded passages.semantic-equivalence: Compares the ranking relationship between original retrieval results and keyword-replaced passages.
Python 3.9+ is recommended. Install a GPU-enabled PyTorch build if needed.
pip install numpy pandas torch sentence-transformers vllmIf you need to generate the initial top-2000 retrieval results with FAISS/Pyserini, install the corresponding dependencies separately. The evaluation script expects the default initial TREC ranking file at results/{dataset}-faiss-pyserini/res/{model}-2000.txt.
The scripts include several built-in model aliases, such as:
qwen3-embedding-4bqwen3-embedding-8bgte-qwen2-1.5bgte-qwen2-7be5-mistral-7b-instructrepllama
The default model paths are placeholders such as /path/to/your/.... Pass --model-path at runtime to use a local model path, or edit MODEL_CONFIGS / PRESETS in the scripts.
encode_corpus.py encodes a TSV file into a JSONL embedding file. Each output line contains id, contents, and vector.
python code/encode_corpus.py \
--model-name qwen3-embedding-4b \
--model-path /path/to/your/Qwen3-Embedding-4B \
--input-file data/fiqa/query-191.tsv \
--output-file results/fiqa/query-embeddings.jsonlCommon options:
--id-column/--text-column: Select the ID and text columns in the TSV input.--batch-size: Override the default batch size.--shard-index/--num-shards: Split the input into contiguous shards for parallel encoding.--append: Append to the output file instead of overwriting it.
encode_query.py encodes query TSV files into pickle files containing a DataFrame with id, text, and embedding columns.
python code/encode_query.py \
--model-name gte-qwen2-1.5b \
--model-path /path/to/your/gte-qwen2-1.5b-instruct \
--input-file data/fiqa/query-191.tsv \
--output-file results/fiqa/query-embeddings.pklcalculate_passage_variants_score.py reads queries, qrels, and task-specific candidate texts, then computes embedding similarities between each query and its semantic passage variants.
python code/calculate_passage_variants_score.py semantic-precision \
--dataset fiqa \
--preset qwen3-embedding-4b \
--model-path /path/to/your/Qwen3-Embedding-4B \
--output-root results \
--overwriteYou can also run all datasets at once:
python code/calculate_passage_variants_score.py semantic-abstraction \
--dataset all \
--preset gte-qwen2-1.5b \
--model-path /path/to/your/gte-qwen2-1.5b-instruct \
--backend sentence-transformers \
--output-root results \
--overwriteDefault output paths:
semantic-precision:results/{dataset}/scores-semantic-precision/{model}/{model}-ks_all_scores.tsvsemantic-abstraction:results/{dataset}/scores-semantic-abstraction/{model}-semantic-abstraction-sim.tsvsemantic-equivalence:results/{dataset}/scores-semantic-equivalence/{model}-rpkw-sim.tsv
eval_retriever.py merges semantic-variant scores with an initial TREC ranking, generates retriever-rank and gold-rank files, and computes SURE metrics.
python code/eval_retriever.py \
--model qwen3-embedding-4b \
--task semantic-precision \
--dataset fiqa \
--output-root results/fiqa \
--trec-file results/fiqa-faiss-pyserini/res/qwen3-embedding-4b-2000.txt \
--print-jsonIf --score-file is not provided, the script automatically searches under results/{dataset} using the task and model name.
Common options:
--score-file: Explicit semantic-variant score TSV.--trec-file: Initial top-2000 TREC ranking file.--gold-trec-file: TREC file used for gold-rank calculation.--skip-rank/--skip-gold/--skip-metric: Skip selected stages.--qids-file: Compute metrics only on a specified query set.
Output metrics:
rdc: Rank distribution consistency, measuring how well the retriever ranking distribution matches the gold ranking distribution.roc: Rank order consistency, measuring how well the retriever ranking order matches the gold ranking order.
A full experiment usually follows these steps:
- Encode the original corpus and queries with a dense retriever.
- Generate top-2000 TREC ranking files for each query using FAISS/Pyserini or another retrieval tool.
- Run
calculate_passage_variants_score.pyto compute SURE semantic-variant similarities. - Run
eval_retriever.pyto generate rank files and compute SURE metrics.
- The vLLM backend defaults to
tensor_parallel_size=4,dtype=bfloat16, andgpu_memory_utilization=0.7; adjust these values for your GPU setup. - The
sentence-transformersbackend uses CUDA by default in the relevant scripts; change the device option or script configuration if needed. - Data files are expected to be UTF-8 encoded.
- Output directories are created automatically. When scoring semantic variants, reruns skip already-written queries unless
--overwriteis set.