We provide an effective solution to unsupervised domain-adaptation for ranking models. Existing line of works proposed multiple synthetic data generation frameworks (InPars, DocGen-RL, and Promptagator) to generate target examples to fine-tune a pre-trained ranker (pre-trained on MS-MARCO data). However, often the fine-tuned performance drops below zero-shot performance and the data generation process is heavily exhaustive and intensive. Therefore, we propose DUQGen, a cost-effective solution that can improve consistently over the zero-shot baselines and substancially improve over the existing baselines in most cases.
DUQGen framework consists of 2 stages: data augmentation (generation) and fine-tuning.
Prepare any target corpus documents in the below jsonl file.
{'docid: <document-id-1>, 'doctext': <document-text-1>}
{'docid: <document-id-2>, 'doctext': <document-text-2>}
...
Then run the below command with the script found in data_preparation/target_representation.
python generate_document_embedding.py \
    --collection_data_filepath <path_to_document_collection_file.jsonl> \
    --save_collection_embedding_filepath <path_to_save_document_embeddings_file.pt> \
    --cache_dir <path_to_download_models_cache_directory>
Run the below command with the script found in data_preparation/target_representation.
python sample_target_collection_documents.py \
    --dataset_name <dataset-name> \
    --collection_text_filepath <path_to_document_collection_file.jsonl> \
    --collection_embedding_filepath <path_to_document_embeddings_file.pt> \
    --save_sampled_documents_filepath <path_to_save_sampled_documents_file.jsonl>
Run the below command with the script found in data_preparation/query_generation. We used LLAMA2-7B-chat found in HuggingFace for our query generation task. All BEIR target dataset prompt templates can be found in data_preparation/prompt_templates folder.
python query_generation.py \
    --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --prompt_template_filepath ../prompt_templates/template_<dataset>.yaml \
    --sampled_documents_filepath <path_to_sampled_documents_file.jsonl> \
    --save_generated_queries_filepath <path_to_save_generated_queries.jsonl> \
    --cache_dir <path_to_download_models_cache_directory>
Run the below command with the script found in data_preparation/hardnegative_mining. We used contriever to return top-100 initial rank list and picked bottom-x (x=4) as the hard negatives.
python train_data_generation.py \
    --dataset_name <dataset-name> \
    --generated_queries_filepath <path_to_generated_queries.jsonl> \
    --save_traindata_filepath <path_to_save_generated_traindata.jsonl>
We fine-tuned both MonoT5-3B and ColBERT, namedly DUQGen-reranker and DUQGen-retriever. But our approach will work for any ranking model.
Run the below command with the script found in monot5_finetuning.
python train_monot5.py \
    --base_modename_or_path castorini/monot5-3b-msmarco-10k \
    --train_data_filepath <input_training_data_file_path.jsonl> \
    --save_model_path <directory_to_save_trained_model> \
    --cache_dir <path_to_download_models_cache_directory>
Run the below command with the bash script found in colbert_finetuning. All the variables to change are descripted in the bash file itself. The bash script was developed to run on a SLURM system, but a simple bash call can run it.
sbatch train_colbert.sh
We have to generate or format test data in order to do the re-ranking and dense retrieval. Please refer https://github.com/thakur-nandan/beir-ColBERT to prepare test data for ColBERT dense retriever. To generate top-100 BM25 re-ranking data, please run the below command with the script found in data_preparation/testdata_preparation.
python generate_bm25_reranking_data.py \
    --dataset_name <dataset-name> \
    --save_bm25_results_filepath <file_to_save_bm25_results.txt> \
    --save_testdata_filepath <file_to_save_test_data.json> \
    --save_qrels_filepath <file_to_save_qrel_data_in_treceval_format.txt>
Run the below command with the script found in monot5_finetuning.
python test_monot5.py \
    --model_name_or_checkpoint_path <path_directory_to_saved_model/checkpoint-*> \
    --save_predictions_fn <file_to_save_predictions.json> \
    --test_filename_path <input_file_having_test_data.json> \
    --qrle_filename <qrel_file_saved_in_treceval_format.txt> \
    --cache_dir <path_to_download_models_cache_directory>
Run the below command with the bash script found in colbert_finetuning. All the variables to change are descripted in the bash file itself. The bash script was developed to run on a SLURM system, but a simple bash call can run it.
sbatch test_colbert.sh
To be released soon ...
To cite our 🦆 DUQGen in your work,
coming soon ...
- Ramraj Chandradevan ([email protected])
- Kaustubh Dhole ([email protected])
- Eugene Agichtein ([email protected])