This is the code of the paper KnowTrace: Bootstrapping Iterative Retrieval-Augmented Generation with Structured Knowledge Tracing for KDD 2025.
A self-bootstrapping RAG framework featuring a unique design perspective of structured knowledge tracing.
- python == 3.9.19
- numpy == 1.26.4
- datasets == 2.20.0
- requests == 2.32.3
- peft == 0.9.0
- networkx == 3.2.1
- openai == 0.28.0
- beir == 2.0.0
- fastapi == 0.111.0
- uvicorn == 0.30.1
- torch == 2.3.0
- transformers == 4.42.3
- elasticsearch == 7.9.1
1. Download a MHQA dataset such as HotpotQA, where each entry contains a question and an answer:
{"question": "<input question text>", "answer": "<target answer text>"}
Note that we focus on the challenging open-domain setting, and do not use the supporting context provided by the raw dataset.
2. Download Wikipedia corpus for retrieval, and then process:
python ./retriever/process_wiki.py --input_dir xx --output_path xx
1. For BM25, install and start Elasticsearch first.
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
shasum -a 512 -c elasticsearch-7.10.2-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz
cd elasticsearch-7.10.2/
./bin/elasticsearch # start the server
python ./retriever/retrieval_server.py --data_name xx --corpus_path xx --retrieval_method xx --topk xx --port 8001
3. Activate LLM server if using an open-source LLM such as LLaMA3-8B-Instruct:
python ./local_llama.py --model_path xx --port 1051
python ./main.py --dataset_path xx --base_llm xx --step_num xx
python ./main.py --dataset_path xx --base_llm xx --step_num xx --collect_data True --exploration_path xx --completion_path xx
python ./finetune_peft.py --dataset_name xx --model_name xx --num_train_epochs xx --per_device_train_batch_size xx --gradient_accumulation_steps xx --learning_rate xx --lora_r xx --lora_alpha xx --lora_dropout xx --output_dir xx
We refer to the code of SuRe and ReAct. Thanks for their contributions.