PAGER is a page-driven autonomous knowledge representation framework designed for organizing and utilizing knowledge in Retrieval-Augmented Generation (RAG) scenarios. Specifically, PAGER first prompts an LLM to construct a structured cognitive outline for a given query, consisting of multiple slots, each representing a distinct knowledge dimension. Guided by these slots, PAGER then iteratively retrieves and refines relevant documents, populating each slot with pertinent information, and ultimately constructs a coherent "knowledge page" that serves as contextual input for answer generation. Experimental results on multiple knowledge-intensive benchmark tasks and across various backbone models show that PAGER consistently outperforms all RAG baselines across evaluation metrics.
conda create --name PAGER python==3.11
conda activate PAGER
git clone https://github.com/OpenBMB/PAGER.git
cd PAGER
pip install -r requirement.txtThis section provides a step-by-step guide to reproduce our results.
You can download the evaluation dataset required for this experiment from here. We use the Wiki dataset provided by FlashRAG as the retrieval corpus, and the downloading and processing procedures are detailed here.
For the downloaded evaluation dataset and the processed corpus, you need to move them to the evaluation_dataset folder.
For the downloaded corpus, you need to encode it using the Qwen3-0.6 embedding model and store the resulting embeddings wiki.npy in the embedding directory. You need to run the following script.
bash bash/embed.shFor the encoded Wikipedia embeddings, wiki.npy, you need to build a FAISS index for it to be used in subsequent retrieval. The constructed index should be saved as wiki.index in the embedding directory. You need to run the following script.
bash bash/index.shNext, you need to run the vllm_emb.sh script to deploy the Qwen3-Embedding-0.6B model to the GPU and run it in the background.
bash bash/vllm_emb.shAfterward, you need to run the ret_serve.sh script to deploy the retrieval service in the background.
bash bash/ret_serve.shConstruct an outline for the questions in the evaluation dataset. You need to run the following script and store the generated outline in the output_data/outline directory.
bash bash/construct_outline.shAfterward, extract the structured outline from the outline to serve as the initialized page and store it in the output_data/outline directory.
bash bash/extract_outline.shAfter obtaining the initialized page, you need to iteratively fill it with knowledge until the final knowledge representations are produced, and store the results in the output_data/page directory.
bash bash/construct_page.shAfter obtaining the pages generated in the output_data/page directory, you need to use the generated pages to answer the given questions and store the resulting answers in the output_data/infer directory.
bash bash/infer_page.shFinally, you can run the following script to evaluate the accuracy of the answers.
bash bash/evaluate_infer.shPAGER/
βββ README.md
βββ requirements.txt
βββ output_data/ # Sample outputs generated by the PAGER
βββ figs/ # README figures
βββ bash/ # The script files used to run the experiments
βββ src/
βββ retriever/ # Deploy the retrieval service
βββ construct_outline.py # Construct the outline
βββ construct_page.py # Construct the page
βββ extract_outline.py # Initialize the page
βββ infer_page.py # Inference the questions
βββ evaluate_infer.py # Evaluate the performance of the answers
Our work is built on the following codebases, and we are deeply grateful for their contributions.
We appreciate your citations if you find our paper related and useful to your research!
@article{li2026structuredknowledgerepresentationcontextual,
title={Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation},
author={Xinze Li and Zhenghao Liu and Haidong Xin and Yukun Yan and Shuo Wang and Zheni Zeng and Sen Mei and Ge Yu and Maosong Sun},
year={2026},
url={https://arxiv.org/abs/2601.09402},
}
If you have questions, suggestions, and bug reports, please email: