Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation

Xinze Li¹, Zhenghao Liu¹, Haidong Xin¹, Yukun Yan², Shuo Wang², Zheni Zeng³, Sen Mei², Ge Yu¹, Maosong Sun²

¹Northeastern University, ²Tsinghua University, ³Nanjing University

📖 Introduction

PAGER is a page-driven autonomous knowledge representation framework designed for organizing and utilizing knowledge in Retrieval-Augmented Generation (RAG) scenarios. Specifically, PAGER first prompts an LLM to construct a structured cognitive outline for a given query, consisting of multiple slots, each representing a distinct knowledge dimension. Guided by these slots, PAGER then iteratively retrieves and refines relevant documents, populating each slot with pertinent information, and ultimately constructs a coherent "knowledge page" that serves as contextual input for answer generation. Experimental results on multiple knowledge-intensive benchmark tasks and across various backbone models show that PAGER consistently outperforms all RAG baselines across evaluation metrics.

⚙️ Setup

conda create --name PAGER python==3.11
conda activate PAGER
git clone https://github.com/OpenBMB/PAGER.git
cd PAGER
pip install -r requirement.txt

🔧 Reproduction Guide

This section provides a step-by-step guide to reproduce our results.

1. Evaluation Dataset Download

You can download the evaluation dataset required for this experiment from here. We use the Wiki dataset provided by FlashRAG as the retrieval corpus, and the downloading and processing procedures are detailed here.

For the downloaded evaluation dataset and the processed corpus, you need to move them to the evaluation_dataset folder.

2. Deploying a Retrieval Model Service

2.1. Encode the Wikipedia Corpus:

For the downloaded corpus, you need to encode it using the Qwen3-0.6 embedding model and store the resulting embeddings wiki.npy in the embedding directory. You need to run the following script.

bash bash/embed.sh

2.2. Build an Index for the Encoded Embeddings:

For the encoded Wikipedia embeddings, wiki.npy, you need to build a FAISS index for it to be used in subsequent retrieval. The constructed index should be saved as wiki.index in the embedding directory. You need to run the following script.

bash bash/index.sh

2.3. Build an Index for the Encoded Embeddings:

Next, you need to run the vllm_emb.sh script to deploy the Qwen3-Embedding-0.6B model to the GPU and run it in the background.

bash bash/vllm_emb.sh

2.4. Deploy the Retrieval Service:

Afterward, you need to run the ret_serve.sh script to deploy the retrieval service in the background.

bash bash/ret_serve.sh

3. Construct the Structured Page

3.1. Construct the Outline:

Construct an outline for the questions in the evaluation dataset. You need to run the following script and store the generated outline in the output_data/outline directory.

bash bash/construct_outline.sh

Afterward, extract the structured outline from the outline to serve as the initialized page and store it in the output_data/outline directory.

bash bash/extract_outline.sh

3.2. Construct the Page:

After obtaining the initialized page, you need to iteratively fill it with knowledge until the final knowledge representations are produced, and store the results in the output_data/page directory.

bash bash/construct_page.sh

4. Page Inference

4.1. Infer the Answers.

After obtaining the pages generated in the output_data/page directory, you need to use the generated pages to answer the given questions and store the resulting answers in the output_data/infer directory.

bash bash/infer_page.sh

4.2. Evaluate the Accuracy of the Answers.

Finally, you can run the following script to evaluate the accuracy of the answers.

bash bash/evaluate_infer.sh

📁 Repository Structure

PAGER/
├── README.md
├── requirements.txt
├── output_data/               # Sample outputs generated by the PAGER
├── figs/                      # README figures
├── bash/                      # The script files used to run the experiments
└── src/
    ├── retriever/             # Deploy the retrieval service
    ├── construct_outline.py   # Construct the outline
    ├── construct_page.py      # Construct the page
    ├── extract_outline.py     # Initialize the page
    ├── infer_page.py          # Inference the questions
    └── evaluate_infer.py      # Evaluate the performance of the answers

📄 Acknowledgement

Our work is built on the following codebases, and we are deeply grateful for their contributions.

🥰 Citation

We appreciate your citations if you find our paper related and useful to your research!

@article{li2026structuredknowledgerepresentationcontextual,
  title={Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation}, 
  author={Xinze Li and Zhenghao Liu and Haidong Xin and Yukun Yan and Shuo Wang and Zheni Zeng and Sen Mei and Ge Yu and Maosong Sun},
  year={2026},
  url={https://arxiv.org/abs/2601.09402}, 
}

📧 Contact

If you have questions, suggestions, and bug reports, please email:

[email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation

Xinze Li¹, Zhenghao Liu¹, Haidong Xin¹, Yukun Yan², Shuo Wang², Zheni Zeng³, Sen Mei², Ge Yu¹, Maosong Sun²

¹Northeastern University, ²Tsinghua University, ³Nanjing University

📖 Introduction

⚙️ Setup

🔧 Reproduction Guide

1. Evaluation Dataset Download

2. Deploying a Retrieval Model Service

2.1. Encode the Wikipedia Corpus:

2.2. Build an Index for the Encoded Embeddings:

2.3. Build an Index for the Encoded Embeddings:

2.4. Deploy the Retrieval Service:

3. Construct the Structured Page

3.1. Construct the Outline:

3.2. Construct the Page:

4. Page Inference

4.1. Infer the Answers.

4.2. Evaluate the Accuracy of the Answers.

📁 Repository Structure

📄 Acknowledgement

🥰 Citation

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bash		bash
figs		figs
output_data		output_data
src		src
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt

License

OpenBMB/PAGER

Folders and files

Latest commit

History

Repository files navigation

Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation

Xinze Li1, Zhenghao Liu1, Haidong Xin1, Yukun Yan2, Shuo Wang2, Zheni Zeng3, Sen Mei2, Ge Yu1, Maosong Sun2 1Northeastern University, 2Tsinghua University, 3Nanjing University

📖 Introduction

⚙️ Setup

🔧 Reproduction Guide

1. Evaluation Dataset Download

2. Deploying a Retrieval Model Service

2.1. Encode the Wikipedia Corpus:

2.2. Build an Index for the Encoded Embeddings:

2.3. Build an Index for the Encoded Embeddings:

2.4. Deploy the Retrieval Service:

3. Construct the Structured Page

3.1. Construct the Outline:

3.2. Construct the Page:

4. Page Inference

4.1. Infer the Answers.

4.2. Evaluate the Accuracy of the Answers.

📁 Repository Structure

📄 Acknowledgement

🥰 Citation

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Xinze Li¹, Zhenghao Liu¹, Haidong Xin¹, Yukun Yan², Shuo Wang², Zheni Zeng³, Sen Mei², Ge Yu¹, Maosong Sun²

¹Northeastern University, ²Tsinghua University, ³Nanjing University

Packages