Thanks to visit codestin.com
Credit goes to github.com

Skip to content

This is the official repo for our paper: "Generative Knowledge-Guided Retrieval System for Construction Disclosure Documents Reviewing"

Notifications You must be signed in to change notification settings

Hongru0306/CDDRS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Construction-Disclosure-Documents-Reviewing-System

This is the official repo for our paper: "Generative Knowledge-Guided Review System for Construction Disclosure Documents Reviewing"(ADVEI2025)

Data

  • Example data files are in ./example/.

Weights

The pretrained weights can be acquired at google_drive.

Train

You can train the extraction modules in this commend:

# Train with default parameters
python train_extract.py -i dataset.csv

# Custom output file and training parameters
python train_extract.py -i dataset.csv -o my_model.pth -e 100 -l 1e-5 -b 32

# Use different BERT model
python train_extract.py -i dataset.csv --model_name bert-base-multilingual-cased

Command Line Arguments

Argument Short Default Description
--input -i Required Input CSV file path
--output -o pretrain-model.pth Output model file path
--epochs -e 200 Number of training epochs
--learning_rate -l 5e-6 Learning rate
--batch_size -b 16 Batch size
--split_ratio -s 0.9 Train/validation split ratio
--max_length -m 512 Maximum sequence length
--weight_decay -w 0.01 Weight decay
--warmup_steps 0 Number of warmup steps
--print_interval 20 Print F1 score interval
--model_name bert-base-chinese BERT model name

Dataset Format

The input CSV file should contain the following columns:

Column Description Required Example
Query The input text/query to be processed "How to train a machine learning model?"
max Maximum priority chunk/span to extract "machine learning model"
mid Medium priority chunk/span to extract "train"
lit Low priority chunk/span to extract "How to"

Sample CSV Structure

Query,max,mid,lit
"How to train a machine learning model?","machine learning model","train","How to"
"What is deep learning?","deep learning",,
"Explain neural networks","neural networks","Explain",

Inference

Retrieval inference example.

from CDDRS import GKGR

source_knowledge_base = 'path_to_knowledge_base'
query = 'your_retrieval_query'
retrieval_result = GKGR(
    query, 
    source_knowledge_base, 
    topk=3, 
    llm='deepseek', 
    api='your_own_deepseek_api', 
    base_url='https://api.deepseek.com'
)

Parameters

Core Parameters

Parameter Type Default Description
query str Required The search query text
source_knowledge_base str Required Path to the document directory
topk int 3 Number of top results to return
llm str 'deepseek-chat' LLM model name ('gpt-4o', 'deepseek-chat', etc.)
api str 'your-api-key' API key for the LLM service
base_url str 'https://api.deepseek.com/v1' API base URL
embedding_model str './models/bge-m3' Path to embedding model
bert_model_path str 'pretrain_model.pth' Path to BERT query expansion model
chunk_size int 512 Document max chunk size for processing
retrieval_mode str 'gkgr' Retrieval mode: 'vector', 'kg', or 'gkgr'

Advanced Parameters

Parameter Type Default Description
force_reinit bool False Force reinitialization of cached instances
fusion_weights List[float] [0.6, 0.4] Weights for combining vector and KG retrieval
expansion_weights List[float] [0.5, 0.3, 0.2] Weights for original and expanded queries

Test

from utils.test import retrieve_test, generate_test
annotated_files = 'retrieve_files_with_annotated'
metric = ['MRR', 'Acc'] # generate is 'F1'
test_results = retrieve_test(annotated_files, metric, mode='retrieve')

Cite

@article{XIAO2025103618,
title = {Generative knowledge-guided review system for construction disclosure documents},
journal = {Advanced Engineering Informatics},
volume = {68},
pages = {103618},
year = {2025},
issn = {1474-0346},
doi = {https://doi.org/10.1016/j.aei.2025.103618},
url = {https://www.sciencedirect.com/science/article/pii/S1474034625005117},
author = {Hongru Xiao and Jiankun Zhuang and Bin Yang and Jiale Han and Yantao Yu and Songning Lai},
keywords = {Construction documents review, Large language model (LLM), Knowledge-guided retrieval, Natural Language Processing (NLP)}
}

About

This is the official repo for our paper: "Generative Knowledge-Guided Retrieval System for Construction Disclosure Documents Reviewing"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages