Will Pre-Training Ever End?

A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition

📄 arXiv:2503.12303
📁 Google Drive (Additional Resources, e.g., Paper, Training Data, Model Checkpoints)

🧠 Abstract

Recent progress in (multimodal) large language models ((M)LLMs) has shifted focus from pre-training to inference-time compute scaling and post-training optimization, driven by concerns over limited high-quality real-world data. However, these strategies alone are insufficient for advancing model capabilities.

We hypothesize that effective model improvement requires a strong synergy among pre-training, inference-time compute scaling, and post-training optimization.

In this paper, we validate this hypothesis in the context of multimodal pre-training for foundation MLLM construction. We introduce Self-Improving Cognition (SICOG), a self-learning framework for constructing next-generation foundation MLLMs by imparting multimodal knowledge and enhancing their systematic cognitive capabilities through multimodal pre-training with self-generated data.

Specifically, we introduce Chain-of-Description (CoD), a step-by-step visual understanding method to improve comprehensive perception, and integrate structured chain-of-thought (CoT) reasoning to support in-depth multimodal reasoning.

SICOG first equips a base model with systematic perception and reasoning using minimal external supervision. The enhanced model then generates candidate image captions and CoT-style reasoning responses for unlabeled images and image-question pairs across diverse tasks, which are curated through a self-consistency mechanism.

These curated samples are subsequently used for large-scale multimodal pre-training, completing a self-learning cycle that strengthens the model’s cognitive foundation.

Extensive experiments demonstrate that SICOG produces next-generation foundation MLLMs with substantially improved multimodal cognition, outperforming prevailing pre-training approaches. These findings empirically establish SICOG as a promising framework for realizing a complete self-improving paradigm.

Environment Setup

Step 1: Python Environment Setup

To replicate the results, initialize the Python environment with the following commands:

conda create -n SICOG python=3.10
conda activate SICOG

pip install imgaug
pip install openpyxl

pip install --upgrade pip  # enable PEP 660 support
pip install torch==2.1.2
pip install -e .

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Step 2: Download Checkpoints

Download the following checkpoints and place them in the /model_weight/ directory: CLIP-ViT-L/14-336, Qwen2-7B-Instruct.

CoD and CoT Training Data Generation

Caption Data Generation

Navigate to the ./training_data_annotation directory and download the Allava-VFLAN 35k split:

export ORIGINAL_CAPTION_FILE={path_to_original_caption_data}
export IMAGE_DIR={path_to_corresponding_images}
export OUTPUT_COD_DIR={path_to_cod_data}
export OUTPUT_DD_DIR={path_to_dd_data}
bash revise_cod.sh

Mix the generated caption data to create the 35k_annotated_caption_data.

Reasoning Data Generation

Download the LLAVA-CoT 35k split and replace tags with corresponding sentences:

export ORIGINAL_COT_FILE={path_to_original_reasoning_data}
export OUTPUT_COT_DIR={path_to_revised_cot_data}
export OUTPUT_SHORT_DIR={path_to_revised_short_answer_data}
bash revise_cot.sh

Combine the generated reasoning data to create the 35k_annotated_reasoning_data.

Training Scripts for SICOG

Experiments were conducted on 4×8 NVIDIA A100 80GB GPUs.

Developing Systematic Multimodal Cognition With Minimal Annotations

Multimodal Perception Development

export FINETUNE_LEARNING_RATE=2e-5
export FINETUNE_BATCH=4
export ACCU_STEPS=1
export PREV_STAGE_CHECKPOINT={base_model_checkpoint}
export EXPERIMENT_NAME={stage0_ckpt_perception}
export DATA_PATH={path_to_annotated_caption_data}

# Stage 0: Warmup for Perception Ability
bash training_scripts/uhdv1-finetune-qwen-stage0-warmup.sh

Multimodal Reasoning Development

export FINETUNE_LEARNING_RATE=2e-5
export FINETUNE_BATCH=4
export ACCU_STEPS=1
export PREV_STAGE_CHECKPOINT={base_model_checkpoint}
export EXPERIMENT_NAME={stage0_ckpt_reasoning}
export DATA_PATH={path_to_annotated_reasoning_data}

# Stage 0: Warmup for Reasoning Ability
bash training_scripts/uhdv1-finetune-qwen-stage0-warmup.sh

Generating Candidate Captions and Responses for Pre-Training Data
Please refer to ./recap_scripts for recap script (we comment in the file).
Caption: You can change the type of prompt; 'cod-default' will ask the model to describe the image step-by-step, while 'default' will request the model to provide a detailed caption directly.

export EXPERIMENT_NAME={recaption_job_cod}
export MODEL_PATH={stage0_ckpt_perception}
export DATA_PATH={unlabeled_image_data_path}
# Change the parameters to generate multiple Candidates
export TEMPERATURE=0.7
export BEAMS=1
export TOP_P=0.9
export PROMPT="cod-default"  # Step-by-step description
bash recap_scripts/recap_inference.sh

export EXPERIMENT_NAME={recaption_job_default}
export PROMPT="default"  # Detailed description
bash recap_scripts/recap_inference.sh

Reasoning: You can change the type of prompt; 'cot-default' will ask the model to answer the question step-by-step, while 'default' will request the model to directly answer.

export EXPERIMENT_NAME={reasoning_job_cot}
export MODEL_PATH={stage0_ckpt_reasoning}
export DATA_PATH={unlabeled_image_question_pair_data_path}
export SPLIT={data_split}
export TEMPERATURE=0.7
export BEAMS=1
export TOP_P=0.9
export PROMPT="cot-default"  # Step-by-step reasoning
bash recap_scripts/reasoning_inference.sh

export EXPERIMENT_NAME={reasoning_job_default}
export PROMPT="default"  # Direct answer
bash recap_scripts/reasoning_inference.sh

Curating Self-Generated Pre-Training Data via Self-Consistency-Guided Quality Evaluation

# Caption Quality Evaluation
export RAW_DATASET={path_to_generated_caption_data}
export MODEL_NAME_OR_PATH={stage0_ckpt_perception}
bash quality_evaluation/scripts/recaption_quality_eval.sh

# Reasoning Quality Evaluation
export RAW_DATASET={path_to_generated_reasoning_data}
export MODEL_NAME_OR_PATH={stage0_ckpt_reasoning}
bash quality_evaluation/scripts/reasoning_quality_eval.sh

Combine the filtered caption and reasoning data to create self_generated_pre_training_data.

Constructing the Next-Generation MLLM Through Multimodal Pre-Training:
Please refer to ./training_scripts for pretraining script and fine-tuning script (we comment in the file). Here is an example attached below.
Please ensure that the total batch size during the alignment phase is 256, and the total batch size during the fine-tuning phase is 128.

export PRETRAIN_LEARNING_RATE=2e-4
export FINETUNE_LEARNING_RATE=2e-5
export PRETRAIN_BATCH=8
export FINETUNE_BATCH=4
export ACCU_STEPS=2

export EXPERIMENT_NAME={stage1-align}
export DATA_PATH={path_to_alignment_data}
# stage1 Modality Alignment
bash training_scripts/uhdv1-pretrain-qwen-stage1-align.sh

export EXPERIMENT_NAME={stage15-multimodal pre-training}
export DATA_PATH={path_to_self_generated_pre_training_data}
export MM_PROJECTOR_CHECKPOINT={stage1-mm-projector-checkpoint}
export LLM_CKPT_DIR={Qwen-checkpoint}
# stage15 Self-learning via Multimodal Pre-Training
bash training_scritps/uhdv1-finetune-qwen-stage15-sft.sh

export EXPERIMENT_NAME={stage2-sft}
export DATA_PATH={path_to_instruction_tuning_data}
export PREV_STAGE_CHECKPOINT={stage15_checkpoint}
# stage2 Visual Instruction-Tuning
bash training_scritps/uhdv1-finetune-qwen-stage2-sft.sh

To reproduce DPO variants for systematic cognition development, you can use the following example script to generate noisy reasoning data.

export EXPERIMENT_NAME={noisy_reasoning_generation}
export MODEL_PATH={stage0_ckpt_reasoning}
export DATA_PATH={unlabeled_image_question_pair_data_path}
export SPLIT={data_split}
export AUGMENT_METHODS={method_to_corrupt_image}  # e.g., "gaussian_blur", "color_jitter", etc.
export TEMPERATURE=0.7
export BEAMS=1
export TOP_P=0.9
export PROMPT="cot-default"  # Step-by-step reasoning
bash recap_scripts/reasoning_inference.sh

If any problem occurs, you can still refer to the LLaVA-UHD repository to seek solutions.

Evaluation

VLMEvalKit: We use VLMEvalKit to evaluete the performance of our model on the general benchmark. You can follow the setup instruction of VLMEvalKit to evaluate.

Acknowledgement

Thanks the contributors of The LLaVA-NeXT project (listed alphabetically by the first names): Bo Li, Dong Guo, Feng Li, Hao Zhang, Kaichen Zhang, Renrui Zhang, Yuanhan Zhang, led by Chunyuan Li and with the guidance and help from Haotian Liu.

Citation

If you find SICOG useful for your research and applications, please cite using this BibTeX:

@article{zhang2025will,
  title={Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition},
  author={Zhang, Xiaoying and Peng, Da and Zhang, Yipeng and Guo, Zonghao and Wu, Chengyue and Chen, Chi and Ke, Wei and Meng, Helen and Sun, Maosong},
  journal={arXiv e-prints},
  pages={arXiv--2503},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
build		build
docs		docs
llava.egg-info		llava.egg-info
llava		llava
playground		playground
quality_evaluation		quality_evaluation
recap_scripts		recap_scripts
training_data_annotation		training_data_annotation
training_scripts		training_scripts
trl		trl
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
rebuild_env.sh		rebuild_env.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Will Pre-Training Ever End?

A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition

🧠 Abstract

Environment Setup

Step 1: Python Environment Setup

Step 2: Download Checkpoints

CoD and CoT Training Data Generation

Training Scripts for SICOG

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

thunlp/SICOG

Folders and files

Latest commit

History

Repository files navigation

Will Pre-Training Ever End?

A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition

🧠 Abstract

Environment Setup

Step 1: Python Environment Setup

Step 2: Download Checkpoints

CoD and CoT Training Data Generation

Training Scripts for SICOG

Evaluation

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages