A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24GB VRAM.
- 08/23/2025: Created a new model based on Qwen3-0.6B-base with SigLIP2-so400m π keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m. This model has ~1B parameters and achieves a 78.5 VQAv2 score, on par with the original LLaVA 1.5 (7B).
- 08/23/2025: Added Qwen3 support to TinyLLaVA_Factory, including:
- A new chat template for Qwen3 integration
- Training and evaluation scripts with hyperparameters for a single Nvidia 4090
- Various compatibility fixes such as transformers upgrade required for the new Qwen3-0.6B-base model
- 08/17/2025: the hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava.
- 08/17/2025: improved VQAv2 average dev-test score from 44.01% to 56.91% by upgrading the vision tower from SigLip to SigLip2.
- 08/09/2025: initial version of MicroLlava released
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model from Hugging Face
hf_path = 'keeeeenw/MicroLlava'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
# model.cuda() # Enable CUDA if needed - model runs fairly quickly on CPU
# Setup tokenizer
config = model.config
tokenizer = AutoTokenizer.from_pretrained(
hf_path,
use_fast=False,
model_max_length=config.tokenizer_model_max_length,
padding_side=config.tokenizer_padding_side
)
# Run inference
prompt = "What are the things I should be cautious about when I visit here?"
image_url = "https://llava-vl.github.io/static/images/view.jpg"
output_text, generation_time = model.chat(
prompt=prompt,
image=image_url,
tokenizer=tokenizer
)
print(f'Model output: {output_text}')
print(f'Generation time: {generation_time}')| Component | Details |
|---|---|
| Framework | Transformers + PyTorch |
| Language Model | MicroLlama (~300M parameters) |
| Vision Encoder | SigLIP2-SO400M |
| Training Hardware | Single NVIDIA RTX 4090 |
| Checkpoint Format | SafeTensors |
| License | Apache 2.0 |
- π§ Single GPU Training: Train on consumer hardware without DeepSpeed
- β‘ Fast Training: Pretraining takes ~5 hours, finetuning ~12 hours on RTX 4090
- π¦ Compact: Only ~300M language model parameters
- π¨ Vision-Language Tasks: Visual Question Answering, image captioning
- π Easy Iteration: Perfect for research and experimentation
| Question Type | Accuracy |
|---|---|
| Yes/No | 72.32% |
| Number | 43.89% |
| Other | 46.65% |
| Overall | 56.91% |
Evaluated on VQAv2 test-dev split
| Question Type | Accuracy |
|---|---|
| Yes/No | 65.08% |
| Number | 28.97% |
| Other | 29.32% |
| Overall | 44.01% |
Evaluated on VQAv2 test-dev split
- VQAv2 test set (instead of test-dev)
- and datasets from TinyLlava evaluation
- Community contributions with benchmark results are welcome and encouraged.
This model is based on TinyLLaVA Factory with optimizations for single GPU training.
- Pretraining: ~5 hours on LAION-CC-SBU-558K
- Finetuning: ~12 hours on TinyLLaVA datasets
Pretraining Hyperparameters:
gradient_accumulation_steps: 2 β 8learning_rate: 1e-3 β 2.5e-4warmup_ratio: 0.03 β 0.06bfloat16: True after the Siglip2 upgrade (improved stability)
Finetuning:
- Precision:
bfloat16(improved stability) - Same major hyperparameters as original TinyLLaVA
- Clone the training repository:
git clone https://github.com/keeeeenw/TinyLLaVA_Factory.git
cd TinyLLaVA_Factory- Follow the training guides in the repository for pretraining and finetuning steps.
- Research: Vision-language experimentation on limited hardware
- Education: Learning VLM concepts and implementations
- Prototyping: Quick iteration for domain-specific applications
- Finetuning: Starting point for specialized vision-language tasks
- Small model size may limit complex reasoning capabilities
- OCR performance may be limited compared to larger models
- Performance varies with image quality and domain
- Minimal safety filtering - implement safeguards for production use
Warning: This model should not be used for safety-critical applications without thorough human review and additional safeguards.
- MicroLlama - The base language model
- TinyLLaVA Factory - Training framework
- SigLIP2 - Vision encoder
@misc{wang2024microllama,
title = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
author = {Zixiao Ken Wang},
year = {2025},
url = {https://huggingface.co/keeeeenw/MicroLlava}
}We welcome contributions! Please see our Contributing Guidelines for details.
- Additional evaluation benchmarks
- Performance optimizations
- Documentation improvements
- Example applications
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Special thanks to:
- TinyLLaVA Factory team for the training framework
- SigLIP2 authors for the efficient vision encoder
- LAION community for the pretraining datasets
- Hugging Face for model hosting and tools
β Star this repository if you find it useful! β
For questions and support, please open an issue or check out the Hugging Face model page.