BLIP-2 implementation for training vision-language models. Q-Former + frozen encoders + any LLM. Colab-ready notebooks with MoE variant.
computer-vision deep-learning pytorch transformer colab moe image-captioning llama image-to-text vlm multimodal mixture-of-experts vision-language-model blip2 siglip q-former
-
Updated
Dec 19, 2025 - Jupyter Notebook