Thanks for your great work!
I can't reproduce the performance when I finetune LLaVA with LingoQA dataset.

Here please help me to align the experient setting
- The single image selected is the middle one from the image sequence.
- During training, is the multi-round QA or single-round QA format adopted?
- During fine-tuning, as the LLaVA fine-tuning approach is used, freeze the visual encoder weights and fine-tune the weights of the projection layer and the whole LLM in LLaVA.