in the case of paper(cook egg), the model generate interleaved image-text which contains 5 images (~ 5K tokens), but chameleon is trained with 4K context. So I think it can not generate such good outputs, and I do experiments and find it cannot generate 5 images.