-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Hi team,
Thanks for the excellent work on Anole!
I'm currently exploring multimodal generation and training using Anole. I understand that Anole uses its own transformer library for model loading and distributed inference. I’d like to ask:
If I want to continue leveraging Anole’s multimodal capabilities (both image and text) for inference and fine-tuning, is it strictly necessary to use the anole/transformer library, or is it possible to use Hugging Face's transformers library instead?
In particular:
If I comment out those two lines in Hugging Face's transformers library
https://github.com/huggingface/transformers/blob/30567c28e81be1ba09249aa5589b8227653ab073/src/transformers/models/chameleon/modeling_chameleon.py
image_tokens = self.model.vocabulary_mapping.image_tokens
logits[:, :, image_tokens] = torch.finfo(logits.dtype).min
then load the GAIR/Anole-7b-v0.1 checkpoint and continue using HuggingFace's native interfaces such as ChameleonForConditionalGeneration for multimodal generation with image and text tokens?
Any clarification would be greatly appreciated. Thanks in advance!