How did you incorporate the generated visual tokens into the input? Just use the id and the embedding layer?