Inference inputs multiple modalities other than text at once

Hello, I would like to ask, the current code seems to support only one modality and text modality at a time of inference, is it possible to input multiple modal data (such as audio, video and text) at a time of inference?