Why does DeformableDETR take samples with shape [batch_size x 3 x H x W] instead of [batch_size x T x 3 x H x W]

I thought it takes multiple frames as the input. 
![Image](https://github.com/user-attachments/assets/e54565c4-8c7b-4334-aa51-7bce57bb0c54)
The Vid_multi dataset certainly returns multiple images
![Image](https://github.com/user-attachments/assets/ca78d264-42af-4317-abb2-02db5fa4f4be)