I thought it takes multiple frames as the input.  The Vid_multi dataset certainly returns multiple images 