Fix video indexing bugs in dataset aggregation with multi video datatasets#2550
Fix video indexing bugs in dataset aggregation with multi video datatasets#2550michel-aractingi merged 6 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes two critical bugs in video indexing during dataset aggregation that caused incorrect frame access when loading data from aggregated multi-video datasets. The fixes ensure that episodes correctly reference their destination video files and that timestamps remain within the bounds of their respective video durations.
- Implements per-source-to-destination file mapping to ensure episodes point to the correct video files
- Changes video offset tracking from global to per-destination-file to prevent timestamp overflow
- Adds defensive type conversion for numpy integers used as dictionary keys
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
let me know if it works for you @brysonjones @nicholas-maselli @andras-makany @Grigorij-Dudnik |
|
Hi @michel-aractingi, I can confirm that the error described in #2212 that I was facing with this is resolved by this MR. I've tested this with a few different large datasets (4 video views, and 2k episodes), and didn't hit issues with merging with any of them. Thank you for working on this! |
|
Hi, @michel-aractingi I've had the problem also on a personal dataset and discovered this PR. I can confirm that it indeed solve my issue with timestamps. |
|
Tested - act policy got trained on filtered and glued dataset. Btw - I previousely tested the fix by @andras-makany, it was also working. |
Awesome! My apologies my github notifications was set incorrectly so I didn't see this |
What this does
Fix critical bugs in
src/lerobot/datasets/aggregate.pythat caused incorrect video frame indexing when aggregating multiple datasets. These bugs resulted in episodes pointing to wrong video files or having timestamps that exceeded actual video durations, causing failure when the dataloader is attempting to access a frame.1. Episodes assigned to wrong destination video files
In
update_meta_data(), all episodes from a source dataset were assigned the same destination chunk/file indices, regardless of which destination file their source video was actually written to.After the fix :
2. Video offsets tracked globally instead of per-destination-file
When concatenating videos from multiple source datasets into destination files, offsets were tracked as a running total across ALL files. This caused episodes to have timestamps that exceeded their actual video file's duration.
Example of the bug:
dst-file-0(duration: 500s)dst-file-0(total: 1000s)dst-file-1(duration: 500s)dst-file-1With the bug, Source D's offset would be ~1500s (total of A+B+C) instead of 500s (just C's duration in
dst-file-1).After the fix :
This issue has been cited in : #2328 #2212 #2438