Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Comments

Fix video indexing bugs in dataset aggregation with multi video datatasets#2550

Merged
michel-aractingi merged 6 commits intomainfrom
fix/dataset_aggr
Dec 5, 2025
Merged

Fix video indexing bugs in dataset aggregation with multi video datatasets#2550
michel-aractingi merged 6 commits intomainfrom
fix/dataset_aggr

Conversation

@michel-aractingi
Copy link
Contributor

What this does

Fix critical bugs in src/lerobot/datasets/aggregate.py that caused incorrect video frame indexing when aggregating multiple datasets. These bugs resulted in episodes pointing to wrong video files or having timestamps that exceeded actual video durations, causing failure when the dataloader is attempting to access a frame.

1. Episodes assigned to wrong destination video files

In update_meta_data(), all episodes from a source dataset were assigned the same destination chunk/file indices, regardless of which destination file their source video was actually written to.

# All episodes get the SAME destination file - the LAST one used
df[orig_chunk_col] = video_idx["chunk"]
df[orig_file_col] = video_idx["file"]

After the fix :

# Track which destination file each source file was written to
videos_idx[key]["src_to_dst"][(src_chunk_idx, src_file_idx)] = (chunk_idx, file_idx)

# Map each episode to its CORRECT destination file
for idx in df.index:
    src_key = (int(df.at[idx, "_orig_chunk"]), int(df.at[idx, "_orig_file"]))
    dst_chunk, dst_file = src_to_dst.get(src_key, (video_idx["chunk"], video_idx["file"]))
    df.at[idx, orig_chunk_col] = dst_chunk
    df.at[idx, orig_file_col] = dst_file

2. Video offsets tracked globally instead of per-destination-file

When concatenating videos from multiple source datasets into destination files, offsets were tracked as a running total across ALL files. This caused episodes to have timestamps that exceeded their actual video file's duration.

Example of the bug:

  • Source A video → written to dst-file-0 (duration: 500s)
  • Source B video → concatenated to dst-file-0 (total: 1000s)
  • Source C video → rotated to new dst-file-1 (duration: 500s)
  • Source D video → concatenated to dst-file-1

With the bug, Source D's offset would be ~1500s (total of A+B+C) instead of 500s (just C's duration in dst-file-1).

current_offset = video_idx["latest_duration"]  # Global running total

# Append to existing file - offset is WRONG (includes videos from other files)
videos_idx[key]["src_to_offset"][(src_chunk_idx, src_file_idx)] = current_offset

After the fix :

# Track duration of EACH destination file separately
if "dst_file_durations" not in videos_idx[key]:
    videos_idx[key]["dst_file_durations"] = {}

dst_file_durations = video_idx["dst_file_durations"]
dst_key = (chunk_idx, file_idx)

# Append to existing destination file
# Offset is the current duration of THIS SPECIFIC destination file
current_dst_duration = dst_file_durations.get(dst_key, 0)
videos_idx[key]["src_to_offset"][(src_chunk_idx, src_file_idx)] = current_dst_duration
# Update duration of this destination file
dst_file_durations[dst_key] = current_dst_duration + src_duration

This issue has been cited in : #2328 #2212 #2438

Copilot AI review requested due to automatic review settings November 30, 2025 20:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes two critical bugs in video indexing during dataset aggregation that caused incorrect frame access when loading data from aggregated multi-video datasets. The fixes ensure that episodes correctly reference their destination video files and that timestamps remain within the bounds of their respective video durations.

  • Implements per-source-to-destination file mapping to ensure episodes point to the correct video files
  • Changes video offset tracking from global to per-destination-file to prevent timestamp overflow
  • Adds defensive type conversion for numpy integers used as dictionary keys

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@michel-aractingi
Copy link
Contributor Author

let me know if it works for you @brysonjones @nicholas-maselli @andras-makany @Grigorij-Dudnik

@brysonjones
Copy link
Contributor

Hi @michel-aractingi, I can confirm that the error described in #2212 that I was facing with this is resolved by this MR.

I've tested this with a few different large datasets (4 video views, and 2k episodes), and didn't hit issues with merging with any of them.

Thank you for working on this!

@alexcbb
Copy link
Contributor

alexcbb commented Dec 1, 2025

Hi, @michel-aractingi I've had the problem also on a personal dataset and discovered this PR. I can confirm that it indeed solve my issue with timestamps.

@Grigorij-Dudnik
Copy link

Tested - act policy got trained on filtered and glued dataset.

Btw - I previousely tested the fix by @andras-makany, it was also working.

@michel-aractingi michel-aractingi merged commit 0217e1e into main Dec 5, 2025
10 checks passed
@michel-aractingi michel-aractingi deleted the fix/dataset_aggr branch December 5, 2025 15:09
@nicholas-maselli
Copy link

let me know if it works for you @brysonjones @nicholas-maselli @andras-makany @Grigorij-Dudnik

Awesome! My apologies my github notifications was set incorrectly so I didn't see this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants