Codestin Search App

michel-aractingi · 2025-11-30T20:58:58Z

What this does

Fix critical bugs in src/lerobot/datasets/aggregate.py that caused incorrect video frame indexing when aggregating multiple datasets. These bugs resulted in episodes pointing to wrong video files or having timestamps that exceeded actual video durations, causing failure when the dataloader is attempting to access a frame.

1. Episodes assigned to wrong destination video files

In update_meta_data(), all episodes from a source dataset were assigned the same destination chunk/file indices, regardless of which destination file their source video was actually written to.

# All episodes get the SAME destination file - the LAST one used
df[orig_chunk_col] = video_idx["chunk"]
df[orig_file_col] = video_idx["file"]

After the fix :

# Track which destination file each source file was written to
videos_idx[key]["src_to_dst"][(src_chunk_idx, src_file_idx)] = (chunk_idx, file_idx)

# Map each episode to its CORRECT destination file
for idx in df.index:
    src_key = (int(df.at[idx, "_orig_chunk"]), int(df.at[idx, "_orig_file"]))
    dst_chunk, dst_file = src_to_dst.get(src_key, (video_idx["chunk"], video_idx["file"]))
    df.at[idx, orig_chunk_col] = dst_chunk
    df.at[idx, orig_file_col] = dst_file

2. Video offsets tracked globally instead of per-destination-file

When concatenating videos from multiple source datasets into destination files, offsets were tracked as a running total across ALL files. This caused episodes to have timestamps that exceeded their actual video file's duration.

Example of the bug:

Source A video → written to dst-file-0 (duration: 500s)
Source B video → concatenated to dst-file-0 (total: 1000s)
Source C video → rotated to new dst-file-1 (duration: 500s)
Source D video → concatenated to dst-file-1

With the bug, Source D's offset would be ~1500s (total of A+B+C) instead of 500s (just C's duration in dst-file-1).

current_offset = video_idx["latest_duration"]  # Global running total

# Append to existing file - offset is WRONG (includes videos from other files)
videos_idx[key]["src_to_offset"][(src_chunk_idx, src_file_idx)] = current_offset

After the fix :

# Track duration of EACH destination file separately
if "dst_file_durations" not in videos_idx[key]:
    videos_idx[key]["dst_file_durations"] = {}

dst_file_durations = video_idx["dst_file_durations"]
dst_key = (chunk_idx, file_idx)

# Append to existing destination file
# Offset is the current duration of THIS SPECIFIC destination file
current_dst_duration = dst_file_durations.get(dst_key, 0)
videos_idx[key]["src_to_offset"][(src_chunk_idx, src_file_idx)] = current_dst_duration
# Update duration of this destination file
dst_file_durations[dst_key] = current_dst_duration + src_duration

This issue has been cited in : #2328 #2212 #2438

Copilot

Pull request overview

This PR fixes two critical bugs in video indexing during dataset aggregation that caused incorrect frame access when loading data from aggregated multi-video datasets. The fixes ensure that episodes correctly reference their destination video files and that timestamps remain within the bounds of their respective video durations.

Implements per-source-to-destination file mapping to ensure episodes point to the correct video files
Changes video offset tracking from global to per-destination-file to prevent timestamp overflow
Adds defensive type conversion for numpy integers used as dictionary keys

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

michel-aractingi · 2025-11-30T21:31:20Z

let me know if it works for you @brysonjones @nicholas-maselli @andras-makany @Grigorij-Dudnik

brysonjones · 2025-12-01T05:04:08Z

Hi @michel-aractingi, I can confirm that the error described in #2212 that I was facing with this is resolved by this MR.

I've tested this with a few different large datasets (4 video views, and 2k episodes), and didn't hit issues with merging with any of them.

Thank you for working on this!

alexcbb · 2025-12-01T14:51:52Z

Hi, @michel-aractingi I've had the problem also on a personal dataset and discovered this PR. I can confirm that it indeed solve my issue with timestamps.

Grigorij-Dudnik · 2025-12-04T11:35:44Z

Tested - act policy got trained on filtered and glued dataset.

Btw - I previousely tested the fix by @andras-makany, it was also working.

nicholas-maselli · 2025-12-05T16:15:50Z

let me know if it works for you @brysonjones @nicholas-maselli @andras-makany @Grigorij-Dudnik

Awesome! My apologies my github notifications was set incorrectly so I didn't see this

Fix dataset aggreagation for multi video datasets'

38bc2bd

Copilot AI review requested due to automatic review settings November 30, 2025 20:58

Merge branch 'main' into fix/dataset_aggr

0b45635

Copilot started reviewing on behalf of michel-aractingi November 30, 2025 20:59 View session

Copilot finished reviewing on behalf of michel-aractingi November 30, 2025 21:00

Copilot AI reviewed Nov 30, 2025

View reviewed changes

Merge branch 'main' into fix/dataset_aggr

3709343

michel-aractingi added 2 commits December 2, 2025 10:47

Merge branch 'main' into fix/dataset_aggr

1abb704

Merge branch 'main' into fix/dataset_aggr

255b362

Merge branch 'main' into fix/dataset_aggr

9537912

pkooij approved these changes Dec 5, 2025

View reviewed changes

michel-aractingi merged commit 0217e1e into main Dec 5, 2025
10 checks passed

michel-aractingi deleted the fix/dataset_aggr branch December 5, 2025 15:09

psguo mentioned this pull request Dec 12, 2025

fix aggregate Genesis-Embodied-AI/lerobot#7

Merged

sotanakamura mentioned this pull request Dec 12, 2025

When training merged datasets, frame timestamp/encoding error says that the queried frame is a lot of secs from the closest frame. #2627

Closed

RiccardoIzzo mentioned this pull request Dec 18, 2025

Invalid frame index when training on merged datasets [RuntimeError] #2680

Open

3 tasks

sandhya-cb pushed a commit to sandhya-cb/lerobot-clutterbot that referenced this pull request Jan 28, 2026

Fix dataset aggreagation for multi video datasets' (huggingface#2550)

909ed99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix video indexing bugs in dataset aggregation with multi video datatasets#2550

Fix video indexing bugs in dataset aggregation with multi video datatasets#2550
michel-aractingi merged 6 commits intomainfrom
fix/dataset_aggr

michel-aractingi commented Nov 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

michel-aractingi commented Nov 30, 2025

Uh oh!

brysonjones commented Dec 1, 2025

Uh oh!

alexcbb commented Dec 1, 2025

Uh oh!

Grigorij-Dudnik commented Dec 4, 2025

Uh oh!

Uh oh!

nicholas-maselli commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Comments

Conversation

michel-aractingi commented Nov 30, 2025

What this does

1. Episodes assigned to wrong destination video files

2. Video offsets tracked globally instead of per-destination-file

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

michel-aractingi commented Nov 30, 2025

Uh oh!

brysonjones commented Dec 1, 2025

Uh oh!

alexcbb commented Dec 1, 2025

Uh oh!

Grigorij-Dudnik commented Dec 4, 2025

Uh oh!

Uh oh!

nicholas-maselli commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants