Fix length in custom datasets #160

gitttt-1234 · 2025-03-18T17:40:50Z

This PR fixes a bug in computing the length of custom datasets in sleap_nn.data.custom_datasets.BaseDataset class. It computes the length of the dataset from self.cache. However, this is not initialized when we are reusing the numpy chunks, which sets len(dataset) as 0. In this PR, we change this by creating a lf_idx_list which computes the list of user labelled frames from the labels object and computing the length of the dataset using this index list.

Summary by CodeRabbit

New Features
- Introduced enhanced dataset filtering to process only the frames with relevant annotations, improving efficiency.
Refactor
- Streamlined data caching and record counting to align with the updated filtering logic, ensuring consistent performance.

coderabbitai · 2025-03-18T17:40:57Z

Walkthrough

The pull request adds a new method, _get_lf_idx_list, to the BaseDataset class in sleap_nn/data/custom_datasets.py. This method filters and returns a list of indices for labeled frames based on user-specified conditions. The class constructor is updated to initialize lf_idx_list using this method. Subsequent modifications in _fill_cache and __len__ use this filtered list to process and determine the valid labeled frames.

Changes

File	Change Summary
sleap_nn/.../custom_datasets.py	• Added `_get_lf_idx_list` in `BaseDataset` for filtering labeled frame indices. • Updated constructor to initialize `lf_idx_list`. • Modified `_fill_cache` to iterate over `lf_idx_list`. • Adjusted `__len__` to return the length of `lf_idx_list`.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BaseDataset
    participant FilterMethod as _get_lf_idx_list
    note over BaseDataset: Dataset initialization
    Client->>BaseDataset: Instantiate dataset
    BaseDataset->>FilterMethod: Call _get_lf_idx_list()
    FilterMethod-->>BaseDataset: Return filtered index list (lf_idx_list)
    BaseDataset->>BaseDataset: Set lf_idx_list attribute

    note over BaseDataset: Cache population
    Client->>BaseDataset: Call _fill_cache()
    BaseDataset->>BaseDataset: Iterate over lf_idx_list
    BaseDataset->>Client: Process each valid labeled frame

Poem

I'm a bunny coding in the night,
Hopping through indices with delight.
_get_lf_idx_list lights up the trail,
Filtering frames without fail.
Cache and length now dance in tune,
A rabbit’s code sings under the moon.
🐇✨

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)

We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
- To enable this feature, set early_access to true under in the settings.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

sleap_nn/data/custom_datasets.py (1)
91-105: Excellent implementation of the new method for determining valid labeled frames

This is the core fix for the issue described in the PR. The method correctly filters labeled frames based on user instances when configured and skips empty instances, creating a reliable index list for the dataset.

Consider simplifying the nested if statements for better readability:
-            # Filter to user instances
-            if self.data_config.user_instances_only:
-                if lf.user_instances is not None and len(lf.user_instances) > 0:
-                    lf.instances = lf.user_instances
+            # Filter to user instances
+            if self.data_config.user_instances_only and lf.user_instances is not None and len(lf.user_instances) > 0:
+                lf.instances = lf.user_instances
🧰 Tools

🪛 Ruff (0.8.2)

96-97: Use a single if statement instead of nested if statements

(SIM102)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c45d4e and 6e116d6.

📒 Files selected for processing (1)

sleap_nn/data/custom_datasets.py (7 hunks)

🧰 Additional context used

🧬 Code Definitions (1)

sleap_nn/data/custom_datasets.py (1)

sleap_nn/inference/predictors.py (4) (4)

data_config (213:214)

data_config (526:534)

data_config (965:970)

data_config (1339:1344)

🪛 Ruff (0.8.2)

sleap_nn/data/custom_datasets.py

96-97: Use a single if statement instead of nested if statements

(SIM102)

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Tests (macos-14, Python 3.9)
GitHub Check: Tests (windows-latest, Python 3.9)
GitHub Check: Tests (ubuntu-latest, Python 3.9)
GitHub Check: Lint

🔇 Additional comments (8)

sleap_nn/data/custom_datasets.py (8)

73-73: LGTM: Added initialization of lf_idx_list

This change initializes lf_idx_list in the BaseDataset constructor, which will be used to track valid labeled frames.

122-124: Good modification to use the filtered index list

This change properly uses lf_idx_list to iterate over valid labeled frames instead of all frames, which is more efficient and ensures that only valid frames are processed.

167-170: Fixed cache key management for numpy chunks

Using the loop index rather than the frame index for filenames ensures sequential file numbering and proper access during loading, which is important for consistent behavior.

172-172: Consistent cache key handling for in-memory caching

This change ensures that the in-memory cache also uses a sequential index for keys, maintaining consistency with the file-based approach.

183-183: Key fix for the dataset length issue

This change addresses the core problem identified in the PR description. Now __len__ returns the actual number of valid labeled frames rather than depending on the cache, which solves the issue where length would be 0 when numpy chunks are reused.

597-599: Correctly applying the same pattern in CentroidDataset class

The same approach for iterating through valid labeled frames is now applied in the CentroidDataset class, ensuring consistency across the codebase.

649-652: Consistent cache management in CentroidDataset

The filename and cache key generation has been updated to match the changes in the BaseDataset class, maintaining consistency across classes.

654-654: In-memory cache consistency in CentroidDataset

This ensures that in-memory caching in CentroidDataset follows the same pattern as BaseDataset.

codecov · 2025-03-18T17:53:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.01%. Comparing base (75bf31b) to head (6e116d6).
Report is 3 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #160   +/-   ##
=======================================
  Coverage   97.00%   97.01%           
=======================================
  Files          46       46           
  Lines        4945     4961   +16     
=======================================
+ Hits         4797     4813   +16     
  Misses        148      148

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix length in custom datasets

6e116d6

coderabbitai bot reviewed Mar 18, 2025

View reviewed changes

gitttt-1234 merged commit 34aa25d into main Mar 18, 2025
7 checks passed

gitttt-1234 deleted the divya/fix-custom-datasets branch March 18, 2025 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix length in custom datasets #160

Fix length in custom datasets #160

Uh oh!

gitttt-1234 commented Mar 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 18, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Mar 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix length in custom datasets #160

Fix length in custom datasets #160

Uh oh!

Conversation

gitttt-1234 commented Mar 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gitttt-1234 commented Mar 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 18, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Mar 18, 2025 •

edited

Loading