Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

jjsjann123
Copy link
Collaborator

logical domain is not the right target for checking validation of reference TensorView.

We care about transformation propagations and coverage, so we should have used initial loop domain instead.
Specifically, this makes it easier to schedule opaque indexing operations, or sparse update operations like scatter. Because the logical domain of the large buffer isn't relevant for scheduling.

@jjsjann123
Copy link
Collaborator Author

!test

Copy link

github-actions bot commented Sep 5, 2025

Description

  • Use initial loop domain instead of logical domain for coverage checks

  • Improve scheduling support for opaque indexing and sparse operations

  • Fix domain mapping validation for broadcast and transformed dimensions

  • Enable safer propagation of iteration domains in fusion scheduling


Changes walkthrough 📝

Relevant files
Bug fix
domain_map.cpp
Switch to initial loop domain for domain map coverage       

csrc/scheduler/tools/domain_map.cpp

  • Replace getLogicalDomain with getInitialLoopDomain in coverage checks
  • Update domain mapping logic for input and target ID validation
  • Preserve broadcast dimension handling with updated domain source
  • Fix propagation safety for transformed and merged broadcast domains
  • +6/-6     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Consistency Check

    The PR replaces getLogicalDomain() with getInitialLoopDomain() in multiple places within areAllInputIdsMappedTo and areAllTargetIdsCoveredBy functions. However, it is unclear if all uses of the logical domain were intended to be replaced, especially in contexts where transformation propagation relies on logical structure. This change may affect correctness in scheduling scenarios involving complex transformations.

    for (auto in_id : input_tv->getInitialLoopDomain()) {
      if (canIgnoreIndexedInputDomainID(input_tv, in_id, ca_map_)) {
        continue;
      }
    
      // Permissive map is required for the transpose scheduler to support cases
      // like T0[I0, b] + T1[b, I1]
      auto concrete =
          ca_map_.getConcreteMappedID(in_id, IdMappingMode::PERMISSIVE);
    
      if (!concrete->isBroadcast() && !in_id->isReduction()) {
        in_concrete_ids.insert(concrete);
      }
    }
    
    // Erase all input concrete IDs mapped to the output domain
    // Ignore unresolved broadcast dimensions
    eraseifInputMappedThroughRootDomainAndIndexing(
        in_concrete_ids, tv->getInitialLoopDomain());
    Broadcast Handling Comment

    The comment suggests that broadcast handling should ideally be moved inside a loop using get_source_iter_domains, but this was not implemented due to an open issue. The PR updates the comment to reference getInitialLoopDomain(), but does not address the underlying technical debt or clarify whether this change impacts broadcast handling safety.

    // `get_source_iter_domains(target_tv->getInitialLoopDomain())` and skip
    // broadcast source IDs. currently we have the issue that split/merge does

    @jjsjann123 jjsjann123 mentioned this pull request Sep 5, 2025
    @jjsjann123
    Copy link
    Collaborator Author

    CI seems to have some issue here. I cannot see the failing tests....

    On Naoya's comment here: #5114 (comment)
    I'm suspecting TransformPropagator also need to switch from logical domain to initial loop domain, following the same logic here. But that felt like a large hammer... I'll double check after looking at failing log.

    @jjsjann123 jjsjann123 changed the title Use initial loop domain in domain_map [DoNotReview] Use initial loop domain in domain_map Sep 8, 2025
    @jjsjann123
    Copy link
    Collaborator Author

    I'm hitting two kinds of failures:

    1. multi-device seems to be failing all over the place. Haven't dived in to see what's missing there.
    2. More concerning, (but somewhat expected), scatter related ops are failing about not being able to find a scheduler that could handle it.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant