Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Sep 18, 2025

Stacked PRs

Breaking original PR #5170 into three:
#5186 Fix allocation logic: non-divisible split
#5185 Fix allocation logic: unconnected alloc/logical <- this one
#5184 Allow non-device split on allocation domain

This PR

Context

PreprocessGroupedMatmulInputSf op has:

  1. unconnected logical and allocation domain.
  2. larger allocation size, because extra padding is represented via arithmetic operations on the extent directly.

Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size.

We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output.
For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output.

Code change

  1. refactor buffer allocation buffer to use allocation domain, instead of logical domain.
  2. fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.

Copy link

github-actions bot commented Sep 18, 2025

Review updated until commit 87afb60

Description

  • Use allocation domain for buffer allocation

  • Slice allocated buffer to logical size

  • Fix stride computation for unconnected domains

  • Ensure correct output shape in layout op


Changes walkthrough 📝

Relevant files
Bug fix
allocations.cpp
Use allocation domain for buffer allocation and reshape to logical

csrc/runtime/allocations.cpp

  • Allocate tensor using allocation sizes/strides when available
  • Reshape allocated tensor to logical sizes/strides via as_strided_
  • Handle case when allocation and logical domains differ
  • Compute correct logical strides with naive layout when projection
    fails
  • +34/-9   
    Tests
    test_layout_op.cpp
    Validate logical shape and padded dimensions in layout test

    tests/cpp/test_layout_op.cpp

  • Add validation for output logical shape
  • Compute padded dimensions for k and m
  • Reshape output tensor using padded sizes and strides
  • Ensure test matches updated allocation behavior
  • +6/-0     

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Allocation Logic

    The allocation logic now uses allocation domains for buffer allocation and then applies logical strides via as_strided. However, the safety of using naive strides in cases where projection is not possible should be validated, especially regarding memory access patterns and potential out-of-bounds accesses.

    at::Tensor alloc_tensor;
    if (!out_info.shape_info.allocation_sizes.empty()) {
      // allocate based on allocation size & stride and restride with logical
      // size & stride afterwards.
      alloc_tensor = at::native::empty_strided_cuda(
          out_info.shape_info.allocation_sizes,
          out_info.shape_info.allocation_strides,
          out_info.type,
          c10::nullopt,
          device,
          c10::nullopt);
      alloc_tensor = alloc_tensor.as_strided_(
          out_info.shape_info.logical_sizes,
          out_info.shape_info.logical_strides);
    } else {
      alloc_tensor = at::native::empty_strided_cuda(
          out_info.shape_info.logical_sizes,
          out_info.shape_info.logical_strides,
          out_info.type,
          c10::nullopt,
          device,
          c10::nullopt);
    }
    Stride Computation

    When frontier_set does not match logical_set, a new stride is computed using a reverse enumeration of logical dimensions. This fallback may impact performance or correctness if the tensor is later used in operations expecting proper strides, even if indexing is not used initially.

    std::vector<int64_t> logical_sizes(logical.size(), 0);
    std::vector<int64_t> logical_strides(logical.size(), 0);
    int64_t cur_stride = 1;
    for (const auto&& [i, id] : enumerate(logical) | std::views::reverse) {
      int64_t cur_size = ee.evaluate(id->extent()).as<int64_t>();
      logical_sizes[i] = cur_size;
      logical_strides[i] = cur_stride;
      cur_stride *= cur_size;
    }
    return tensor.as_strided(logical_sizes, logical_strides);
    Shape Validation

    The test modifies the output tensor with as_strided_ using padded dimensions, which may interfere with the validation of logical shape correctness. It should be ensured that this mutation does not mask potential issues in the logical-to-allocation domain transformation.

    out.as_strided_({padded_m, padded_k}, {padded_k, 1});

    @jjsjann123 jjsjann123 changed the title Fix output buffer size for PreprocessGroupedMatmulInputSf Fix allocation logic: unconnected alloc/logical Sep 18, 2025
    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from c64d299 to 33d0ce3 Compare September 18, 2025 21:40
    @jjsjann123 jjsjann123 marked this pull request as ready for review September 18, 2025 21:41
    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from 33d0ce3 to 17df15a Compare September 19, 2025 22:55
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from 17df15a to f9acfc3 Compare September 22, 2025 21:50
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    1. refactor buffer allocation buffer to use allocation domain, intead of logical domain.
    2. fixing projection from allocation to logical special path when projection is not possible:
       We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.
    @jjsjann123 jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from f9acfc3 to 87afb60 Compare September 23, 2025 17:39
    @jjsjann123
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant