Fix allocation logic: unconnected alloc/logical #5185

jjsjann123 · 2025-09-18T21:09:59Z

Stacked PRs

Breaking original PR #5170 into three:
#5186 Fix allocation logic: non-divisible split
#5185 Fix allocation logic: unconnected alloc/logical <- this one
#5184 Allow non-device split on allocation domain

This PR

Context

PreprocessGroupedMatmulInputSf op has:

unconnected logical and allocation domain.
larger allocation size, because extra padding is represented via arithmetic operations on the extent directly.

Existing allocation logic allocate buffer matches logical sizes/strides. This is not the right behavior. Because allocation domain could have larger extent. We also cannot use allocation sizes/strides neither, because consumer of the tensor expects a tensor matching the logical size.

We updated the logic to use allocation domain for buffer allocation. Then we slice into the buffer using logical domain to produce the correct-sized output.
For the case of PreprocessGroupedMatmulInputSf, because there's no correct way to slice into the buffer for indexing, we give up on producing correct strides and just use a naive stride instead. It's safe to do so, since we are not using indexing logic on the output.

Code change

refactor buffer allocation buffer to use allocation domain, instead of logical domain.
fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.

github-actions · 2025-09-18T21:11:14Z

Review updated until commit 87afb60

Description

Use allocation domain for buffer allocation
Slice allocated buffer to logical size
Fix stride computation for unconnected domains
Ensure correct output shape in layout op

Changes walkthrough 📝

Relevant files

Bug fix

allocations.cpp `Use allocation domain for buffer allocation and reshape to logical` csrc/runtime/allocations.cpp Allocate tensor using allocation sizes/strides when available Reshape allocated tensor to logical sizes/strides via as_strided_ Handle case when allocation and logical domains differ Compute correct logical strides with naive layout when projection fails	+34/-9

Tests

test_layout_op.cpp `Validate logical shape and padded dimensions in layout test` tests/cpp/test_layout_op.cpp Add validation for output logical shape Compute padded dimensions for k and m Reshape output tensor using padded sizes and strides Ensure test matches updated allocation behavior	+6/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Allocation Logic

The allocation logic now uses allocation domains for buffer allocation and then applies logical strides via as_strided. However, the safety of using naive strides in cases where projection is not possible should be validated, especially regarding memory access patterns and potential out-of-bounds accesses.

at::Tensor alloc_tensor;
if (!out_info.shape_info.allocation_sizes.empty()) {
  // allocate based on allocation size & stride and restride with logical
  // size & stride afterwards.
  alloc_tensor = at::native::empty_strided_cuda(
      out_info.shape_info.allocation_sizes,
      out_info.shape_info.allocation_strides,
      out_info.type,
      c10::nullopt,
      device,
      c10::nullopt);
  alloc_tensor = alloc_tensor.as_strided_(
      out_info.shape_info.logical_sizes,
      out_info.shape_info.logical_strides);
} else {
  alloc_tensor = at::native::empty_strided_cuda(
      out_info.shape_info.logical_sizes,
      out_info.shape_info.logical_strides,
      out_info.type,
      c10::nullopt,
      device,
      c10::nullopt);
}

Stride Computation

When frontier_set does not match logical_set, a new stride is computed using a reverse enumeration of logical dimensions. This fallback may impact performance or correctness if the tensor is later used in operations expecting proper strides, even if indexing is not used initially.

std::vector<int64_t> logical_sizes(logical.size(), 0);
std::vector<int64_t> logical_strides(logical.size(), 0);
int64_t cur_stride = 1;
for (const auto&& [i, id] : enumerate(logical) | std::views::reverse) {
  int64_t cur_size = ee.evaluate(id->extent()).as<int64_t>();
  logical_sizes[i] = cur_size;
  logical_strides[i] = cur_stride;
  cur_stride *= cur_size;
}
return tensor.as_strided(logical_sizes, logical_strides);

Shape Validation

The test modifies the output tensor with as_strided_ using padded dimensions, which may interfere with the validation of logical shape correctness. It should be ensured that this mutation does not mask potential issues in the logical-to-allocation domain transformation.

out.as_strided_({padded_m, padded_k}, {padded_k, 1});

jjsjann123 · 2025-09-19T23:08:12Z

!test

jjsjann123 · 2025-09-22T21:50:52Z

!test

1. refactor buffer allocation buffer to use allocation domain, intead of logical domain. 2. fixing projection from allocation to logical special path when projection is not possible: We now compute correct extent instead of returning the allocation buffer as-is, this allows that layout op to return a tensor with the correct logical size, while still allocating a large enough buffer to accommodate the padding requirement.

jjsjann123 · 2025-09-23T17:42:02Z

!test

jjsjann123 changed the title ~~Fix output buffer size for PreprocessGroupedMatmulInputSf~~ Fix allocation logic: unconnected alloc/logical Sep 18, 2025

This was referenced Sep 18, 2025

Support Split between logical domain to allocation domain to represent padding #5184

Open

Fix allocation logic: non-divisible split #5186

Draft

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from c64d299 to 33d0ce3 Compare September 18, 2025 21:40

jjsjann123 marked this pull request as ready for review September 18, 2025 21:41

jjsjann123 requested review from naoyam and protonu September 18, 2025 21:41

This was referenced Sep 18, 2025

Patch allocation logic to produce outputs with correct logical size #5170

Closed

Support layout op in scheduler #5174

Draft

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from 33d0ce3 to 17df15a Compare September 19, 2025 22:55

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from 17df15a to f9acfc3 Compare September 22, 2025 21:50

jjsjann123 force-pushed the jj/allocation_for_layout_op_PR_1 branch from f9acfc3 to 87afb60 Compare September 23, 2025 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix allocation logic: unconnected alloc/logical #5185

Fix allocation logic: unconnected alloc/logical #5185

jjsjann123 commented Sep 18, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 18, 2025 •

edited

Loading

Uh oh!

jjsjann123 commented Sep 19, 2025

Uh oh!

jjsjann123 commented Sep 22, 2025

Uh oh!

jjsjann123 commented Sep 23, 2025

Uh oh!

Uh oh!

Fix allocation logic: unconnected alloc/logical #5185

Are you sure you want to change the base?

Fix allocation logic: unconnected alloc/logical #5185

Conversation

jjsjann123 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stacked PRs

This PR

Context

Code change

Uh oh!

github-actions bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

jjsjann123 commented Sep 19, 2025

Uh oh!

jjsjann123 commented Sep 22, 2025

Uh oh!

jjsjann123 commented Sep 23, 2025

Uh oh!

Uh oh!

jjsjann123 commented Sep 18, 2025 •

edited

Loading

github-actions bot commented Sep 18, 2025 •

edited

Loading