Gemlite generate.py fix #2372

mobicham · 2025-06-13T18:55:28Z

Fixes generate.py benchmarking with gemlite.

Also, the current code gives OOM on smaller gpus. By putting the weights on cpu first, we can avoid this issue.

Signed-off-by: mobicham <[email protected]>

pytorch-bot · 2025-06-13T18:55:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2372

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 862e0cb with merge base 6243040 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/sparsity/test_sparse_api.py::TestQuantSemiSparse::test_quant_semi_sparse_compile_False

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2025-06-24T19:30:00Z

torchao/_models/llama/generate.py

 def _load_model(checkpoint_path, device, precision):
    checkpoint = torch.load(
-        str(checkpoint_path), mmap=True, weights_only=True, map_location=device
+        str(checkpoint_path), mmap=True, weights_only=True, map_location="cpu"


actually is the map_location change correct?

Yes, without this, you can't load a Llama3-8B fp16 on a 24GB gpu, you just get OOM.
This is the part I mentioned:

Also, the current code gives OOM on smaller gpus. By putting the weights on cpu first, we can avoid this issue.

oh OK, but do we need to change it somewhere or it has to run in cpu

I think if user request to run on some device but it fails, we should probably not silently change the device, but instead ask user to use a different device instead

No it still runs on gpu, there's a model.to(device): https://github.com/mobicham/ao/blob/862e0cbd90f8cc5f992ae66e779022912fa4d93a/torchao/_models/llama/generate.py#L245-L256

The issue is that, loading the weights via map_location='cuda' + doing model.to(device) for some reason uses more vram leading to oom. Maybe there's a cleaner way of doing it.

oh OK, makes sense, this should be OK, some quantized format does not support loading in CPU but for full/half precision model it should be fine

* fix get_plain() with FMA mode * update * fix in_features/out_feature meta-data mismatch * update gemlite slice test * add packing_bitwidth support * add packing_bitwidth support and cleanup * update default gemlite layout * cleanup * fix symmetric use-case and relax _same_meta_data * _copy() meta data * fix (4,) in autoquant * Add dynamic mode in gemlite layout * mode explanation Signed-off-by: mobicham <[email protected]> * use weights_only instead of static * generate fix Signed-off-by: mobicham <[email protected]> * remove set_packing_bitwidth --------- Signed-off-by: mobicham <[email protected]>

mobicham and others added 23 commits June 2, 2025 12:26

fix get_plain() with FMA mode

36c0c25

update

5cc70e1

Merge branch 'pytorch:main' into main

4c9dad8

fix in_features/out_feature meta-data mismatch

9ac689e

update gemlite slice test

bece806

add packing_bitwidth support

ba7b4f1

add packing_bitwidth support and cleanup

33e2bf6

update default gemlite layout

587ab10

cleanup

1cb7794

Merge branch 'pytorch:main' into main

2a31e9d

fix symmetric use-case and relax _same_meta_data

fc7ff50

fix symmetric use-case and relax _same_meta_data

75c13a5

_copy() meta data

2d66fb4

fix (4,) in autoquant

eba10ad

Merge branch 'pytorch:main' into main

b2892ce

Merge branch 'pytorch:main' into main

320167c

Merge branch 'pytorch:main' into main

19ea1c2

Add dynamic mode in gemlite layout

9c7d41d

mode explanation

6c7537b

Signed-off-by: mobicham <[email protected]>

use weights_only instead of static

d11f3e2

Merge branch 'pytorch:main' into main

a94c9cf

generate fix

490b667

Signed-off-by: mobicham <[email protected]>

Merge branch 'pytorch:main' into generate_fix

3eba801

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 13, 2025

remove set_packing_bitwidth

862e0cb

jerryzh168 approved these changes Jun 16, 2025

View reviewed changes

jerryzh168 added the topic: bug fix Use this tag for PRs that fix bugs label Jun 16, 2025

jerryzh168 merged commit 7a846d5 into pytorch:main Jun 24, 2025
18 of 20 checks passed

jerryzh168 reviewed Jun 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemlite generate.py fix #2372

Gemlite generate.py fix #2372

Uh oh!

mobicham commented Jun 13, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

jerryzh168 Jun 24, 2025

Uh oh!

mobicham Jun 24, 2025

Uh oh!

jerryzh168 Jun 24, 2025 •

edited

Loading

Uh oh!

mobicham Jun 24, 2025

Uh oh!

jerryzh168 Jun 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gemlite generate.py fix #2372

Gemlite generate.py fix #2372

Uh oh!

Conversation

mobicham commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2372

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

jerryzh168 Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

mobicham Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mobicham Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mobicham commented Jun 13, 2025 •

edited

Loading

pytorch-bot bot commented Jun 13, 2025 •

edited

Loading

jerryzh168 Jun 24, 2025 •

edited

Loading

jerryzh168 Jun 24, 2025 •

edited

Loading