Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[1/n][Optimus][Auto-AC] Support activation quantization without scaling #148380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mengluy0125
Copy link
Contributor

@mengluy0125 mengluy0125 commented Mar 3, 2025

Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:

unit test

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten

Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining 0/4 6.7s exec time total
Command: test. Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

E2E

how to enable (you can overrite the dtype, if nothing given, the default is fp8)

post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}
        },

Differential Revision: D70522237

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link

pytorch-bot bot commented Mar 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148380

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7e92ff3 with merge base d2ee606 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

pytorch-bot bot pushed a commit that referenced this pull request Mar 6, 2025
Summary:

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/fc8469a3-54f7-425d-9b1f-e54840c0793a
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3377699989853946
Network: Up: 10KiB  Down: 0B  (reSessionID-ab248457-6ac0-4b72-96da-0d3c427e260a)
Executing actions. Remaining     0/3                                                                                        0.7s exec time total
Command: test.     Finished 2 local                                                                                                                                                              
Time elapsed: 1:03.1s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable (you can overrite the dtype and clamp range, if nothing given, the default is fp8)

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": torch.float8_e5m2, "clamp_min": -57344.0, "clamp_max": 57344.0}
        },
```

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

pytorch-bot bot pushed a commit that referenced this pull request Apr 9, 2025
Summary:

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 9, 2025
Summary:
X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 9, 2025
Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 9, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 10, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 10, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request Apr 10, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request Apr 10, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing comments not addressed..

Comment on lines 1896 to 1928
@functools.lru_cache(None)
def _warn_complex_not_supported():
warnings.warn(
"Torchinductor does not support code generation for complex operators. Performance may be worse than eager."
)


# There are some types (CPU) which we accept as input but not as
# output.
def unsupported_input_tensor(t: torch.Tensor, parent=None, node=None):
"Do not support reading or writing to this tensor"
if t.is_complex():
# Complex views are supported with IR ComplexView
if parent and parent.target in (
torch.ops.aten.view.dtype,
torch.ops.prims.convert_element_type.default,
):
return False
_warn_complex_not_supported()
return True

if t.is_meta:
return True

if t.dtype == torch.float8_e8m0fnu:
if not node:
return True

# allow bitcast, views, memory movement, but not arithmetic
# TODO: delete once triton adds native support
return not (
isinstance(parent.target, torch._ops.OpOverload)
and parent.target
in (
aten.view.dtype,
aten.cat.default,
aten._scaled_mm.default,
)
or (isinstance(node.target, torch._ops.OpOverload) and is_view(node.target))
)

return False


def unsupported_output_tensor(t: torch.Tensor, parent=None, node=None):
"Do not support writing tensor but can read from it"
if unsupported_input_tensor(t, parent):
return True
return t.is_cpu and config.disable_cpp_codegen


def fallback_node_due_to_unsupported_type(node: torch.fx.Node, allow_cpu_inputs=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 1

bwd_module: fx.GraphModule,
static_lifetime_input_nodes: Optional[OrderedSet[fx.Node]] = None,
):
GraphTransformObserver = functools.partial(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a function, not a class - I would stick with function/variable naming, e.g. graph_transform_observer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the naming convention in the codebase to keep it consistent lol @bdhirsh

GraphTransformObserver = functools.partial(

@mengluy0125
Copy link
Contributor Author

To improve the readability, I will make a separate PR to refactor the code to resolve the circular dependency problem @eellison @bdhirsh

pytorch-bot bot pushed a commit that referenced this pull request May 1, 2025
…ng (#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot


### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request May 1, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 added a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot


### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request May 1, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 1, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607


We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000


baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot


### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Reviewed By: avicizhu

Differential Revision: D70522237
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70522237

mengluy0125 added a commit to mengluy0125/benchmark that referenced this pull request May 7, 2025
Summary:

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Reviewed By: avicizhu

Differential Revision: D70522237
mengluy0125 pushed a commit to mengluy0125/pytorch that referenced this pull request May 7, 2025
…ng (pytorch#148380)

Summary:
X-link: pytorch/benchmark#2607

Pull Request resolved: pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten
```

Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719
Network: Up: 62KiB  Down: 81KiB  (reSessionID-913ca82d-c395-4492-818e-6e004df37f87)
Executing actions. Remaining     0/4                                                                                                   6.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:22.9s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to enable

```
post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"},
        },
```
see D51860030 to check how to set the config under dynamo_config_map

Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize

#### If you use FSDP

- You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting)
- Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1
(context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/)

```
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization]
```

aps-512_8_remove_fsdp_guards-92ae3972ba

tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa

w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4

### QPS

 {F1977040587}

### memory

baseline
{F1977040640}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

with fp8
{F1977040641}

memory snapshot:
https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot

### conclusion:

- ~9% qps improvement, reduces peak memory from 82.01 to 78.97.

- for NE, we need have longer verification, WIP with scaling version.

Differential Revision: D70522237
facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request May 8, 2025
Summary:
Pull Request resolved: #2607

X-link: pytorch/pytorch#148380

We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Reviewed By: Hahu803, avicizhu

Differential Revision: D70522237

fbshipit-source-id: 9c501506e8bd40a1199fafb2e28e6384e7df4786
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Comment on lines +338 to +340
return size_in_mb >= torch._inductor.config.post_grad_fusion_options[
"activation_quantization_aten_pass"
].get("size_in_mb", 100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to add shape guards on every tensor. We probably want this only if statically true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants