[1/n][Optimus][Auto-AC] Support activation quantization without scaling #148380

mengluy0125 · 2025-03-03T23:09:10Z

Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.

Test Plan:

unit test

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten

Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining 0/4 6.7s exec time total
Command: test. Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

E2E

how to enable (you can overrite the dtype, if nothing given, the default is fp8)

post_grad_fusion_options={
            "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}
        },

Differential Revision: D70522237

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-03-03T23:09:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148380

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 7e92ff3 with merge base d2ee606 ():

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / unit-test / cuda12.6-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper, 1, 2, lf.ephemeral.linux.g5.4xlarge.nvidia.gpu) (gh) (#152916)
[ FAILED ] AotInductorTest.BasicTestCuda

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-03-03T23:13:08Z

This pull request was exported from Phabricator. Differential Revision: D70522237

facebook-github-bot · 2025-03-03T23:13:14Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/fc8469a3-54f7-425d-9b1f-e54840c0793a Test UI: https://www.internalfb.com/intern/testinfra/testrun/3377699989853946 Network: Up: 10KiB Down: 0B (reSessionID-ab248457-6ac0-4b72-96da-0d3c427e260a) Executing actions. Remaining 0/3 0.7s exec time total Command: test. Finished 2 local Time elapsed: 1:03.1s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable (you can overrite the dtype and clamp range, if nothing given, the default is fp8) ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": torch.float8_e5m2, "clamp_min": -57344.0, "clamp_max": 57344.0} }, ``` Differential Revision: D70522237

facebook-github-bot · 2025-03-06T21:35:16Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba Differential Revision: D70522237

facebook-github-bot · 2025-04-09T18:40:41Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237

Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba Differential Revision: D70522237

facebook-github-bot · 2025-04-09T20:55:16Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237

…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237

facebook-github-bot · 2025-04-10T01:18:12Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237

…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237

facebook-github-bot · 2025-04-10T03:06:07Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237

eellison

existing comments not addressed..

eellison · 2025-04-28T19:14:50Z

torch/_inductor/lowering.py

-@functools.lru_cache(None)
-def _warn_complex_not_supported():
-    warnings.warn(
-        "Torchinductor does not support code generation for complex operators. Performance may be worse than eager."
-    )
-
-
-# There are some types (CPU) which we accept as input but not as
-# output.
-def unsupported_input_tensor(t: torch.Tensor, parent=None, node=None):
-    "Do not support reading or writing to this tensor"
-    if t.is_complex():
-        # Complex views are supported with IR ComplexView
-        if parent and parent.target in (
-            torch.ops.aten.view.dtype,
-            torch.ops.prims.convert_element_type.default,
-        ):
-            return False
-        _warn_complex_not_supported()
-        return True
-
-    if t.is_meta:
-        return True
-
-    if t.dtype == torch.float8_e8m0fnu:
-        if not node:
-            return True
-
-        # allow bitcast, views, memory movement, but not arithmetic
-        # TODO: delete once triton adds native support
-        return not (
-            isinstance(parent.target, torch._ops.OpOverload)
-            and parent.target
-            in (
-                aten.view.dtype,
-                aten.cat.default,
-                aten._scaled_mm.default,
-            )
-            or (isinstance(node.target, torch._ops.OpOverload) and is_view(node.target))
-        )
-
-    return False
-
-
-def unsupported_output_tensor(t: torch.Tensor, parent=None, node=None):
-    "Do not support writing tensor but can read from it"
-    if unsupported_input_tensor(t, parent):
-        return True
-    return t.is_cpu and config.disable_cpp_codegen
-
-
-def fallback_node_due_to_unsupported_type(node: torch.fx.Node, allow_cpu_inputs=True):


bdhirsh · 2025-04-30T18:25:44Z

torch/_functorch/partitioners.py

+    bwd_module: fx.GraphModule,
+    static_lifetime_input_nodes: Optional[OrderedSet[fx.Node]] = None,
+):
+    GraphTransformObserver = functools.partial(


nit: this is a function, not a class - I would stick with function/variable naming, e.g. graph_transform_observer?

I followed the naming convention in the codebase to keep it consistent lol @bdhirsh

pytorch/torch/_inductor/fx_passes/joint_graph.py

Line 525 in ad9e209

GraphTransformObserver = functools.partial(

mengluy0125 · 2025-04-30T21:12:01Z

To improve the readability, I will make a separate PR to refactor the code to resolve the circular dependency problem @eellison @bdhirsh

…ng (#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237

facebook-github-bot · 2025-05-01T04:20:33Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237

…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237

facebook-github-bot · 2025-05-01T05:29:06Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237

…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237

…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Reviewed By: avicizhu Differential Revision: D70522237

facebook-github-bot · 2025-05-07T21:13:05Z

This pull request was exported from Phabricator. Differential Revision: D70522237

Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Reviewed By: avicizhu Differential Revision: D70522237

…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237

Summary: Pull Request resolved: #2607 X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Reviewed By: Hahu803, avicizhu Differential Revision: D70522237 fbshipit-source-id: 9c501506e8bd40a1199fafb2e28e6384e7df4786

facebook-github-bot · 2025-05-08T04:36:43Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-05-08T04:38:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

eellison · 2025-05-08T13:50:04Z

torch/_functorch/partitioners.py

+    return size_in_mb >= torch._inductor.config.post_grad_fusion_options[
+        "activation_quantization_aten_pass"
+    ].get("size_in_mb", 100)


This is going to add shape guards on every tensor. We probably want this only if statically true

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 3, 2025

facebook-github-bot added the fb-exported label Mar 3, 2025

mengluy0125 force-pushed the export-D70522237 branch from d4097e7 to 059a135 Compare March 3, 2025 23:13

mengluy0125 added release notes: inductor release notes: quantization release notes category labels Mar 3, 2025

mengluy0125 force-pushed the export-D70522237 branch from 059a135 to dc3fa73 Compare March 6, 2025 21:34

mengluy0125 force-pushed the export-D70522237 branch from dc3fa73 to 5fb19c8 Compare April 9, 2025 18:40

pytorch-bot bot added the module: dynamo label Apr 9, 2025

mengluy0125 mentioned this pull request Apr 9, 2025

[1/n][Optimus][Auto-AC] Support activation quantization without scaling pytorch/benchmark#2607

Closed

mengluy0125 force-pushed the export-D70522237 branch from 5fb19c8 to 6130aad Compare April 9, 2025 20:55

mengluy0125 force-pushed the export-D70522237 branch from 6130aad to cf45f4d Compare April 10, 2025 01:18

mengluy0125 force-pushed the export-D70522237 branch from cf45f4d to 8101b09 Compare April 10, 2025 03:05

eellison reviewed Apr 28, 2025

View reviewed changes

bdhirsh reviewed Apr 30, 2025

View reviewed changes

mengluy0125 force-pushed the export-D70522237 branch from 60bf612 to 7dfc078 Compare May 1, 2025 04:20

mengluy0125 force-pushed the export-D70522237 branch from 7dfc078 to 6c1323a Compare May 1, 2025 05:28

mengluy0125 force-pushed the export-D70522237 branch from 6c1323a to 7e92ff3 Compare May 7, 2025 21:12

Mingming-Ding approved these changes May 7, 2025

View reviewed changes

Hahu803 approved these changes May 7, 2025

View reviewed changes

pytorchmergebot added the merging label May 8, 2025

pytorchmergebot closed this in 2d25e4d May 8, 2025

pytorchmergebot added Merged and removed merging labels May 8, 2025

eellison reviewed May 8, 2025

View reviewed changes

[1/n][Optimus][Auto-AC] Support activation quantization without scaling #148380

[1/n][Optimus][Auto-AC] Support activation quantization without scaling #148380

Uh oh!

Conversation

mengluy0125 commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

unit test

E2E

how to enable (you can overrite the dtype, if nothing given, the default is fp8)

Uh oh!

pytorch-bot bot commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148380

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

facebook-github-bot commented Mar 3, 2025

Uh oh!

facebook-github-bot commented Mar 3, 2025

Uh oh!

facebook-github-bot commented Mar 6, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 9, 2025

Uh oh!

facebook-github-bot commented Apr 10, 2025

Uh oh!

facebook-github-bot commented Apr 10, 2025

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

eellison Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

bdhirsh Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

mengluy0125 May 1, 2025

Choose a reason for hiding this comment

Uh oh!

mengluy0125 commented Apr 30, 2025

Uh oh!

facebook-github-bot commented May 1, 2025

Uh oh!

facebook-github-bot commented May 1, 2025

Uh oh!

facebook-github-bot commented May 7, 2025

Uh oh!

facebook-github-bot commented May 8, 2025

Uh oh!

pytorchmergebot commented May 8, 2025

Merge started

Uh oh!

eellison May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mengluy0125 commented Mar 3, 2025 •

edited

Loading

pytorch-bot bot commented Mar 3, 2025 •

edited

Loading