-
Notifications
You must be signed in to change notification settings - Fork 24.1k
[1/n][Optimus][Auto-AC] Support activation quantization without scaling #148380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148380
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 7e92ff3 with merge base d2ee606 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D70522237 |
This pull request was exported from Phabricator. Differential Revision: D70522237 |
d4097e7
to
059a135
Compare
059a135
to
dc3fa73
Compare
Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/fc8469a3-54f7-425d-9b1f-e54840c0793a Test UI: https://www.internalfb.com/intern/testinfra/testrun/3377699989853946 Network: Up: 10KiB Down: 0B (reSessionID-ab248457-6ac0-4b72-96da-0d3c427e260a) Executing actions. Remaining 0/3 0.7s exec time total Command: test. Finished 2 local Time elapsed: 1:03.1s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable (you can overrite the dtype and clamp range, if nothing given, the default is fp8) ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": torch.float8_e5m2, "clamp_min": -57344.0, "clamp_max": 57344.0} }, ``` Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
dc3fa73
to
5fb19c8
Compare
Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
5fb19c8
to
6130aad
Compare
Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
6130aad
to
cf45f4d
Compare
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
cf45f4d
to
8101b09
Compare
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing comments not addressed..
torch/_inductor/lowering.py
Outdated
@functools.lru_cache(None) | ||
def _warn_complex_not_supported(): | ||
warnings.warn( | ||
"Torchinductor does not support code generation for complex operators. Performance may be worse than eager." | ||
) | ||
|
||
|
||
# There are some types (CPU) which we accept as input but not as | ||
# output. | ||
def unsupported_input_tensor(t: torch.Tensor, parent=None, node=None): | ||
"Do not support reading or writing to this tensor" | ||
if t.is_complex(): | ||
# Complex views are supported with IR ComplexView | ||
if parent and parent.target in ( | ||
torch.ops.aten.view.dtype, | ||
torch.ops.prims.convert_element_type.default, | ||
): | ||
return False | ||
_warn_complex_not_supported() | ||
return True | ||
|
||
if t.is_meta: | ||
return True | ||
|
||
if t.dtype == torch.float8_e8m0fnu: | ||
if not node: | ||
return True | ||
|
||
# allow bitcast, views, memory movement, but not arithmetic | ||
# TODO: delete once triton adds native support | ||
return not ( | ||
isinstance(parent.target, torch._ops.OpOverload) | ||
and parent.target | ||
in ( | ||
aten.view.dtype, | ||
aten.cat.default, | ||
aten._scaled_mm.default, | ||
) | ||
or (isinstance(node.target, torch._ops.OpOverload) and is_view(node.target)) | ||
) | ||
|
||
return False | ||
|
||
|
||
def unsupported_output_tensor(t: torch.Tensor, parent=None, node=None): | ||
"Do not support writing tensor but can read from it" | ||
if unsupported_input_tensor(t, parent): | ||
return True | ||
return t.is_cpu and config.disable_cpp_codegen | ||
|
||
|
||
def fallback_node_due_to_unsupported_type(node: torch.fx.Node, allow_cpu_inputs=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 1
torch/_functorch/partitioners.py
Outdated
bwd_module: fx.GraphModule, | ||
static_lifetime_input_nodes: Optional[OrderedSet[fx.Node]] = None, | ||
): | ||
GraphTransformObserver = functools.partial( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is a function, not a class - I would stick with function/variable naming, e.g. graph_transform_observer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the naming convention in the codebase to keep it consistent lol @bdhirsh
GraphTransformObserver = functools.partial( |
60bf612
to
7dfc078
Compare
…ng (#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
7dfc078
to
6c1323a
Compare
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
This pull request was exported from Phabricator. Differential Revision: D70522237 |
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Reviewed By: avicizhu Differential Revision: D70522237
6c1323a
to
7e92ff3
Compare
This pull request was exported from Phabricator. Differential Revision: D70522237 |
Summary: X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Reviewed By: avicizhu Differential Revision: D70522237
…ng (pytorch#148380) Summary: X-link: pytorch/benchmark#2607 Pull Request resolved: pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:quantization -- test_activation_quantization_aten ``` Buck UI: https://www.internalfb.com/buck2/9a53c909-d3ea-479a-874e-cc917999ca88 Test UI: https://www.internalfb.com/intern/testinfra/testrun/12384899050440719 Network: Up: 62KiB Down: 81KiB (reSessionID-913ca82d-c395-4492-818e-6e004df37f87) Executing actions. Remaining 0/4 6.1s exec time total Command: test. Finished 2 local Time elapsed: 3:22.9s Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 # E2E ### how to enable ``` post_grad_fusion_options={ "activation_quantization_aten_pass": {"quant_type": "torch.float8_e5m2"}, }, ``` see D51860030 to check how to set the config under dynamo_config_map Note: you can change the quant_type, if nothing gives, then the default type torch.float8_e5m2 will be used to quantize #### If you use FSDP - You may also need to set inline_inbuilt_nn_modules to true for models that use FSDP (see D70023488 to check the config setting) - Remove UNSAFE_SKIP_FSDP_MODULE_GUARDS=1 (context: https://fb.workplace.com/groups/1075192433118967/permalink/1629608671010671/) ``` buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=mast_omnifm_v1-5_mwb launcher.max_retries=3 data_loader.dataset.batch_size=8 launcher.data_project=oncall_ads_model_platform launcher.fbl_entitlement=ads_global_tc_training_efficiency_qps max_ind_range=1 launcher.num_workers=8 data_loader.reading_service.num_remote_dpp_workers=30 data_loader.dataset.num_batches=100 trainer.gpu_tracer.wait=50 trainer.gpu_tracer.active=3 trainer.gpu_tracer.overhead_detection=10 launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] ``` aps-512_8_remove_fsdp_guards-92ae3972ba tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-512_8_remove_fsdp_guards-92ae3972ba/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 baseline w/o fp8 quantization: aps-mengluy_remove_fsdp-ce75b306fa w/ fp8 quantization: aps-mengluy_remove_fsdp_fp8-96541deec4 ### QPS {F1977040587} ### memory baseline {F1977040640} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=1767027467197075&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot with fp8 {F1977040641} memory snapshot: https://www.internalfb.com/ai_infra/zoomer/profiling-run/insights?profilingRunID=639378375763157&tab=INSIGHTS&primarySubtab=Memory%20Analysis&secondarySubtab=Memory%20Snapshot ### conclusion: - ~9% qps improvement, reduces peak memory from 82.01 to 78.97. - for NE, we need have longer verification, WIP with scaling version. Differential Revision: D70522237
Summary: Pull Request resolved: #2607 X-link: pytorch/pytorch#148380 We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize. Reviewed By: Hahu803, avicizhu Differential Revision: D70522237 fbshipit-source-id: 9c501506e8bd40a1199fafb2e28e6384e7df4786
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
return size_in_mb >= torch._inductor.config.post_grad_fusion_options[ | ||
"activation_quantization_aten_pass" | ||
].get("size_in_mb", 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to add shape guards on every tensor. We probably want this only if statically true
Summary: We enable the activation quantization in the forward pass, and users can customize the dtype they want to quantize.
Test Plan:
unit test
Buck UI: https://www.internalfb.com/buck2/776d3911-bb86-4ac8-a527-540cf1510b9d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/4785074873051017
Network: Up: 4.3MiB Down: 42MiB (reSessionID-fef7e727-68b1-4645-a519-5652854df38d)
Executing actions. Remaining 0/4 6.7s exec time total
Command: test. Finished 2 local
Time elapsed: 3:11.5s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
E2E
how to enable (you can overrite the dtype, if nothing given, the default is fp8)
Differential Revision: D70522237
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov