torch_tensorrt.runtime#
Functions#
- torch_tensorrt.runtime.set_multi_device_safe_mode(mode: bool) _MultiDeviceSafeModeContextManager[source]#
Sets the runtime (Python-only and default) into multi-device safe mode
In the case that multiple devices are available on the system, in order for the runtime to execute safely, additional device checks are necessary. These checks can have a performance impact so they are therefore opt-in. Used to suppress the warning about running unsafely in a multi-device context.
- Parameters:
mode (bool) – Enable (
True) or disable (False) multi-device checks
Example
with torch_tensorrt.runtime.set_multi_device_safe_mode(True): results = trt_compiled_module(*inputs)
- torch_tensorrt.runtime.enable_cudagraphs(compiled_module: GraphModule | Module) _CudagraphsContextManager[source]#
- torch_tensorrt.runtime.enable_pre_allocated_outputs(module: GraphModule) _PreAllocatedOutputContextManager[source]#
Runtime backend#
Execution uses the C++ runtime engine when it is installed in the build; otherwise the
Python runtime engine is used. There is no separate process-wide backend switch
in torch_tensorrt.runtime.
Classes#
- class torch_tensorrt.runtime.TorchTensorRTModule(serialized_engine: bytes | None = None, input_binding_names: ~typing.List[str] | None = None, output_binding_names: ~typing.List[str] | None = None, *, name: str = '', settings: ~torch_tensorrt.dynamo._settings.CompilationSettings = CompilationSettings(workspace_size=0, min_block_size=5, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, timing_cache_path='/tmp/torch_tensorrt_engine_cache/timing_cache.bin', runtime_cache_path='/tmp/torch_tensorrt_engine_cache/runtime_cache.bin', dynamic_shapes_kernel_specialization_strategy='lazy', cuda_graph_strategy='disabled', lazy_engine_init=False, cache_built_engines=False, reuse_cached_engines=False, use_fp32_acc=False, refit_identical_engine_weights=False, strip_engine_weights=False, immutable_weights=True, enable_weight_streaming=False, enable_cross_compile_for_windows=False, tiling_optimization_level='none', l2_limit_for_tiling=-1, use_distributed_mode_trace=False, offload_module_to_cpu=False, enable_autocast=False, autocast_low_precision_type=None, autocast_excluded_nodes=set(), autocast_excluded_ops=set(), autocast_max_output_threshold=512, autocast_max_depth_of_reduction=None, autocast_calibration_dataloader=None, enable_resource_partitioning=False, cpu_memory_budget=None, dynamically_allocate_resources=False, decompose_attention=False, attn_bias_is_causal=True), weight_name_map: dict[~typing.Any, ~typing.Any] | None = None, requires_output_allocator: bool = False, requires_native_multidevice: bool = False, symbolic_shape_expressions: ~typing.Dict[str, ~typing.List[~typing.Dict[str, ~typing.Any]]] | None = None)[source]#
Bases:
Modulenn.Modulethat runs a TensorRT engine inside PyTorch.When the C++ Torch-TensorRT runtime is available, execution uses
torch.classes.tensorrt.Engineandtorch.ops.tensorrt.execute_engine. When only the Python runtime is available, a PythonTRTEngineis registered under the sametensorrt::execute_engineop so that the same compiled graph works with either runtime transparently.Supports
torch.save/torch.loadviaget_extra_state/set_extra_state.Single runtime module for TensorRT engines. Dispatches to the C++ or Python execution implementation depending on whether the C++ extension is available. See Python vs C++ runtime.
- __init__(serialized_engine: bytes | None = None, input_binding_names: ~typing.List[str] | None = None, output_binding_names: ~typing.List[str] | None = None, *, name: str = '', settings: ~torch_tensorrt.dynamo._settings.CompilationSettings = CompilationSettings(workspace_size=0, min_block_size=5, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, timing_cache_path='/tmp/torch_tensorrt_engine_cache/timing_cache.bin', runtime_cache_path='/tmp/torch_tensorrt_engine_cache/runtime_cache.bin', dynamic_shapes_kernel_specialization_strategy='lazy', cuda_graph_strategy='disabled', lazy_engine_init=False, cache_built_engines=False, reuse_cached_engines=False, use_fp32_acc=False, refit_identical_engine_weights=False, strip_engine_weights=False, immutable_weights=True, enable_weight_streaming=False, enable_cross_compile_for_windows=False, tiling_optimization_level='none', l2_limit_for_tiling=-1, use_distributed_mode_trace=False, offload_module_to_cpu=False, enable_autocast=False, autocast_low_precision_type=None, autocast_excluded_nodes=set(), autocast_excluded_ops=set(), autocast_max_output_threshold=512, autocast_max_depth_of_reduction=None, autocast_calibration_dataloader=None, enable_resource_partitioning=False, cpu_memory_budget=None, dynamically_allocate_resources=False, decompose_attention=False, attn_bias_is_causal=True), weight_name_map: dict[~typing.Any, ~typing.Any] | None = None, requires_output_allocator: bool = False, requires_native_multidevice: bool = False, symbolic_shape_expressions: ~typing.Dict[str, ~typing.List[~typing.Dict[str, ~typing.Any]]] | None = None)[source]#
Takes a name, target device, serialized TensorRT engine, and binding names / order and constructs a PyTorch
torch.nn.Modulearound it. Uses the Torch-TensorRT runtime extension to run the enginesIf binding names are not provided, it is assumed that the engine binding names follow the following convention:
- [symbol].[index in input / output array]
ex. [x.0, x.1, x.2] -> [y.0]
- Parameters:
serialized_engine (bytes) – Serialized TensorRT engine in the form of a bytearray
input_binding_names (List[str]) – List of input TensorRT engine binding names in the order they would be passed to the TRT modules
output_binding_names (List[str]) – List of output TensorRT engine binding names in the order they should be returned
- Keyword Arguments:
name (str) – Name for module
settings (CompilationSettings) – Settings used to compile engine, assumes engine was built with default compilation settings if object not passed
weight_name_map (dict) – Mapping of engine weight name to state_dict weight name
requires_output_allocator (bool) – Boolean flag indicating if the converter creates operators which require an Output Allocator to run (e.g. data dependent operators)
requires_native_multidevice (bool) – Boolean flag indicating if the converter creates operators which require multiple devices to run (e.g. multi-device collective operations)
symbolic_shape_expressions (List[Any]) – List of symbolic shape expressions for each input binding
Example
with io.BytesIO() as engine_bytes: engine_bytes.write(trt_engine.serialize()) engine_str = engine_bytes.getvalue() trt_module = TorchTensorRTModule( engine_str, input_binding_names=["x"], output_binding_names=["output"], name="my_module", settings=CompilationSettings(device=torch.cuda.current_device) )
- Parameters:
serialized_engine – Raw TRT engine bytes (
Noneif restoring state only).input_binding_names – Input tensor names in
forwardorder.output_binding_names – Output tensor names in return order.
name – Logical name for logging and serialization.
settings – Compilation/runtime settings (device, lazy init, cross-compile, etc.).
weight_name_map – Engine weight name to
state_dictkey mapping (refit).requires_output_allocator – Engine needs TRT dynamic output allocation.
symbolic_shape_expressions – Optional symbolic shape metadata from compile.
- disable_profiling() None[source]#
Disable engine profiling and clear the profiling flag on this module.
- dump_layer_info() None[source]#
Dump layer information encoded by the TensorRT engine in this module to STDOUT
- enable_profiling(profiling_results_dir: str | None = None, profile_format: str = 'perfetto') None[source]#
Enable engine profiling (optional path prefix and format for tracing output).
- forward(*inputs: Any) Tensor | Tuple[Tensor, ...][source]#
Run the TensorRT engine on GPU tensors (non-tensor args are cast to CUDA tensors).
Note: callers are responsible for ensuring the engine has been set up; the hot path intentionally omits a
self.engine is Noneguard so that a properly-bound module avoids the per-call attribute check.
- get_engine() <torch.ScriptClass object at 0x7fe68b9edb70>[source]#
Return the underlying engine, raising if it has not been set up.
Used by every engine-accessing method except the hot
forwardpath, which intentionally skips the check to avoid per-call overhead.
- get_extra_state() Tuple[str, List[str | bytes] | None, List[str], List[str]][source]#
Return any extra state to include in the module’s state_dict.
Implement this and a corresponding
set_extra_state()for your module if you need to store extra state. This function is called when building the module’s state_dict().Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.
- Returns:
Any extra state to store in the module’s state_dict
- Return type:
object
- get_layer_info() str[source]#
Get a JSON string containing the layer information encoded by the TensorRT engine in this module
- Returns:
A JSON string which contains the layer information of the engine incapsulated in this module
- Return type:
str
- set_extra_state(state: Tuple[str, List[str | bytes] | None, List[str], List[str]]) None[source]#
Set extra state contained in the loaded state_dict.
This function is called from
load_state_dict()to handle any extra state found within the state_dict. Implement this function and a correspondingget_extra_state()for your module if you need to store extra state within its state_dict.- Parameters:
state (dict) – Extra state from the state_dict
- setup_engine() None[source]#
Setup engine for a module which has deferred engine setup.
Will setup the TensorRT engine for this module in the case that setup has been deferred. In the case that the engine has already been setup, will return without changing anything. Assumes that serialized engine and settings have already been passed to the module.
- property pre_allocated_outputs: Any#
Pre-allocated output tensors currently held by the underlying engine.