torch_tensorrt.runtime#

Functions#

torch_tensorrt.runtime.set_multi_device_safe_mode(mode: bool) _MultiDeviceSafeModeContextManager[source]#

Sets the runtime (Python-only and default) into multi-device safe mode

In the case that multiple devices are available on the system, in order for the runtime to execute safely, additional device checks are necessary. These checks can have a performance impact so they are therefore opt-in. Used to suppress the warning about running unsafely in a multi-device context.

Parameters:

mode (bool) – Enable (True) or disable (False) multi-device checks

Example

with torch_tensorrt.runtime.set_multi_device_safe_mode(True):
    results = trt_compiled_module(*inputs)
torch_tensorrt.runtime.enable_cudagraphs(compiled_module: GraphModule | Module) _CudagraphsContextManager[source]#
torch_tensorrt.runtime.get_cudagraphs_mode() bool[source]#
torch_tensorrt.runtime.get_whole_cudagraphs_mode() bool[source]#
torch_tensorrt.runtime.set_cudagraphs_mode(mode: bool) None[source]#
torch_tensorrt.runtime.enable_pre_allocated_outputs(module: GraphModule) _PreAllocatedOutputContextManager[source]#
torch_tensorrt.runtime.weight_streaming(module: GraphModule) _WeightStreamingContextManager[source]#
torch_tensorrt.runtime.enable_output_allocator(module: GraphModule) _OutputAllocatorContextManager[source]#

Runtime backend#

Execution uses the C++ runtime engine when it is installed in the build; otherwise the Python runtime engine is used. There is no separate process-wide backend switch in torch_tensorrt.runtime.

Classes#

class torch_tensorrt.runtime.TorchTensorRTModule(serialized_engine: bytes | None = None, input_binding_names: ~typing.List[str] | None = None, output_binding_names: ~typing.List[str] | None = None, *, name: str = '', settings: ~torch_tensorrt.dynamo._settings.CompilationSettings = CompilationSettings(workspace_size=0, min_block_size=5, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, timing_cache_path='/tmp/torch_tensorrt_engine_cache/timing_cache.bin', runtime_cache_path='/tmp/torch_tensorrt_engine_cache/runtime_cache.bin', dynamic_shapes_kernel_specialization_strategy='lazy', cuda_graph_strategy='disabled', lazy_engine_init=False, cache_built_engines=False, reuse_cached_engines=False, use_fp32_acc=False, refit_identical_engine_weights=False, strip_engine_weights=False, immutable_weights=True, enable_weight_streaming=False, enable_cross_compile_for_windows=False, tiling_optimization_level='none', l2_limit_for_tiling=-1, use_distributed_mode_trace=False, offload_module_to_cpu=False, enable_autocast=False, autocast_low_precision_type=None, autocast_excluded_nodes=set(), autocast_excluded_ops=set(), autocast_max_output_threshold=512, autocast_max_depth_of_reduction=None, autocast_calibration_dataloader=None, enable_resource_partitioning=False, cpu_memory_budget=None, dynamically_allocate_resources=False, decompose_attention=False, attn_bias_is_causal=True), weight_name_map: dict[~typing.Any, ~typing.Any] | None = None, requires_output_allocator: bool = False, requires_native_multidevice: bool = False, symbolic_shape_expressions: ~typing.Dict[str, ~typing.List[~typing.Dict[str, ~typing.Any]]] | None = None)[source]#

Bases: Module

nn.Module that runs a TensorRT engine inside PyTorch.

When the C++ Torch-TensorRT runtime is available, execution uses torch.classes.tensorrt.Engine and torch.ops.tensorrt.execute_engine. When only the Python runtime is available, a Python TRTEngine is registered under the same tensorrt::execute_engine op so that the same compiled graph works with either runtime transparently.

Supports torch.save / torch.load via get_extra_state / set_extra_state.

Single runtime module for TensorRT engines. Dispatches to the C++ or Python execution implementation depending on whether the C++ extension is available. See Python vs C++ runtime.

__init__(serialized_engine: bytes | None = None, input_binding_names: ~typing.List[str] | None = None, output_binding_names: ~typing.List[str] | None = None, *, name: str = '', settings: ~torch_tensorrt.dynamo._settings.CompilationSettings = CompilationSettings(workspace_size=0, min_block_size=5, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, truncate_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, assume_dynamic_shape_support=False, sparse_weights=False, engine_capability=<EngineCapability.STANDARD: 1>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, timing_cache_path='/tmp/torch_tensorrt_engine_cache/timing_cache.bin', runtime_cache_path='/tmp/torch_tensorrt_engine_cache/runtime_cache.bin', dynamic_shapes_kernel_specialization_strategy='lazy', cuda_graph_strategy='disabled', lazy_engine_init=False, cache_built_engines=False, reuse_cached_engines=False, use_fp32_acc=False, refit_identical_engine_weights=False, strip_engine_weights=False, immutable_weights=True, enable_weight_streaming=False, enable_cross_compile_for_windows=False, tiling_optimization_level='none', l2_limit_for_tiling=-1, use_distributed_mode_trace=False, offload_module_to_cpu=False, enable_autocast=False, autocast_low_precision_type=None, autocast_excluded_nodes=set(), autocast_excluded_ops=set(), autocast_max_output_threshold=512, autocast_max_depth_of_reduction=None, autocast_calibration_dataloader=None, enable_resource_partitioning=False, cpu_memory_budget=None, dynamically_allocate_resources=False, decompose_attention=False, attn_bias_is_causal=True), weight_name_map: dict[~typing.Any, ~typing.Any] | None = None, requires_output_allocator: bool = False, requires_native_multidevice: bool = False, symbolic_shape_expressions: ~typing.Dict[str, ~typing.List[~typing.Dict[str, ~typing.Any]]] | None = None)[source]#

Takes a name, target device, serialized TensorRT engine, and binding names / order and constructs a PyTorch torch.nn.Module around it. Uses the Torch-TensorRT runtime extension to run the engines

If binding names are not provided, it is assumed that the engine binding names follow the following convention:

  • [symbol].[index in input / output array]
    • ex. [x.0, x.1, x.2] -> [y.0]

Parameters:
  • serialized_engine (bytes) – Serialized TensorRT engine in the form of a bytearray

  • input_binding_names (List[str]) – List of input TensorRT engine binding names in the order they would be passed to the TRT modules

  • output_binding_names (List[str]) – List of output TensorRT engine binding names in the order they should be returned

Keyword Arguments:
  • name (str) – Name for module

  • settings (CompilationSettings) – Settings used to compile engine, assumes engine was built with default compilation settings if object not passed

  • weight_name_map (dict) – Mapping of engine weight name to state_dict weight name

  • requires_output_allocator (bool) – Boolean flag indicating if the converter creates operators which require an Output Allocator to run (e.g. data dependent operators)

  • requires_native_multidevice (bool) – Boolean flag indicating if the converter creates operators which require multiple devices to run (e.g. multi-device collective operations)

  • symbolic_shape_expressions (List[Any]) – List of symbolic shape expressions for each input binding

Example

with io.BytesIO() as engine_bytes:
    engine_bytes.write(trt_engine.serialize())
    engine_str = engine_bytes.getvalue()

trt_module = TorchTensorRTModule(
    engine_str,
    input_binding_names=["x"],
    output_binding_names=["output"],
    name="my_module",
    settings=CompilationSettings(device=torch.cuda.current_device)
)
Parameters:
  • serialized_engine – Raw TRT engine bytes (None if restoring state only).

  • input_binding_names – Input tensor names in forward order.

  • output_binding_names – Output tensor names in return order.

  • name – Logical name for logging and serialization.

  • settings – Compilation/runtime settings (device, lazy init, cross-compile, etc.).

  • weight_name_map – Engine weight name to state_dict key mapping (refit).

  • requires_output_allocator – Engine needs TRT dynamic output allocation.

  • symbolic_shape_expressions – Optional symbolic shape metadata from compile.

disable_profiling() None[source]#

Disable engine profiling and clear the profiling flag on this module.

dump_layer_info() None[source]#

Dump layer information encoded by the TensorRT engine in this module to STDOUT

enable_profiling(profiling_results_dir: str | None = None, profile_format: str = 'perfetto') None[source]#

Enable engine profiling (optional path prefix and format for tracing output).

forward(*inputs: Any) Tensor | Tuple[Tensor, ...][source]#

Run the TensorRT engine on GPU tensors (non-tensor args are cast to CUDA tensors).

Note: callers are responsible for ensuring the engine has been set up; the hot path intentionally omits a self.engine is None guard so that a properly-bound module avoids the per-call attribute check.

get_engine() <torch.ScriptClass object at 0x7fe68b9edb70>[source]#

Return the underlying engine, raising if it has not been set up.

Used by every engine-accessing method except the hot forward path, which intentionally skips the check to avoid per-call overhead.

get_extra_state() Tuple[str, List[str | bytes] | None, List[str], List[str]][source]#

Return any extra state to include in the module’s state_dict.

Implement this and a corresponding set_extra_state() for your module if you need to store extra state. This function is called when building the module’s state_dict().

Note that extra state should be picklable to ensure working serialization of the state_dict. We only provide backwards compatibility guarantees for serializing Tensors; other objects may break backwards compatibility if their serialized pickled form changes.

Returns:

Any extra state to store in the module’s state_dict

Return type:

object

get_layer_info() str[source]#

Get a JSON string containing the layer information encoded by the TensorRT engine in this module

Returns:

A JSON string which contains the layer information of the engine incapsulated in this module

Return type:

str

set_extra_state(state: Tuple[str, List[str | bytes] | None, List[str], List[str]]) None[source]#

Set extra state contained in the loaded state_dict.

This function is called from load_state_dict() to handle any extra state found within the state_dict. Implement this function and a corresponding get_extra_state() for your module if you need to store extra state within its state_dict.

Parameters:

state (dict) – Extra state from the state_dict

setup_engine() None[source]#

Setup engine for a module which has deferred engine setup.

Will setup the TensorRT engine for this module in the case that setup has been deferred. In the case that the engine has already been setup, will return without changing anything. Assumes that serialized engine and settings have already been passed to the module.

property pre_allocated_outputs: Any#

Pre-allocated output tensors currently held by the underlying engine.