Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@briancoutinho
Copy link
Contributor

@briancoutinho briancoutinho commented Sep 18, 2025

Credits to @zli669 for implementing this change, I am just setting this up here for contribution to kineto.

Overview

Adds the capability to dynamically load plugins for Kineto (#1121). The core idea is to have plugin modules that can be made available as a shared object file (.so). Kineto loads all .so object files in a specified path:

export KINETO_PLUGIN_LIB_DIR_PATH=/foo/bar/

These plugins can then register themselves with kineto, start/stop profiling and return trace events to kineto.
Please look at DynamicPluginTest.cpp for example usage of this API.

The PR also enables counter events that can show up in the trace.

Details

Companion changes

First, let's address some simpler changes:

  1. Support for Performance Counter Events in Chrome Trace Format: Adds a new activity type GPU_PM_COUNTER, enabling a stream of performance events to be logged by any plugin in Kineto.
  2. Environment Variable KINETO_DISABLE_CUPTI: Disables using CUPTI on NVIDIA GPUs to support non-CUPTI based profiling modules. This helps avoid conflicts with CUPTI, as NVIDIA support is currently closely coupled with it.

Dynamic Plugin changes

The Dynamic plugin interface is a generalization of the IActivityProfiler interface in Kineto. The support is implemented in three parts, recommended for review in this order:

1) Plugin C Interface: libkineto/include/KinetoDynamicPluginInterface.h

The plugin shared object must implement KinetoPlugin_register(), providing function pointers for trace plugin functions: start(), stop(), create(), destroy(), processEvents(). Kineto calls these function pointers.

To ensure ABI compatibility and avoid C++ compiler mismatches, a pure C interface is used. This interface includes C versions of key structures and enums like ActivityType, ProfileEvent, Flows, etc.

Lastly, To transfer trace data from the plugin to Kineto, an opaque C object called KinetoTraceBuilder is used. The key idea is the trace builder handles generating events and transferring them to kineto's Activity profiler. This ensures there is NO dynamic memory being passed around between plugin shared object and kineto

2) Shim to Internal Plugin Interface: libkineto/src/dynamic_plugin/PluginProfiler.h

The PluginProfiler class implements the IActivityProfiler interface used internally in Kineto. It controls the profiling session and generates session trace data using the function pointers from the shared object.

2.2) Trace Event Builder: libkineto/src/dynamic_plugin/PluginTraceBuilder.h

Implements the TraceEvent builder used by the shared object plugin.

3) Plugin Loader: libkineto/src/dynamic_plugin/PluginLoader.h

Finally, the plugin loader handles the discovery of shared object plugins and dynamically loads them into the address space using standard techniques.

Testing

Unit Test

Added a DynamicPlugTest which basically tests out the PluginProfiler class and the creation/start/stop functions. It also tests the trace event builder. The unit test does not do any dynamic loading though.
Build

$> mkdir build && cd build
$> cmake .. &&  make

Test

$> ctest -R 'DynamicPlugin.*'
Test project /localhome/local-bcoutinho/kineto/kineto_brian/libkineto/build
    Start 15: DynamicPluginTest.PluginProfilerLifecycle
1/2 Test #15: DynamicPluginTest.PluginProfilerLifecycle ...   Passed    0.00 sec
    Start 16: DynamicPluginTest.EventBuilderProcessing
2/2 Test #16: DynamicPluginTest.EventBuilderProcessing ....   Passed    0.00 sec

100% tests passed, 0 tests failed out of 2

End end test with Always On Profiling Plugin

Build PyTorch and import this branch to third_party/kineto. Base version tested on ``
I used a simple test program
Normal operation uses CUPTI

python3 ~/torch_samples/test_profiling_simple.py

Run using plugin:

$> $ KINETO_LOG_LEVEL=0 KINETO_PLUGIN_LIB_DIR_PATH=$AON_KINETO_PLUGIN_LIB_DIR_PATH KINETO_DISABLE_CUPTI=1 python3 ~/torch_samples/test_profiling_simple.py
Using device: cuda
INFO:2025-10-03 21:48:26 384669:384669 init.cpp:164] Setting initCupti = 0 from environment KINETO_DISABLE_CUPTI=1

INFO:2025-10-03 21:48:26 384669:384669 PluginLoader.h:127] Found symbol KinetoPlugin_register() from /home/bcoutinho/aon_kineto_plugin_build/c36614826/libAonKinetoPlugin.so
*********************************************
***** [AON] KinetoPlugin_register(). *******
*********************************************
INFO:2025-10-03 21:48:26 384669:384669 CuptiActivityProfiler.cpp:244] CUDA versions. CUPTI: 130001; Runtime: 13000; Driver: 13000
*********************************************
******** [AON] [PLUGIN] query(). *********
*********************************************
Adding supported activities  Log file: /tmp/libkineto_activities_384669.json
  Trace start time: 2025-10-03 21:48:33
  Trace duration: 500ms
  Warmup duration: 5s
  Max GPU buffer size: 128MB
  Enabled activities: cpu_op,user_annotation,gpu_user_annotation,gpu_memcpy,gpu_memset,kernel,external_correlation,cuda_runtime,cuda_driver,cpu_instant_event,python_function,overhead,xpu_runtime,privateuse1_runtime,privateuse1_driver
INFO:2025-10-03 21:48:26 384669:384669 CuptiActivityProfiler.cpp:1017] [Profiler = AON Profiler] Evaluating whether to run child profiler.
*********************************************
******** [AON] [PLUGIN] create(). ********
*********************************************
INFO:2025-10-03 21:48:29 384669:384669 CuptiActivityProfiler.cpp:1025] [Profiler = AON Profiler] Running child profiler AON Profiler for 500 ms
...
INFO:2025-10-03 21:48:29 384669:384669 CuptiActivityProfiler.cpp:1242] Starting child profiler session
*********************************************
******** [AON] [PLUGIN] start(). *********
*********************************************
INFO:2025-10-03 21:48:29 384669:384669 CuptiActivityProfiler.cpp:1279] Stopping child profiler session
*********************************************
******** [AON] [PLUGIN] stop(). **********
*********************************************
STAGE:2025-10-03 21:48:29 384669:384669 ActivityProfilerController.cpp:396] Completed Stage: Collection
INFO:2025-10-03 21:48:29 384669:384669 CuptiActivityProfiler.cpp:293] Processing 1 CPU buffers
INFO:2025-10-03 21:48:29 384669:384669 CuptiActivityProfiler.cpp:394] Processing child profiler trace
*********************************************
***** [AON] [PLUGIN] processEvents(). *******
*********************************************
...

I can open the trace and find events added by plugin

  {
    "ph": "X", "cat": "kernel", "name": "_Z13gemmk1_kernelIifLi256ELi5ELb0ELb0ELb0ELb0E30cublasGemvTensorStridedBatchedIKfES2_S0_IfEfLi0EEv18cublasGemmk1ParamsIT0_T7_T8_T9_T10_N8biasTypeINS8_10value_typeES9_E4typeEE<<<(1, 1, 1), (256, 1, 1)>>>", "pid": 0, "tid": 7,
    "ts": 0.000, "dur": 0.000,
    "args": {
      "isImmediateLaunch": true, "hesCorrelationId": 2155942657, "apiCorrelationId": 8459009
    }
  },

@meta-cla
Copy link

meta-cla bot commented Sep 18, 2025

Hi @briancoutinho!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Comment on lines 21 to 66
enum KinetoPlugin_ProfileEventType {
KINETO_PLUGIN_PROFILE_EVENT_TYPE_INVALID = 0,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_CPU_OP, // cpu side ops
KINETO_PLUGIN_PROFILE_EVENT_TYPE_USER_ANNOTATION,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_GPU_USER_ANNOTATION,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_GPU_MEMCPY,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_GPU_MEMSET,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_CONCURRENT_KERNEL, // on-device kernels
KINETO_PLUGIN_PROFILE_EVENT_TYPE_EXTERNAL_CORRELATION,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_CUDA_RUNTIME, // host side cuda runtime
// events
KINETO_PLUGIN_PROFILE_EVENT_TYPE_CUDA_DRIVER, // host side cuda driver events
KINETO_PLUGIN_PROFILE_EVENT_TYPE_CPU_INSTANT_EVENT, // host side point-like
// events
KINETO_PLUGIN_PROFILE_EVENT_TYPE_PYTHON_FUNCTION,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_OVERHEAD, // CUPTI induced overhead events
// sampled from its overhead API.
KINETO_PLUGIN_PROFILE_EVENT_TYPE_CUDA_SYNC, // synchronization events between
// runtime and kernels
KINETO_PLUGIN_PROFILE_EVENT_TYPE_GPU_PM_COUNTER, // GPU PM counters
KINETO_PLUGIN_PROFILE_EVENT_NUM_TYPES
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is the most problematic in my opinion. As it is fixed set it would be impossible to create new plugin without contributing to pytorch code and changing this file by adding new enums supported by new plugin.
Then there would exist enum in pytorch code that is used nowhere in the code, as it would be used only by the plugin.

How about more dynamic approach ?
Having integer here instead of fixed enums. The lower range 0..N would map to fixed enums and higher values would be assigned dynamically to plugins:

0 - A-1 : fixed enumx
A - B-1 : 'enums' for the 1st plugin
B - C-1 : 'enums' for the 2nd plugin
...

During initialization the plugin would get the base address in enums address space: A for 1st plugin, B for 2nd plugin.
In return it would provide the list of strings, containing supported activities, that would be converted in provided order to its enum space.

If you prefer to keep fixed list please add here KINETO_PLUGIN_PROFILE_EVENT_TYPE_XPU_RUNTIME.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should try to make the ActivityTypes more generic so we have lesser number of event types in the list and every new platform should not be adding its own version runtime and *PU events. That way new plugins should rarely ever need adding enums.

For now I will update with adding all activity types in the header file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added all the activity types for completeness here now.

KINETO_PLUGIN_PROFILE_EVENT_TYPE_GPU_MEMSET,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_CONCURRENT_KERNEL, // on-device kernels
KINETO_PLUGIN_PROFILE_EVENT_TYPE_EXTERNAL_CORRELATION,
KINETO_PLUGIN_PROFILE_EVENT_TYPE_CUDA_RUNTIME, // host side cuda runtime
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These events should be more generic in my opinion.

Why CUDA is hardcoded?
What about Intel's XPU, AMD's ROCM, Google's TPU, some custom FPGA, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsocha I completely agree with you. Note that this enum is reflecting the C++ enums in ActivityType.h.
https://github.com/pytorch/kineto/blob/main/libkineto/include/ActivityType.h#L19-L28

Due to historical reasons the activities are named after CUDA but ideally they should be named generically rather than for each GPUs's runtime. I don't think we can fix it in this PR but I would do something like

enum class .. {

GPU_RUNTIME

CUDA_RUNTIME=GPU_RUNTIME
XPU_RUNTIME=GPU_RUNTIME
MTIA_RUNTIME=GPU_RUNTIME

cc @sraikund16 that makes sense long term?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briancoutinho Agreed, this should be easier to scale

@briancoutinho briancoutinho force-pushed the bcoutinho/dynamic_plugin branch from 3237912 to 0641549 Compare September 24, 2025 00:31
@briancoutinho briancoutinho marked this pull request as ready for review September 26, 2025 00:50
@briancoutinho
Copy link
Contributor Author

cc @sraikund16 @davidberard98 this is ready for review.
I have some internal tests pending but that should not impact code on kineto.

@briancoutinho
Copy link
Contributor Author

@sraikund16 / @aaronenyeshi / @sanrise Gentle nudge, please help with review.

@briancoutinho
Copy link
Contributor Author

@sraikund16 Please help with review:)

@briancoutinho
Copy link
Contributor Author

please

// This file handles pure C plugin profiler interface and converts to internal
// profiler interface

class PluginProfilerSession : public IActivityProfilerSession {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to expose toggleCollectionDynamic functionality, so plugin profilers can react to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently since this is not yet part of the IActivityProfilerSession interface.
However, the structures here can be extended as long as we keep the same order
https://github.com/pytorch/kineto/blob/main/libkineto/include/IActivityProfiler.h

@briancoutinho briancoutinho force-pushed the bcoutinho/dynamic_plugin branch from c72bef6 to e9d5cf1 Compare October 10, 2025 19:11
@meta-cla meta-cla bot added the cla signed label Oct 14, 2025
@briancoutinho
Copy link
Contributor Author

Hi @malfet curious if we could get some feedback on this PR. guessing folks are busy but is there anyone from the maintainers from Meta who can help import and get this to the next stage.
My CLA signup just went through today, let me know if there is anything else needed from my side.

// Clear error state
dlerror();

void *pHandle = dlopen(libFilePath.c_str(), RTLD_LAZY);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to add support for Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@australopitek Not planning to in this PR but it should be an easy addition on top of this PR. Personally, I don't have a Windows setup to test things right now.

@briancoutinho
Copy link
Contributor Author

@australopitek Still waiting to hear back from the maintainers at Meta. Though Kineto is part of PyTorch Foundation now, the codebase is not github first from what I know, it needs to be imported by Meta to push through. Hoping the maintainers at Meta get back on this.

@australopitek If you describe your use-case for this interface too that might help make a stronger case. Thanks!

@australopitek
Copy link

@australopitek Still waiting to hear back from the maintainers at Meta. Though Kineto is part of PyTorch Foundation now, the codebase is not github first from what I know, it needs to be imported by Meta to push through. Hoping the maintainers at Meta get back on this.

@australopitek If you describe your use-case for this interface too that might help make a stronger case. Thanks!

@briancoutinho In response to my comment https://github.com/pytorch/kineto/pull/1148/files#r2454233130, we want to move XpuptiProfiler to external Intel repo, so it's independent from Kineto repository. Currently XpuptiProfiler is responsible for various traces (runtime, kernel, memset, etc.).
To maintain full functionality, plugin profiler should know about requested activities, otherwise it will always report all kinds of traces, even if only runtime is requested.

@sraikund16
Copy link
Contributor

Taking a look, for some reason I was not getting notifications on this PR.

@meta-codesync
Copy link

meta-codesync bot commented Oct 27, 2025

@sraikund16 has imported this pull request. If you are a Meta employee, you can view this in D85576028.

@facebook-github-bot
Copy link
Contributor

@briancoutinho has updated the pull request. You must reimport the pull request before landing.

@briancoutinho briancoutinho force-pushed the bcoutinho/dynamic_plugin branch from d257ac2 to 676135f Compare October 29, 2025 23:31
@facebook-github-bot
Copy link
Contributor

@briancoutinho has updated the pull request. You must reimport the pull request before landing.

@briancoutinho
Copy link
Contributor Author

briancoutinho commented Oct 29, 2025

@australopitek Updated the start trace API to pass enabled Activity types, this should help configure plugins 👍 Also see the DynamicPluginTests.cpp, that verifies this works

Updates

  1. Updated the start trace API to pass enabled Activity types
  2. Exposed addDeviceInfo() to the plugin's trace builder API, so plugins can create new device rows.
  3. Rebase to main

@briancoutinho
Copy link
Contributor Author

@sraikund16 Thanks for checking in 👍
You will probably need to add new TARGET files..
Let me know if I can help with the format changes too, or please feel free to push changes to the PR :)

Comment on lines +242 to +247
// [in] Enabled activity types.
KinetoPlugin_ProfileEventType *pEnabledActivityTypes;

// [in] Max length of pEnabledActivityTypes
size_t enabledActivityTypesMaxLen;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if KinetoPlugin_ProfilerCreate_Params isn't a better place for these params. Both Cupti profiler and Xpupti profiler enable activities in configure phase, which happen before start https://github.com/pytorch/kineto/blob/main/libkineto/src/CuptiActivityProfiler.cpp#L1118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants