Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

jacobhinkle
Copy link
Collaborator

@jacobhinkle jacobhinkle commented Sep 23, 2025

This PR allows us to optionally compile in dynamic runtime support for nvMatmulHeuristics. Users will still need to pip install nvidia-matmul-heuristics and then set their LD_LIBRARY_PATH to the installed location (on my machine it is at $HOME/.local/lib/python3.12/site-packages/nvidia/nvMatmulHeuristics/lib). If the user doesn't do this, we gracefully just print a warning and use the default CutlassParams, avoiding a dynamic linker error if we had done a compile-time link against libnvMatmulHeuristics.so.

I have not yet verified that this improves performance. This PR is so far just an exploration of the mechanics of calling nvmmh dynamically.

NOTE: CMakeLists.txt changes are not yet pushed...

Copy link

Description

  • Integrate dynamic loading of nvMatmulHeuristics library

  • Initialize and validate nvMatmulHeuristics at runtime

  • Fetch kernel configurations using nvMatmulHeuristics

  • Fallback to default params if nvMMH is unavailable


Changes walkthrough 📝

Relevant files
Enhancement
cutlass.cpp
Add dynamic nvMatmulHeuristics integration in Cutlass scheduler

csrc/scheduler/cutlass.cpp

  • Added dynamic loading of libnvMatmulHeuristics.so using dlopen
  • Implemented initNVMMH() to load symbols and check version
    compatibility
  • Integrated nvMatmulHeuristics API calls to query kernel configurations
  • Set CutlassParams from heuristic results if available, else fallback
  • +171/-0 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The PR uses TORCH_WARN_ONCE for version mismatch between compiled and runtime nvMatmulHeuristics, but continues execution despite requiring exactly matching versions. This may lead to undefined behavior if the versions are incompatible.

      TORCH_WARN_ONCE(
          "nvFuser was compiled against nvMatmulHeuristics version ",
          NVMMH_VERSION_MAJOR,
          ".",
          NVMMH_VERSION_MINOR,
          " but found nvMatmulHeuristics shared library version ",
          lib_major,
          ".",
          lib_minor,
          ". Exactly matching versions are required");
    }
    Performance Risk

    Hardcoded problem dimensions (M=N=K=8192) are used when querying nvMatmulHeuristics, which may result in suboptimal or invalid kernel configurations for actual runtime tensor sizes.

    constexpr uint32_t M = 8192, N = 8192, K = 8192, Batch = 1;
    nvmmhMatmulProblem_t problem{M, N, K, Batch, layout};
    Missing Cleanup

    If nvMatmulHeuristicsCreate succeeds but a subsequent call fails, the created handle is not destroyed before returning, potentially causing resource leaks.

    NVMMH_SAFE_CALL(nvMatmulHeuristicsCreate(&handle));

    Comment on lines +98 to +104
    #ifdef HAS_NVMMH_INCLUDE

    #include <nvMatmulHeuristics.h>

    static thread_local void* nvmmh_handle = nullptr;

    namespace nvmmh_func {
    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    All this stuff probably belongs in another file nvmmh.{cpp,h}.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant