Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[StaticDataLayout][PGO] Add profile format for static data layout, and the classes to operate on the profiles. #138170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mingmingl-llvm
Copy link
Contributor

Context: For https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744#p-336543-background-3, we propose to profile memory loads and stores via hardware events, symbolize the addresses of binary static data sections and feed the profile back into compiler for data partitioning.

This change adds the profile format for static data layout, and the classes to operate on it.

The profile and its format

  1. Conceptually, a piece of data (call it a symbol) is represented by its symbol name or its content hash. The former applies to majority of data whose mangled name remains relatively stable over binary releases, and the latter applies to string literals (with name patterns like .str.<N>[.llvm.<hash>].
    • The symbols with samples are hot data. The number of hot symbols is small relative to all symbols. The profile tracks its sampled counts and locations. Sampled counts come from hardware events, and locations come from debug information in the profiled binary. The symbols without samples are cold data. The number of such cold symbols is large. The profile tracks its representation (the name or content hash).
    • Based on a preliminary study, debug information coverage for data symbols is partial and best-effort. In the LLVM IR, global variables with source code correspondence may or may not have debug information. Therefore the location information is optional in the profiles.
  2. The profile-and-compile cycle is similar to SamplePGO. Profiles are sampled from production binaries, and used in next binary releases. Known cold symbols and new hot symbols can both have zero sampled counts, so the profile records known cold symbols to tell the two for next compile.

In the profile's serialization format, strings are concatenated together and compressed. Individual records stores the index.

A separate PR will connect this class to InstrProfReader/Writer via MemProfReader/Writer.

@llvmbot llvmbot added PGO Profile Guided Optimizations llvm:adt labels May 1, 2025
@llvmbot
Copy link
Member

llvmbot commented May 1, 2025

@llvm/pr-subscribers-llvm-adt

Author: Mingming Liu (mingmingl-llvm)

Changes

Context: For https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744#p-336543-background-3, we propose to profile memory loads and stores via hardware events, symbolize the addresses of binary static data sections and feed the profile back into compiler for data partitioning.

This change adds the profile format for static data layout, and the classes to operate on it.

The profile and its format

  1. Conceptually, a piece of data (call it a symbol) is represented by its symbol name or its content hash. The former applies to majority of data whose mangled name remains relatively stable over binary releases, and the latter applies to string literals (with name patterns like .str.&lt;N&gt;[.llvm.&lt;hash&gt;].
    • The symbols with samples are hot data. The number of hot symbols is small relative to all symbols. The profile tracks its sampled counts and locations. Sampled counts come from hardware events, and locations come from debug information in the profiled binary. The symbols without samples are cold data. The number of such cold symbols is large. The profile tracks its representation (the name or content hash).
    • Based on a preliminary study, debug information coverage for data symbols is partial and best-effort. In the LLVM IR, global variables with source code correspondence may or may not have debug information. Therefore the location information is optional in the profiles.
  2. The profile-and-compile cycle is similar to SamplePGO. Profiles are sampled from production binaries, and used in next binary releases. Known cold symbols and new hot symbols can both have zero sampled counts, so the profile records known cold symbols to tell the two for next compile.

In the profile's serialization format, strings are concatenated together and compressed. Individual records stores the index.

A separate PR will connect this class to InstrProfReader/Writer via MemProfReader/Writer.


Patch is 27.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138170.diff

7 Files Affected:

  • (modified) llvm/include/llvm/ADT/MapVector.h (+2)
  • (added) llvm/include/llvm/ProfileData/DataAccessProf.h (+156)
  • (modified) llvm/include/llvm/ProfileData/InstrProf.h (+11-5)
  • (modified) llvm/lib/ProfileData/CMakeLists.txt (+1)
  • (added) llvm/lib/ProfileData/DataAccessProf.cpp (+246)
  • (modified) llvm/lib/ProfileData/InstrProf.cpp (+2-6)
  • (modified) llvm/unittests/ProfileData/MemProfTest.cpp (+161)
diff --git a/llvm/include/llvm/ADT/MapVector.h b/llvm/include/llvm/ADT/MapVector.h
index c11617a81c97d..fe0d106795c34 100644
--- a/llvm/include/llvm/ADT/MapVector.h
+++ b/llvm/include/llvm/ADT/MapVector.h
@@ -57,6 +57,8 @@ class MapVector {
     return std::move(Vector);
   }
 
+  ArrayRef<value_type> getArrayRef() const { return Vector; }
+
   size_type size() const { return Vector.size(); }
 
   /// Grow the MapVector so that it can contain at least \p NumEntries items
diff --git a/llvm/include/llvm/ProfileData/DataAccessProf.h b/llvm/include/llvm/ProfileData/DataAccessProf.h
new file mode 100644
index 0000000000000..2cce4945fddd5
--- /dev/null
+++ b/llvm/include/llvm/ProfileData/DataAccessProf.h
@@ -0,0 +1,156 @@
+//===- DataAccessProf.h - Data access profile format support ---------*- C++
+//-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file contains support to construct and use data access profiles.
+//
+// For the original RFC of this pass please see
+// https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_PROFILEDATA_DATAACCESSPROF_H_
+#define LLVM_PROFILEDATA_DATAACCESSPROF_H_
+
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseMapInfoVariant.h"
+#include "llvm/ADT/MapVector.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/StringSaver.h"
+
+#include <cstdint>
+#include <variant>
+
+namespace llvm {
+
+namespace data_access_prof {
+// The location of data in the source code.
+struct DataLocation {
+  // The filename where the data is located.
+  StringRef FileName;
+  // The line number in the source code.
+  uint32_t Line;
+};
+
+// The data access profiles for a symbol.
+struct DataAccessProfRecord {
+  // Represents a data symbol. The semantic comes in two forms: a symbol index
+  // for symbol name if `IsStringLiteral` is false, or the hash of a string
+  // content if `IsStringLiteral` is true. Required.
+  uint64_t SymbolID;
+
+  // The access count of symbol. Required.
+  uint64_t AccessCount;
+
+  // True iff this is a record for string literal (symbols with name pattern
+  // `.str.*` in the symbol table). Required.
+  bool IsStringLiteral;
+
+  // The locations of data in the source code. Optional.
+  llvm::SmallVector<DataLocation> Locations;
+};
+
+/// Encapsulates the data access profile data and the methods to operate on it.
+/// This class provides profile look-up, serialization and deserialization.
+class DataAccessProfData {
+public:
+  // SymbolID is either a string representing symbol name, or a uint64_t
+  // representing the content hash of a string literal.
+  using SymbolID = std::variant<StringRef, uint64_t>;
+  using StringToIndexMap = llvm::MapVector<StringRef, uint64_t>;
+
+  DataAccessProfData() : saver(Allocator) {}
+
+  /// Serialize profile data to the output stream.
+  /// Storage layout:
+  /// - Serialized strings.
+  /// - The encoded hashes.
+  /// - Records.
+  Error serialize(ProfOStream &OS) const;
+
+  /// Deserialize this class from the given buffer.
+  Error deserialize(const unsigned char *&Ptr);
+
+  /// Returns a pointer of profile record for \p SymbolID, or nullptr if there
+  /// isn't a record. Internally, this function will canonicalize the symbol
+  /// name before the lookup.
+  const DataAccessProfRecord *getProfileRecord(const SymbolID SymID) const;
+
+  /// Returns true if \p SymID is seen in profiled binaries and cold.
+  bool isKnownColdSymbol(const SymbolID SymID) const;
+
+  /// Methods to add symbolized data access profile. Returns error if duplicated
+  /// symbol names or content hashes are seen. The user of this class should
+  /// aggregate counters that corresponds to the same symbol name or with the
+  /// same string literal hash before calling 'add*' methods.
+  Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
+  Error addSymbolizedDataAccessProfile(
+      SymbolID SymbolID, uint64_t AccessCount,
+      const llvm::SmallVector<DataLocation> &Locations);
+  Error addKnownSymbolWithoutSamples(SymbolID SymbolID);
+
+  /// Returns a iterable StringRef for strings in the order they are added.
+  auto getStrings() const {
+    ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames(
+        StrToIndexMap.begin(), StrToIndexMap.end());
+    return llvm::make_first_range(RefSymbolNames);
+  }
+
+  /// Returns array reference for various internal data structures.
+  inline ArrayRef<
+      std::pair<std::variant<StringRef, uint64_t>, DataAccessProfRecord>>
+  getRecords() const {
+    return Records.getArrayRef();
+  }
+  inline ArrayRef<StringRef> getKnownColdSymbols() const {
+    return KnownColdSymbols.getArrayRef();
+  }
+  inline ArrayRef<uint64_t> getKnownColdHashes() const {
+    return KnownColdHashes.getArrayRef();
+  }
+
+private:
+  /// Serialize the symbol strings into the output stream.
+  Error serializeStrings(ProfOStream &OS) const;
+
+  /// Deserialize the symbol strings from \p Ptr and increment \p Ptr to the
+  /// start of the next payload.
+  Error deserializeStrings(const unsigned char *&Ptr,
+                           const uint64_t NumSampledSymbols,
+                           uint64_t NumColdKnownSymbols);
+
+  /// Decode the records and increment \p Ptr to the start of the next payload.
+  Error deserializeRecords(const unsigned char *&Ptr);
+
+  /// A helper function to compute a storage index for \p SymbolID.
+  uint64_t getEncodedIndex(const SymbolID SymbolID) const;
+
+  // `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to
+  // its record index.
+  MapVector<SymbolID, DataAccessProfRecord> Records;
+
+  // Use MapVector to keep input order of strings for serialization and
+  // deserialization.
+  StringToIndexMap StrToIndexMap;
+  llvm::SetVector<uint64_t> KnownColdHashes;
+  llvm::SetVector<StringRef> KnownColdSymbols;
+  // Keeps owned copies of the input strings.
+  llvm::BumpPtrAllocator Allocator;
+  llvm::UniqueStringSaver saver;
+};
+
+} // namespace data_access_prof
+} // namespace llvm
+
+#endif // LLVM_PROFILEDATA_DATAACCESSPROF_H_
diff --git a/llvm/include/llvm/ProfileData/InstrProf.h b/llvm/include/llvm/ProfileData/InstrProf.h
index 2d011c89f27cb..8a6be22bdb1a4 100644
--- a/llvm/include/llvm/ProfileData/InstrProf.h
+++ b/llvm/include/llvm/ProfileData/InstrProf.h
@@ -357,6 +357,12 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
 /// the duplicated profile variables for Comdat functions.
 bool needsComdatForCounter(const GlobalObject &GV, const Module &M);
 
+/// \c NameStrings is a string composed of one of more possibly encoded
+/// sub-strings. The substrings are separated by 0 or more zero bytes. This
+/// method decodes the string and calls `NameCallback` for each substring.
+Error readAndDecodeStrings(StringRef NameStrings,
+                           std::function<Error(StringRef)> NameCallback);
+
 /// An enum describing the attributes of an instrumented profile.
 enum class InstrProfKind {
   Unknown = 0x0,
@@ -493,6 +499,11 @@ class InstrProfSymtab {
 public:
   using AddrHashMap = std::vector<std::pair<uint64_t, uint64_t>>;
 
+  // Returns the canonial name of the given PGOName. In a canonical name, all
+  // suffixes that begins with "." except ".__uniq." are stripped.
+  // FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
+  static StringRef getCanonicalName(StringRef PGOName);
+
 private:
   using AddrIntervalMap =
       IntervalMap<uint64_t, uint64_t, 4, IntervalMapHalfOpenInfo<uint64_t>>;
@@ -528,11 +539,6 @@ class InstrProfSymtab {
 
   static StringRef getExternalSymbol() { return "** External Symbol **"; }
 
-  // Returns the canonial name of the given PGOName. In a canonical name, all
-  // suffixes that begins with "." except ".__uniq." are stripped.
-  // FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
-  static StringRef getCanonicalName(StringRef PGOName);
-
   // Add the function into the symbol table, by creating the following
   // map entries:
   // name-set = {PGOFuncName} union {getCanonicalName(PGOFuncName)}
diff --git a/llvm/lib/ProfileData/CMakeLists.txt b/llvm/lib/ProfileData/CMakeLists.txt
index eb7c2a3c1a28a..67a69d7761b2c 100644
--- a/llvm/lib/ProfileData/CMakeLists.txt
+++ b/llvm/lib/ProfileData/CMakeLists.txt
@@ -1,4 +1,5 @@
 add_llvm_component_library(LLVMProfileData
+  DataAccessProf.cpp
   GCOV.cpp
   IndexedMemProfData.cpp
   InstrProf.cpp
diff --git a/llvm/lib/ProfileData/DataAccessProf.cpp b/llvm/lib/ProfileData/DataAccessProf.cpp
new file mode 100644
index 0000000000000..cf538d6a1b28e
--- /dev/null
+++ b/llvm/lib/ProfileData/DataAccessProf.cpp
@@ -0,0 +1,246 @@
+#include "llvm/ProfileData/DataAccessProf.h"
+#include "llvm/ADT/DenseMapInfoVariant.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/Compression.h"
+#include "llvm/Support/Endian.h"
+#include "llvm/Support/Errc.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/StringSaver.h"
+#include "llvm/Support/raw_ostream.h"
+#include <sys/types.h>
+
+namespace llvm {
+namespace data_access_prof {
+
+// If `Map` has an entry keyed by `Str`, returns the entry iterator. Otherwise,
+// creates an owned copy of `Str`, adds a map entry for it and returns the
+// iterator.
+static MapVector<StringRef, uint64_t>::iterator
+saveStringToMap(MapVector<StringRef, uint64_t> &Map,
+                llvm::UniqueStringSaver &saver, StringRef Str) {
+  auto [Iter, Inserted] = Map.try_emplace(saver.save(Str), Map.size());
+  return Iter;
+}
+
+const DataAccessProfRecord *
+DataAccessProfData::getProfileRecord(const SymbolID SymbolID) const {
+  auto Key = SymbolID;
+  if (std::holds_alternative<StringRef>(SymbolID))
+    Key = InstrProfSymtab::getCanonicalName(std::get<StringRef>(SymbolID));
+
+  auto It = Records.find(Key);
+  if (It != Records.end())
+    return &It->second;
+
+  return nullptr;
+}
+
+bool DataAccessProfData::isKnownColdSymbol(const SymbolID SymID) const {
+  if (std::holds_alternative<uint64_t>(SymID))
+    return KnownColdHashes.count(std::get<uint64_t>(SymID));
+  return KnownColdSymbols.count(std::get<StringRef>(SymID));
+}
+
+Error DataAccessProfData::addSymbolizedDataAccessProfile(SymbolID Symbol,
+                                                         uint64_t AccessCount) {
+  uint64_t RecordID = -1;
+  bool IsStringLiteral = false;
+  SymbolID Key;
+  if (std::holds_alternative<uint64_t>(Symbol)) {
+    RecordID = std::get<uint64_t>(Symbol);
+    Key = RecordID;
+    IsStringLiteral = true;
+  } else {
+    StringRef SymbolName = std::get<StringRef>(Symbol);
+    if (SymbolName.empty())
+      return make_error<StringError>("Empty symbol name",
+                                     llvm::errc::invalid_argument);
+
+    StringRef CanonicalName = InstrProfSymtab::getCanonicalName(SymbolName);
+    Key = CanonicalName;
+    RecordID = saveStringToMap(StrToIndexMap, saver, CanonicalName)->second;
+    IsStringLiteral = false;
+  }
+
+  auto [Iter, Inserted] = Records.try_emplace(
+      Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral});
+  if (!Inserted)
+    return make_error<StringError>("Duplicate symbol or string literal added. "
+                                   "User of DataAccessProfData should "
+                                   "aggregate count for the same symbol. ",
+                                   llvm::errc::invalid_argument);
+
+  return Error::success();
+}
+
+Error DataAccessProfData::addSymbolizedDataAccessProfile(
+    SymbolID SymbolID, uint64_t AccessCount,
+    const llvm::SmallVector<DataLocation> &Locations) {
+  if (Error E = addSymbolizedDataAccessProfile(SymbolID, AccessCount))
+    return E;
+
+  auto &Record = Records.back().second;
+  for (const auto &Location : Locations)
+    Record.Locations.push_back(
+        {saveStringToMap(StrToIndexMap, saver, Location.FileName)->first,
+         Location.Line});
+
+  return Error::success();
+}
+
+Error DataAccessProfData::addKnownSymbolWithoutSamples(SymbolID SymbolID) {
+  if (std::holds_alternative<uint64_t>(SymbolID)) {
+    KnownColdHashes.insert(std::get<uint64_t>(SymbolID));
+    return Error::success();
+  }
+  StringRef SymbolName = std::get<StringRef>(SymbolID);
+  if (SymbolName.empty())
+    return make_error<StringError>("Empty symbol name",
+                                   llvm::errc::invalid_argument);
+  StringRef CanonicalSymName = InstrProfSymtab::getCanonicalName(SymbolName);
+  KnownColdSymbols.insert(CanonicalSymName);
+  return Error::success();
+}
+
+Error DataAccessProfData::deserialize(const unsigned char *&Ptr) {
+  uint64_t NumSampledSymbols =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+  uint64_t NumColdKnownSymbols =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+  if (Error E = deserializeStrings(Ptr, NumSampledSymbols, NumColdKnownSymbols))
+    return E;
+
+  uint64_t Num =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+  for (uint64_t I = 0; I < Num; ++I)
+    KnownColdHashes.insert(
+        support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr));
+
+  return deserializeRecords(Ptr);
+}
+
+Error DataAccessProfData::serializeStrings(ProfOStream &OS) const {
+  OS.write(StrToIndexMap.size());
+  OS.write(KnownColdSymbols.size());
+
+  std::vector<std::string> Strs;
+  Strs.reserve(StrToIndexMap.size() + KnownColdSymbols.size());
+  for (const auto &Str : StrToIndexMap)
+    Strs.push_back(Str.first.str());
+  for (const auto &Str : KnownColdSymbols)
+    Strs.push_back(Str.str());
+
+  std::string CompressedStrings;
+  if (!Strs.empty())
+    if (Error E = collectGlobalObjectNameStrings(
+            Strs, compression::zlib::isAvailable(), CompressedStrings))
+      return E;
+  const uint64_t CompressedStringLen = CompressedStrings.length();
+  // Record the length of compressed string.
+  OS.write(CompressedStringLen);
+  // Write the chars in compressed strings.
+  for (auto &c : CompressedStrings)
+    OS.writeByte(static_cast<uint8_t>(c));
+  // Pad up to a multiple of 8.
+  // InstrProfReader could read bytes according to 'CompressedStringLen'.
+  const uint64_t PaddedLength = alignTo(CompressedStringLen, 8);
+  for (uint64_t K = CompressedStringLen; K < PaddedLength; K++)
+    OS.writeByte(0);
+  return Error::success();
+}
+
+uint64_t DataAccessProfData::getEncodedIndex(const SymbolID SymbolID) const {
+  if (std::holds_alternative<uint64_t>(SymbolID))
+    return std::get<uint64_t>(SymbolID);
+
+  return StrToIndexMap.find(std::get<StringRef>(SymbolID))->second;
+}
+
+Error DataAccessProfData::serialize(ProfOStream &OS) const {
+  if (Error E = serializeStrings(OS))
+    return E;
+  OS.write(KnownColdHashes.size());
+  for (const auto &Hash : KnownColdHashes)
+    OS.write(Hash);
+  OS.write((uint64_t)(Records.size()));
+  for (const auto &[Key, Rec] : Records) {
+    OS.write(getEncodedIndex(Rec.SymbolID));
+    OS.writeByte(Rec.IsStringLiteral);
+    OS.write(Rec.AccessCount);
+    OS.write(Rec.Locations.size());
+    for (const auto &Loc : Rec.Locations) {
+      OS.write(getEncodedIndex(Loc.FileName));
+      OS.write32(Loc.Line);
+    }
+  }
+  return Error::success();
+}
+
+Error DataAccessProfData::deserializeStrings(const unsigned char *&Ptr,
+                                             uint64_t NumSampledSymbols,
+                                             uint64_t NumColdKnownSymbols) {
+  uint64_t Len =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+  // With M=NumSampledSymbols and N=NumColdKnownSymbols, the first M strings are
+  // symbols with samples, and next N strings are known cold symbols.
+  uint64_t StringCnt = 0;
+  std::function<Error(StringRef)> addName = [&](StringRef Name) {
+    if (StringCnt < NumSampledSymbols)
+      saveStringToMap(StrToIndexMap, saver, Name);
+    else
+      KnownColdSymbols.insert(saver.save(Name));
+    ++StringCnt;
+    return Error::success();
+  };
+  if (Error E =
+          readAndDecodeStrings(StringRef((const char *)Ptr, Len), addName))
+    return E;
+
+  Ptr += alignTo(Len, 8);
+  return Error::success();
+}
+
+Error DataAccessProfData::deserializeRecords(const unsigned char *&Ptr) {
+  SmallVector<StringRef> Strings = llvm::to_vector(getStrings());
+
+  uint64_t NumRecords =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+  for (uint64_t I = 0; I < NumRecords; ++I) {
+    uint64_t ID =
+        support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+    bool IsStringLiteral =
+        support::endian::readNext<uint8_t, llvm::endianness::little>(Ptr);
+
+    uint64_t AccessCount =
+        support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+    SymbolID SymbolID;
+    if (IsStringLiteral)
+      SymbolID = ID;
+    else
+      SymbolID = Strings[ID];
+    if (Error E = addSymbolizedDataAccessProfile(SymbolID, AccessCount))
+      return E;
+
+    auto &Record = Records.back().second;
+
+    uint64_t NumLocations =
+        support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+    Record.Locations.reserve(NumLocations);
+    for (uint64_t J = 0; J < NumLocations; ++J) {
+      uint64_t FileNameIndex =
+          support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+      uint32_t Line =
+          support::endian::readNext<uint32_t, llvm::endianness::little>(Ptr);
+      Record.Locations.push_back({Strings[FileNameIndex], Line});
+    }
+  }
+  return Error::success();
+}
+} // namespace data_access_prof
+} // namespace llvm
diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp
index 88621787c1dd9..254f941acde82 100644
--- a/llvm/lib/ProfileData/InstrProf.cpp
+++ b/llvm/lib/ProfileData/InstrProf.cpp
@@ -573,12 +573,8 @@ Error InstrProfSymtab::addVTableWithName(GlobalVariable &VTable,
   return Error::success();
 }
 
-/// \c NameStrings is a string composed of one of more possibly encoded
-/// sub-strings. The substrings are separated by 0 or more zero bytes. This
-/// method decodes the string and calls `NameCallback` for each substring.
-static Error
-readAndDecodeStrings(StringRef NameStrings,
-                     std::function<Error(StringRef)> NameCallback) {
+Error readAndDecodeStrings(StringRef NameStrings,
+                           std::function<Error(StringRef)> NameCallback) {
   const uint8_t *P = NameStrings.bytes_begin();
   const uint8_t *EndP = NameStrings.bytes_end();
   while (P < EndP) {
diff --git a/llvm/unittests/ProfileData/MemProfTest.cpp b/llvm/unittests/ProfileData/MemProfTest.cpp
index 3e430aa4eae58..f6362448a5734 100644
--- a/llvm/unittests/ProfileData/MemProfTest.cpp
+++ b/llvm/unittests/ProfileData/MemProfTest.cpp
@@ -10,14 +10,17 @@
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/STLForwardCompat.h"
+#include "llvm/ADT/SmallVector.h"
 #include "llvm/DebugInfo/DIContext.h"
 #include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
 #include "llvm/IR/Value.h"
 #include "llvm/Object/ObjectFile.h"
+#include "llvm/ProfileData/DataAccessProf.h"
 #include "llvm/ProfileData/MemProfData.inc"
 #include "llvm/ProfileData/MemProfReader.h"
 #include "llvm/ProfileData/MemProfYAML.h"
 #include "llvm/Support/raw_ostream.h"
+#include "gmock/gmock-more-matchers.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
 
@@ -36,6 +39,8 @@ using ::llvm::StringRef;
 using ::llvm::object::SectionedAddress;
 using ::llvm::symbolize::SymbolizableModule;
 using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+using ::testing::HasSubstr;
 using ::testing::IsEmpty;
 using ::testing::Pair;
 using ::testing::Return;
@@ -747,6 +752,162 @@ TEST(MemProf, YAMLParser) {
                                ElementsAre(0x3000)))));
 }
 
+static std::string ErrorToString(Error E) {
+  std::string ErrMsg;
...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented May 1, 2025

@llvm/pr-subscribers-pgo

Author: Mingming Liu (mingmingl-llvm)

Changes

Context: For https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744#p-336543-background-3, we propose to profile memory loads and stores via hardware events, symbolize the addresses of binary static data sections and feed the profile back into compiler for data partitioning.

This change adds the profile format for static data layout, and the classes to operate on it.

The profile and its format

  1. Conceptually, a piece of data (call it a symbol) is represented by its symbol name or its content hash. The former applies to majority of data whose mangled name remains relatively stable over binary releases, and the latter applies to string literals (with name patterns like .str.&lt;N&gt;[.llvm.&lt;hash&gt;].
    • The symbols with samples are hot data. The number of hot symbols is small relative to all symbols. The profile tracks its sampled counts and locations. Sampled counts come from hardware events, and locations come from debug information in the profiled binary. The symbols without samples are cold data. The number of such cold symbols is large. The profile tracks its representation (the name or content hash).
    • Based on a preliminary study, debug information coverage for data symbols is partial and best-effort. In the LLVM IR, global variables with source code correspondence may or may not have debug information. Therefore the location information is optional in the profiles.
  2. The profile-and-compile cycle is similar to SamplePGO. Profiles are sampled from production binaries, and used in next binary releases. Known cold symbols and new hot symbols can both have zero sampled counts, so the profile records known cold symbols to tell the two for next compile.

In the profile's serialization format, strings are concatenated together and compressed. Individual records stores the index.

A separate PR will connect this class to InstrProfReader/Writer via MemProfReader/Writer.


Patch is 27.53 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138170.diff

7 Files Affected:

  • (modified) llvm/include/llvm/ADT/MapVector.h (+2)
  • (added) llvm/include/llvm/ProfileData/DataAccessProf.h (+156)
  • (modified) llvm/include/llvm/ProfileData/InstrProf.h (+11-5)
  • (modified) llvm/lib/ProfileData/CMakeLists.txt (+1)
  • (added) llvm/lib/ProfileData/DataAccessProf.cpp (+246)
  • (modified) llvm/lib/ProfileData/InstrProf.cpp (+2-6)
  • (modified) llvm/unittests/ProfileData/MemProfTest.cpp (+161)
diff --git a/llvm/include/llvm/ADT/MapVector.h b/llvm/include/llvm/ADT/MapVector.h
index c11617a81c97d..fe0d106795c34 100644
--- a/llvm/include/llvm/ADT/MapVector.h
+++ b/llvm/include/llvm/ADT/MapVector.h
@@ -57,6 +57,8 @@ class MapVector {
     return std::move(Vector);
   }
 
+  ArrayRef<value_type> getArrayRef() const { return Vector; }
+
   size_type size() const { return Vector.size(); }
 
   /// Grow the MapVector so that it can contain at least \p NumEntries items
diff --git a/llvm/include/llvm/ProfileData/DataAccessProf.h b/llvm/include/llvm/ProfileData/DataAccessProf.h
new file mode 100644
index 0000000000000..2cce4945fddd5
--- /dev/null
+++ b/llvm/include/llvm/ProfileData/DataAccessProf.h
@@ -0,0 +1,156 @@
+//===- DataAccessProf.h - Data access profile format support ---------*- C++
+//-*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file contains support to construct and use data access profiles.
+//
+// For the original RFC of this pass please see
+// https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_PROFILEDATA_DATAACCESSPROF_H_
+#define LLVM_PROFILEDATA_DATAACCESSPROF_H_
+
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/DenseMapInfoVariant.h"
+#include "llvm/ADT/MapVector.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SetVector.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/Allocator.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/StringSaver.h"
+
+#include <cstdint>
+#include <variant>
+
+namespace llvm {
+
+namespace data_access_prof {
+// The location of data in the source code.
+struct DataLocation {
+  // The filename where the data is located.
+  StringRef FileName;
+  // The line number in the source code.
+  uint32_t Line;
+};
+
+// The data access profiles for a symbol.
+struct DataAccessProfRecord {
+  // Represents a data symbol. The semantic comes in two forms: a symbol index
+  // for symbol name if `IsStringLiteral` is false, or the hash of a string
+  // content if `IsStringLiteral` is true. Required.
+  uint64_t SymbolID;
+
+  // The access count of symbol. Required.
+  uint64_t AccessCount;
+
+  // True iff this is a record for string literal (symbols with name pattern
+  // `.str.*` in the symbol table). Required.
+  bool IsStringLiteral;
+
+  // The locations of data in the source code. Optional.
+  llvm::SmallVector<DataLocation> Locations;
+};
+
+/// Encapsulates the data access profile data and the methods to operate on it.
+/// This class provides profile look-up, serialization and deserialization.
+class DataAccessProfData {
+public:
+  // SymbolID is either a string representing symbol name, or a uint64_t
+  // representing the content hash of a string literal.
+  using SymbolID = std::variant<StringRef, uint64_t>;
+  using StringToIndexMap = llvm::MapVector<StringRef, uint64_t>;
+
+  DataAccessProfData() : saver(Allocator) {}
+
+  /// Serialize profile data to the output stream.
+  /// Storage layout:
+  /// - Serialized strings.
+  /// - The encoded hashes.
+  /// - Records.
+  Error serialize(ProfOStream &OS) const;
+
+  /// Deserialize this class from the given buffer.
+  Error deserialize(const unsigned char *&Ptr);
+
+  /// Returns a pointer of profile record for \p SymbolID, or nullptr if there
+  /// isn't a record. Internally, this function will canonicalize the symbol
+  /// name before the lookup.
+  const DataAccessProfRecord *getProfileRecord(const SymbolID SymID) const;
+
+  /// Returns true if \p SymID is seen in profiled binaries and cold.
+  bool isKnownColdSymbol(const SymbolID SymID) const;
+
+  /// Methods to add symbolized data access profile. Returns error if duplicated
+  /// symbol names or content hashes are seen. The user of this class should
+  /// aggregate counters that corresponds to the same symbol name or with the
+  /// same string literal hash before calling 'add*' methods.
+  Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
+  Error addSymbolizedDataAccessProfile(
+      SymbolID SymbolID, uint64_t AccessCount,
+      const llvm::SmallVector<DataLocation> &Locations);
+  Error addKnownSymbolWithoutSamples(SymbolID SymbolID);
+
+  /// Returns a iterable StringRef for strings in the order they are added.
+  auto getStrings() const {
+    ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames(
+        StrToIndexMap.begin(), StrToIndexMap.end());
+    return llvm::make_first_range(RefSymbolNames);
+  }
+
+  /// Returns array reference for various internal data structures.
+  inline ArrayRef<
+      std::pair<std::variant<StringRef, uint64_t>, DataAccessProfRecord>>
+  getRecords() const {
+    return Records.getArrayRef();
+  }
+  inline ArrayRef<StringRef> getKnownColdSymbols() const {
+    return KnownColdSymbols.getArrayRef();
+  }
+  inline ArrayRef<uint64_t> getKnownColdHashes() const {
+    return KnownColdHashes.getArrayRef();
+  }
+
+private:
+  /// Serialize the symbol strings into the output stream.
+  Error serializeStrings(ProfOStream &OS) const;
+
+  /// Deserialize the symbol strings from \p Ptr and increment \p Ptr to the
+  /// start of the next payload.
+  Error deserializeStrings(const unsigned char *&Ptr,
+                           const uint64_t NumSampledSymbols,
+                           uint64_t NumColdKnownSymbols);
+
+  /// Decode the records and increment \p Ptr to the start of the next payload.
+  Error deserializeRecords(const unsigned char *&Ptr);
+
+  /// A helper function to compute a storage index for \p SymbolID.
+  uint64_t getEncodedIndex(const SymbolID SymbolID) const;
+
+  // `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to
+  // its record index.
+  MapVector<SymbolID, DataAccessProfRecord> Records;
+
+  // Use MapVector to keep input order of strings for serialization and
+  // deserialization.
+  StringToIndexMap StrToIndexMap;
+  llvm::SetVector<uint64_t> KnownColdHashes;
+  llvm::SetVector<StringRef> KnownColdSymbols;
+  // Keeps owned copies of the input strings.
+  llvm::BumpPtrAllocator Allocator;
+  llvm::UniqueStringSaver saver;
+};
+
+} // namespace data_access_prof
+} // namespace llvm
+
+#endif // LLVM_PROFILEDATA_DATAACCESSPROF_H_
diff --git a/llvm/include/llvm/ProfileData/InstrProf.h b/llvm/include/llvm/ProfileData/InstrProf.h
index 2d011c89f27cb..8a6be22bdb1a4 100644
--- a/llvm/include/llvm/ProfileData/InstrProf.h
+++ b/llvm/include/llvm/ProfileData/InstrProf.h
@@ -357,6 +357,12 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
 /// the duplicated profile variables for Comdat functions.
 bool needsComdatForCounter(const GlobalObject &GV, const Module &M);
 
+/// \c NameStrings is a string composed of one of more possibly encoded
+/// sub-strings. The substrings are separated by 0 or more zero bytes. This
+/// method decodes the string and calls `NameCallback` for each substring.
+Error readAndDecodeStrings(StringRef NameStrings,
+                           std::function<Error(StringRef)> NameCallback);
+
 /// An enum describing the attributes of an instrumented profile.
 enum class InstrProfKind {
   Unknown = 0x0,
@@ -493,6 +499,11 @@ class InstrProfSymtab {
 public:
   using AddrHashMap = std::vector<std::pair<uint64_t, uint64_t>>;
 
+  // Returns the canonial name of the given PGOName. In a canonical name, all
+  // suffixes that begins with "." except ".__uniq." are stripped.
+  // FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
+  static StringRef getCanonicalName(StringRef PGOName);
+
 private:
   using AddrIntervalMap =
       IntervalMap<uint64_t, uint64_t, 4, IntervalMapHalfOpenInfo<uint64_t>>;
@@ -528,11 +539,6 @@ class InstrProfSymtab {
 
   static StringRef getExternalSymbol() { return "** External Symbol **"; }
 
-  // Returns the canonial name of the given PGOName. In a canonical name, all
-  // suffixes that begins with "." except ".__uniq." are stripped.
-  // FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
-  static StringRef getCanonicalName(StringRef PGOName);
-
   // Add the function into the symbol table, by creating the following
   // map entries:
   // name-set = {PGOFuncName} union {getCanonicalName(PGOFuncName)}
diff --git a/llvm/lib/ProfileData/CMakeLists.txt b/llvm/lib/ProfileData/CMakeLists.txt
index eb7c2a3c1a28a..67a69d7761b2c 100644
--- a/llvm/lib/ProfileData/CMakeLists.txt
+++ b/llvm/lib/ProfileData/CMakeLists.txt
@@ -1,4 +1,5 @@
 add_llvm_component_library(LLVMProfileData
+  DataAccessProf.cpp
   GCOV.cpp
   IndexedMemProfData.cpp
   InstrProf.cpp
diff --git a/llvm/lib/ProfileData/DataAccessProf.cpp b/llvm/lib/ProfileData/DataAccessProf.cpp
new file mode 100644
index 0000000000000..cf538d6a1b28e
--- /dev/null
+++ b/llvm/lib/ProfileData/DataAccessProf.cpp
@@ -0,0 +1,246 @@
+#include "llvm/ProfileData/DataAccessProf.h"
+#include "llvm/ADT/DenseMapInfoVariant.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ProfileData/InstrProf.h"
+#include "llvm/Support/Compression.h"
+#include "llvm/Support/Endian.h"
+#include "llvm/Support/Errc.h"
+#include "llvm/Support/Error.h"
+#include "llvm/Support/StringSaver.h"
+#include "llvm/Support/raw_ostream.h"
+#include <sys/types.h>
+
+namespace llvm {
+namespace data_access_prof {
+
+// If `Map` has an entry keyed by `Str`, returns the entry iterator. Otherwise,
+// creates an owned copy of `Str`, adds a map entry for it and returns the
+// iterator.
+static MapVector<StringRef, uint64_t>::iterator
+saveStringToMap(MapVector<StringRef, uint64_t> &Map,
+                llvm::UniqueStringSaver &saver, StringRef Str) {
+  auto [Iter, Inserted] = Map.try_emplace(saver.save(Str), Map.size());
+  return Iter;
+}
+
+const DataAccessProfRecord *
+DataAccessProfData::getProfileRecord(const SymbolID SymbolID) const {
+  auto Key = SymbolID;
+  if (std::holds_alternative<StringRef>(SymbolID))
+    Key = InstrProfSymtab::getCanonicalName(std::get<StringRef>(SymbolID));
+
+  auto It = Records.find(Key);
+  if (It != Records.end())
+    return &It->second;
+
+  return nullptr;
+}
+
+bool DataAccessProfData::isKnownColdSymbol(const SymbolID SymID) const {
+  if (std::holds_alternative<uint64_t>(SymID))
+    return KnownColdHashes.count(std::get<uint64_t>(SymID));
+  return KnownColdSymbols.count(std::get<StringRef>(SymID));
+}
+
+Error DataAccessProfData::addSymbolizedDataAccessProfile(SymbolID Symbol,
+                                                         uint64_t AccessCount) {
+  uint64_t RecordID = -1;
+  bool IsStringLiteral = false;
+  SymbolID Key;
+  if (std::holds_alternative<uint64_t>(Symbol)) {
+    RecordID = std::get<uint64_t>(Symbol);
+    Key = RecordID;
+    IsStringLiteral = true;
+  } else {
+    StringRef SymbolName = std::get<StringRef>(Symbol);
+    if (SymbolName.empty())
+      return make_error<StringError>("Empty symbol name",
+                                     llvm::errc::invalid_argument);
+
+    StringRef CanonicalName = InstrProfSymtab::getCanonicalName(SymbolName);
+    Key = CanonicalName;
+    RecordID = saveStringToMap(StrToIndexMap, saver, CanonicalName)->second;
+    IsStringLiteral = false;
+  }
+
+  auto [Iter, Inserted] = Records.try_emplace(
+      Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral});
+  if (!Inserted)
+    return make_error<StringError>("Duplicate symbol or string literal added. "
+                                   "User of DataAccessProfData should "
+                                   "aggregate count for the same symbol. ",
+                                   llvm::errc::invalid_argument);
+
+  return Error::success();
+}
+
+Error DataAccessProfData::addSymbolizedDataAccessProfile(
+    SymbolID SymbolID, uint64_t AccessCount,
+    const llvm::SmallVector<DataLocation> &Locations) {
+  if (Error E = addSymbolizedDataAccessProfile(SymbolID, AccessCount))
+    return E;
+
+  auto &Record = Records.back().second;
+  for (const auto &Location : Locations)
+    Record.Locations.push_back(
+        {saveStringToMap(StrToIndexMap, saver, Location.FileName)->first,
+         Location.Line});
+
+  return Error::success();
+}
+
+Error DataAccessProfData::addKnownSymbolWithoutSamples(SymbolID SymbolID) {
+  if (std::holds_alternative<uint64_t>(SymbolID)) {
+    KnownColdHashes.insert(std::get<uint64_t>(SymbolID));
+    return Error::success();
+  }
+  StringRef SymbolName = std::get<StringRef>(SymbolID);
+  if (SymbolName.empty())
+    return make_error<StringError>("Empty symbol name",
+                                   llvm::errc::invalid_argument);
+  StringRef CanonicalSymName = InstrProfSymtab::getCanonicalName(SymbolName);
+  KnownColdSymbols.insert(CanonicalSymName);
+  return Error::success();
+}
+
+Error DataAccessProfData::deserialize(const unsigned char *&Ptr) {
+  uint64_t NumSampledSymbols =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+  uint64_t NumColdKnownSymbols =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+  if (Error E = deserializeStrings(Ptr, NumSampledSymbols, NumColdKnownSymbols))
+    return E;
+
+  uint64_t Num =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+  for (uint64_t I = 0; I < Num; ++I)
+    KnownColdHashes.insert(
+        support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr));
+
+  return deserializeRecords(Ptr);
+}
+
+Error DataAccessProfData::serializeStrings(ProfOStream &OS) const {
+  OS.write(StrToIndexMap.size());
+  OS.write(KnownColdSymbols.size());
+
+  std::vector<std::string> Strs;
+  Strs.reserve(StrToIndexMap.size() + KnownColdSymbols.size());
+  for (const auto &Str : StrToIndexMap)
+    Strs.push_back(Str.first.str());
+  for (const auto &Str : KnownColdSymbols)
+    Strs.push_back(Str.str());
+
+  std::string CompressedStrings;
+  if (!Strs.empty())
+    if (Error E = collectGlobalObjectNameStrings(
+            Strs, compression::zlib::isAvailable(), CompressedStrings))
+      return E;
+  const uint64_t CompressedStringLen = CompressedStrings.length();
+  // Record the length of compressed string.
+  OS.write(CompressedStringLen);
+  // Write the chars in compressed strings.
+  for (auto &c : CompressedStrings)
+    OS.writeByte(static_cast<uint8_t>(c));
+  // Pad up to a multiple of 8.
+  // InstrProfReader could read bytes according to 'CompressedStringLen'.
+  const uint64_t PaddedLength = alignTo(CompressedStringLen, 8);
+  for (uint64_t K = CompressedStringLen; K < PaddedLength; K++)
+    OS.writeByte(0);
+  return Error::success();
+}
+
+uint64_t DataAccessProfData::getEncodedIndex(const SymbolID SymbolID) const {
+  if (std::holds_alternative<uint64_t>(SymbolID))
+    return std::get<uint64_t>(SymbolID);
+
+  return StrToIndexMap.find(std::get<StringRef>(SymbolID))->second;
+}
+
+Error DataAccessProfData::serialize(ProfOStream &OS) const {
+  if (Error E = serializeStrings(OS))
+    return E;
+  OS.write(KnownColdHashes.size());
+  for (const auto &Hash : KnownColdHashes)
+    OS.write(Hash);
+  OS.write((uint64_t)(Records.size()));
+  for (const auto &[Key, Rec] : Records) {
+    OS.write(getEncodedIndex(Rec.SymbolID));
+    OS.writeByte(Rec.IsStringLiteral);
+    OS.write(Rec.AccessCount);
+    OS.write(Rec.Locations.size());
+    for (const auto &Loc : Rec.Locations) {
+      OS.write(getEncodedIndex(Loc.FileName));
+      OS.write32(Loc.Line);
+    }
+  }
+  return Error::success();
+}
+
+Error DataAccessProfData::deserializeStrings(const unsigned char *&Ptr,
+                                             uint64_t NumSampledSymbols,
+                                             uint64_t NumColdKnownSymbols) {
+  uint64_t Len =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+  // With M=NumSampledSymbols and N=NumColdKnownSymbols, the first M strings are
+  // symbols with samples, and next N strings are known cold symbols.
+  uint64_t StringCnt = 0;
+  std::function<Error(StringRef)> addName = [&](StringRef Name) {
+    if (StringCnt < NumSampledSymbols)
+      saveStringToMap(StrToIndexMap, saver, Name);
+    else
+      KnownColdSymbols.insert(saver.save(Name));
+    ++StringCnt;
+    return Error::success();
+  };
+  if (Error E =
+          readAndDecodeStrings(StringRef((const char *)Ptr, Len), addName))
+    return E;
+
+  Ptr += alignTo(Len, 8);
+  return Error::success();
+}
+
+Error DataAccessProfData::deserializeRecords(const unsigned char *&Ptr) {
+  SmallVector<StringRef> Strings = llvm::to_vector(getStrings());
+
+  uint64_t NumRecords =
+      support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+  for (uint64_t I = 0; I < NumRecords; ++I) {
+    uint64_t ID =
+        support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+    bool IsStringLiteral =
+        support::endian::readNext<uint8_t, llvm::endianness::little>(Ptr);
+
+    uint64_t AccessCount =
+        support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+    SymbolID SymbolID;
+    if (IsStringLiteral)
+      SymbolID = ID;
+    else
+      SymbolID = Strings[ID];
+    if (Error E = addSymbolizedDataAccessProfile(SymbolID, AccessCount))
+      return E;
+
+    auto &Record = Records.back().second;
+
+    uint64_t NumLocations =
+        support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+
+    Record.Locations.reserve(NumLocations);
+    for (uint64_t J = 0; J < NumLocations; ++J) {
+      uint64_t FileNameIndex =
+          support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);
+      uint32_t Line =
+          support::endian::readNext<uint32_t, llvm::endianness::little>(Ptr);
+      Record.Locations.push_back({Strings[FileNameIndex], Line});
+    }
+  }
+  return Error::success();
+}
+} // namespace data_access_prof
+} // namespace llvm
diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp
index 88621787c1dd9..254f941acde82 100644
--- a/llvm/lib/ProfileData/InstrProf.cpp
+++ b/llvm/lib/ProfileData/InstrProf.cpp
@@ -573,12 +573,8 @@ Error InstrProfSymtab::addVTableWithName(GlobalVariable &VTable,
   return Error::success();
 }
 
-/// \c NameStrings is a string composed of one of more possibly encoded
-/// sub-strings. The substrings are separated by 0 or more zero bytes. This
-/// method decodes the string and calls `NameCallback` for each substring.
-static Error
-readAndDecodeStrings(StringRef NameStrings,
-                     std::function<Error(StringRef)> NameCallback) {
+Error readAndDecodeStrings(StringRef NameStrings,
+                           std::function<Error(StringRef)> NameCallback) {
   const uint8_t *P = NameStrings.bytes_begin();
   const uint8_t *EndP = NameStrings.bytes_end();
   while (P < EndP) {
diff --git a/llvm/unittests/ProfileData/MemProfTest.cpp b/llvm/unittests/ProfileData/MemProfTest.cpp
index 3e430aa4eae58..f6362448a5734 100644
--- a/llvm/unittests/ProfileData/MemProfTest.cpp
+++ b/llvm/unittests/ProfileData/MemProfTest.cpp
@@ -10,14 +10,17 @@
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/MapVector.h"
 #include "llvm/ADT/STLForwardCompat.h"
+#include "llvm/ADT/SmallVector.h"
 #include "llvm/DebugInfo/DIContext.h"
 #include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
 #include "llvm/IR/Value.h"
 #include "llvm/Object/ObjectFile.h"
+#include "llvm/ProfileData/DataAccessProf.h"
 #include "llvm/ProfileData/MemProfData.inc"
 #include "llvm/ProfileData/MemProfReader.h"
 #include "llvm/ProfileData/MemProfYAML.h"
 #include "llvm/Support/raw_ostream.h"
+#include "gmock/gmock-more-matchers.h"
 #include "gmock/gmock.h"
 #include "gtest/gtest.h"
 
@@ -36,6 +39,8 @@ using ::llvm::StringRef;
 using ::llvm::object::SectionedAddress;
 using ::llvm::symbolize::SymbolizableModule;
 using ::testing::ElementsAre;
+using ::testing::ElementsAreArray;
+using ::testing::HasSubstr;
 using ::testing::IsEmpty;
 using ::testing::Pair;
 using ::testing::Return;
@@ -747,6 +752,162 @@ TEST(MemProf, YAMLParser) {
                                ElementsAre(0x3000)))));
 }
 
+static std::string ErrorToString(Error E) {
+  std::string ErrMsg;
...
[truncated]

/// symbol names or content hashes are seen. The user of this class should
/// aggregate counters that corresponds to the same symbol name or with the
/// same string literal hash before calling 'add*' methods.
Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shorten the name --> addDataAccessProfile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the semantics of the method, may be change the name to 'setDataAccessProfile'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/// aggregate counters that corresponds to the same symbol name or with the
/// same string literal hash before calling 'add*' methods.
Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
Error addSymbolizedDataAccessProfile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document location parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Error addSymbolizedDataAccessProfile(
SymbolID SymbolID, uint64_t AccessCount,
const llvm::SmallVector<DataLocation> &Locations);
Error addKnownSymbolWithoutSamples(SymbolID SymbolID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same as setting the symbol access count to zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and the idea is similar to SamplePGO's function symbol list.

public:
// SymbolID is either a string representing symbol name, or a uint64_t
// representing the content hash of a string literal.
using SymbolID = std::variant<StringRef, uint64_t>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the use of StringRef, is there a risk of use-after-free error (if the caller passes in a Ref to short lived string)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. This class keeps owned copy of strings (inside saver member of type llvm::UniqueStringSaver) and references the owned copies as opposed to caller's copy. FWIW, I moved saver to the beginning of class member list to make sure it's destructed after other class members that references the strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add a comment here and in the below type that the StringRef is owned by this class's saver object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed a potential use-after-free bug in L69 with std::tie(Key, RecordID) = *saveStringToMap(StrToIndexMap, Saver, *CanonicalName); in the updated change. Basically Key should use the owned copy.

bool IsStringLiteral;

// The locations of data in the source code. Optional.
llvm::SmallVector<DataLocation> Locations;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we can have many instances of DataAccessProfRecord. Do we know the most common number of entries here? If we don't, we might want to start out with llvm::SmallVector<DataLocation, 0>, which still saves space relative to std::vector<...>. I'm afraid that the inlined storage goes wasted (and multiplied by the number of DataAccessProfRecord entries) if the real usage is way off the number of inlined entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know the most common number of entries here? If we don't, we might want to start out with llvm::SmallVector<DataLocation, 0>

I'd expect that many records have small single-digit number of locations, and definitely within the range of unsigned, which is type for SmallVector<T, 0> size and capacity per https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h.

Done.

Comment on lines 41 to 42
return KnownColdHashes.count(std::get<uint64_t>(SymID));
return KnownColdSymbols.count(std::get<StringRef>(SymID));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we use contains instead of count here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 66 to 67
auto [Iter, Inserted] = Records.try_emplace(
Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: With try_emplace, you should be able to drop the value type and expect the constructor arguments to be perfectly forwarded like:

  auto [Iter, Inserted] = Records.try_emplace(Key, RecordID, AccessCount, IsStringLiteral);

By the way, when I try your PR, I get:

llvm/lib/ProfileData/DataAccessProf.cpp:67:71: error: missing field 'Locations' initializer [-Werror,-Wmissing-field-initializers]
      Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral});
                                                                      ^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated PR adds a constructor for DataAccessProfRecord and does Records.try_emplace(Key, RecordID, AccessCount, IsStringLiteral); as suggested. I also re-configured my cmake to set -DCMAKE_C_COMPILER and -DCMAKE_CXX_COMPILER to clang in the environment. Now I no longer see missing field 'Locations' initializer.

In the future when C++20 is available for LLVM developer, we can possibly just use designated initializer here.

return deserializeRecords(Ptr);
}

Error DataAccessProfData::serializeStrings(ProfOStream &OS) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we name this something like serializeSymbolNames to indicate that we are serializing symbol names? The same applies to deserializeString.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to make the names more targeted. Renamed to {serialize,deserialize}SymbolsAndFilenames.

Copy link
Contributor Author

@mingmingl-llvm mingmingl-llvm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reviews! PTAL.

bool IsStringLiteral;

// The locations of data in the source code. Optional.
llvm::SmallVector<DataLocation> Locations;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know the most common number of entries here? If we don't, we might want to start out with llvm::SmallVector<DataLocation, 0>

I'd expect that many records have small single-digit number of locations, and definitely within the range of unsigned, which is type for SmallVector<T, 0> size and capacity per https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h.

Done.

/// symbol names or content hashes are seen. The user of this class should
/// aggregate counters that corresponds to the same symbol name or with the
/// same string literal hash before calling 'add*' methods.
Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 41 to 42
return KnownColdHashes.count(std::get<uint64_t>(SymID));
return KnownColdSymbols.count(std::get<StringRef>(SymID));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 66 to 67
auto [Iter, Inserted] = Records.try_emplace(
Key, DataAccessProfRecord{RecordID, AccessCount, IsStringLiteral});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated PR adds a constructor for DataAccessProfRecord and does Records.try_emplace(Key, RecordID, AccessCount, IsStringLiteral); as suggested. I also re-configured my cmake to set -DCMAKE_C_COMPILER and -DCMAKE_CXX_COMPILER to clang in the environment. Now I no longer see missing field 'Locations' initializer.

In the future when C++20 is available for LLVM developer, we can possibly just use designated initializer here.

return deserializeRecords(Ptr);
}

Error DataAccessProfData::serializeStrings(ProfOStream &OS) const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to make the names more targeted. Renamed to {serialize,deserialize}SymbolsAndFilenames.

/// aggregate counters that corresponds to the same symbol name or with the
/// same string literal hash before calling 'add*' methods.
Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
Error addSymbolizedDataAccessProfile(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Error addSymbolizedDataAccessProfile(
SymbolID SymbolID, uint64_t AccessCount,
const llvm::SmallVector<DataLocation> &Locations);
Error addKnownSymbolWithoutSamples(SymbolID SymbolID);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and the idea is similar to SamplePGO's function symbol list.

public:
// SymbolID is either a string representing symbol name, or a uint64_t
// representing the content hash of a string literal.
using SymbolID = std::variant<StringRef, uint64_t>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. This class keeps owned copy of strings (inside saver member of type llvm::UniqueStringSaver) and references the owned copies as opposed to caller's copy. FWIW, I moved saver to the beginning of class member list to make sure it's destructed after other class members that references the strings.

llvm::BumpPtrAllocator Allocator;
llvm::UniqueStringSaver Saver;

// `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see SymbolToRecordIndex defined anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, SymbolToRecordIndex refers to deleted code. Removed it.

Originally I had

DenseMap<SymbolID, uint64_t> SymbolToRecordIndex;
SmallVector<DataAccessProfRecord> Records;

as opposed to
MapVector<SymbolID, DataAccessProfRecord>, in order to get a ArrayRef of the records.

Later I realized that it's simpler to use the latter and add MapVector::getArrayRef. Strictly speaking MapVector::getArrayRef returns ArrayRef<key, value> rather than ArrayRef<value>, but the implementation is overall simpler to have MapVector.


/// Returns a iterable StringRef for strings in the order they are added.
auto getStrings() const {
ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use SymbolID instead of the std::pair for consistency and simplicity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated change used new MapVector::getArrayRef as suggested, and this line is simplified away.

// Represents a data symbol. The semantic comes in two forms: a symbol index
// for symbol name if `IsStringLiteral` is false, or the hash of a string
// content if `IsStringLiteral` is true. Required.
uint64_t SymbolID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing that SymbolID is a different thing (a type) in the following class. Suggest making these different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I renamed the type std::variant<StringRef, uint64_t> to SymbolHandle.

@@ -57,6 +57,8 @@ class MapVector {
return std::move(Vector);
}

ArrayRef<value_type> getArrayRef() const { return Vector; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably put this into a separate prep patch with a test in llvm/unittests/ADT/MapVectorTest.cpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in #138726

public:
// SymbolID is either a string representing symbol name, or a uint64_t
// representing the content hash of a string literal.
using SymbolID = std::variant<StringRef, uint64_t>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add a comment here and in the below type that the StringRef is owned by this class's saver object.

uint64_t Len =
support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);

// With M=NumSampledSymbols and N=NumColdKnownSymbols, the first M strings are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler to directly say "The first NumSampledSymbols strings are symbols with samples, and the next NumColdKnownSymbols are known cold symbols." ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

: SymbolID(SymbolID), AccessCount(AccessCount),
IsStringLiteral(IsStringLiteral) {}

// Represents a data symbol. The semantic comes in two forms: a symbol index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the different forms be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic of this field depends on the IsStringLiteral field below. For a string literal, IsStringLiteral is true and SymbolID has the semantic of 'hash'; otherwise, IsStringLiteral is false and SymbolID represent an index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes. Might be useful to note this in a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes.

This makes sense. Added comment at L55 to explain why two forms are used.

/// This class provides profile look-up, serialization and deserialization.
class DataAccessProfData {
public:
// SymbolID is either a string representing symbol name, or a uint64_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to question for above class - what determines which type is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is determined by the symbol itself. This variant is uint64_t for string literals (which has the unstable name pattern like .str.<N>[.llvm.<hash>]), and the canonicalized symbol name for those symbols with stable mangled names relative to source code. Updated the comment. PTAL.


// Test the various scenarios when DataAccessProfData should return error on
// invalid input.
TEST(MemProf, DataAccessProfileError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the DataAccessProf tests should be moved to a dedicated DataAccessProfTest.cpp file to mirror where they are defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@kazutakahirata kazutakahirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo some nits. Thanks for updating the patch!


Error DataAccessProfData::setDataAccessProfile(
SymbolID SymbolID, uint64_t AccessCount,
const llvm::SmallVector<DataLocation> &Locations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer ArrayRef<DataLocation> to const llvm::SmallVector<DataLocation> & per:

https://llvm.org/docs/ProgrammersManual.html#llvm-adt-arrayref-h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link

github-actions bot commented May 6, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

mingmingl-llvm added a commit that referenced this pull request May 6, 2025
uint64_t AccessCount =
support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);

SymbolID SymbolID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a different name for variable from the type name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I didn't notice this comment in my latest reply. Teresa suggested similarly in #138170 (comment). So this is done!

Copy link
Contributor Author

@mingmingl-llvm mingmingl-llvm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reviews! PTAL.

@@ -57,6 +57,8 @@ class MapVector {
return std::move(Vector);
}

ArrayRef<value_type> getArrayRef() const { return Vector; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in #138726


// Test the various scenarios when DataAccessProfData should return error on
// invalid input.
TEST(MemProf, DataAccessProfileError) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

uint64_t Len =
support::endian::readNext<uint64_t, llvm::endianness::little>(Ptr);

// With M=NumSampledSymbols and N=NumColdKnownSymbols, the first M strings are
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

if (std::holds_alternative<uint64_t>(SymbolID))
return std::get<uint64_t>(SymbolID);

return StrToIndexMap.find(std::get<StringRef>(SymbolID))->second;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


Error DataAccessProfData::setDataAccessProfile(
SymbolID SymbolID, uint64_t AccessCount,
const llvm::SmallVector<DataLocation> &Locations) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


/// Returns a iterable StringRef for strings in the order they are added.
auto getStrings() const {
ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated change used new MapVector::getArrayRef as suggested, and this line is simplified away.

llvm::BumpPtrAllocator Allocator;
llvm::UniqueStringSaver Saver;

// `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, SymbolToRecordIndex refers to deleted code. Removed it.

Originally I had

DenseMap<SymbolID, uint64_t> SymbolToRecordIndex;
SmallVector<DataAccessProfRecord> Records;

as opposed to
MapVector<SymbolID, DataAccessProfRecord>, in order to get a ArrayRef of the records.

Later I realized that it's simpler to use the latter and add MapVector::getArrayRef. Strictly speaking MapVector::getArrayRef returns ArrayRef<key, value> rather than ArrayRef<value>, but the implementation is overall simpler to have MapVector.

@@ -357,6 +357,12 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
/// the duplicated profile variables for Comdat functions.
bool needsComdatForCounter(const GlobalObject &GV, const Module &M);

/// \c NameStrings is a string composed of one of more possibly encoded
/// sub-strings. The substrings are separated by 0 or more zero bytes. This
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by mentioning the substrings are separated by \01, which is returned by getInstrProfNameSeparator.

return Error::success();
}
StringRef SymbolName = std::get<StringRef>(SymbolID);
if (SymbolName.empty())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a static helper function getCanonicalName in this cpp file.

public:
// SymbolID is either a string representing symbol name, or a uint64_t
// representing the content hash of a string literal.
using SymbolID = std::variant<StringRef, uint64_t>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed a potential use-after-free bug in L69 with std::tie(Key, RecordID) = *saveStringToMap(StrToIndexMap, Saver, *CanonicalName); in the updated change. Basically Key should use the owned copy.

Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with a couple of small comments below.

if (std::holds_alternative<StringRef>(SymbolID)) {
StringRef Name = std::get<StringRef>(SymbolID);
assert(!Name.empty() && "Empty symbol name");
Key = InstrProfSymtab::getCanonicalName(Name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call the new getCanonicalName helper defined above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This getProfileRecord suppresses name canonicalization error rather than further propagating it, mainly to make the return value type simple (i.e., no need to handle Expected<Ptr> type) for the common no-error path.

Added assertion to make the code intention clearer.

: SymbolID(SymbolID), AccessCount(AccessCount),
IsStringLiteral(IsStringLiteral) {}

// Represents a data symbol. The semantic comes in two forms: a symbol index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes. Might be useful to note this in a comment.

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request May 7, 2025
…'s underlying vector (#138726)

SetVector currently has a [similar
method](https://github.com/llvm/llvm-project/blob/c956ed06dc1c1b340d0c589c472c438b9220b36d/llvm/include/llvm/ADT/SetVector.h#L90),
and llvm/llvm-project#138170 has a use case to
get an ArrayRef of MapVector's underlying vector.
GeorgeARM pushed a commit to GeorgeARM/llvm-project that referenced this pull request May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:adt PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants