Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[clang] Constant-evaluate format strings as last resort #135864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

apple-fcloutier
Copy link
Contributor

I asked on the forums and people were generally supportive of the idea, so:

Clang's -Wformat checker can see through an inconsistent set of operations. We can fall back to the recently-updated constant string evaluation infrastructure when Clang's initial evaluation fails for a second chance at figuring out what the format string is intended to be. This enables analyzing format strings that were built at compile-time with std::string and other constexpr-capable types in C++, as long as all pieces are also constexpr-visible, and a number of other patterns.

As a side effect, it also enables tryEvaluateString on char arrays (rather than only char pointers).

Radar-ID: rdar://99940060

Clang's -Wformat checker can see through an inconsistent set of
operations. We can fall back to the recently-updated constant string
evaluation infrastructure when Clang's initial evaluation fails for
a second chance at figuring out what the format string is intended
to be. This enables analyzing format strings that were built at
compile-time with std::string and other constexpr-capable types in
C++, as long as all pieces are also constexpr-visible, and a number
of other patterns.

Radar-ID: rdar://99940060
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" labels Apr 15, 2025
@llvmbot
Copy link
Member

llvmbot commented Apr 15, 2025

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-clang-analysis

@llvm/pr-subscribers-clang

Author: None (apple-fcloutier)

Changes

I asked on the forums and people were generally supportive of the idea, so:

Clang's -Wformat checker can see through an inconsistent set of operations. We can fall back to the recently-updated constant string evaluation infrastructure when Clang's initial evaluation fails for a second chance at figuring out what the format string is intended to be. This enables analyzing format strings that were built at compile-time with std::string and other constexpr-capable types in C++, as long as all pieces are also constexpr-visible, and a number of other patterns.

As a side effect, it also enables tryEvaluateString on char arrays (rather than only char pointers).

Radar-ID: rdar://99940060


Full diff: https://github.com/llvm/llvm-project/pull/135864.diff

7 Files Affected:

  • (modified) clang/docs/ReleaseNotes.rst (+3)
  • (modified) clang/include/clang/AST/Expr.h (+8-1)
  • (modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+2)
  • (modified) clang/lib/AST/ExprConstant.cpp (+32-8)
  • (modified) clang/lib/Sema/SemaChecking.cpp (+94-50)
  • (modified) clang/test/Sema/format-strings.c (+14)
  • (modified) clang/test/SemaCXX/format-strings.cpp (+74)
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 77bf3355af9da..05566d66a65d2 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -265,6 +265,9 @@ related warnings within the method body.
   ``format_matches`` accepts an example valid format string as its third
   argument. For more information, see the Clang attributes documentation.
 
+- Format string checking now supports the compile-time evaluation of format
+  strings as a fallback mechanism.
+
 - Introduced a new statement attribute ``[[clang::atomic]]`` that enables
   fine-grained control over atomic code generation on a per-statement basis.
   Supported options include ``[no_]remote_memory``,
diff --git a/clang/include/clang/AST/Expr.h b/clang/include/clang/AST/Expr.h
index 20f70863a05b3..78eda8bc3c43e 100644
--- a/clang/include/clang/AST/Expr.h
+++ b/clang/include/clang/AST/Expr.h
@@ -791,7 +791,14 @@ class Expr : public ValueStmt {
                                  const Expr *PtrExpression, ASTContext &Ctx,
                                  EvalResult &Status) const;
 
-  /// If the current Expr can be evaluated to a pointer to a null-terminated
+  /// Fill \c Into with the first characters that can be constant-evaluated
+  /// from this \c Expr . When encountering a null character, stop and return
+  /// \c true (the null is not returned in \c Into ). Return \c false if
+  /// evaluation runs off the end of the constant-evaluated string before it
+  /// encounters a null character.
+  bool tryEvaluateString(ASTContext &Ctx, std::string &Into) const;
+
+  /// If the current \c Expr can be evaluated to a pointer to a null-terminated
   /// constant string, return the constant string (without the terminating
   /// null).
   std::optional<std::string> tryEvaluateString(ASTContext &Ctx) const;
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 3cb2731488fab..4139ff2737c76 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -10170,6 +10170,8 @@ def warn_format_bool_as_character : Warning<
   "using '%0' format specifier, but argument has boolean value">,
   InGroup<Format>;
 def note_format_string_defined : Note<"format string is defined here">;
+def note_format_string_evaluated_to : Note<
+  "format string was constant-evaluated">;
 def note_format_fix_specifier : Note<"did you mean to use '%0'?">;
 def note_printf_c_str: Note<"did you mean to call the %0 method?">;
 def note_format_security_fixit: Note<
diff --git a/clang/lib/AST/ExprConstant.cpp b/clang/lib/AST/ExprConstant.cpp
index 80ece3c4ed7e1..fec92edf49096 100644
--- a/clang/lib/AST/ExprConstant.cpp
+++ b/clang/lib/AST/ExprConstant.cpp
@@ -17945,15 +17945,36 @@ bool Expr::tryEvaluateObjectSize(uint64_t &Result, ASTContext &Ctx,
 
 static bool EvaluateBuiltinStrLen(const Expr *E, uint64_t &Result,
                                   EvalInfo &Info, std::string *StringResult) {
-  if (!E->getType()->hasPointerRepresentation() || !E->isPRValue())
+  QualType Ty = E->getType();
+  if (!E->isPRValue())
     return false;
 
   LValue String;
-
-  if (!EvaluatePointer(E, String, Info))
+  QualType CharTy;
+  if (Ty->canDecayToPointerType()) {
+    if (E->isGLValue()) {
+      if (!EvaluateLValue(E, String, Info))
+        return false;
+    } else {
+      APValue &Value = Info.CurrentCall->createTemporary(
+          E, Ty, ScopeKind::FullExpression, String);
+      if (!EvaluateInPlace(Value, Info, String, E))
+        return false;
+    }
+    // The result is a pointer to the first element of the array.
+    auto *AT = Info.Ctx.getAsArrayType(Ty);
+    CharTy = AT->getElementType();
+    if (auto *CAT = dyn_cast<ConstantArrayType>(AT))
+      String.addArray(Info, E, CAT);
+    else
+      String.addUnsizedArray(Info, E, CharTy);
+  } else if (Ty->hasPointerRepresentation()) {
+    if (!EvaluatePointer(E, String, Info))
+      return false;
+    CharTy = Ty->getPointeeType();
+  } else {
     return false;
-
-  QualType CharTy = E->getType()->getPointeeType();
+  }
 
   // Fast path: if it's a string literal, search the string value.
   if (const StringLiteral *S = dyn_cast_or_null<StringLiteral>(
@@ -17995,13 +18016,16 @@ static bool EvaluateBuiltinStrLen(const Expr *E, uint64_t &Result,
   }
 }
 
-std::optional<std::string> Expr::tryEvaluateString(ASTContext &Ctx) const {
+bool Expr::tryEvaluateString(ASTContext &Ctx, std::string &StringResult) const {
   Expr::EvalStatus Status;
   EvalInfo Info(Ctx, Status, EvalInfo::EM_ConstantFold);
   uint64_t Result;
-  std::string StringResult;
+  return EvaluateBuiltinStrLen(this, Result, Info, &StringResult);
+}
 
-  if (EvaluateBuiltinStrLen(this, Result, Info, &StringResult))
+std::optional<std::string> Expr::tryEvaluateString(ASTContext &Ctx) const {
+  std::string StringResult;
+  if (tryEvaluateString(Ctx, StringResult))
     return StringResult;
   return {};
 }
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index bffd0dd461d3d..017be929ca18e 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -98,6 +98,7 @@
 #include "llvm/Support/Locale.h"
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Support/SaveAndRestore.h"
+#include "llvm/Support/SmallVectorMemoryBuffer.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/TargetParser/RISCVTargetParser.h"
 #include "llvm/TargetParser/Triple.h"
@@ -5935,8 +5936,14 @@ static void CheckFormatString(
     llvm::SmallBitVector &CheckedVarArgs, UncoveredArgHandler &UncoveredArg,
     bool IgnoreStringsWithoutSpecifiers);
 
-static const Expr *maybeConstEvalStringLiteral(ASTContext &Context,
-                                               const Expr *E);
+enum StringLiteralConstEvalResult {
+  SLCER_NotEvaluated,
+  SLCER_NotNullTerminated,
+  SLCER_Evaluated,
+};
+
+static StringLiteralConstEvalResult
+constEvalStringAsLiteral(Sema &S, const Expr *E, const StringLiteral *&SL);
 
 // Determine if an expression is a string literal or constant string.
 // If this function returns false on the arguments to a function expecting a
@@ -5968,14 +5975,9 @@ static StringLiteralCheckType checkFormatStringExpr(
 
   switch (E->getStmtClass()) {
   case Stmt::InitListExprClass:
-    // Handle expressions like {"foobar"}.
-    if (const clang::Expr *SLE = maybeConstEvalStringLiteral(S.Context, E)) {
-      return checkFormatStringExpr(
-          S, ReferenceFormatString, SLE, Args, APK, format_idx, firstDataArg,
-          Type, CallType, /*InFunctionCall*/ false, CheckedVarArgs,
-          UncoveredArg, Offset, IgnoreStringsWithoutSpecifiers);
-    }
-    return SLCT_NotALiteral;
+    // try to constant-evaluate the string
+    break;
+
   case Stmt::BinaryConditionalOperatorClass:
   case Stmt::ConditionalOperatorClass: {
     // The expression is a literal if both sub-expressions were, and it was
@@ -6066,10 +6068,9 @@ static StringLiteralCheckType checkFormatStringExpr(
             if (InitList->isStringLiteralInit())
               Init = InitList->getInit(0)->IgnoreParenImpCasts();
           }
-          return checkFormatStringExpr(
-              S, ReferenceFormatString, Init, Args, APK, format_idx,
-              firstDataArg, Type, CallType,
-              /*InFunctionCall*/ false, CheckedVarArgs, UncoveredArg, Offset);
+          InFunctionCall = false;
+          E = Init;
+          goto tryAgain;
         }
       }
 
@@ -6142,11 +6143,9 @@ static StringLiteralCheckType checkFormatStringExpr(
                 }
                 return SLCT_UncheckedLiteral;
               }
-              return checkFormatStringExpr(
-                  S, ReferenceFormatString, PVFormatMatches->getFormatString(),
-                  Args, APK, format_idx, firstDataArg, Type, CallType,
-                  /*InFunctionCall*/ false, CheckedVarArgs, UncoveredArg,
-                  Offset, IgnoreStringsWithoutSpecifiers);
+              E = PVFormatMatches->getFormatString();
+              InFunctionCall = false;
+              goto tryAgain;
             }
           }
 
@@ -6214,20 +6213,13 @@ static StringLiteralCheckType checkFormatStringExpr(
         unsigned BuiltinID = FD->getBuiltinID();
         if (BuiltinID == Builtin::BI__builtin___CFStringMakeConstantString ||
             BuiltinID == Builtin::BI__builtin___NSStringMakeConstantString) {
-          const Expr *Arg = CE->getArg(0);
-          return checkFormatStringExpr(
-              S, ReferenceFormatString, Arg, Args, APK, format_idx,
-              firstDataArg, Type, CallType, InFunctionCall, CheckedVarArgs,
-              UncoveredArg, Offset, IgnoreStringsWithoutSpecifiers);
+          E = CE->getArg(0);
+          goto tryAgain;
         }
       }
     }
-    if (const Expr *SLE = maybeConstEvalStringLiteral(S.Context, E))
-      return checkFormatStringExpr(
-          S, ReferenceFormatString, SLE, Args, APK, format_idx, firstDataArg,
-          Type, CallType, /*InFunctionCall*/ false, CheckedVarArgs,
-          UncoveredArg, Offset, IgnoreStringsWithoutSpecifiers);
-    return SLCT_NotALiteral;
+    // try to constant-evaluate the string
+    break;
   }
   case Stmt::ObjCMessageExprClass: {
     const auto *ME = cast<ObjCMessageExpr>(E);
@@ -6248,11 +6240,8 @@ static StringLiteralCheckType checkFormatStringExpr(
           IgnoreStringsWithoutSpecifiers = true;
         }
 
-        const Expr *Arg = ME->getArg(FA->getFormatIdx().getASTIndex());
-        return checkFormatStringExpr(
-            S, ReferenceFormatString, Arg, Args, APK, format_idx, firstDataArg,
-            Type, CallType, InFunctionCall, CheckedVarArgs, UncoveredArg,
-            Offset, IgnoreStringsWithoutSpecifiers);
+        E = ME->getArg(FA->getFormatIdx().getASTIndex());
+        goto tryAgain;
       }
     }
 
@@ -6314,7 +6303,8 @@ static StringLiteralCheckType checkFormatStringExpr(
       }
     }
 
-    return SLCT_NotALiteral;
+    // try to constant-evaluate the string
+    break;
   }
   case Stmt::UnaryOperatorClass: {
     const UnaryOperator *UnaOp = cast<UnaryOperator>(E);
@@ -6331,26 +6321,79 @@ static StringLiteralCheckType checkFormatStringExpr(
       }
     }
 
-    return SLCT_NotALiteral;
+    // try to constant-evaluate the string
+    break;
   }
 
   default:
+    // try to constant-evaluate the string
+    break;
+  }
+
+  const StringLiteral *FakeLiteral = nullptr;
+  switch (constEvalStringAsLiteral(S, E, FakeLiteral)) {
+  case SLCER_NotEvaluated:
     return SLCT_NotALiteral;
+
+  case SLCER_NotNullTerminated:
+    S.Diag(Args[format_idx]->getBeginLoc(),
+           diag::warn_printf_format_string_not_null_terminated)
+        << Args[format_idx]->getSourceRange();
+    if (!InFunctionCall)
+      S.Diag(E->getBeginLoc(), diag::note_format_string_defined);
+    // Stop checking, as this might just mean we're missing a chunk of the
+    // format string and there would be other spurious format issues.
+    return SLCT_UncheckedLiteral;
+
+  case SLCER_Evaluated:
+    InFunctionCall = false;
+    E = FakeLiteral;
+    goto tryAgain;
   }
 }
 
-// If this expression can be evaluated at compile-time,
-// check if the result is a StringLiteral and return it
-// otherwise return nullptr
-static const Expr *maybeConstEvalStringLiteral(ASTContext &Context,
-                                               const Expr *E) {
+static StringLiteralConstEvalResult
+constEvalStringAsLiteral(Sema &S, const Expr *E, const StringLiteral *&SL) {
+  // As a last resort, try to constant-evaluate the format string. If it
+  // evaluates to a string literal in the first place, we can point to that
+  // string literal in source and use that.
   Expr::EvalResult Result;
-  if (E->EvaluateAsRValue(Result, Context) && Result.Val.isLValue()) {
+  if (E->EvaluateAsRValue(Result, S.Context) && Result.Val.isLValue()) {
     const auto *LVE = Result.Val.getLValueBase().dyn_cast<const Expr *>();
-    if (isa_and_nonnull<StringLiteral>(LVE))
-      return LVE;
+    if (auto *BaseSL = dyn_cast_or_null<StringLiteral>(LVE)) {
+      SL = BaseSL;
+      return SLCER_Evaluated;
+    }
   }
-  return nullptr;
+
+  // Otherwise, try to evaluate the expression as a string constant.
+  std::string FormatString;
+  if (!E->tryEvaluateString(S.Context, FormatString)) {
+    return FormatString.empty() ? SLCER_NotEvaluated : SLCER_NotNullTerminated;
+  }
+
+  std::unique_ptr<llvm::MemoryBuffer> MemBuf;
+  {
+    llvm::SmallString<80> EscapedString;
+    {
+      llvm::raw_svector_ostream OS(EscapedString);
+      OS << '"';
+      OS.write_escaped(FormatString);
+      OS << '"';
+    }
+    MemBuf.reset(new llvm::SmallVectorMemoryBuffer(std::move(EscapedString),
+                                                   "<scratch space>", true));
+  }
+
+  // Plop that string into a scratch buffer, create a string literal and then
+  // go with that.
+  auto ScratchFile = S.getSourceManager().createFileID(std::move(MemBuf));
+  SourceLocation Begin = S.getSourceManager().getLocForStartOfFile(ScratchFile);
+  QualType SLType = S.Context.getStringLiteralArrayType(S.Context.CharTy,
+                                                        FormatString.length());
+  SL = StringLiteral::Create(S.Context, FormatString,
+                             StringLiteralKind::Ordinary, false, SLType, Begin);
+  return SLCER_Evaluated;
 }
 
 StringRef Sema::GetFormatStringTypeName(Sema::FormatStringType FST) {
@@ -6973,10 +7016,11 @@ void CheckFormatHandler::EmitFormatDiagnostic(
     S.Diag(IsStringLocation ? ArgumentExpr->getExprLoc() : Loc, PDiag)
       << ArgumentExpr->getSourceRange();
 
-    const Sema::SemaDiagnosticBuilder &Note =
-      S.Diag(IsStringLocation ? Loc : StringRange.getBegin(),
-             diag::note_format_string_defined);
-
+    SourceLocation DiagLoc = IsStringLocation ? Loc : StringRange.getBegin();
+    unsigned DiagID = S.getSourceManager().isWrittenInScratchSpace(DiagLoc)
+                          ? diag::note_format_string_evaluated_to
+                          : diag::note_format_string_defined;
+    const Sema::SemaDiagnosticBuilder &Note = S.Diag(DiagLoc, DiagID);
     Note << StringRange;
     Note << FixIt;
   }
diff --git a/clang/test/Sema/format-strings.c b/clang/test/Sema/format-strings.c
index af30ad5d15fe2..a94e0619ce843 100644
--- a/clang/test/Sema/format-strings.c
+++ b/clang/test/Sema/format-strings.c
@@ -3,6 +3,11 @@
 // RUN: %clang_cc1 -fblocks -fsyntax-only -verify -Wformat-nonliteral -isystem %S/Inputs -triple=x86_64-unknown-fuchsia %s
 // RUN: %clang_cc1 -fblocks -fsyntax-only -verify -Wformat-nonliteral -isystem %S/Inputs -triple=x86_64-linux-android %s
 
+// expected-note@-5{{format string was constant-evaluated}}
+// ^^^ there will be a <scratch space> SourceLocation caused by the
+// test_consteval_init_array test, that -verify treats as if it showed up at
+// line 1 of this file.
+
 #include <stdarg.h>
 #include <stddef.h>
 #define __need_wint_t
@@ -900,3 +905,12 @@ void test_promotion(void) {
   // pointers
   printf("%s", i); // expected-warning{{format specifies type 'char *' but the argument has type 'int'}}
 }
+
+void test_consteval_init_array(void) {
+  const char buf_not_terminated[] = {'%', 55 * 2 + 5, '\n'}; // expected-note{{format string is defined here}}
+  printf(buf_not_terminated, "hello"); // expected-warning{{format string is not null-terminated}}
+
+  const char buf[] = {'%', 55 * 2 + 5, '\n', 0};
+  printf(buf, "hello"); // no-warning
+  printf(buf, 123); // expected-warning{{format specifies type 'char *' but the argument has type 'int'}}
+}
diff --git a/clang/test/SemaCXX/format-strings.cpp b/clang/test/SemaCXX/format-strings.cpp
index 48cf23999a94f..7b04ea7d8bc75 100644
--- a/clang/test/SemaCXX/format-strings.cpp
+++ b/clang/test/SemaCXX/format-strings.cpp
@@ -1,6 +1,14 @@
 // RUN: %clang_cc1 -fsyntax-only -verify -Wformat-nonliteral -Wformat-non-iso -Wformat-pedantic -fblocks %s
 // RUN: %clang_cc1 -fsyntax-only -verify -Wformat-nonliteral -Wformat-non-iso -fblocks -std=c++98 %s
 // RUN: %clang_cc1 -fsyntax-only -verify -Wformat-nonliteral -Wformat-non-iso -Wformat-pedantic -fblocks -std=c++11 %s
+// RUN: %clang_cc1 -fsyntax-only -verify -Wformat-nonliteral -Wformat-non-iso -Wformat-pedantic -fblocks -std=c++20 %s
+
+#if __cplusplus >= 202000l
+// expected-note@-6{{format string was constant-evaluated}}
+// ^^^ there will be a <scratch space> SourceLocation caused by the
+// test_constexpr_string test, that -verify treats as if it showed up at
+// line 1 of this file.
+#endif
 
 #include <stdarg.h>
 
@@ -238,3 +246,69 @@ void f(Scoped1 S1, Scoped2 S2) {
 }
 
 #endif
+
+#if __cplusplus >= 202000L
+class my_string {
+  char *data;
+  unsigned size;
+
+public:
+  template<unsigned N>
+  constexpr my_string(const char (&literal)[N]) {
+    data = new char[N+1];
+    for (size = 0; size < N; ++size) {
+      data[size] = literal[size];
+      if (data[size] == 0)
+        break;
+    }
+    data[size] = 0;
+  }
+
+  my_string(const my_string &) = delete;
+
+  constexpr my_string(my_string &&that) {
+    data = that.data;
+    size = that.size;
+    that.data = nullptr;
+    that.size = 0;
+  }
+
+  constexpr ~my_string() {
+    delete[] data;
+  }
+
+  template<unsigned N>
+  constexpr void append(const char (&literal)[N]) {
+    char *cat = new char[size + N + 1];
+    char *tmp = cat;
+    for (unsigned i = 0; i < size; ++i) {
+      *tmp++ = data[i];
+    }
+    for (unsigned i = 0; i < N; ++i) {
+      *tmp = literal[i];
+      if (*tmp == 0)
+        break;
+      ++tmp;
+    }
+    *tmp = 0;
+    delete[] data;
+    size = tmp - cat;
+    data = cat;
+  }
+
+  constexpr const char *c_str() const {
+    return data;
+  }
+};
+
+constexpr my_string const_string() {
+  my_string str("hello %s");
+  str.append(", %d");
+  return str;
+}
+
+void test_constexpr_string() {
+  printf(const_string().c_str(), "hello", 123); // no-warning
+  printf(const_string().c_str(), 123, 456); // expected-warning {{format specifies type 'char *' but the argument has type 'int'}}
+}
+#endif

@@ -3,6 +3,11 @@
// RUN: %clang_cc1 -fblocks -fsyntax-only -verify -Wformat-nonliteral -isystem %S/Inputs -triple=x86_64-unknown-fuchsia %s
// RUN: %clang_cc1 -fblocks -fsyntax-only -verify -Wformat-nonliteral -isystem %S/Inputs -triple=x86_64-linux-android %s

// expected-note@-5{{format string was constant-evaluated}}
// ^^^ there will be a <scratch space> SourceLocation caused by the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say, I don't like this at all.

Copy link
Contributor Author

@apple-fcloutier apple-fcloutier Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, in an actual diagnostic, it shows up like this:

<scratch space>:1:8: note: format string was constant-evaluated
    1 | "hello %s, %d"
      |        ^~
      |        %d

The format string was constant-evaluated note could say format string was constant-evaluated to "hello %s, %d" and not have the scratch space text, or we could simply not show the constant-evaluated string. This is worse because we are unable to point at the incorrect specifier in the format string. Given format specifies type 'char *' but the argument has type 'int', if your format string has two or three %s specifiers, there is no simple way for you to know which one the compiler is talking about.

The patch already supports the case where constant evaluation resolves to a string literal that exists in source. When that's not the case, I feel pretty strongly that we need to bring up the format string to the user somehow to show these diagnostics. Can you think of other ways to do this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That definitely clarifies a lot, do we do this anywhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I'm aware of. Before this change, buffers are used only for macro expansion. (The other specially-named buffers are <built-in> and <command line>, but we really could call it anything.) As far as I know, diagnosing based on the string result of compile-time evaluation is unprecedented and we need to do something new one way or another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @shafik and @tbaederr - I have a slight concern with performance.
In general, we should avoid checking format strings when these diagnostics are not enabled. That would at least lead to less work in system headers.

Benchmarking sounds like a good idea.
But I don't have a better solution than using a scratch space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cor3ntin I looked into this and there's about 50 distinct DiagIDs for format-related issues, which is impractical to for check ahead of time, and impractical to maintain as format diagnostics expand. Most/all of them can be controlled as an aggregate by the -Wformat and -Wformat=2 warning groups. This would be practical to check, but I think that the only facilities we have to check whether diagnostics are enabled are based on DiagIDs rather than groups.

For what it's worth, I'm less worried than you: when the format string is a function call, we already try to evaluate it. However, the result is discarded if the lvalue base is not a string literal. This PR expands the technique as a universal fallback, but I expect the expensive case to be function calls since I think that's the only way to get control flow (aside from expression statements).

I can try to improve this to avoid having to evaluate the string twice, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too worried about performance, but I can see where the concerns come from. It may be worth it to put a branch up on https://llvm-compile-time-tracker.com/ to verify we're not slowing things down too much, but I also don't think it's strictly required. (Checking diagnostic IDs to see if the check is disabled would be really awkward and I think we should avoid it in this case.)

@@ -17945,15 +17945,36 @@ bool Expr::tryEvaluateObjectSize(uint64_t &Result, ASTContext &Ctx,

static bool EvaluateBuiltinStrLen(const Expr *E, uint64_t &Result,
EvalInfo &Info, std::string *StringResult) {
if (!E->getType()->hasPointerRepresentation() || !E->isPRValue())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better early exit would be:

if (!Ty->hasPointerRepresentation() && !Ty->canDecayToPointerType())
  return false;

This would eliminate the need for the else { return false; } below.

I think leaving the if (!E->isPRValue()) separate might be cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, I don't love checking the same thing twice. I reorganized things a little differently to avoid the else { return false } branch.

@@ -10170,6 +10170,8 @@ def warn_format_bool_as_character : Warning<
"using '%0' format specifier, but argument has boolean value">,
InGroup<Format>;
def note_format_string_defined : Note<"format string is defined here">;
def note_format_string_evaluated_to : Note<
"format string was constant-evaluated">;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that this wording will lead to further confusion over what constitutes "constant evaluation".

Firstly, this "constant evaluation" is not a constant evaluation required by the language (in terms of "when" constant evaluation is supposed to occur).

Secondly, as all constant evaluation required by the language occurs as-if in a manifestly constant-evaluated context, this "constant evaluation" does not match "how" constant evaluation required by the language would behave.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is taken. Would you like to suggest an alternate wording?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is taken. Would you like to suggest an alternate wording?

"format string was computed, for diagnostic purposes, to"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noodled on it and I think it's a little awkward, but I understand why we're trying to stay away from wording with a standardized definition. How do you feel about "format string resolved to a constant string"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"computed format string is"?
"format string computed to"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this format awkward when it doesn't have the source immediately after. For instance, clang-tidy displays the note without the <scratch buffer> contents. I like the current wording because it feels complete even when that's missing. (I know the C++ diagnostics have lots of "found this candidate", "candidate ignored because ...", "in template instantiation requested here", etc, but I think that they are a necessary evil rather than the format to strive for.)

As I understand your concerns, the main problem with my wording is that it needs an adjective to qualify "string" with that is ideally something other than "constant" because it's not used in the standard sense.

"non-literal format string evaluated/resolved/computed at compile time"? Or do you see something else in that vein that would work? There is one other diagnostic that says "compile time constant expression" (and I'm OK dropping "constant expression" for all the reasons above).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"non-literal format string evaluated/resolved/computed at compile time"

"computed format string (from non-literal) for diagnostic purposes"
or just
"computed format string for diagnostic purposes"?

printf(const_string().c_str(), "hello", 123); // no-warning
printf(const_string().c_str(), 123, 456); // expected-warning {{format specifies type 'char *' but the argument has type 'int'}}
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test using if consteval in a meaningful manner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #135913 for other potential considerations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't see any new tests for these comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests for a simple case, but I'm not sure what counts as "meaningful". With that said, I'm hitting the same problem that Hubert reported (which shipping Clang currently exhibits, and that my change does not address): https://godbolt.org/z/zTfGfGvKj

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the point @hubert-reinterpretcast was making is that this exacerbates an existing problem rather than introduces a new one. That's unfortunate, but perhaps we can live with it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context of the expression being evaluated during computation for the format string determines whether we can get a false positive/negative if we insist on getting a format string computation result.

We know whether or not if consteval should return true or false if the context is:

  • in a manifestly constant-evaluated context or
  • outside of a manifestly constant-evaluated context and not "in"
    • a default member initializer, or
    • a constexpr function or the default arguments thereof.

For such cases, I think we should get the correct format string computation.

For the other cases, I think (at least in the long term) we should (by default) fail the format string computation attempt when if consteval is encountered.

@tbaederr
Copy link
Contributor

I don't know how much format strings are actually exercised there, but don't forget to run this through the compile time tracker.

@@ -17945,15 +17945,36 @@ bool Expr::tryEvaluateObjectSize(uint64_t &Result, ASTContext &Ctx,

static bool EvaluateBuiltinStrLen(const Expr *E, uint64_t &Result,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to rename that method to EvaluateCString or something like that given how we use it

@@ -3,6 +3,11 @@
// RUN: %clang_cc1 -fblocks -fsyntax-only -verify -Wformat-nonliteral -isystem %S/Inputs -triple=x86_64-unknown-fuchsia %s
// RUN: %clang_cc1 -fblocks -fsyntax-only -verify -Wformat-nonliteral -isystem %S/Inputs -triple=x86_64-linux-android %s

// expected-note@-5{{format string was constant-evaluated}}
// ^^^ there will be a <scratch space> SourceLocation caused by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @shafik and @tbaederr - I have a slight concern with performance.
In general, we should avoid checking format strings when these diagnostics are not enabled. That would at least lead to less work in system headers.

Benchmarking sounds like a good idea.
But I don't have a better solution than using a scratch space.

@@ -5935,8 +5936,14 @@ static void CheckFormatString(
llvm::SmallBitVector &CheckedVarArgs, UncoveredArgHandler &UncoveredArg,
bool IgnoreStringsWithoutSpecifiers);

static const Expr *maybeConstEvalStringLiteral(ASTContext &Context,
const Expr *E);
enum StringLiteralConstEvalResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum StringLiteralConstEvalResult {
enum StringLiteralConstantEvaluationResult {

Comment on lines 5939 to 5946
enum StringLiteralConstEvalResult {
SLCER_NotEvaluated,
SLCER_NotNullTerminated,
SLCER_Evaluated,
};

static StringLiteralConstEvalResult
constEvalStringAsLiteral(Sema &S, const Expr *E, const StringLiteral *&SL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum StringLiteralConstEvalResult {
SLCER_NotEvaluated,
SLCER_NotNullTerminated,
SLCER_Evaluated,
};
static StringLiteralConstEvalResult
constEvalStringAsLiteral(Sema &S, const Expr *E, const StringLiteral *&SL);
enum StringLiteralConstEvalResult {
SLCER_NotEvaluated,
SLCER_NotNullTerminated,
SLCER_Evaluated,
};
static StringLiteralConstEvalResult
EvaluateStringAndCreateLiteral(Sema &S, const Expr *E, const StringLiteral *&SL);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really reflect we might create a scratch space here.
Also, can we comment this function?

static const Expr *maybeConstEvalStringLiteral(ASTContext &Context,
const Expr *E) {
static StringLiteralConstEvalResult
constEvalStringAsLiteral(Sema &S, const Expr *E, const StringLiteral *&SL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constEvalStringAsLiteral(Sema &S, const Expr *E, const StringLiteral *&SL) {
EvaluateStringAndCreateLiteral(Sema &S, const Expr *E, const StringLiteral *&SL) {

{
llvm::raw_svector_ostream OS(EscapedString);
OS << '"';
OS.write_escaped(FormatString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not need to do any escaping here, the diagnostics engine should take care of that for you.
You probably want tests for that.

Copy link
Contributor Author

@apple-fcloutier apple-fcloutier Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I really do need the escaping, otherwise format strings containing quotes, newlines and probably other characters will print incorrectly. For instance, the format string hello "%s" will print as "hello "%s"" in the scratch space when it needs to be "hello \"%s\"". Even if we found this to be acceptable, it would break the logic that figures out the source location of a specifier into the string literal (this has to be computed at the point there is a diagnostic to show because Clang doesn't keep source locations for individual characters in a string literal). Keep in mind that this is essentially synthesized source code, not text being piped into a diagnostic.

I don't know how to add a test for it because I don't know how to get clang -verify to surface the code in a <scratch space> file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big concern here is what happens to applied fix-its when the fix is in the scratch space? Do we need to suppress the fix-its in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang uses fixits in diagnostics to show what should change, which I think is useful:

<scratch space>:1:8: note: format string was constant-evaluated
    1 | "hello %s, %d"
      |        ^~
      |        %d

%d here is displayed because of the fixit attached to the diagnostics.

I ran some simple tests and this is what I get:

  • with -fdiagnostics-parseable-fixits, you do get a diagnostic entry that looks like fix-it:"<scratch space>":{1:4-1:6}:"%d"
  • with -Xclang -fix-what-you-can, clang completes with no error
  • with -Xclang -fix-what-you-can -Xclang -fixit-to-temporary, no temporary file is created in . or the source directory (which were different for the purposes of that test)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang uses fixits in diagnostics to show what should change, which I think is useful:

<scratch space>:1:8: note: format string was constant-evaluated
    1 | "hello %s, %d"
      |        ^~
      |        %d

%d here is displayed because of the fixit attached to the diagnostics.

It's useful information, but we have ways which try to automatically apply fixes and we need to make sure those behave reasonably.

I ran some simple tests and this is what I get:

* with `-fdiagnostics-parseable-fixits`, you do get a diagnostic entry that looks like `fix-it:"<scratch space>":{1:4-1:6}:"%d"`

That seems reasonable.

* with `-Xclang -fix-what-you-can`, clang completes with no error

That's good

* with `-Xclang -fix-what-you-can -Xclang -fixit-to-temporary`, no temporary file is created in `.` or the source directory (which were different for the purposes of that test)

I suppose that's reasonable. How about with -fixit which tries to apply the fix to the source file? Similar question if you run the test via clang-tidy and try to apply all fixes automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with -fixit and had the same result. For clang-tidy, I tried this:

% cat /tmp/test.c
__attribute__((format(printf, 1, 2)))
int printf(const char *, ...);

int main() {
	const char buf[] = {'"', '%', 's', '"', 0};
	printf(buf, 123);
	printf("%s", 123);
}
% bin/clang-tidy --fix /tmp/test.c
....
2 warnings generated.
/tmp/test.c:6:17: warning: format specifies type 'char *' but the argument has type 'int' [clang-diagnostic-format]
    6 |     printf(buf, 123);
      |            ~~~  ^~~
note: format string resolved to a constant string
/tmp/test.c:7:18: warning: format specifies type 'char *' but the argument has type 'int' [clang-diagnostic-format]
    7 |     printf("%s", 123);
      |             ~~   ^~~
      |             %d
/tmp/test.c:7:13: note: FIX-IT applied suggested code changes
    7 |     printf("%s", 123);
      |             ^
clang-tidy applied 1 of 1 suggested fixes.

In words: it fixes the printf("%s", 123) line to use %d and leaves alone the other printf alone without throwing a fuss (claiming "clang-tidy applied 1 of 1 suggested fixes"). Clang-tidy shows the "format string resolved to a constant string" note but not the scratch space contents. It's not ideal, but it's quite reasonable IMO.

@@ -3,6 +3,11 @@
// RUN: %clang_cc1 -fblocks -fsyntax-only -verify -Wformat-nonliteral -isystem %S/Inputs -triple=x86_64-unknown-fuchsia %s
// RUN: %clang_cc1 -fblocks -fsyntax-only -verify -Wformat-nonliteral -isystem %S/Inputs -triple=x86_64-linux-android %s

// expected-note@-5{{format string was constant-evaluated}}
// ^^^ there will be a <scratch space> SourceLocation caused by the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too worried about performance, but I can see where the concerns come from. It may be worth it to put a branch up on https://llvm-compile-time-tracker.com/ to verify we're not slowing things down too much, but I also don't think it's strictly required. (Checking diagnostic IDs to see if the check is disabled would be really awkward and I think we should avoid it in this case.)

@@ -265,6 +265,9 @@ related warnings within the method body.
``format_matches`` accepts an example valid format string as its third
argument. For more information, see the Clang attributes documentation.

- Format string checking now supports the compile-time evaluation of format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may help users to understand the improvement if there's a small code example showing what wasn't checked and is now correctly caught. WDYT?

{
llvm::raw_svector_ostream OS(EscapedString);
OS << '"';
OS.write_escaped(FormatString);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big concern here is what happens to applied fix-its when the fix is in the scratch space? Do we need to suppress the fix-its in that case?


// Plop that string into a scratch buffer, create a string literal and then
// go with that.
auto ScratchFile = S.getSourceManager().createFileID(std::move(MemBuf));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please spell out the type.

Several users of compile-time string evaluation can meaningfully use the
special case that compile-time string evaluation resolves to a string
literal in source (for instance, to improve diagnostics). This changes
Expr::tryEvaluateString to return a StringEvalResult, which can hold
either a string literal and an offset or a std::string of evaluated
characters.
@llvmbot llvmbot added clang:codegen IR generation bugs: mangling, exceptions, etc. clang:analysis labels Apr 22, 2025
@apple-fcloutier
Copy link
Contributor Author

I have addressed current feedback, except that I still haven't checked for perf (will follow Aaron's directions on that soon) and I'm still working out what I need to check for if constexpr (confession time: I have never used it before).

@apple-fcloutier
Copy link
Contributor Author

(The broken tests are because I updated the note wording and then did not update the tests. We have ongoing conversation for that so I'll fix that when we have a resolution)

printf(const_string().c_str(), "hello", 123); // no-warning
printf(const_string().c_str(), 123, 456); // expected-warning {{format specifies type 'char *' but the argument has type 'int'}}
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't see any new tests for these comments.

@ojhunt
Copy link
Contributor

ojhunt commented May 3, 2025

The pointer auth options PR would also benefit from this - it currently has an ad hoc and restricted version of this as it needed to support strings produced by builtins, and predated @cor3ntin's string evaluation change so that was not even remotely an option at the time.

I would much rather have us use a single string evaluation routine rather than the current behavior, but that would block the options PR on this one.

#136828

@AaronBallman
Copy link
Collaborator

I'd appreciate if you could fix the merge conflicts btw; I'd like to apply the patch locally and play around with it. The code changes look pretty reasonable to me, but I had some questions as to how stuff was handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:analysis clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants