kram-profile
==========

kram-profile wraps SwiftUI atop a WKWebView running the Perfetto TraceViewer.  A dev can open directories or files of traces.  Supported files are added to a list to quickly view these in Perfetto.  The app is multi-document.  Each window is a single instance of Pefertto TraceViewer that is loaded once.   The sandboxed SwiftUI acts as the bridge to the native file system, which the TraceViewer browser sandbox lacks.

Flamegraphs are key to all profiling.  Why look at giant table of numbers when you can see them visually.  Flamegraphs also need to be dynamic and display hover tips and details.  Fortunately there are several tools now supporting flamegraphs.  Perfetto is one such tool.

kram-profile fixes up build traces to reflect the name of the file/function.  And it demangles function names from clang.

Files can be dragged onto the list view, double-clicked from Finder if the filenames below are associated with the app, or there is an Open and Refresh command.

Supported files

* .memtrace - memory report generated by Kram scripts folder.
* .trace/.perftrace - performance timings in the form catapult trace json files
* .json/.buildtrace - clang timing output generated using -ftime-trace
* .zip archives of above
* .gzip compressed files of above
* folders of loose files or achives

There is a pre-built version of kram-profile for macOS 13.0 and higher.

List view 
  File type, name, duration
  Up/down arrow keys or cmd+N/cmd+shift+N to advance through list
  Hover for path of filename
  Can collapse/restore the list
  Type-search in the list
    
Navigation Title
  Filename (archive)
  Info button (memtrace) - shows max of tracks for heap size
  cmd+T  search by duration
  cmd+shift+T search by name
  
WKWebView
  Perfetto Flamegraph
  Tracknames on left
  cmd+S to search for a named entry in flamegraph
  cmd+shift+P to parse command 
  Cannot hide the tracknames
  
----------------

TODO: (x are done)
* x Fix document support, so can double click and have app open files. readFromURL like kramv.
* x Support binary Perfetto traces.  Test with Google sample code.
* x Fixup "Source" tags in clang json to use filename (no extension) from detail field
* x Find start/end time of each json files. 
* x Support gzip trace files
* x Add sort by range (useful for mem/build traces)
* x Add zip archive support, can drop archive of 1+ traces
* x Tie in with the excellent ClangBuildAnalyzer tool

* Add frame type for perf traces for vsync ticker (binary format prob has it)
* Scale specific traces to a single duration.  That way the next file comes in at that scale. 
* Move away from Catapult json to own binary format.  Can then translate to json or use the Perfetto SDK to convert to protobufs.

----------------

#Profilers

Cpu Profilers. See for more details

* Catapult - see below
* Perfetto - see below
* Flutter (using Perfetto) https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview#heading=h.yr4qxyxotyw
* Optick - https://github.com/bombomby/optick
* Tracy - https://github.com/wolfpld/tracy

* ClangBuildAnalyzer - https://github.com/aras-p/ClangBuildAnalyzer
* Microprofile 
* Microprofile 2
* Microprofiler
* EasyProfiler
* VerySleepy
* LukeStackwalker
* Remotery
* geiger
* Palanteer
* Intel IACA
* Coz
* heaptrack
* hotspot
* dprofiler
* spall

* Commercial
* Telemetry - httpd://www.radgametools.com/telemetry.htm
* Superluminal - higher-rate sampling profiler
* Xcode Instruments - see Xcode
* AMD Code Analyst - see Xcode
* Intel Vtune -

Gpu Profilers. See for more details

* Xcode Gpu Capture
* Android Gpu Inspector - https://developer.android.com/agi
* Nvidia NSight
* Mali Shader Compiler
* Pix Profiler

Catapult
---------

This was the tracing system that Perfetto replaced.  Originally designed for Chrome profiling.  Flamegraph and track-based.  It also had a nice json API for recording thread names and profile scopes.

Perfetto
---------
* https://ui.perfetto.dev
* https://perfetto.dev/docs/visualization/deep-linking-to-perfetto-ui

This is a web-based profiling and flame-graph tool.  It's fast on desktop, and continues to evolve.  Only has second and timecode granularity which isn't enough.  For example, performance profiling for games is in milliseconds.  The team is mostly focused on Chrome profiling which apparently is in seconds.  But the visuals are nice, and it now has hover tips with size/name, and also has an Issues list that the devs are responsive to.  Flutter is using this profiler, and kram-profile does too.

Perfetto lives inside a sandbox due to the browser, so feeding files to Perfetto is one weakness.  As a result kram-profile's file list is a nice complement, and can send the file data across via Javascript.  This is not unlike an Electron wrapper, but in much less memory.  

One limitation is that traces must be nested.  So timestamps cannot overlap.   Make sure to honor this, or traces will overlap verticall and become confused.  There is a C++ SDK to help with writing out traces, and that is a much more compact format than the json.  But more languages can write to the json format.  The Perfetto team is doing no further work on the json format.  And fields like "color" are unsupported, and Perfetto uses it's own coloration for blocks instead.  This coloration is nice and consistent and tied to name.

Having lots of issues trying to reuse the Perfetto web page to load more than one profile.  The web app gets into a bad state, and then won't load any content afterwareds.

Orbit
---------
* https://orbitprofiler.com/

This profiler uses dynamic instrumentation of code via dtrace and trampolines.  Note that Win, macOS can use this sort of system.  Apple blocks access to dtrace on iOS, but there are mentions of ktrace.  So you inject/remove traces dynamically by patching the dll sources directly.  This used to run on macOS, Win, and Linux.  Google Stadia adopted this project, and now it is limited to Linux support.

This avoids the need to blindly instrument code or inject scopes into high-frequency routines.  But this patching is not be compatible by the security theater adopted by iOS devices.

ClangBuildAnalyzer
--------
* https://github.com/aras-p/ClangBuildAnalyzer

A nice build profile aggregator.  Runs through the json timings that Clang generates, and details which headers and templates and optimization are slowing down builds.  Then go back and review the json files to validate the results.  Uses hierarchical and not self time, so the timings do overlap.  And timings across threads total up to more timing than the overal build takes. 

Has an incremental system to snapshot and compare modestamps, and only do work on newer files.  This is some great open-source.  Aras optimized Unity builds with this, and that's a huge codebase.  I've used this to optimize kram.

Include What You Use
---------
* https://github.com/include-what-you-use/include-what-you-use

Automate the tedium of finding the minimal set of headers for C/C++ with this utility.  A third party added ObjC support, but it hasn't landed.  Seems like on large projects the includes gets out of hand.  So I look forward to trying this out.  

Rewrites the headers, but there are ways to keep it from removing some.  Unclear how this works with cross-platform code.  But maybe it only strips includes within the defines that it sees.  Send the CXXFLAGS used for the build to the exe along with a source file.

# Use Cases

Memory profiling
---------

VMA can dump a json file, and that can be converted using scripts/GpuMemDumpPerfetto.py.  Then open this in kram-profile to see current memory fragmentation and layout across the various Vulkan heaps.  VMA can generate a png, but it's static.  Perfetto can allow one to zoom in and see the actual names of blocks and size.

Set the Pefetto timestamp to seconds, and then 1s = 1MB.  This allows reading the timings as megabytes.  A good timescale is 64s (64MB).

Performance profiling
---------

Have app write out time and duration events using the Catapult json format.  Then open these in kram-profile to optimize an application.   A good timescale is 0.1s for games.   Can then see where app performance is lost across threads and job systems.  It is harder to measure async wait gaps, since these are not nested properly.  Also good to instrument sleeps.  Not sure now to scope fibers, since these get swapped out.  There are events which aren't duration based, so use those.

Build profiling
---------

Clang supports -ftime-trace across all platforms.  Set that to dump the Perfetto trace files into the build directories alongside the .o files.  Then use kram-profile to open these folders.  Also see scripts/cba.sh for to run ClangBuildAnalyzer on these folders to identify where build timings are slow.  Then address with optimizing includes and using pch where possible.   A good timescale is 1s.  Files that take longer than this to build should be targeted. 

Simd libraries, and especially files like STL with heavy template generation will often be at the top of the list.  PCH will reduce parsing time for templates, but not the instantiation.  See the Optimization section for more details.

Ideally run the traces, run CBA, reduce headers and identify pch candidates.  Then repeat, until overall timings go down.  Remember that PCH is per link, so one per DLL or app.  It also break isolation of headers in files, so may want a CI build not using it to catch unspecified headers.

Ninja Build
---------

This is a minimal version of Make.  But code must generate the Ninja file.  Cmake is one generator, and GN is the primary generator.  But Ninja is so simple that it's fairly easy to specify directly.  I'm experimenting with this in the hlslparser, where I wrote the Ninja files manually just to work with the syntax.

# Optimization

Unity builds
-----------

Not to be confused with the Unity game engine.  But unity builds combine several .cpp files into a single .cpp.  This works around problems with slow linkers, and multiple template and inline code instantations.  But code and macros from one .cpp spill into the next.  To avoid this, be careful about undeffing at the bottoms of files.  kram also uses a common namespaces across headers and source files.  This allows "using namespace" in both, and keeps the namespaces compartmentalized.

Precompiled headers (PCH)
-----------

These are a precursor to C++ modules.  pch are universally support across compilers, where we may never see C++ modules.  You get one pch per library.  So if your app is a DLL and a exe, then each could have their own pch.  Need one pch per platform and config.  Force include this since it must be the first file in each, or explicitly include a file if you want to be explicit about which files get the pch.

pch spread headers into files.  So the build can break if some don't use it, or configs skip it.  Occasionally fixup missing headers by disabling it. Templates are parsed by only specializations are instatiated.  So may be worth defining specializations in the pch. STL is always a top offender with vector/unordered_map, function, and others at the top.

There are broken examples of setting up pch for Makefiles all over the internet.  Maybe cmake has a valid setup, but the jist is below for gcc/clang.  Make sure to verify the parse time is gone in kram-profile by looking at the clang build profiles.

Clang has options to generate a pch .o file.  This must be linked separately into the library.  This is something MSVC pch support for a long time.  gcc doesn't support this.  See the link below, and the pchObj in the makefile example below.

Advanced clang pch usage
https://maskray.me/blog/2023-07-16-precompiled-headers


    # gen the .d file, written to tmp and only replaces if it changes
    cppFlags = ... 
    
    cppDepFlags = -MMD -MP (or -MD)

    # header must be unique to build (f.e. defines, etc)
    cppBuild = $(platform)($config)
    
    # setup the files involved, only get 1 pch per DLL/App since
    pchSrc = Precompile.h
    pchHdrSrc = Precompile-$(cppBuild).h
    pchDeps = $(pchHdr).d
    pchHdr = $(pchHdrSrc).pch
    pchObj = $(pchHdr).o
    pchIncludesDirs = -Idir1 -Idir2
    
    # this does code gen, templates, and debuginfo into the h.pch.o file
    pchFlags = -fpch-codegen -fpch-instantiate-templates -fpch-debuginfo
             
    # important - only copy the hdr if it changes, don't want full rebuild every time
    # linux (cp -u), win (xcopy), macOS (shell compare then cp)
    $(pchHdrSrc): $(pchSrc)
        cp $< $@
        
    # this will output the .d and .gch file
    $(pchHdr): $(pchHdrSrc)
        clang++ -x c++header $(cppFlags) $(cppDepFlags) $(pchFlags) $(pchIncludesDirs) -c $< -o $@ 
        
    # this makes sure that the pch is rebuilt if hdrs within pchHdr changee
    # the - sign ignores the deps file on the first run where it does not exist.
    $(pchDeps): ;
    -include $(pchDeps)
    
    # optional code to build .o from .pch 
    # must link this in with the lib/exe, don't use "-x c++" here - it's ast not C++ code
    #  speeds the build, since code isn't prepended to each .o file, and then linked.
    $(pchObj): $(pchHdr)
        clang++ $(cppFlags) -c $< -o $@
    
    ....
    
    # prefix Precompile.h.pch to each .o file
    cppPchFlags = -include-pch $(pchHdr)
   
    # now build the files
    *.cpp: ... $(pchHdr)
        clang++ $(cppFlags) $(cppPchFlags) -c $< -o $@ 

    # link the pchObj into the lib or ese
    allObjs = *.o $(pchObj)

    $(libOrExe): $(allObjs)
        clang++ $< -o $@
        
        
SIMD
-----------

Vector instructions are universal now via SIMD.  For 16B SIMD, ARM has Neon and x64 has SSE4.2.  AVX/2 introduce 32B, and AVX-512 is 64B registers, but Intel has stripped that from newer consumer chips, and is introducing AVX10.  So AVX2 is as safe as it gets.  Note that Apple's Rosetta 2 emulator only supports SSE 4.2 at the time of this writing.  x64 SSE is always 16B size and 16B aligned, where Neon has an 8B float32x2 and 16B float32x4.  The default allodator for macOS is 16B aligned.  x64 is 16B aligned, but x86 was 8B alignd.  

Apple has a very nice SIMD (simd/simd.h) library.  This uses the gcc vector extensions so swizzles and math operators are built into the compiler.  This makes the code look more HLSL like which is a good thing.  This ships with all calls inline, but optimized 2/3/4 way trancendental calls are buried in the Accelerate library, and the implementation just calls the c stdlib functions multiple times as a fallback.  It has a nice abstraction for int, uint, float, double simd math.  One defines the maximum SIMD level supported by the app, and the library then uses the largest register size that it can for that platform.  The higher size registers work with 16B alignment, so that is what Apple uses.  

Optimized debug builds
-----------

One nice aspect of C++ is that specific files can be optimized.  But to do so, calls outside the .cpp become functions instead of inlines.  But within the .cpp, they get inlined and optimized.  Setting this up on a SIMD library takes a bit of work, but then callers are running optimized math even in debug.

Also Microsoft has various debug build flags that can optimize and optimize force_inline calls.  Need to find out the details for clang.  These disable Edit & Continue, but clang in Visual Studio doesn't support it anyways.

* https://learn.microsoft.com/en-us/visualstudio/debugger/how-to-debug-optimized-code?view=vs-2022

* Visual Studio
* Use /Zo instead of /Od.  Now with Edit&continue.
* /d2Zi+
* Use VS2022, it's 64-bit
* Avoid C++20, it's slower to compile
* /Ob1 allows inline of  inline, __inline, or __forceinline, and member functions in the class decls.
* disable STL bounds checking
* WIN_LEAN_AND_MEAN
* NOMINMAX
* use clang-cli

Xcode
* make sure to deadstrip the release build, or it's huge
* Cmake uses /Ob1 for RelWithDebInfo
* use SSE4.2 for Resetta4.2, and make sure to use Neon on arm64

* https://randomascii.wordpress.com/2013/09/11/debugging-optimized-codenew-in-visual-studio-2012/

* https://dirtyhandscoding.github.io/posts/fast-debug-in-visual-c.html






