google · LebedevRI · Jul 9, 2023 · Jul 6, 2023 · Jul 7, 2023 · Jul 7, 2023
diff --git a/docs/tools.md b/docs/tools.md
@@ -186,6 +186,65 @@ Benchmark                               Time             CPU      Time Old
 This is a mix of the previous two modes, two (potentially different) benchmark binaries are run, and a different filter is applied to each one.
 As you can note, the values in `Time` and `CPU` columns are calculated as `(new - old) / |old|`.
 
+## Note: Interpreting the output
+
+Performance measurements are an art, and performance comparisons are doubly so.
+Results are often noisy and don't necessarily have large absolute differences to them,
+so just by visual inspection, it is not at all apparent if two measurements are actually showing
+a performance change or not. It is even more confusing with multiple benchmark repetitions.
+
+Thankfully, what we can do, is use statistical tests on the results
+to determine whether the performance has statistically significantly changed.
+`compare.py` uses [Mann–Whitney U test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test),
+with a null hypothesis being that there's no difference in performance.
+
+
+Let's first try to see what the different columns represent in the above `compare.py` benchmarking output:
+
+  1. **Benchmark:** The name of the function being benchmarked, along with the size of the input (after the slash).
+
+  2. **Time:** The average time per operation, across all iterations.
+
+  3. **CPU:** The average CPU time per operation, across all iterations.
+
+  4. **Iterations:** The number of iterations the benchmark was run to get a stable estimate.
+
+  5. **Time Old and Time New:** These represent the average time it takes for a function to run in two different scenarios or versions. For example, you might be comparing how fast a function runs before and after you make some changes to it.
+
+  6. **CPU Old and CPU New:** These show the average amount of CPU time that the function uses in two different scenarios or versions. This is similar to Time Old and Time New, but focuses on CPU usage instead of overall time.
+
+
+In the comparison section, the relative differences in both time and CPU time are displayed for each input size.
+
+When comparing benchmarks, `compare.py` uses statistical tests to determine whether there is a statistically-significant difference between the measurements being compared.
+The result of said the statistical test is additionally communicated through color coding:
+```diff
++ Green:
+```
+  The benchmarks are _**statistically different**_. This could mean the performance has either **significantly improved** or **significantly deteriorated**. You should look at the actual performance numbers to see which is the case.
+```diff
+- Red:
+```
+  The benchmarks are _**statistically similar**_. This means the performance hasn't significantly changed.
+
+In statistical terms, **'green'** means we reject the null hypothesis that there's no difference in performance, and **'red'** means we fail to reject the null hypothesis. 
+This might seem counter-intuitive if you're expecting 'green' to mean 'improved performance' and 'red' to mean 'worsened performance'. 
+```
+  But remember, in this context:
+
+    'Success' means 'successfully finding a difference'.
+    'Failure' means 'failing to find a difference'.
+```
+A statistically significant difference is determined by a **p-value**, which is a measure of the probability that the observed difference could have occurred just by random chance. A smaller p-value indicates stronger evidence against the null hypothesis. 
+
+**Therefore:**
+  1. If the p-value is less than the chosen significance level (alpha), we reject the null hypothesis and conclude the benchmarks are significantly different.
+  2. If the p-value is greater than or equal to alpha, we fail to reject the null hypothesis and treat the two benchmarks as similar.
+
+Also please note that **even if** we determine that there **is** a statistically-significant difference between the two measurements, it does not _necessarily_ mean that the actual benchmarks that were measured **are** different, or vice versa, even if we determine that there is **no** statistically-significant difference between the two measurements, it does not necessarily mean that the actual benchmarks that were measured **are not** different.
+
+
+
 ### U test
 
 If there is a sufficient repetition count of the benchmarks, the tool can do