pa is an efficient tool for analyzing performance data based on statistical simulation, i.e., bootstrap. The performance data can come from measurements such as performance tests or benchmarks.
There are two main ways how to analyze performance data with pa:
- Single version analysis: statistic (e.g., mean) + variability (confidence interval of the statistic)
- Analysis between two versions: confidence interval ratio of a statistic (e.g., mean)
Inspired by [1], pa employs a Monte-Carlo technique called bootstrap [2] to estimate the population confidence interval from a given sample. It uses hierarchical random re-sampling with replacement [3]. In the context of performance analysis, these hierarchical levels correspond to levels where measurement repetition happens to have reliable results. These hierarchical levels are:
- invocation
- iteration
- fork
- trial
- instance
Higher levels are composed of multiple occurrences of lower levels, i.e., an iteration (level 2) consists of many invocations (level 1), and so on.
pa requires only Go with version 1.14 or higher (Install Page).
Install pa by running go get github.com/chrstphlbr/pa.
pa comes with a simple command line interface (optional flags in [...] with their defaults):
pa [-bs 10000] [-is 0] [-sl 0.01] [-st mean] [-os] [-m 1] [-tra id:id] \
file_1 \
[file_2 ... file_n] If 1 file (file_1) is provided, the single version analysis is performed, i.e., the confidence intervals of a single performance experiment is computed.
If multiple files are provided, the two version analysis is performed:
the confidence intervals for both versions and the confidence interval ratio between the two versions is computed.
In the simple case, 2 files are provided, file_1 for version 1 and file_2 for version 2.
It is also possible to provide multiple files per version (of equal number) by setting the flag -m.
Note that the files MUST be sorted alphabetically by their benchmarks (see section "Input Files").
Flags:
-bsdefines the number of bootstrap simulations, i.e., how many random samples are taken to estimate the population distribution-isdefines how many invocation samples are used (0 takes the mean across all invocations of an iteration, -1 takes all invocations, and > 0 for number of samples).-sldefines the significance level. The confidence level for the confidence intervals is then1-sl. The default is 0.01 which corresponds to a 99% confidence level.-stdefines the statistic for which a confidence interval is computed. The default ismean. Another option ismedian.-osdefines whether the statistic, as set by-st, is included in the output file.-msets the number of files per version (control and test group). For example, if-m 3pa expects 6 files, wherefile_1,file_2, andfile_3belong to version 1, andfile_4,file_5, andfile_6belong to version two.-tradefines the transformation(s) applied to the benchmark results (i.e., the file(s)), in the form oftransformer1:transformer2, wheretransformer1is applied to the first (control) group andtransformer2is applied to the second (test) group (if it exists). Transformers can be one ofid(identity, no transformation) orf0.0('f' for factor followed by a user-specified float64 value)
pa expects CSV input files of the following form. For JMH benchmark results, the tool bencher can transform JMH JSON output to this CSV file format.
project;commit;benchmark;params;instance;trial;fork;iteration;mode;unit;value_count;value
The columns represent the following values:
projectis the project namecommitis the project version, e.g., a commit hashbenchmarkis the name of the fully-qualified benchmark methodparamsare the performance parameters (not the function/method parameters) of the benchmark in comma-separated form. Every parameter consists of a name and a value, separated by an equal sign (name=value). For example JMH supports performance parameters through its@Paramannotationinstanceis the name of the instance or machine (level 5)trialis the number of the trial (level 4)forkis the fork number (level 3). For example JMH supports forks through their@Forkannotationiterationis the iteration number within a fork (level 2)modeis the benchmark mode. For exmaple JMH supports average timeavgt, throughputthrpt, or sample timesampleunitis the measurement unit of the benchmark value. Depending on themode, the measurement unit can be ns/op for average time or op/s for throughputvalue_countis the number of invocations (level 1) thevalueoccurred in this iteration. Every iteration can have multiple values (i.e., invocations), which are presented as a histogram. Each histogram value corresponds to one CSV row, and the occurrences of this value is defined byvalue_count.valueis the performance metric with a certainunit
IMPORTANT: the input files must be sorted by benchmark and params, otherwise the tool will not work correctly.
This is because input files can be large and, therefore, pa works on file input streams.
pa writes the results in CSV form to stdout. The output can contain 3 types of CSV rows:
- rows starting with
#are comments - empty rows
- all other rows are CSV rows
The columns are:
benchmarkis the name of the benchmarkparamsare the function/method parameters of the benchmark. pa does not populate this column, because the input format does not provide the function/method parametersperf_paramsis a comma-separated list of performance parameters. See columnparamsof the input files for comparisonstis the statistic the confidence interval is for. Can be "mean" or "median"ci_lis the lower bound of the confidence intervalci_uis the upper bound of the confidence intervalclis the confidence level of the confidence interval
The output file is a CSV with the following columns (without -os):
benchmark;params;perf_params;ci_l;ci_u;cl
And with the statistic, as set by -os, it has the following columns:
benchmark;params;perf_params;st;ci_lower;ci_u;cl
The output file is a CSV with the following columns (without -os):
benchmark;params;perf_params;v1_ci_l;v1_ci_u;v1_cl;v2_ci_l;v2_ci_u;v2_cl;ratio_s;ratio_ci_l;ratio_ci_u;ratio_cl
And with the statistic, as set by -os, it has the following columns:
benchmark;params;perf_params;v1_st;v1_ci_l;v1_ci_u;v1_cl;v2_st;v2_ci_l;v2_ci_u;v2_cl;ratio_st;ratio_ci_l;ratio_ci_u;ratio_cl
Compared to the single version analysis, the two version analysis has three or four (with or without -os) columns, for both versions (v1 and v2) and the confidence interval for the ratio between the two versions (ratio).
[1] T. Kalibera and R. Jones, “Quantifying performance changes with effect size confidence intervals”, University of Kent, Technical Report 4–12, June 2012. Available: URL
[2] A. C. Davison and D. V. Hinkley, “Bootstrap methods and their application”
[3] S. Ren, H. Lai, W. Tong, M. Aminzadeh, X. Hou, and S. Lai, “Nonparametric bootstrapping for hierarchical data”, Journal of Applied Statistics, vol. 37, no. 9, pp. 1487–1498, 2010. Available: DOI