You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/BenchmarkTaskflow.html
+8-8Lines changed: 8 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -49,7 +49,7 @@ <h1>
49
49
<spanclass="m-breadcrumb"><ahref="install.html">Building and Installing</a> »</span>
50
50
Benchmark Taskflow
51
51
</h1>
52
-
<divclass="m-block m-default">
52
+
<navclass="m-block m-default">
53
53
<h3>Contents</h3>
54
54
<ul>
55
55
<li><ahref="#CompileAndRunBenchmarks">Compile and Run Benchmarks</a></li>
@@ -62,8 +62,8 @@ <h3>Contents</h3>
62
62
</ul>
63
63
</li>
64
64
</ul>
65
-
</div>
66
-
<sectionid="CompileAndRunBenchmarks"><h2><ahref="#CompileAndRunBenchmarks">Compile and Run Benchmarks</a></h2><p>To build the benchmark code, enable the CMake option <code>TF_BUILD_BENCHMARKS</code> to <code>ON</code> as follows:</p><preclass="m-console"><spanclass="gp">#</span>under /taskflow/build
65
+
</nav>
66
+
<sectionid="CompileAndRunBenchmarks"><h2><ahref="#CompileAndRunBenchmarks">Compile and Run Benchmarks</a></h2><p>To build the benchmark code, enable the CMake option <code>TF_BUILD_BENCHMARKS</code> to <code>ON</code> as follows:</p><preclass="m-console"><spanclass="gp">#</span>under /taskflow/build
<spanclass="go">~$ make</span></pre><p>After you successfully build the benchmark code, you can find all benchmark instances in the <code>benchmarks/</code> folder. You can run the executable of each instance in the corresponding folder.</p><preclass="m-console"><spanclass="go">~$ cd benchmarks & ls</span>
<spanclass="go"> -r,--num_rounds UINT number of rounds (default=1)</span>
88
88
<spanclass="go"> -m,--model TEXT model name tbb|omp|tf (default=tf)</span></pre><p>We currently implement the following instances that are commonly used by the parallel computing community to evaluate the system performance.</p><tableclass="m-table"><thead><tr><th>Instance</th><th>Description</th></tr></thead><tbody><tr><td>binary_tree</td><td>traverses a complete binary tree</td></tr><tr><td>black_scholes</td><td>computes option pricing with Black-Shcoles Models</td></tr><tr><td>graph_traversal</td><td>traverses a randomly generated direct acyclic graph</td></tr><tr><td>linear_chain</td><td>traverses a linear chain of tasks</td></tr><tr><td>mandelbrot</td><td>exploits imbalanced workloads in a Mandelbrot set</td></tr><tr><td>matrix_multiplication</td><td>multiplies two 2D matrices</td></tr><tr><td>mnist</td><td>trains a neural network-based image classifier on the MNIST dataset</td></tr><tr><td>parallel_sort</td><td>sorts a range of items</td></tr><tr><td>reduce_sum</td><td>sums a range of items using reduction</td></tr><tr><td>wavefront</td><td>propagates computations in a 2D grid</td></tr><tr><td>linear_pipeline</td><td>pipeline scheduling on a linear chain of pipes</td></tr><tr><td>graph_pipeline</td><td>pipeline scheduling on a graph of pipes</td></tr></tbody></table></section><sectionid="ConfigureRunOptions"><h2><ahref="#ConfigureRunOptions">Configure Run Options</a></h2><p>We implement consistent options for each benchmark instance. Common options are:</p><tableclass="m-table"><thead><tr><th>option</th><th>value</th><th>function</th></tr></thead><tbody><tr><td><code>-h</code></td><td>none</td><td>display the help message</td></tr><tr><td><code>-t</code></td><td>integer</td><td>configure the number of threads to run</td></tr><tr><td><code>-r</code></td><td>integer</td><td>configure the number of rounds to run</td></tr><tr><td><code>-m</code></td><td>string</td><td>configure the baseline models to run, tbb, omp, or tf</td></tr></tbody></table><p>You can configure the benchmarking environment by giving different options.</p><sectionid="SpecifyTheRunModel"><h3><ahref="#SpecifyTheRunModel">Specify the Run Model</a></h3><p>In addition to a Taskflow-based implementation for each benchmark instance, we have implemented two baseline models using the state-of-the-art parallel programming libraries, <ahref="https://www.openmp.org/">OpenMP</a> and <ahref="https://github.com/oneapi-src/oneTBB">Intel TBB</a>, to measure and evaluate the performance of Taskflow. You can select different implementations by passing the option <code>-m</code>.</p><preclass="m-console"><spanclass="go">~$ ./graph_traversal -m tf # run the Taskflow implementation (default)</span>
89
89
<spanclass="go">~$ ./graph_traversal -m tbb # run the TBB implementation</span>
90
-
<spanclass="go">~$ ./graph_traversal -m omp # run the OpenMP implementation</span></pre></section><sectionid="SpecifyTheNumberOfThreads"><h3><ahref="#SpecifyTheNumberOfThreads">Specify the Number of Threads</a></h3><p>You can configure the number of threads to run a benchmark instance by passing the option <code>-t</code>. The default value is one.</p><preclass="m-console"><spanclass="gp">#</span>run the Taskflow implementation using <spanclass="m">4</span> threads
91
-
<spanclass="go">~$ ./graph_traversal -m tf -t 4</span></pre><p>Depending on your environment, you may need to use <code>taskset</code> to set the CPU affinity of the running process. This allows the OS scheduler to keep process on the same CPU(s) as long as practical for performance reason.</p><preclass="m-console"><spanclass="gp">#</span>affine the process to <spanclass="m">4</span> CPUs, CPU <spanclass="m">0</span>, CPU <spanclass="m">1</span>, CPU <spanclass="m">2</span>, and CPU <spanclass="m">3</span>
92
-
<spanclass="go">~$ taskset -c 0-3 graph_traversal -t 4 </span></pre></section><sectionid="SpecifyTheNumberOfRounds"><h3><ahref="#SpecifyTheNumberOfRounds">Specify the Number of Rounds</a></h3><p>Each benchmark instance evaluates the runtime of the implementation at different problem sizes. Each problem size corresponds to one iteration. You can configure the number of rounds per iteration to average the runtime.</p><preclass="m-console"><spanclass="gp">#</span>measure the runtime in an average of <spanclass="m">10</span> runs
90
+
<spanclass="go">~$ ./graph_traversal -m omp # run the OpenMP implementation</span></pre></section><sectionid="SpecifyTheNumberOfThreads"><h3><ahref="#SpecifyTheNumberOfThreads">Specify the Number of Threads</a></h3><p>You can configure the number of threads to run a benchmark instance by passing the option <code>-t</code>. The default value is one.</p><preclass="m-console"><spanclass="gp">#</span>run the Taskflow implementation using <spanclass="m">4</span> threads
91
+
<spanclass="go">~$ ./graph_traversal -m tf -t 4</span></pre><p>Depending on your environment, you may need to use <code>taskset</code> to set the CPU affinity of the running process. This allows the OS scheduler to keep process on the same CPU(s) as long as practical for performance reason.</p><preclass="m-console"><spanclass="gp">#</span>affine the process to <spanclass="m">4</span> CPUs, CPU <spanclass="m">0</span>, CPU <spanclass="m">1</span>, CPU <spanclass="m">2</span>, and CPU <spanclass="m">3</span>
92
+
<spanclass="go">~$ taskset -c 0-3 graph_traversal -t 4 </span></pre></section><sectionid="SpecifyTheNumberOfRounds"><h3><ahref="#SpecifyTheNumberOfRounds">Specify the Number of Rounds</a></h3><p>Each benchmark instance evaluates the runtime of the implementation at different problem sizes. Each problem size corresponds to one iteration. You can configure the number of rounds per iteration to average the runtime.</p><preclass="m-console"><spanclass="gp">#</span>measure the runtime <spanclass="k">in</span> an average of <spanclass="m">10</span> runs
93
93
<spanclass="go">~$ ./graph_traversal -r 10</span>
94
94
<spanclass="go">|V|+|E| Runtime</span>
95
95
<spanclass="go"> 2 0.109 # the runtime value 0.109 is an average of 10 runs</span>
<li><ahref="#CUDASTDDefineAnExecutionPolicy">Define an Execution Policy</a></li>
58
58
<li><ahref="#CUDASTDAllocateMemoryBufferForAlgorithms">Allocate Memory Buffer for Algorithms</a></li>
59
59
</ul>
60
-
</div>
61
-
<p>Taskflow provides standalone template methods for expressing common parallel algorithms on a GPU. Each of these methods is governed by an <em>execution policy object</em> to configure the kernel execution parameters.</p><sectionid="CUDASTDExecutionPolicyIncludeTheHeader"><h2><ahref="#CUDASTDExecutionPolicyIncludeTheHeader">Include the Header</a></h2><p>You need to include the header file, <code>taskflow/cuda/cudaflow.hpp</code>, for creating a CUDA execution policy object.</p></section><sectionid="CUDASTDParameterizePerformance"><h2><ahref="#CUDASTDParameterizePerformance">Parameterize Performance</a></h2><p>Taskflow parameterizes most CUDA algorithms in terms of <em>the number of threads per block</em> and <em>units of work per thread</em>, which can be specified in the execution policy template type, <ahref="classtf_1_1cudaExecutionPolicy.html" class="m-doc">tf::<wbr/>cudaExecutionPolicy</a>. The design is inspired by <ahref="https://moderngpu.github.io/">Modern GPU Programming</a> authored by Sean Baxter to achieve high-performance GPU computing.</p></section><sectionid="CUDASTDDefineAnExecutionPolicy"><h2><ahref="#CUDASTDDefineAnExecutionPolicy">Define an Execution Policy</a></h2><p>The following example defines an execution policy object, <code>policy</code>, which configures (1) each block to invoke 512 threads and (2) each of these <code>512</code> threads to perform <code>11</code> units of work. Block size must be a power of two. It is always a good idea to specify an odd number in the second parameter to avoid bank conflicts.</p><preclass="m-code"><spanclass="n">tf</span><spanclass="o">::</span><spanclass="n">cudaExecutionPolicy</span><spanclass="o"><</span><spanclass="mi">512</span><spanclass="p">,</span><spanclass="mi">11</span><spanclass="o">></span><spanclass="n">policy</span><spanclass="p">;</span></pre><asideclass="m-note m-info"><h4>Note</h4><p>To use CUDA standard algorithms, you need to include the header taskflow/cudaflow.hpp.</p></aside><p>By default, the execution policy object is associated with the CUDA <em>default stream</em> (i.e., 0). Default stream can incur significant overhead due to the global synchronization. You can associate an execution policy with another stream as shown below:</p><preclass="m-code"><spanclass="c1">// assign a stream to a policy at construction time</span>
<p>Taskflow provides standalone template methods for expressing common parallel algorithms on a GPU. Each of these methods is governed by an <em>execution policy object</em> to configure the kernel execution parameters.</p><sectionid="CUDASTDExecutionPolicyIncludeTheHeader"><h2><ahref="#CUDASTDExecutionPolicyIncludeTheHeader">Include the Header</a></h2><p>You need to include the header file, <code>taskflow/cuda/cudaflow.hpp</code>, for creating a CUDA execution policy object.</p></section><sectionid="CUDASTDParameterizePerformance"><h2><ahref="#CUDASTDParameterizePerformance">Parameterize Performance</a></h2><p>Taskflow parameterizes most CUDA algorithms in terms of <em>the number of threads per block</em> and <em>units of work per thread</em>, which can be specified in the execution policy template type, <ahref="classtf_1_1cudaExecutionPolicy.html" class="m-doc">tf::<wbr/>cudaExecutionPolicy</a>. The design is inspired by <ahref="https://moderngpu.github.io/">Modern GPU Programming</a> authored by Sean Baxter to achieve high-performance GPU computing.</p></section><sectionid="CUDASTDDefineAnExecutionPolicy"><h2><ahref="#CUDASTDDefineAnExecutionPolicy">Define an Execution Policy</a></h2><p>The following example defines an execution policy object, <code>policy</code>, which configures (1) each block to invoke 512 threads and (2) each of these <code>512</code> threads to perform <code>11</code> units of work. Block size must be a power of two. It is always a good idea to specify an odd number in the second parameter to avoid bank conflicts.</p><preclass="m-code"><spanclass="n">tf</span><spanclass="o">::</span><spanclass="n">cudaExecutionPolicy</span><spanclass="o"><</span><spanclass="mi">512</span><spanclass="p">,</span><spanclass="w"></span><spanclass="mi">11</span><spanclass="o">></span><spanclass="w"></span><spanclass="n">policy</span><spanclass="p">;</span><spanclass="w"></span></pre><asideclass="m-note m-info"><h4>Note</h4><p>To use CUDA standard algorithms, you need to include the header taskflow/cudaflow.hpp.</p></aside><p>By default, the execution policy object is associated with the CUDA <em>default stream</em> (i.e., 0). Default stream can incur significant overhead due to the global synchronization. You can associate an execution policy with another stream as shown below:</p><preclass="m-code"><spanclass="c1">// assign a stream to a policy at construction time</span>
<spanclass="c1">// assign another stream to the policy</span>
65
-
<spanclass="n">policy</span><spanclass="p">.</span><spanclass="n">stream</span><spanclass="p">(</span><spanclass="n">another_stream</span><spanclass="p">);</span></pre><p>All the CUDA standard algorithms in Taskflow are asynchronous with respect to the stream assigned to the execution policy. This enables high execution efficiency for large GPU workloads that call for many different algorithms. You can synchronize the execution at your own wish by calling <code>synchronize</code>.</p><preclass="m-code"><spanclass="n">policy</span><spanclass="p">.</span><spanclass="n">synchronize</span><spanclass="p">();</span><spanclass="c1">// synchronize the associated stream</span></pre><p>The best-performing configurations for each algorithm, each GPU architecture, and each data type can vary significantly. You should experiment different configurations and find the optimal tuning parameters for your applications. A default policy is given in <ahref="namespacetf.html#aa18f102977c3257b75e21fde05efdb68" class="m-doc">tf::<wbr/>cudaDefaultExecutionPolicy</a>.</p><preclass="m-code"><spanclass="n">tf</span><spanclass="o">::</span><spanclass="n">cudaDefaultExecutionPolicy</span><spanclass="n">default_policy</span><spanclass="p">;</span></pre></section><sectionid="CUDASTDAllocateMemoryBufferForAlgorithms"><h2><ahref="#CUDASTDAllocateMemoryBufferForAlgorithms">Allocate Memory Buffer for Algorithms</a></h2><p>A key difference between our CUDA standard algorithms and others (e.g., Thrust) is the <em>memory management</em>. Unlike CPU-parallel algorithms, many GPU-parallel algorithms require extra buffer to store the temporary results during the multi-phase computation, for instance, <ahref="namespacetf.html#a8a872d2a0ac73a676713cb5be5aa688c" class="m-doc">tf::<wbr/>cuda_reduce</a> and <ahref="namespacetf.html#a06804cb1598e965febc7bd35fc0fbbb0" class="m-doc">tf::<wbr/>cuda_sort</a>. We <em>DO NOT</em> allocate any memory during these algorithms call but ask you to provide the memory buffer required for each of such algorithms. This decision seems to complicate the code a little bit, but it gives applications freedom to optimize the memory; also, it makes all algorithm calls capturable to a CUDA graph to improve the execution efficiency.</p></section>
65
+
<spanclass="n">policy</span><spanclass="p">.</span><spanclass="n">stream</span><spanclass="p">(</span><spanclass="n">another_stream</span><spanclass="p">);</span><spanclass="w"></span></pre><p>All the CUDA standard algorithms in Taskflow are asynchronous with respect to the stream assigned to the execution policy. This enables high execution efficiency for large GPU workloads that call for many different algorithms. You can synchronize the execution at your own wish by calling <code>synchronize</code>.</p><preclass="m-code"><spanclass="n">policy</span><spanclass="p">.</span><spanclass="n">synchronize</span><spanclass="p">();</span><spanclass="w"></span><spanclass="c1">// synchronize the associated stream</span></pre><p>The best-performing configurations for each algorithm, each GPU architecture, and each data type can vary significantly. You should experiment different configurations and find the optimal tuning parameters for your applications. A default policy is given in <ahref="namespacetf.html#aa18f102977c3257b75e21fde05efdb68" class="m-doc">tf::<wbr/>cudaDefaultExecutionPolicy</a>.</p><preclass="m-code"><spanclass="n">tf</span><spanclass="o">::</span><spanclass="n">cudaDefaultExecutionPolicy</span><spanclass="w"></span><spanclass="n">default_policy</span><spanclass="p">;</span><spanclass="w"></span></pre></section><sectionid="CUDASTDAllocateMemoryBufferForAlgorithms"><h2><ahref="#CUDASTDAllocateMemoryBufferForAlgorithms">Allocate Memory Buffer for Algorithms</a></h2><p>A key difference between our CUDA standard algorithms and others (e.g., Thrust) is the <em>memory management</em>. Unlike CPU-parallel algorithms, many GPU-parallel algorithms require extra buffer to store the temporary results during the multi-phase computation, for instance, <ahref="namespacetf.html#a8a872d2a0ac73a676713cb5be5aa688c" class="m-doc">tf::<wbr/>cuda_reduce</a> and <ahref="namespacetf.html#a06804cb1598e965febc7bd35fc0fbbb0" class="m-doc">tf::<wbr/>cuda_sort</a>. We <em>DO NOT</em> allocate any memory during these algorithms call but ask you to provide the memory buffer required for each of such algorithms. This decision seems to complicate the code a little bit, but it gives applications freedom to optimize the memory; also, it makes all algorithm calls capturable to a CUDA graph to improve the execution efficiency.</p></section>
0 commit comments