High performance scientific computing in C++
HPC C++ Course 2024
28 October – 31 October 2024 Sandipan Mohanty Forschungszentrum Jülich, Germany
Member of the Helmholtz Association
1 struct Point {
2 double x{}, y{}, z{}, w{};
3 };
Assuming a cache line size of 64 bytes, and that double is 8 bytes, how many elements of a vector of
Point would fit completely inside a cache line?
(A) 1
(B) 2
(C) 1 or 2, can’t be sure
(D) 8
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 1
1 struct Point {
2 double x{}, y{}, z{}, w{};
3 };
Assuming a cache line size of 64 bytes, and that double is 8 bytes, how many cache lines must be read
when accessing a single Point ?
(A) 1
(B) 2
(C) 1 or 2, can’t be sure
(D) 8
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 2
1 struct alignas(32) Point {
2 double x{}, y{}, z{}, w{};
3 };
Assuming a cache line size of 64 bytes, and that double is 8 bytes, how many cache lines must be read
when accessing a single Point ?
(A) 1
(B) 2
(C) 1 or 2, can’t be sure
(D) 8
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 3
xtensor
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 4
xtensor: multi-dimensional arrays with lazy evaluation
1 np.linspace(0., 2., 10) 1 xt::linspace<double>(0., 2., 10UL);
2 np.logspace(1., 10., 4) 2 xt::logspace<double>(2., 10., 4UL);
3 np.zeros(10, 10) 3 xt::zeros<double>({10UL, 10UL});
4 A[1,2] 4 A(1,2);
5 A.flat(4) 5 A[4];
6 A[:,3] 6 xt::col(A, 3) or xt::view(A, xt::all, 3);
7 A[:3, 3:] 7 xt::view(A, xt::range(_, 3), xt::range(3,_));
8 np.vectorize(f) 8 xt::vectorize(f);
9 A[A > 1.0] 9 xt::filter(A, A > 1.0);
10 A[[1,2], [0,1]] 10 xt::index_view(A, {{1,2}, {0,1}});
11 np.random.rand(100,200) 11 xt::random::rand<double>({100, 200});
12 np.random.shuffle(A) 12 xt::random::shuffle(A);
13 np.where(a < 0, a , b) 13 xt::where(A < 0, A, B);
14 np.load_txt(file, delim) 14 xt::load_csv<double>(stream);
15 np.linalg.svd(a) 15 xt::linalg::svd(A);
16 np.linalg.eig(a) 16 xt::linalg::eig(A);
Syntax modelled after python numpy
Sometimes more lazy evaluations
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 5
1 #include <xtensor/xtensor.hpp> Exercise 3.1:
2 #include <xtensor/xarray.hpp> The short program examples/xtensor/xt0.cc
3 #include <xtensor/xio.hpp>
4 #include <xtensor/xrandom.hpp> demonstrates using xtensor with eigenvalue
5 #include <xtensor-blas/xlinalg.hpp> evaluation. The linear algebra functionality in xtensor is
6 #include <iostream> currently handled by an external project
7
8 auto main() -> int xtensor-blas , which offloads some of the work to a
9 { blas library. To build the program, set the include path
10 auto R = xt::random::rand<double>({4, 4});
11 auto eigs = xt::linalg::eigvals(R);
to include headers from “xtensor-stack”, i.e., xtl ,
12 std::cout << R << "\n\n"; xtensor , xsimd , and xtensor-blas . They can
13 std::cout << eigs << "\n"; be given a common installation prefix. On JUSUF, the
14 }
relevant include and library directories are already in the
right paths. For linking, use
-lopenblas -lpthread -lgfortran
Exercise 3.2:
The program xt1.cc demonstrates creation of two random matrices using xtensor, and matrix multiplication
using xtensor-blas. Test using compilation as above. xt2.cc demonstrates numerical verification that the sum of
eigenvalues of a symmetric matrix is the trace of the matrix. Use these programs to familiarize yourself with
xtensor .
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 6
Parallel computing
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 7
Parallel computing
Engineering (power consumption) challenges make
processors with higher and higher clock rates
impractical
Computers in the last 20 years have instead
increased processing power by adding more
hardware for parallel processing
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8
Parallel computing
Engineering (power consumption) challenges make
1 auto gcd(unsigned s, unsigned l) -> unsigned processors with higher and higher clock rates
2 {
3 if (s > l)
impractical
4 std::swap(s, l); Computers in the last 20 years have instead
5 while (s != 0) {
6 auto r = l % s; increased processing power by adding more
7 l = s; hardware for parallel processing
8 s = r;
9 } A sequence of dependent operations on a small set
10 return l; of entities is ill-suited for processing with many
11 }
workers
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8
Parallel computing
Engineering (power consumption) challenges make
processors with higher and higher clock rates
impractical
Computers in the last 20 years have instead
increased processing power by adding more
hardware for parallel processing
A sequence of dependent operations on a small set
of entities is ill-suited for processing with many
workers
Given a large amount of information to be
processed, or a task with a large number of
independent sub-tasks, it is possible to reduce the
overall processing time.
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8
Parallel computing
What mechanisms do we have in C++ to exploit available parallelism in hardware?
Threads, mutexes, atomic operations
RAII for resource management
Libraries to partition and assign work to workers
Templates, lambda functions, CTAD
High-level STL style algorithms abstracting common programming building blocks
Containers and allocators for more efficient (and corrrect) parallel processing
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 9
Threads
std::thread , std::async ... since C++11
1 auto calc1 = [=]() {
2 auto tot1 = 0.;
Parallel algorithms since C++17
3 for (auto i = 0UL; i < N; ++i) { std::jthread , std::stop_token since
4 auto ang = 2 * i * pi / N;
5 tot1 += std::cos(ang) * std::cos(ang);
C++20
6 } std::jthread joins in the destructor
7 };
8 auto calc2 = [=]() {
9 auto tot1 = 0.;
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot1 += std::sin(ang) * std::sin(ang);
13 }
14 };
15 std::jthread j1 { calc1 };
16 std::jthread j2 { calc2 };
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 10
Threads
std::thread , std::async ... since C++11
1 auto calc1 = [=]() {
2 auto tot1 = 0.;
Parallel algorithms since C++17
3 for (auto i = 0UL; i < N; ++i) { std::jthread , std::stop_token since
4 auto ang = 2 * i * pi / N;
5 tot1 += std::cos(ang) * std::cos(ang);
C++20
6 } std::jthread joins in the destructor
7 };
8 auto calc2 = [=]() {
9 auto tot1 = 0.;
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot1 += std::sin(ang) * std::sin(ang);
13 }
14 };
15 std::jthread j1 { calc1 };
16 std::jthread j2 { calc2 };
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 10
Threads
1 auto tot = 0.;
2 {
3 std::jthread j1 { [&]() {
4 for (auto i = 0UL; i < N; ++i) {
5 auto ang = 2 * i * pi / N;
6 tot += std::cos(ang) * std::cos(ang);
7 }
8 } };
9 std::jthread j2 { [&]() {
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot += std::sin(ang) * std::sin(ang);
13 } Modification of data at the same address from
14 } };
15 }
multiple threads can lead to “data races”
16 std::cout << "Total " << tot << "\n";
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11
Threads
1 auto tot = 0.;
2 {
3 std::jthread j1 { [&]() {
4 for (auto i = 0UL; i < N; ++i) {
5 auto ang = 2 * i * pi / N;
6 tot += std::cos(ang) * std::cos(ang);
7 }
8 } };
9 std::jthread j2 { [&]() {
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot += std::sin(ang) * std::sin(ang);
13 } The result can be incorrect, since the
14 } };
15 }
load-modify-commit operations from the two
16 std::cout << "Total " << tot << "\n"; threads can overlap
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11
Threads
1 std::mutex totmutex;
2 {
3 std::jthread j1 { [&]() {
4 for (auto i = 0UL; i < N; ++i) {
5 auto ang = 2 * i * pi / N;
6 std::scoped_lock lck { totmutex };
7 tot += std::cos(ang) * std::cos(ang);
8 }
9 } };
10 std::jthread j2 { [&]() {
11 for (auto i = 0UL; i < N; ++i) {
12 auto ang = 2 * i * pi / N;
13 std::scoped_lock lck { totmutex }; Fix 1: std::mutex : A resource which can be
14 tot += std::sin(ang) * std::sin(ang);
15 }
acquired by only one thread at a time. Must be
16 } }; released by the acquiring thread.
17 } std::scoped_lock manages mutex
18 std::cout << "Total " << tot << "\n";
acquisition/release using RAII
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11
Threads
1 std::atomic<double> tot {};
2 {
3 std::jthread j1 { [&]() {
4 for (auto i = 0UL; i < N; ++i) {
5 auto ang = 2 * i * pi / N;
6 tot += std::cos(ang) * std::cos(ang);
7 }
8 } };
9 std::jthread j2 { [&]() {
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot += std::sin(ang) * std::sin(ang);
13 } std::atomic<T> gives us “atomic”
14 } };
15 }
load-modify-commit operations
16 std::cout << "Total " << tot << "\n";
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11
Threads
Even when threads write to
1 struct wrapped1 { different addresses, there can
2 int val {};
3 };
be a significant slowdown
4 template <class W> because of “false sharing”
5 struct func {
6 void operator()(volatile W* var)
7 {
8 for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i) {
9 var->val = var->val + 1;
10 }
11 }
12 };
13 {
14 std::array<wrapped2, PARALLEL> arr {};
15 {
16 std::array<std::jthread, PARALLEL> threads;
17 for (unsigned i = 0U; i < PARALLEL; ++i) {
18 threads[i] =
19 std::jthread(func<wrapped2>{}, &arr[i]);
20 }
21 }
22 }
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 12
Threads
Even when threads write to
1 struct align_as(std::hardware_destructive_interference_size) different addresses, there can
2 wrapped1 {
3 int val {};
be a significant slowdown
4 }; because of “false sharing”
5 template <class W>
6 struct func { Mitigation: alignment or
7 void operator()(volatile W* var) padding
8 {
9 for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i) {
10 var->val = var->val + 1;
11 }
12 }
13 };
14 {
15 std::array<wrapped2, PARALLEL> arr {};
16 {
17 std::array<std::jthread, PARALLEL> threads;
18 for (unsigned i = 0U; i < PARALLEL; ++i) {
19 threads[i] =
20 std::jthread(func<wrapped2>{}, &arr[i]);
21 }
22 }
23 }
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 12
Parallel STL
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 13
Parallel STL
Parallel versions of the high-level building blocks
such as std::sort , std::reduce etc. 1 std::sort(std::execution::par,
2 points.begin(), points.end(),
C++17 parallel STL provides a way to express that 3 [](auto p1, auto p2) {
something can be done in parallel, but does not 4 return p1.x() < p2.x();
5 });
mandate implementation strategy 6 std::for_each(std::execution::par_unseq,
Programs already written using algorithms will offer 7 points.begin(), points.end(),
8 [](auto & p) {
many opportunities for exploiting parallelism 9 p.norm(1);
10 });
A TBB based implementation is used since GCC
9.1. Intel and Microsoft compilers have their
implementations as well. As of GCC 14.2, to compile programs using parallel
algorithms, we need to link with libtbb and
std::sort sorts. libtbbmalloc , e.g.,
std::sort(std::execution::par, ...) G par_user.cc -ltbb -ltbbmalloc
sorts in parallel
As of Clang 19.1, parallel STL remains an
std::reduce adds up elements from a range. experimental feature in libc++ , and must be
std::reduce(std::execution::par, ...) enabled through -fexperimental-library
adds up elements in parallel
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 14
Execution policies
std::execution::sequenced_policy : Parallel algorithm’s execution may not be parallelised.
Element wise operations are indeterminately sequenced in the calling thread. An instance called,
std::execution::seq is usually used to disambiguate overload resolution
std::execution::parallel_policy : May be parallelised. Element wise operations can happen in
the calling thread, or on another. Relative sequencing is indeterminate. Convenience instance:
std::execution::par
std::execution::parallel_unsequenced_policy May be parallelised and vectorised. Element
wise operations can run in unspecified threads, and can be unordered in each thread.
std::execution::par_unseq
std::execution::unsequenced_policy Only vectorised. std::execution::unseq
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 15
Parallel STL examples
Exercise 3.3:
The program examples/pstl/inner_product.cc demonstrates the use of the parallel STL library,
performing a simple inner product calculation. Use -ltbb -ltbbmalloc for linking, or use the CMake file in
the directory.
Exercise 3.4:
The program examples/pstl/transform_reduce.cc creates a vector of random points in 2D, and then
calculates the moment of inertia using STL algorithms. Just switching the execution policy parameter, the
ptogram can be parallelised and vectorised. Test!
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 16
Parallel STL examples
Exercise 3.5:
Parallelise the program exercises/pstl/mandelbrot0.cc using parallel STL.
Exercise 3.6:
At what size of a group of random strangers does the chance of two people sharing a birthday become greater
than 0.5? The program birthday_problem.cc solves it using a crude, brute force Monte Carlo simulation.
Parallelise it using parallel STL.
Examples in this section can be done with both GCC and Clang, with some caveats when using Clang.
clang++ -std=c++23 -stdlib=libc++ -fexpermental-library -O3 -march=native ___.cc
and
clang++ -std=c++23 -stdlib=libstdc++ -O3 -march=native ___.cc
will both will work. As of October 2024, libc++ has not optimised performance when using parallel algorithms.
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 17
Threading Building Blocks
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 18
TBB: Threading Building Blocks I
Provides utilities like parallel_for , parallel_reduce to simplify the most commonly used
structures in parallel programs
Provides scalable concurrent containers such as vectors, hash tables and queues for use in multi-threaded
environments
No direct support for vector parallelism. But can be combined with auto-parallelisation and
#pragma omp simd etc or explicit SIMD with a SIMD library
Supports complex models such as pipelines, data flow and unstructured task graphs
Scalable memory allocation, avoidance of false sharing, thread local storage
Low level synchronisation tools like mutexes and atomics
Work stealing task scheduler
http://www.threadingbuildingblocks.org
Structured Parallel Programming, Michael McCool, Arch D. Robinson, James Reinders
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 19
Using TBB
Public names are available under the namespaces tbb and tbb::flow
You indicate "available parallelism", scheduler may run it in parallel if resources are available
Unnecessary parallelism will be ignored
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 20
parallel invoke
1 void prep(Population &p);
2 void iomanage();
3 tbb::parallel_invoke(
4 [&] {
5 noise_w(0., pars.sigma, wns);
6 std::copy(wns.begin(), wns.end(), wnoisemat.begin());
7 },
8 [&] {
9 noise_phi(0., pars.sigma, phins);
10 std::copy(phins.begin(), phins.end(), phinoisemat.begin());
11 });
A few adhoc tasks which do not depend on each
Exercise 3.7: examples/tbb/parallel_invoke.cc other
Compile with Runs them in parallel
G parallel_invoke.cc -ltbb -ltbbmalloc
waits until all of them are finished
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 21
TBB task groups
Run an arbitrary number of callable objects in
1 struct Equation { parallel
2 void solve();
3 }; In case an exception is thrown, the task group is
4 cancelled
5 std::list<Equation> equations;
6 tbb::task_group g;
7 for (auto eq : equations)
8 g.run([]{eq.solve();});
9
10 g.wait();
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 22
TBB task arena
Task arena to manage tasks, maps them to threads
1 auto main(int argc, char *argv[]) -> int etc.
2 {
3 size_t nthreads=std::stoul(argv[1]); Number of threads in an arena limited by its
4 tbb::task_arena main_executor; concurrency level
5 main_executor.initialize(nthreads);
6 main_executor.execute([&]{ Execute function, with a function object as
7 haha(); argument.
8 });
9 } Returns the same thing as the function it is
10 void haha()
11 {
executing.
12 ...
13 tbb::parallel_invoke(a,b,c,d,e);
14 }
15 void a()
16 {
17 tbb::parallel_for(...);
18 }
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 23
Parallel for loops
Template function modelled after the for loops,
like many STL algorithms 1 tbb::parallel_for(first,last,f);
2 // parallel equivalent of
Takes a callable object as the third argument 3 // for (auto i=first;i<last;++i) f(i);
4
Using lambda functions, you can expose parallelism 5 tbb::parallel_for(first,last,stride,f);
in sections of your code 6 // parallel equivalent of
7 // for (auto i=first;i<last;i+=stride)
8 // f(i);
9
10 tbb::parallel_for(first,last,
11 [captures](anything){
12 //Code that can run in parallel
13 });
14
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 24
Parallel for with ranges
Splits range into smaller ranges, and applies f to
them in parallel 1 tbb::parallel_for(0,1000000,f);
2 // One parallel invocation for each i!
Possible to optimize f for sub-ranges rather than a 3 tbb::parallel_for(range,f);
single index 4
5 // A type R can be a range if the
Any type satisfying a few design conditions can be 6 // following are available
used as a range 7 R::R(const R &);
8 R::~R();
Multidimensional ranges possible 9 bool R::is_divisible() const;
10 bool R::empty() const;
11 R::R(R & r,split); //Split constructor
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 25
Parallel for with ranges
1 tbb::blocked_range<int> r{0,30,20};
2 assert(r.is_divisible());
3 blocked_range<int> s{r};
4 //Splitting constructor
5 assert(!r.is_divisible());
6 assert(!s.is_divisible());
7
tbb::blocked_range<int>(0,4) represents an integer range 0..4
tbb::blocked_range<int>(0,50,30) represents two ranges, 0..25 and 26..50
So long as the size of the range is bigger than the "grain size" (third argument), the range is split
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 26
Parallel for with ranges
1 void dasxpcy_tbb(double a, std::span<const double> x, std::span<double> y) {
2 tbb::parallel_for(tbb::blocked_range<int>(0, x.size()),
3 [&](tbb::blocked_range<int> r) {
4 for (size_t i = r.begin(); i != r.end(); ++i) {
5 y[i] = a * sin(x[i]) + cos(y[i]);
6 }
7 });
8 }
parallel_for with a range uses split constructor to split the range as far as possible, and then calls
f(range), where f is the functional given to parallel_for
It is unlikely that you wrote your useful functions with ranges compatible with parallel_for as
arguments
But with lambda functions, it is easy to fit the parts!
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 27
Exercise 3.8: TBB parallel for demo
The program examples/dasxpcy.cc demonstrates the use of parallel for in TBB. It is a slightly modified
version of the commonly used DAXPY demos. Instead of calculating y = a ∗ x + y for scalar a and large vectors
x and y , we calculate y = a ∗ sin(x ) + cos(y ). To compile, you need to load your compiler and TBB modules,
and use them like this:
1 G dasxpcy.cc -ltbb -ltbbmalloc
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 28
2D ranges
1 void f(size_t i, size_t j);
2 tbb::blocked_range2d<size_t> r{0, N, 0, N};
3 tbb::parallel_for(r, [&](tbb::blocked_range2d<size_t> r) {
4 for (auto i = r.rows().begin(); i != r.rows().end(); ++i) {
5 for (auto j = r.cols().begin(); j != r.cols().end(); ++j) {
6 f(i, j);
7 }
8 }
9 });
rows() is an object with a begin() and an end() returning just the integer row values in the range.
Similarly: cols() ...
2D range can also be split
The callable object argument should assume that the original 2D range has been split many times, and we
are operating on a smaller range, whose properties can be accessed with these functions.
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 29
Parallel reductions with ranges
1 T result = tbb::parallel_reduce(range, identity, subrange_reduction, combine);
range : As with parallel for
identity : Identity element of type T. The type determines the type used to accumulate the result
subrange_reduction : Functor taking a "subrange" and an initial value, returning reduction
combine : Functor taking two arguments of type T and returning reduction over them over the subrange.
Must be associative, but not necessarily commutative.
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 30
Parallel reduce with ranges
1 double inner_prod_tbb(std::span<const double> x, std::span<double> y) {
2 return tbb::parallel_reduce(
3 tbb::blocked_range<int>(0, n), // range
4 double{}, // identity
5 [&](tbb::blocked_range<int> &r, float in){
6 return std::inner_product(x.begin() + r.begin(), x.begin() + r.end(),
7 y.begin() + r.begin(), in);
8 }, // subrange reduction
9 std::plus<double>{} // combine
10 );
11 }
With TBB ranges, we can use blocked implementations with hopefully vectorisable calculations in subranges
Two functors are required, either of which could be lambda functions
Important to add the contribution of initial value in subrange reductions
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 31
Exercise 3.9: TBB parallel reduce
The program tbbreduce.cc is a demo program to calculate an integral using tbb::parallel_reduce
What kind of speed up do you see relative to the serial version ? Does it make sense considering the number of
physical cores in your computer ?
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 32
Atomic variables
"Instantaneous" updates
1 std::array<double, N> A;
Lock-free synchronization 2 std::atomic<int> index;
For std::atomic<T> , T can be integral, enum 3
4 void append(double val)
or pointer type, and since C++20, also floating 5 {
point, std::shared_ptr and 6 A[index++] = val;
std::weak_ptr 7 }
If index.load() == k simultaneous calls to
index++ by n threads will increase index to
k + n . Each thread will use a distinct value
between k and k + n
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 33
Atomic variables
"Instantaneous" updates
1 std::array<double, N> A;
Lock-free synchronization 2 std::atomic<int> index;
For std::atomic<T> , T can be integral, enum 3
4 void append(double val)
or pointer type, and since C++20, also floating 5 {
point, std::shared_ptr and 6 A[index++] = val;
std::weak_ptr 7 }
If index.load() == k simultaneous calls to
index++ by n threads will increase index to But it is important that we use the return value of
k + n . Each thread will use a distinct value index++ in the threads!
between k and k + n
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 33
Enumerable thread specific
1 tbb::enumerable_thread_specific<double> E;
2 double Eglob=0;
3 double f(size_t i, size_t j);
4 tbb::blocked_range2d<size_t> r{0, N, 0, N};
5 tbb::parallel_for(r, [&](tbb::blocked_range2d<size_t> r){
6 auto & eloc = E.local();
7 for (size_t i = r.rows().begin(); i != r.rows().end(); ++i) {
8 for (size_t j = r.cols().begin();j != r.cols().end(); ++j) {
9 if (j > i) eloc += f(i,j);
10 }
11 }
12 });
13 Eglob = 0;
14 for (auto& v : E) {Eglob += v; v = 0;}
Thread local "views" of a variable
behaves like an STL container of those views
Member function local() gives a reference to the local view in the current thread
Any thread can access all views by treating it as an STL container
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 34
TBB allocators
Dynamic memory allocation in a multithreaded program must avoid conflicts from new calls from different
threads
Global memory lock
TBB allocators
Interface like std::allocator , so that it can be used with STL containers. E.g.,
std::vector<T, tbb::cache_aligned_allocator<T>>
tbb::scalable_allocator<T> : general purpose scalable allocator type, for rapid allocation from
multiple threads
tbb::cache_aligned_allocator<T> : Allocates with cache line alignment. As a consequence,
objects allocated in different threads are guaranteed to be in different cache lines.
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 35
Concurrent containers
1 #include <tbb/concurrent_vector.h>
2
3 auto v = tbb::concurrent_vector<int>(N, 0);
4
5 tbb::parallel_for(v.range(), [&](tbb::concurrent_vector::range_type r) {
6 //...
7 });
Random access by index
Multiple threads can grow container and add elements concurrently
Growing the container does not invalidate any iterators or indexes
Has a range() member function for use with parallel_for etc.
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 36
Linear Algebra
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 37
Linear algebra
Operations on matrices, vectors, linear systems etc.
Data parallel, simple numerical calculations
Can be hand coded, but taking proper account of available CPU instructions, memory hierarchy etc is hard
Libraries with standardized syntax for wide applicability
Excellent vendor libraries are available on HPC systems
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 38
Eigen: A C++ template library for linear algebra
Include only library. Download from
http://eigen.tuxfamily.org/, unpack in a 1 // examples/Eigen/eigen1.cc
2 #include <iostream>
location of your choice, and use. Nothing to link. 3 #include <Eigen/Dense>
Small fixed size to large dense/sparse matrices 4 using namespace Eigen;
5 using namespace std;
Matrix operations, numerical solvers, tensors ... 6 int main()
7 {
Expression templates: lazy evaluation, smart 8 MatrixXd m=MatrixXd::Random(3,3);
removal of temporaries 9 m = (m + MatrixXd::Constant(3, 3, 1.2)) * 50;
10 cout << "m =" << "\n" << m << "\n";
11 VectorXd v(3);
12 v << 1, 2, 3;
13 cout << "m * v =" << "\n" << m * v << "\n";
14 }
G eigen1.cc
Explicit vectorization
Elegant API
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 39
Eigen: matrix types
MatirxXd : matrix of arbitrary dimensions
Matrix3d : fixed size 3 × 3 matrix
Vector3d : fixed size 3d vector
Element access m(i,j)
Output std::cout << m << "\n";
Constant : MatrixXd::Constant(a,b,c)
Random : MatrixXd::Random(n,n)
Products : m * v or m1 * m2
Expressions : 3 * m * m * v1 + u * v2 + m * m * m * v3
Column major matrix : Matrix<float, 3, 10, Eigen::ColMajor>
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 40
Eigen: matrix operations
1 #include <iostream>
2 #include <Eigen/Dense>
3 using namespace Eigen;
4 auto main() -> int {
5 Matrix3f A;
6 Vector3f b;
7 A << 1,2,3, 4,5,6, 7,8,10;
8 b << 3, 3, 4;
9 std::cout << "Here is the matrix A:\n" << A << "\n";
10 std::cout << "Here is the vector b:\n" << b << "\n";
11 Vector3f x = A.colPivHouseholderQr().solve(b);
12 std::cout << "The solution is:\n" << x << "\n";
13 }
Blocks m.block(start_r, start_c, nr, nc) , or m.block<nr,nc>(start_r, start_c)
1 SelfAdjointEigenSolver<Matrix2f> eigensolver(A);
2 if (eigensolver.info() != Success) abort();
3 std::cout << "Eigenvalues " << eigensolver.eigenvalues() << "\n";
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 41
Eigen: examples
Exercise 3.10:
There are a few example programs using Eigen in the folder examples/Eigen . Read the programs
eigen0.cc and eigen1.cc . To compile, use G program.cc .
Exercise 3.11:
The folder examples/Eigen contains a matrix multiplication example, matmul.cc using Eigen. Compare
with a naive version of a matrix multiplication program, matmul_naive.cc , by compiling and running both
programs. Try different matrix sizes. Then, you can use a parallel version of the Eigen matrix multiplication by
recompiling with -fopenmp .
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 42
Exercise 3.12:
The file exercises/PCA has a data file with tabular data. Each column represents all measurements of a
particular type, while each row is a different trial. In each row, the first column, xi0 , represents a pseudo-time
variable. Write a program using Eigen to perform a Principal Component Analysis on this data set, ignoring the
first column. Hint:
if Xi = [xi1 , xi2 , ...xim ] is the data of row i, the covariance matrix is defined as,
1 X
Cab = xka xkb
(n − 1)
k
The principal components of the data are obtained by right multiplying the data matrix by the matrix whose
columns are the eigen vectors of the matrix Cab , conventionally ordered by decreasing eigenvalues.
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 43
Lessons from matrix multiplication
Exercise 3.13:
In the examples folder, you will find a MatMul subfolder, containing a written lesson called SessionMatrix.pdf .
This file contains 8 stages organised as exercises starting with a naive implementation of a matrix type in C++,
and ending with something with reasonably respectable performance (comparable to what is possible with, e.g.,
Eigen, or other BLAS libraries) on a single node on JUSUF. It only uses concepts introduced in this course, and
does not call any linear algebra library function. Work through the exercises and test the different stages on
JUSUF!
Member of the Helmholtz Association 28 October – 31 October 2024 Slide 44