0% found this document useful (0 votes)

14 views53 pages

hpcxx2024 d3

The document outlines a High Performance Scientific Computing (HPC) C++ course scheduled for October 28-31, 2024, at Forschungszentrum Jülich, Germany. It includes technical discussions on cache line sizes, the use of the xtensor library for multi-dimensional arrays, and parallel computing mechanisms in C++. Various programming examples and challenges related to parallelism and data management in C++ are also presented.

Uploaded by

elcaracolenojon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views53 pages

hpcxx2024 d3

Uploaded by

elcaracolenojon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

High performance scientific computing in C++

HPC C++ Course 2024

28 October – 31 October 2024 Sandipan Mohanty Forschungszentrum Jülich, Germany

Member of the Helmholtz Association

1 struct Point {
2 double x{}, y{}, z{}, w{};
3 };

Assuming a cache line size of 64 bytes, and that double is 8 bytes, how many elements of a vector of
Point would fit completely inside a cache line?

(A) 1
(B) 2
(C) 1 or 2, can’t be sure
(D) 8

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 1

1 struct Point {
2 double x{}, y{}, z{}, w{};
3 };

Assuming a cache line size of 64 bytes, and that double is 8 bytes, how many cache lines must be read
when accessing a single Point ?

(A) 1
(B) 2
(C) 1 or 2, can’t be sure
(D) 8

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 2

1 struct alignas(32) Point {
2 double x{}, y{}, z{}, w{};
3 };

Assuming a cache line size of 64 bytes, and that double is 8 bytes, how many cache lines must be read
when accessing a single Point ?

(A) 1
(B) 2
(C) 1 or 2, can’t be sure
(D) 8

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 3

xtensor

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 4

xtensor: multi-dimensional arrays with lazy evaluation
1 np.linspace(0., 2., 10) 1 xt::linspace<double>(0., 2., 10UL);
2 np.logspace(1., 10., 4) 2 xt::logspace<double>(2., 10., 4UL);
3 np.zeros(10, 10) 3 xt::zeros<double>({10UL, 10UL});
4 A[1,2] 4 A(1,2);
5 A.flat(4) 5 A[4];
6 A[:,3] 6 xt::col(A, 3) or xt::view(A, xt::all, 3);
7 A[:3, 3:] 7 xt::view(A, xt::range(_, 3), xt::range(3,_));
8 np.vectorize(f) 8 xt::vectorize(f);
9 A[A > 1.0] 9 xt::filter(A, A > 1.0);
10 A[[1,2], [0,1]] 10 xt::index_view(A, {{1,2}, {0,1}});
11 np.random.rand(100,200) 11 xt::random::rand<double>({100, 200});
12 np.random.shuffle(A) 12 xt::random::shuffle(A);
13 np.where(a < 0, a , b) 13 xt::where(A < 0, A, B);
14 np.load_txt(file, delim) 14 xt::load_csv<double>(stream);
15 np.linalg.svd(a) 15 xt::linalg::svd(A);
16 np.linalg.eig(a) 16 xt::linalg::eig(A);

Syntax modelled after python numpy

Sometimes more lazy evaluations

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 5

1 #include <xtensor/xtensor.hpp> Exercise 3.1:
2 #include <xtensor/xarray.hpp> The short program examples/xtensor/xt0.cc
3 #include <xtensor/xio.hpp>
4 #include <xtensor/xrandom.hpp> demonstrates using xtensor with eigenvalue
5 #include <xtensor-blas/xlinalg.hpp> evaluation. The linear algebra functionality in xtensor is
6 #include <iostream> currently handled by an external project
7
8 auto main() -> int xtensor-blas , which offloads some of the work to a
9 { blas library. To build the program, set the include path
10 auto R = xt::random::rand<double>({4, 4});
11 auto eigs = xt::linalg::eigvals(R);
to include headers from “xtensor-stack”, i.e., xtl ,
12 std::cout << R << "\n\n"; xtensor , xsimd , and xtensor-blas . They can
13 std::cout << eigs << "\n"; be given a common installation prefix. On JUSUF, the
14 }
relevant include and library directories are already in the
right paths. For linking, use
-lopenblas -lpthread -lgfortran

Exercise 3.2:
The program xt1.cc demonstrates creation of two random matrices using xtensor, and matrix multiplication
using xtensor-blas. Test using compilation as above. xt2.cc demonstrates numerical verification that the sum of
eigenvalues of a symmetric matrix is the trace of the matrix. Use these programs to familiarize yourself with
xtensor .

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 6

Parallel computing

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 7

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8

Parallel computing
Engineering (power consumption) challenges make
1 auto gcd(unsigned s, unsigned l) -> unsigned processors with higher and higher clock rates
2 {
3 if (s > l)
impractical
4 std::swap(s, l); Computers in the last 20 years have instead
5 while (s != 0) {
6 auto r = l % s; increased processing power by adding more
7 l = s; hardware for parallel processing
8 s = r;
9 } A sequence of dependent operations on a small set
10 return l; of entities is ill-suited for processing with many
11 }
workers

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8

Parallel computing
Engineering (power consumption) challenges make
processors with higher and higher clock rates
impractical
Computers in the last 20 years have instead
increased processing power by adding more
hardware for parallel processing
A sequence of dependent operations on a small set
of entities is ill-suited for processing with many
workers
Given a large amount of information to be
processed, or a task with a large number of
independent sub-tasks, it is possible to reduce the
overall processing time.

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8

Parallel computing
What mechanisms do we have in C++ to exploit available parallelism in hardware?

Threads, mutexes, atomic operations

RAII for resource management
Libraries to partition and assign work to workers
Templates, lambda functions, CTAD
High-level STL style algorithms abstracting common programming building blocks
Containers and allocators for more efficient (and corrrect) parallel processing

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 9

Threads
std::thread , std::async ... since C++11
1 auto calc1 = [=]() {
2 auto tot1 = 0.;
Parallel algorithms since C++17
3 for (auto i = 0UL; i < N; ++i) { std::jthread , std::stop_token since
4 auto ang = 2 * i * pi / N;
5 tot1 += std::cos(ang) * std::cos(ang);
C++20
6 } std::jthread joins in the destructor
7 };
8 auto calc2 = [=]() {
9 auto tot1 = 0.;
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot1 += std::sin(ang) * std::sin(ang);
13 }
14 };
15 std::jthread j1 { calc1 };
16 std::jthread j2 { calc2 };

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 10

Threads
1 auto tot = 0.;
2 {
3 std::jthread j1 { [&]() {
4 for (auto i = 0UL; i < N; ++i) {
5 auto ang = 2 * i * pi / N;
6 tot += std::cos(ang) * std::cos(ang);
7 }
8 } };
9 std::jthread j2 { [&]() {
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot += std::sin(ang) * std::sin(ang);
13 } Modification of data at the same address from
14 } };
15 }
multiple threads can lead to “data races”
16 std::cout << "Total " << tot << "\n";

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11

Threads
1 auto tot = 0.;
2 {
3 std::jthread j1 { [&]() {
4 for (auto i = 0UL; i < N; ++i) {
5 auto ang = 2 * i * pi / N;
6 tot += std::cos(ang) * std::cos(ang);
7 }
8 } };
9 std::jthread j2 { [&]() {
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot += std::sin(ang) * std::sin(ang);
13 } The result can be incorrect, since the
14 } };
15 }
load-modify-commit operations from the two
16 std::cout << "Total " << tot << "\n"; threads can overlap

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11

Threads
1 std::mutex totmutex;
2 {
3 std::jthread j1 { [&]() {
4 for (auto i = 0UL; i < N; ++i) {
5 auto ang = 2 * i * pi / N;
6 std::scoped_lock lck { totmutex };
7 tot += std::cos(ang) * std::cos(ang);
8 }
9 } };
10 std::jthread j2 { [&]() {
11 for (auto i = 0UL; i < N; ++i) {
12 auto ang = 2 * i * pi / N;
13 std::scoped_lock lck { totmutex }; Fix 1: std::mutex : A resource which can be
14 tot += std::sin(ang) * std::sin(ang);
15 }
acquired by only one thread at a time. Must be
16 } }; released by the acquiring thread.
17 } std::scoped_lock manages mutex
18 std::cout << "Total " << tot << "\n";
acquisition/release using RAII

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11

Threads
1 std::atomic<double> tot {};
2 {
3 std::jthread j1 { [&]() {
4 for (auto i = 0UL; i < N; ++i) {
5 auto ang = 2 * i * pi / N;
6 tot += std::cos(ang) * std::cos(ang);
7 }
8 } };
9 std::jthread j2 { [&]() {
10 for (auto i = 0UL; i < N; ++i) {
11 auto ang = 2 * i * pi / N;
12 tot += std::sin(ang) * std::sin(ang);
13 } std::atomic<T> gives us “atomic”
14 } };
15 }
load-modify-commit operations
16 std::cout << "Total " << tot << "\n";

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11

Threads
Even when threads write to
1 struct wrapped1 { different addresses, there can
2 int val {};
3 };
be a significant slowdown
4 template <class W> because of “false sharing”
5 struct func {
6 void operator()(volatile W* var)
7 {
8 for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i) {
9 var->val = var->val + 1;
10 }
11 }
12 };
13 {
14 std::array<wrapped2, PARALLEL> arr {};
15 {
16 std::array<std::jthread, PARALLEL> threads;
17 for (unsigned i = 0U; i < PARALLEL; ++i) {
18 threads[i] =
19 std::jthread(func<wrapped2>{}, &arr[i]);
20 }
21 }
22 }

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 12

Threads
Even when threads write to
1 struct align_as(std::hardware_destructive_interference_size) different addresses, there can
2 wrapped1 {
3 int val {};
be a significant slowdown
4 }; because of “false sharing”
5 template <class W>
6 struct func { Mitigation: alignment or
7 void operator()(volatile W* var) padding
8 {
9 for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i) {
10 var->val = var->val + 1;
11 }
12 }
13 };
14 {
15 std::array<wrapped2, PARALLEL> arr {};
16 {
17 std::array<std::jthread, PARALLEL> threads;
18 for (unsigned i = 0U; i < PARALLEL; ++i) {
19 threads[i] =
20 std::jthread(func<wrapped2>{}, &arr[i]);
21 }
22 }
23 }

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 12

Parallel STL

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 13

Parallel STL
Parallel versions of the high-level building blocks
such as std::sort , std::reduce etc. 1 std::sort(std::execution::par,
2 points.begin(), points.end(),
C++17 parallel STL provides a way to express that 3 [](auto p1, auto p2) {
something can be done in parallel, but does not 4 return p1.x() < p2.x();
5 });
mandate implementation strategy 6 std::for_each(std::execution::par_unseq,
Programs already written using algorithms will offer 7 points.begin(), points.end(),
8 [](auto & p) {
many opportunities for exploiting parallelism 9 p.norm(1);
10 });
A TBB based implementation is used since GCC
9.1. Intel and Microsoft compilers have their
implementations as well. As of GCC 14.2, to compile programs using parallel
algorithms, we need to link with libtbb and
std::sort sorts. libtbbmalloc , e.g.,
std::sort(std::execution::par, ...) G par_user.cc -ltbb -ltbbmalloc
sorts in parallel
As of Clang 19.1, parallel STL remains an
std::reduce adds up elements from a range. experimental feature in libc++ , and must be
std::reduce(std::execution::par, ...) enabled through -fexperimental-library
adds up elements in parallel

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 14

Execution policies
std::execution::sequenced_policy : Parallel algorithm’s execution may not be parallelised.
Element wise operations are indeterminately sequenced in the calling thread. An instance called,
std::execution::seq is usually used to disambiguate overload resolution
std::execution::parallel_policy : May be parallelised. Element wise operations can happen in
the calling thread, or on another. Relative sequencing is indeterminate. Convenience instance:
std::execution::par
std::execution::parallel_unsequenced_policy May be parallelised and vectorised. Element
wise operations can run in unspecified threads, and can be unordered in each thread.
std::execution::par_unseq
std::execution::unsequenced_policy Only vectorised. std::execution::unseq

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 15

Parallel STL examples
Exercise 3.3:
The program examples/pstl/inner_product.cc demonstrates the use of the parallel STL library,
performing a simple inner product calculation. Use -ltbb -ltbbmalloc for linking, or use the CMake file in
the directory.

Exercise 3.4:
The program examples/pstl/transform_reduce.cc creates a vector of random points in 2D, and then
calculates the moment of inertia using STL algorithms. Just switching the execution policy parameter, the
ptogram can be parallelised and vectorised. Test!

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 16

Parallel STL examples
Exercise 3.5:
Parallelise the program exercises/pstl/mandelbrot0.cc using parallel STL.

Exercise 3.6:
At what size of a group of random strangers does the chance of two people sharing a birthday become greater
than 0.5? The program birthday_problem.cc solves it using a crude, brute force Monte Carlo simulation.
Parallelise it using parallel STL.

Examples in this section can be done with both GCC and Clang, with some caveats when using Clang.
clang++ -std=c++23 -stdlib=libc++ -fexpermental-library -O3 -march=native ___.cc
and
clang++ -std=c++23 -stdlib=libstdc++ -O3 -march=native ___.cc
will both will work. As of October 2024, libc++ has not optimised performance when using parallel algorithms.

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 17

Threading Building Blocks

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 18

TBB: Threading Building Blocks I
Provides utilities like parallel_for , parallel_reduce to simplify the most commonly used
structures in parallel programs
Provides scalable concurrent containers such as vectors, hash tables and queues for use in multi-threaded
environments
No direct support for vector parallelism. But can be combined with auto-parallelisation and
#pragma omp simd etc or explicit SIMD with a SIMD library
Supports complex models such as pipelines, data flow and unstructured task graphs
Scalable memory allocation, avoidance of false sharing, thread local storage
Low level synchronisation tools like mutexes and atomics
Work stealing task scheduler
http://www.threadingbuildingblocks.org
Structured Parallel Programming, Michael McCool, Arch D. Robinson, James Reinders

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 19

Using TBB
Public names are available under the namespaces tbb and tbb::flow
You indicate "available parallelism", scheduler may run it in parallel if resources are available
Unnecessary parallelism will be ignored

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 20

parallel invoke

1 void prep(Population &p);

2 void iomanage();
3 tbb::parallel_invoke(
4 [&] {
5 noise_w(0., pars.sigma, wns);
6 std::copy(wns.begin(), wns.end(), wnoisemat.begin());
7 },
8 [&] {
9 noise_phi(0., pars.sigma, phins);
10 std::copy(phins.begin(), phins.end(), phinoisemat.begin());
11 });

A few adhoc tasks which do not depend on each

Exercise 3.7: examples/tbb/parallel_invoke.cc other
Compile with Runs them in parallel
G parallel_invoke.cc -ltbb -ltbbmalloc
waits until all of them are finished

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 21

TBB task groups
Run an arbitrary number of callable objects in
1 struct Equation { parallel
2 void solve();
3 }; In case an exception is thrown, the task group is
4 cancelled
5 std::list<Equation> equations;
6 tbb::task_group g;
7 for (auto eq : equations)
8 g.run([]{eq.solve();});
9
10 g.wait();

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 22

TBB task arena
Task arena to manage tasks, maps them to threads
1 auto main(int argc, char *argv[]) -> int etc.
2 {
3 size_t nthreads=std::stoul(argv[1]); Number of threads in an arena limited by its
4 tbb::task_arena main_executor; concurrency level
5 main_executor.initialize(nthreads);
6 main_executor.execute([&]{ Execute function, with a function object as
7 haha(); argument.
8 });
9 } Returns the same thing as the function it is
10 void haha()
11 {
executing.
12 ...
13 tbb::parallel_invoke(a,b,c,d,e);
14 }
15 void a()
16 {
17 tbb::parallel_for(...);
18 }

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 23

Parallel for loops
Template function modelled after the for loops,
like many STL algorithms 1 tbb::parallel_for(first,last,f);
2 // parallel equivalent of
Takes a callable object as the third argument 3 // for (auto i=first;i<last;++i) f(i);
4
Using lambda functions, you can expose parallelism 5 tbb::parallel_for(first,last,stride,f);
in sections of your code 6 // parallel equivalent of
7 // for (auto i=first;i<last;i+=stride)
8 // f(i);
9
10 tbb::parallel_for(first,last,
11 [captures](anything){
12 //Code that can run in parallel
13 });
14

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 24

Parallel for with ranges
Splits range into smaller ranges, and applies f to
them in parallel 1 tbb::parallel_for(0,1000000,f);
2 // One parallel invocation for each i!
Possible to optimize f for sub-ranges rather than a 3 tbb::parallel_for(range,f);
single index 4
5 // A type R can be a range if the
Any type satisfying a few design conditions can be 6 // following are available
used as a range 7 R::R(const R &);
8 R::~R();
Multidimensional ranges possible 9 bool R::is_divisible() const;
10 bool R::empty() const;
11 R::R(R & r,split); //Split constructor

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 25

Parallel for with ranges

1 tbb::blocked_range<int> r{0,30,20};
2 assert(r.is_divisible());
3 blocked_range<int> s{r};
4 //Splitting constructor
5 assert(!r.is_divisible());
6 assert(!s.is_divisible());
7

tbb::blocked_range<int>(0,4) represents an integer range 0..4

tbb::blocked_range<int>(0,50,30) represents two ranges, 0..25 and 26..50
So long as the size of the range is bigger than the "grain size" (third argument), the range is split

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 26

Parallel for with ranges

1 void dasxpcy_tbb(double a, std::span<const double> x, std::span<double> y) {

2 tbb::parallel_for(tbb::blocked_range<int>(0, x.size()),
3 [&](tbb::blocked_range<int> r) {
4 for (size_t i = r.begin(); i != r.end(); ++i) {
5 y[i] = a * sin(x[i]) + cos(y[i]);
6 }
7 });
8 }

parallel_for with a range uses split constructor to split the range as far as possible, and then calls
f(range), where f is the functional given to parallel_for
It is unlikely that you wrote your useful functions with ranges compatible with parallel_for as
arguments
But with lambda functions, it is easy to fit the parts!

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 27

Exercise 3.8: TBB parallel for demo
The program examples/dasxpcy.cc demonstrates the use of parallel for in TBB. It is a slightly modified
version of the commonly used DAXPY demos. Instead of calculating y = a ∗ x + y for scalar a and large vectors
x and y , we calculate y = a ∗ sin(x ) + cos(y ). To compile, you need to load your compiler and TBB modules,
and use them like this:

1 G dasxpcy.cc -ltbb -ltbbmalloc

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 28

2D ranges

1 void f(size_t i, size_t j);

2 tbb::blocked_range2d<size_t> r{0, N, 0, N};
3 tbb::parallel_for(r, [&](tbb::blocked_range2d<size_t> r) {
4 for (auto i = r.rows().begin(); i != r.rows().end(); ++i) {
5 for (auto j = r.cols().begin(); j != r.cols().end(); ++j) {
6 f(i, j);
7 }
8 }
9 });

rows() is an object with a begin() and an end() returning just the integer row values in the range.
Similarly: cols() ...
2D range can also be split
The callable object argument should assume that the original 2D range has been split many times, and we
are operating on a smaller range, whose properties can be accessed with these functions.

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 29

Parallel reductions with ranges

1 T result = tbb::parallel_reduce(range, identity, subrange_reduction, combine);

range : As with parallel for

identity : Identity element of type T. The type determines the type used to accumulate the result
subrange_reduction : Functor taking a "subrange" and an initial value, returning reduction
combine : Functor taking two arguments of type T and returning reduction over them over the subrange.
Must be associative, but not necessarily commutative.

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 30

Parallel reduce with ranges

1 double inner_prod_tbb(std::span<const double> x, std::span<double> y) {

2 return tbb::parallel_reduce(
3 tbb::blocked_range<int>(0, n), // range
4 double{}, // identity
5 [&](tbb::blocked_range<int> &r, float in){
6 return std::inner_product(x.begin() + r.begin(), x.begin() + r.end(),
7 y.begin() + r.begin(), in);
8 }, // subrange reduction
9 std::plus<double>{} // combine
10 );
11 }

With TBB ranges, we can use blocked implementations with hopefully vectorisable calculations in subranges
Two functors are required, either of which could be lambda functions
Important to add the contribution of initial value in subrange reductions

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 31

Exercise 3.9: TBB parallel reduce
The program tbbreduce.cc is a demo program to calculate an integral using tbb::parallel_reduce
What kind of speed up do you see relative to the serial version ? Does it make sense considering the number of
physical cores in your computer ?

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 32

Atomic variables
"Instantaneous" updates
1 std::array<double, N> A;
Lock-free synchronization 2 std::atomic<int> index;
For std::atomic<T> , T can be integral, enum 3
4 void append(double val)
or pointer type, and since C++20, also floating 5 {
point, std::shared_ptr and 6 A[index++] = val;
std::weak_ptr 7 }

If index.load() == k simultaneous calls to

index++ by n threads will increase index to
k + n . Each thread will use a distinct value
between k and k + n

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 33

If index.load() == k simultaneous calls to

index++ by n threads will increase index to But it is important that we use the return value of
k + n . Each thread will use a distinct value index++ in the threads!
between k and k + n

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 33

Enumerable thread specific
1 tbb::enumerable_thread_specific<double> E;
2 double Eglob=0;
3 double f(size_t i, size_t j);
4 tbb::blocked_range2d<size_t> r{0, N, 0, N};
5 tbb::parallel_for(r, [&](tbb::blocked_range2d<size_t> r){
6 auto & eloc = E.local();
7 for (size_t i = r.rows().begin(); i != r.rows().end(); ++i) {
8 for (size_t j = r.cols().begin();j != r.cols().end(); ++j) {
9 if (j > i) eloc += f(i,j);
10 }
11 }
12 });
13 Eglob = 0;
14 for (auto& v : E) {Eglob += v; v = 0;}

Thread local "views" of a variable

behaves like an STL container of those views
Member function local() gives a reference to the local view in the current thread
Any thread can access all views by treating it as an STL container

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 34

TBB allocators
Dynamic memory allocation in a multithreaded program must avoid conflicts from new calls from different
threads
Global memory lock

TBB allocators
Interface like std::allocator , so that it can be used with STL containers. E.g.,
std::vector<T, tbb::cache_aligned_allocator<T>>
tbb::scalable_allocator<T> : general purpose scalable allocator type, for rapid allocation from
multiple threads
tbb::cache_aligned_allocator<T> : Allocates with cache line alignment. As a consequence,
objects allocated in different threads are guaranteed to be in different cache lines.

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 35

Concurrent containers

1 #include <tbb/concurrent_vector.h>
2
3 auto v = tbb::concurrent_vector<int>(N, 0);
4
5 tbb::parallel_for(v.range(), [&](tbb::concurrent_vector::range_type r) {
6 //...
7 });

Random access by index

Multiple threads can grow container and add elements concurrently
Growing the container does not invalidate any iterators or indexes
Has a range() member function for use with parallel_for etc.

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 36

Linear Algebra

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 37

Linear algebra
Operations on matrices, vectors, linear systems etc.
Data parallel, simple numerical calculations
Can be hand coded, but taking proper account of available CPU instructions, memory hierarchy etc is hard
Libraries with standardized syntax for wide applicability
Excellent vendor libraries are available on HPC systems

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 38

Eigen: A C++ template library for linear algebra
Include only library. Download from
http://eigen.tuxfamily.org/, unpack in a 1 // examples/Eigen/eigen1.cc
2 #include <iostream>
location of your choice, and use. Nothing to link. 3 #include <Eigen/Dense>
Small fixed size to large dense/sparse matrices 4 using namespace Eigen;
5 using namespace std;
Matrix operations, numerical solvers, tensors ... 6 int main()
7 {
Expression templates: lazy evaluation, smart 8 MatrixXd m=MatrixXd::Random(3,3);
removal of temporaries 9 m = (m + MatrixXd::Constant(3, 3, 1.2)) * 50;
10 cout << "m =" << "\n" << m << "\n";
11 VectorXd v(3);
12 v << 1, 2, 3;
13 cout << "m * v =" << "\n" << m * v << "\n";
14 }

G eigen1.cc

Explicit vectorization
Elegant API

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 39

Eigen: matrix types
MatirxXd : matrix of arbitrary dimensions
Matrix3d : fixed size 3 × 3 matrix
Vector3d : fixed size 3d vector
Element access m(i,j)
Output std::cout << m << "\n";
Constant : MatrixXd::Constant(a,b,c)
Random : MatrixXd::Random(n,n)
Products : m * v or m1 * m2
Expressions : 3 * m * m * v1 + u * v2 + m * m * m * v3
Column major matrix : Matrix<float, 3, 10, Eigen::ColMajor>

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 40

Eigen: matrix operations

1 #include <iostream>
2 #include <Eigen/Dense>
3 using namespace Eigen;
4 auto main() -> int {
5 Matrix3f A;
6 Vector3f b;
7 A << 1,2,3, 4,5,6, 7,8,10;
8 b << 3, 3, 4;
9 std::cout << "Here is the matrix A:\n" << A << "\n";
10 std::cout << "Here is the vector b:\n" << b << "\n";
11 Vector3f x = A.colPivHouseholderQr().solve(b);
12 std::cout << "The solution is:\n" << x << "\n";
13 }

Blocks m.block(start_r, start_c, nr, nc) , or m.block<nr,nc>(start_r, start_c)

1 SelfAdjointEigenSolver<Matrix2f> eigensolver(A);
2 if (eigensolver.info() != Success) abort();
3 std::cout << "Eigenvalues " << eigensolver.eigenvalues() << "\n";

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 41

Eigen: examples
Exercise 3.10:
There are a few example programs using Eigen in the folder examples/Eigen . Read the programs
eigen0.cc and eigen1.cc . To compile, use G program.cc .

Exercise 3.11:
The folder examples/Eigen contains a matrix multiplication example, matmul.cc using Eigen. Compare
with a naive version of a matrix multiplication program, matmul_naive.cc , by compiling and running both
programs. Try different matrix sizes. Then, you can use a parallel version of the Eigen matrix multiplication by
recompiling with -fopenmp .

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 42

Exercise 3.12:
The file exercises/PCA has a data file with tabular data. Each column represents all measurements of a
particular type, while each row is a different trial. In each row, the first column, xi0 , represents a pseudo-time
variable. Write a program using Eigen to perform a Principal Component Analysis on this data set, ignoring the
first column. Hint:
if Xi = [xi1 , xi2 , ...xim ] is the data of row i, the covariance matrix is defined as,
1 X
Cab = xka xkb
(n − 1)
k

The principal components of the data are obtained by right multiplying the data matrix by the matrix whose
columns are the eigen vectors of the matrix Cab , conventionally ordered by decreasing eigenvalues.

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 43

Lessons from matrix multiplication
Exercise 3.13:
In the examples folder, you will find a MatMul subfolder, containing a written lesson called SessionMatrix.pdf .
This file contains 8 stages organised as exercises starting with a naive implementation of a matrix type in C++,
and ending with something with reasonably respectable performance (comparable to what is possible with, e.g.,
Eigen, or other BLAS libraries) on a single node on JUSUF. It only uses concepts introduced in this course, and
does not call any linear algebra library function. Work through the exercises and test the different stages on
JUSUF!

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 44

Bcs702 Parallel Computing Module 1
100% (2)
Bcs702 Parallel Computing Module 1
35 pages
Solution Manual For Introductory Statistics 8th Edition by Mann
44% (16)
Solution Manual For Introductory Statistics 8th Edition by Mann
5 pages
Introduction C
100% (1)
Introduction C
28 pages
Skylight Space Frame
No ratings yet
Skylight Space Frame
1 page
Parallel Computing Lab Manual PDF
100% (1)
Parallel Computing Lab Manual PDF
51 pages
HPC Lab Workbook - LATEST - 02-07-25
No ratings yet
HPC Lab Workbook - LATEST - 02-07-25
52 pages
Problems and Solutions - C4
83% (6)
Problems and Solutions - C4
25 pages
Mastercam PDF
0% (1)
Mastercam PDF
2 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Secrets of Sight Reading Piano Music
100% (5)
Secrets of Sight Reading Piano Music
8 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
Astm A278 A278m
No ratings yet
Astm A278 A278m
4 pages
Parallel Programming
100% (2)
Parallel Programming
410 pages
OpenMP Shared
No ratings yet
OpenMP Shared
28 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Document 15
No ratings yet
Document 15
5 pages
Modular Linear Algebra Library in C For Science & Education: Computer Engineering
No ratings yet
Modular Linear Algebra Library in C For Science & Education: Computer Engineering
8 pages
Lec7 - TLP Shared Memory and OpenMP
No ratings yet
Lec7 - TLP Shared Memory and OpenMP
45 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
Pinto - pm2 - Session 4 - Shared Slides
No ratings yet
Pinto - pm2 - Session 4 - Shared Slides
78 pages
PDC Experiments
No ratings yet
PDC Experiments
11 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
Instruction Manual FOR New Mather Metals, Inc.: Ajax TOCCO Magnethermic Corporation
100% (1)
Instruction Manual FOR New Mather Metals, Inc.: Ajax TOCCO Magnethermic Corporation
289 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Lecture+10-12 (Sampling and Reconstruction) PDF
No ratings yet
Lecture+10-12 (Sampling and Reconstruction) PDF
72 pages
Dokumen - Tips Basic Flowsheeting Principles Thermart Himmelblau D M and Riggs J B 2003 Basic
No ratings yet
Dokumen - Tips Basic Flowsheeting Principles Thermart Himmelblau D M and Riggs J B 2003 Basic
111 pages
EEE4120F Exam Solutions 2024
No ratings yet
EEE4120F Exam Solutions 2024
12 pages
Parallel Answers
No ratings yet
Parallel Answers
6 pages
MC Openmp
No ratings yet
MC Openmp
10 pages
Less Slow C++ - Hacker News
No ratings yet
Less Slow C++ - Hacker News
3 pages
Kokkos for C++ HPC Developers
No ratings yet
Kokkos for C++ HPC Developers
322 pages
Lecture 9-OpenMP Coclusion
No ratings yet
Lecture 9-OpenMP Coclusion
39 pages
OpenMP 4.0: GPU Programming Shift
No ratings yet
OpenMP 4.0: GPU Programming Shift
128 pages
High Performance Computing: 772 10 91 Thomas@chalmers - Se
No ratings yet
High Performance Computing: 772 10 91 Thomas@chalmers - Se
75 pages
3-Parallel Software
No ratings yet
3-Parallel Software
35 pages
Cs Cheat
No ratings yet
Cs Cheat
2 pages
Seed Drying
No ratings yet
Seed Drying
9 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
01 - Lecture Intro To HPC
No ratings yet
01 - Lecture Intro To HPC
62 pages
Philosophy, Scientific Knowledge, and Concept Formation in Guelincx and Descartes
No ratings yet
Philosophy, Scientific Knowledge, and Concept Formation in Guelincx and Descartes
460 pages
Explicit Solutions For Critical and Normal Depths in Trapezoidal and Parabolic Open Channels
No ratings yet
Explicit Solutions For Critical and Normal Depths in Trapezoidal and Parabolic Open Channels
7 pages
Lec 01
No ratings yet
Lec 01
2 pages
Big Data Computing: Week 8 Quiz
No ratings yet
Big Data Computing: Week 8 Quiz
3 pages
CFD Sample Answer
No ratings yet
CFD Sample Answer
3 pages
High Performance Computing Syllabus
No ratings yet
High Performance Computing Syllabus
35 pages
05 C++ Threads
No ratings yet
05 C++ Threads
28 pages
Lab 7
No ratings yet
Lab 7
3 pages
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
No ratings yet
Parallelizing The Standard Algorithms Library - Jared Hoberock - CppCon 2014
58 pages
Vector Addition Activity
No ratings yet
Vector Addition Activity
4 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Carbon Black Surface Area Analysis
No ratings yet
Carbon Black Surface Area Analysis
39 pages
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
Lab1 PAR
No ratings yet
Lab1 PAR
40 pages
Assignment 1spring25
No ratings yet
Assignment 1spring25
3 pages
GPU Programming with C++ AMP
No ratings yet
GPU Programming with C++ AMP
43 pages
Par Proc Book
No ratings yet
Par Proc Book
335 pages
Danyal Education: Tanjong Katong Girls' I
No ratings yet
Danyal Education: Tanjong Katong Girls' I
20 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
Evo Series
No ratings yet
Evo Series
2 pages
Qalambartar (QB) For Windows and Mac: 10, 2 M Flower @
No ratings yet
Qalambartar (QB) For Windows and Mac: 10, 2 M Flower @
3 pages
Midterm
No ratings yet
Midterm
5 pages
Mysql Assignment 1
No ratings yet
Mysql Assignment 1
2 pages
E 3 (Openmp - Iii) : Matrix Multiplication
No ratings yet
E 3 (Openmp - Iii) : Matrix Multiplication
10 pages
ParProcBook PDF
No ratings yet
ParProcBook PDF
410 pages
W 20221227 1241 C++ Vector Emplace Back 24
No ratings yet
W 20221227 1241 C++ Vector Emplace Back 24
27 pages
Decomposing A Problem For Parallel Execution - Pablo Halpern - CppCon 2014
No ratings yet
Decomposing A Problem For Parallel Execution - Pablo Halpern - CppCon 2014
48 pages
Revenue Grade Metering Standards
No ratings yet
Revenue Grade Metering Standards
2 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Guidelines AdvancedWebProgramming
No ratings yet
Guidelines AdvancedWebProgramming
2 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Advanced OpenMP Pitfalls & Solutions
No ratings yet
Advanced OpenMP Pitfalls & Solutions
52 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Solutions To Exercises On Parallelism and Concurrency
No ratings yet
Solutions To Exercises On Parallelism and Concurrency
5 pages
4in SB12MNRX2 25 4
No ratings yet
4in SB12MNRX2 25 4
1 page
dataVAR LAAR
No ratings yet
dataVAR LAAR
1 page
Par - 1 In-Term Exam - Course 2017/18-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2017/18-Q2
7 pages
Non-Invasive Cylicon (Cylinder and Cone) Antenna For Blood Glucose Monitoring
No ratings yet
Non-Invasive Cylicon (Cylinder and Cone) Antenna For Blood Glucose Monitoring
5 pages
Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020
No ratings yet
Taskflow A Generalpurpose Parallel and Heterogeneous Task Programming System Using Modern CPP Tsungwei Huang Cppcon 2020
53 pages
Blockholders' Power & Firm Value
No ratings yet
Blockholders' Power & Firm Value
13 pages
CC218 Lec1 DiscreteMath Logic of Compound Stat
No ratings yet
CC218 Lec1 DiscreteMath Logic of Compound Stat
7 pages
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
No ratings yet
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
37 pages
Advances in Engineering Software: Rodrigo R. Paz, Mario A. Storti, Lisandro D. Dalcin, Hugo G. Castro, Pablo A. Kler
No ratings yet
Advances in Engineering Software: Rodrigo R. Paz, Mario A. Storti, Lisandro D. Dalcin, Hugo G. Castro, Pablo A. Kler
11 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Multicore Code Entwicklung
No ratings yet
Multicore Code Entwicklung
33 pages

hpcxx2024 d3

Uploaded by

hpcxx2024 d3

Uploaded by

High performance scientific computing in C++

HPC C++ Course 2024

Member of the Helmholtz Association

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 1

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 2

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 3

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 4

Syntax modelled after python numpy

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 5

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 6

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 7

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 8

Threads, mutexes, atomic operations

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 9

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 10

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 10

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 11

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 12

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 12

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 13

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 14

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 15

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 16

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 17

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 18

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 19

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 20

1 void prep(Population &p);

A few adhoc tasks which do not depend on each

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 21

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 22

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 23

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 24

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 25

tbb::blocked_range<int>(0,4) represents an integer range 0..4

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 26

1 void dasxpcy_tbb(double a, std::span<const double> x, std::span<double> y) {

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 27

1 G dasxpcy.cc -ltbb -ltbbmalloc

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 28

1 void f(size_t i, size_t j);

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 29

1 T result = tbb::parallel_reduce(range, identity, subrange_reduction, combine);

range : As with parallel for

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 30

1 double inner_prod_tbb(std::span<const double> x, std::span<double> y) {

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 31

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 32

If index.load() == k simultaneous calls to

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 33

If index.load() == k simultaneous calls to

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 33

Thread local "views" of a variable

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 34

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 35

Random access by index

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 36

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 37

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 38

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 39

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 40

Blocks m.block(start_r, start_c, nr, nc) , or m.block<nr,nc>(start_r, start_c)

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 41

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 42

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 43

Member of the Helmholtz Association 28 October – 31 October 2024 Slide 44

You might also like