Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mutalibmohammed
Copy link
Contributor

The main theme of this pull request is optimizing memory allocations/deallocations. It was found while profiling that on average, 10-20 memory allocations/deallocations would occur per cycle. By reducing the no. of allocations, I hope to reduce cycle time.

New language features and smaller changes were preferred over big structural changes to improve performance.

  • 6bf6aad ... 5410557 primarily change function parameters to accept const references. This allows them to accept both lvalue and rvalue references; avoiding a copy.

  • member initializer list were also added/improved wherever possible. This change avoids having to a call the constructor of a member twice.

Several of these changes were suggested by the static analyzer CppCheck.

  • b0c1773 uses a new C++17 method try_emplace to construct an element in place avoiding a copy. It also has better syntax to pass constructor arguments compared to pre C++17 methods that require std::piecewise_construct.

  • 16f2758 changes getSupportedPorts to return a reference to a vector. The aarch64::Instruction class has a member vector that stores the ports, and it is alive for as long as the instruction instance is, hence we return a reference to it to avoid copying.

  • a0aebb1 adds a new CMake flag SIMENG_OPTIMIZE. This flag turns on Link-Time Optimizations allowing the compiler to optimize across translation units. Additionally, on x86 systems, it sets the -march and -mtune compiler flags.

  • 856dc96 overloads setMemoryAddresses to accept rvalue references. This overload can reuse the resources of its parameter. The vectors created in Instruction::generateAddresses are deleted at the end of function. This presents an opportunity to reuse resources.

  • bea1f88 creates a new member flushed_ to avoid allocating a map every flush. The sets are still allocated/deallocated every flush, perhaps using a stack allocator here could be beneficial.

  • f4a1dbb and ac83072 adds a second parameter capacity to the RegisterValue ctor that takes an array as a reference. In aarch64, FP, SVE and NEON instructions use the same register bank of 256 byte registers. The updated ctor makes it easier to create RegisterValue of the correct size.

    Instruction::execute was updated to use the new ctor. This avoids data copying and an extra ctor call at the end of the function when zero extending.

The RegisterValue class along with Instruction class are a major source of memory allocations. The next few commits focus on reducing it.

  • b40fde0 experiments with using C++17's new std::unsynchronized_pool_resource. It is a general-purpose, non thread-safe memory pool implementation. However, it was abandoned as it was not implemented in LLVM/Clang.

  • a068449 experiments with using raw pointers instead of shared_ptr. This required implementing copy and move constructors. This change resulted in a performance regression and was reverted.

  • ee4ec4d and cbd0d39 implement a simple memory pool implementation that is based on a free list. Each node of the free list is stored in the free chunks of memory. The memory pool grows by a factor of 2 when exhausted.

  • ce708b0 changes the predict and update functions in BranchPredictor to accept a reference to shared_ptr. This is to avoid the cost of atomic increment/decrement. The lifetime of the ptr is expected to out last predict and update, hence it is safe.

These optimizations were suggested by cppcheck.
The optimizations consist of using const reference in
function arguments to avoid invoking the copy constructor
and using intialization lists.
This change avoids a copy of the vector being returned.
Instead of creating a new vector every cycle to count the number of
dispatched instructions, store the vector as a class member and reset it
every cycle.
This option turns on link time optimization and  sets -march
and -mtune flags for all targets.
A map is used instead of a vector as it has
amortised constant time erase complexity.
The ctor that accepts a reference to an array is updated to accept
capacity as a second parameter. This allows us to create RegisterValue
from arrays without the size of RegisterValue being bound to the array
size.

Type traits were introduced in the first ctor to stop it from
accepting pointers as it's first parameter. If pointers were
allowed, it would clash with array ctor in overload resolution.
Previously, the registers were extended at the
end of the execute function. Now, they are
created with the right size in the first place.
We are abandoning C++17 memory pool because it is not yet
supported by LLVM/Clang.

I hypothesize that using raw pointers could give a performance
boost as we do not use any atomics. However, this means that
we need to implement our own copy/move constructors.
I have used the copy-and-swap idiom to implement them.
Using C++17 memory pool and raw pointers were
unsuccesful in providing a performance boost,
hence we are reverting back to the old state.
The purpose of implementing a memory pool is to
reduce the cost of memory allocations by reusing
memory. It was designed to be used primarily for
the RegisterValue class.

Code from Standford's Bit Twiddling Hacks is used to
implement the roundUp function.
predict() and update() take a reference to the shared_ptr
instead of making a copy.  This goes against the intended use of
shared_ptr, however we deviate from this rule to improve
performance (avoiding atomic inc/dec).
@mutalibmohammed mutalibmohammed added the performance Performance optimisation label Sep 16, 2021
@jj16791
Copy link
Contributor

jj16791 commented Sep 16, 2021

#rerun tests

1 similar comment
@jj16791
Copy link
Contributor

jj16791 commented Sep 16, 2021

#rerun tests

Comment on lines +5 to +9
// Temporary; until execute has been verified to work correctly.
#ifndef NDEBUG
#include <iostream>
#endif

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I have only tested it on my end, so it would be better if the team could test as well.

@jj16791
Copy link
Contributor

jj16791 commented Sep 24, 2021

#rerun tests

3 similar comments
@jj16791
Copy link
Contributor

jj16791 commented Sep 24, 2021

#rerun tests

@jj16791
Copy link
Contributor

jj16791 commented Sep 25, 2021

#rerun tests

@jj16791
Copy link
Contributor

jj16791 commented Sep 25, 2021

#rerun tests

Copy link
Contributor

@seunghun1ee seunghun1ee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link
Contributor

@FinnWilkinson FinnWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good

@jj16791 jj16791 merged commit 25064d6 into main Oct 1, 2021
@jj16791 jj16791 deleted the optimizations branch December 16, 2021 10:21
@jj16791 jj16791 restored the optimizations branch December 16, 2021 10:21
@jj16791 jj16791 deleted the optimizations branch December 16, 2021 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance optimisation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants