-
Notifications
You must be signed in to change notification settings - Fork 22
Optimizations #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizations #191
Conversation
These optimizations were suggested by cppcheck. The optimizations consist of using const reference in function arguments to avoid invoking the copy constructor and using intialization lists.
This change avoids a copy of the vector being returned.
Instead of creating a new vector every cycle to count the number of dispatched instructions, store the vector as a class member and reset it every cycle.
This option turns on link time optimization and sets -march and -mtune flags for all targets.
A map is used instead of a vector as it has amortised constant time erase complexity.
The ctor that accepts a reference to an array is updated to accept capacity as a second parameter. This allows us to create RegisterValue from arrays without the size of RegisterValue being bound to the array size. Type traits were introduced in the first ctor to stop it from accepting pointers as it's first parameter. If pointers were allowed, it would clash with array ctor in overload resolution.
Previously, the registers were extended at the end of the execute function. Now, they are created with the right size in the first place.
We are abandoning C++17 memory pool because it is not yet supported by LLVM/Clang. I hypothesize that using raw pointers could give a performance boost as we do not use any atomics. However, this means that we need to implement our own copy/move constructors. I have used the copy-and-swap idiom to implement them.
Using C++17 memory pool and raw pointers were unsuccesful in providing a performance boost, hence we are reverting back to the old state.
The purpose of implementing a memory pool is to reduce the cost of memory allocations by reusing memory. It was designed to be used primarily for the RegisterValue class. Code from Standford's Bit Twiddling Hacks is used to implement the roundUp function.
predict() and update() take a reference to the shared_ptr instead of making a copy. This goes against the intended use of shared_ptr, however we deviate from this rule to improve performance (avoiding atomic inc/dec).
|
#rerun tests |
1 similar comment
|
#rerun tests |
| // Temporary; until execute has been verified to work correctly. | ||
| #ifndef NDEBUG | ||
| #include <iostream> | ||
| #endif | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. I have only tested it on my end, so it would be better if the team could test as well.
|
#rerun tests |
3 similar comments
|
#rerun tests |
|
#rerun tests |
|
#rerun tests |
seunghun1ee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
FinnWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good
The main theme of this pull request is optimizing memory allocations/deallocations. It was found while profiling that on average, 10-20 memory allocations/deallocations would occur per cycle. By reducing the no. of allocations, I hope to reduce cycle time.
New language features and smaller changes were preferred over big structural changes to improve performance.
6bf6aad ... 5410557 primarily change function parameters to accept const references. This allows them to accept both lvalue and rvalue references; avoiding a copy.
member initializer list were also added/improved wherever possible. This change avoids having to a call the constructor of a member twice.
b0c1773 uses a new C++17 method
try_emplaceto construct an element in place avoiding a copy. It also has better syntax to pass constructor arguments compared to pre C++17 methods that requirestd::piecewise_construct.16f2758 changes
getSupportedPortsto return a reference to a vector. Theaarch64::Instructionclass has a member vector that stores the ports, and it is alive for as long as the instruction instance is, hence we return a reference to it to avoid copying.a0aebb1 adds a new CMake flag
SIMENG_OPTIMIZE. This flag turns on Link-Time Optimizations allowing the compiler to optimize across translation units. Additionally, on x86 systems, it sets the-marchand-mtunecompiler flags.856dc96 overloads
setMemoryAddressesto accept rvalue references. This overload can reuse the resources of its parameter. The vectors created inInstruction::generateAddressesare deleted at the end of function. This presents an opportunity to reuse resources.bea1f88 creates a new member
flushed_to avoid allocating a map every flush. The sets are still allocated/deallocated every flush, perhaps using a stack allocator here could be beneficial.f4a1dbb and ac83072 adds a second parameter
capacityto theRegisterValuector that takes an array as a reference. In aarch64, FP, SVE and NEON instructions use the same register bank of 256 byte registers. The updated ctor makes it easier to createRegisterValueof the correct size.Instruction::executewas updated to use the new ctor. This avoids data copying and an extra ctor call at the end of the function when zero extending.b40fde0 experiments with using C++17's new
std::unsynchronized_pool_resource. It is a general-purpose, non thread-safe memory pool implementation. However, it was abandoned as it was not implemented in LLVM/Clang.a068449 experiments with using raw pointers instead of shared_ptr. This required implementing copy and move constructors. This change resulted in a performance regression and was reverted.
ee4ec4d and cbd0d39 implement a simple memory pool implementation that is based on a free list. Each node of the free list is stored in the free chunks of memory. The memory pool grows by a factor of 2 when exhausted.
ce708b0 changes the
predictandupdatefunctions inBranchPredictorto accept a reference to shared_ptr. This is to avoid the cost of atomic increment/decrement. The lifetime of the ptr is expected to out lastpredictandupdate, hence it is safe.