-
Notifications
You must be signed in to change notification settings - Fork 98
Parallelising fan-data to proj-data function in ML_norm.cxx #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelising fan-data to proj-data function in ML_norm.cxx #1168
Conversation
|
Before the change, the following calls were timed as:
After the change, they were much faster:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked but this is probably fine. The tricky (i.e. virtually impossible) one would be iterate_efficiencies as that updates efficiencies continuously in the loop. We tried a "one step late" scheme but it doesn't converge (see the old proceedings by Darren Hogg).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think a lot of this is not thread-safe. Lines like
STIR/src/buildblock/ML_norm.cxx
Lines 1138 to 1139 in fda27e0
| fan_data(new_ra, new_a, new_rb, new_b) = | |
| fan_data(new_rb, new_b, new_ra, new_a) = |
are not guaranteed to work. See for instance https://stackoverflow.com/a/41614045/15030207
Unfortunately, using atomic will likely not work for lines like that as they go via a class member, but only for vector access. Even then they will be version specific (but I'd be entirely fine only parallelising for compilers that support recent enough openmp). If atomic doesn't work, I think it'll need a critical section, essentially killing the speed-up. It would then need creating writable variables for every thread (which would probably break all encapsulation of FanData etc.
For the loops that use set_segment, it would be alright, as each segment is independent per thread, and set_segment is thread-safe.
|
Your comment makes sense. But the sections I've parallelised should be fine, I think. As you say, parallelising on segments should be alright. And in the places where we parallelise over ra and a, the threads only access entries at the ra and a locations. The only one that looks problematic is the first of the three collapse(2) parallelisations: there we indeed compute new indices and use them to write into the "work" variable. |
|
Yes, the lines you originally quoted are in the first loop, but that loop only parallelises across segments. The first new link you sent is the one I also think is problematic, but the second link I think is fine again because it does update fan_data, but only in location (ra, a, ...) therefore ensuring that each thread writes a different section of fan_data. |
|
but updating It might be alright as all the index access in the 4D array is read-only, and we only update the 1D vectors, but how do know that those 1D arrays are not adjacent for different segments? Of course, it seems pretty unlikely that this would generate a race-condition, but I don't think we have a solid guarantee. I don't have access to any tools to check thread-safety sadly. |
|
Darn, I only noticed this symmetry thing now. I ran TSAN over it and even the parallelisation across segments seems to be not thread-safe. |
|
By the way, how reliable is TSAN (presumably with clang) for OpenMP? (I've tried it with gcc, but I had to build by own instrumented openmp library (or was it gcc?), so gave up). |
|
I haven't used it with OpenMP before, and the more I look into it here the less I trust it. I think it might be getting confused with the various layers of bound checking and shared pointers when obtaining segments by sinogram. |
|
yeah.. by the way, there's no reason for this code shared_ptr<SegmentBySinogram<float> > segment_ptr;
segment_ptr.reset(new SegmentBySinogram<float>(proj_data.get_segment_by_sinogram(bin.segment_num())));the following is essentially equivalent but much clearer const auto segment(proj_data.get_segment_by_sinogram(bin.segment_num()));(I'm hoping I didn't write those lines) Note that this is going to make TSAN problems disappear. |
|
This does look much cleaner indeed! But I still get TSAN problems... I'll have to look into this some more. |
|
Finally I managed to get OpenMP working with TSAN for a very simple dummy for loop just printing something to cout. Then parallelising the least problematic for loop over segments where only the segment is modified, it already throws a lot of warnings in TSAN. Furthermore, using "atomic read" on the fan_data doesn't even compile, because it expects this to work on simple operations such as "v=x", not on complex functions. Even on the "[]"-style indexing in VectorWithOffset it complained. I'm afraid we'll have to live with the slow serial implementation for now... |
well, that's a bit strange. Writing to
I'm not surprised by the We don't need atomic reads anywhere. They are only necessary when someone else can write to that memory. So, in the second loop, we read Updating |
|
Yes, it's all very mysterious.
Yes, this is how I got it to work eventually.
I can try this a bit later this week! |
|
As discussed, std::vector access works, but VectorWithOffset doesn't. There seems to be no way to use atomic, and TSAN still complains about all parallelisation (therefore TSAN can't be trusted). I have now also tried to add an atomic_read function to FanData that has a critical section on the read, but that kills performance thoroughly. Therefore, the best option is to parallelise only the loop that looks safe for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good now. can you just add something to the release notes? Thanks!
Relates to issue: #1167