Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@markus-jehl
Copy link
Contributor

@markus-jehl markus-jehl commented Feb 28, 2023

Relates to issue: #1167

@markus-jehl
Copy link
Contributor Author

Before the change, the following calls were timed as:

  • make_fan_data_remove_gaps: 2.8 s
  • apply_efficiencies: 0.7 s
  • apply_geo_norm: 2.1 s
  • set_fan_data_add_gaps: 3.1 s

After the change, they were much faster:

  • make_fan_data_remove_gaps: 1 s
  • apply_efficiencies: 0.1 s
  • apply_geo_norm: 0.4 s
  • set_fan_data_add_gaps: 0.6 s

Copy link
Collaborator

@KrisThielemans KrisThielemans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked but this is probably fine. The tricky (i.e. virtually impossible) one would be iterate_efficiencies as that updates efficiencies continuously in the loop. We tried a "one step late" scheme but it doesn't converge (see the old proceedings by Darren Hogg).

Copy link
Collaborator

@KrisThielemans KrisThielemans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think a lot of this is not thread-safe. Lines like

fan_data(new_ra, new_a, new_rb, new_b) =
fan_data(new_rb, new_b, new_ra, new_a) =

are not guaranteed to work. See for instance https://stackoverflow.com/a/41614045/15030207

Unfortunately, using atomic will likely not work for lines like that as they go via a class member, but only for vector access. Even then they will be version specific (but I'd be entirely fine only parallelising for compilers that support recent enough openmp). If atomic doesn't work, I think it'll need a critical section, essentially killing the speed-up. It would then need creating writable variables for every thread (which would probably break all encapsulation of FanData etc.

For the loops that use set_segment, it would be alright, as each segment is independent per thread, and set_segment is thread-safe.

@markus-jehl
Copy link
Contributor Author

Your comment makes sense. But the sections I've parallelised should be fine, I think. As you say, parallelising on segments should be alright. And in the places where we parallelise over ra and a, the threads only access entries at the ra and a locations. The only one that looks problematic is the first of the three collapse(2) parallelisations: there we indeed compute new indices and use them to write into the "work" variable.

@KrisThielemans
Copy link
Collaborator

not sure if I agree. The lines I quoted are in the 1st loop and update new_fan_data. They are problematic. There's a loop here (3rd one?) which updated work. The next one updates fan_data. No?

@markus-jehl
Copy link
Contributor Author

Yes, the lines you originally quoted are in the first loop, but that loop only parallelises across segments. The first new link you sent is the one I also think is problematic, but the second link I think is fine again because it does update fan_data, but only in location (ra, a, ...) therefore ensuring that each thread writes a different section of fan_data.

@KrisThielemans
Copy link
Collaborator

but updating fan_data is just as problematic, certainly see as we update it symmetrically (fan_data(new_ra, new_a, new_rb, new_b) = fan_data(new_rb, new_b, new_ra, new_a) = ...). How do we know that another thread isn't access the "symmetric" version, or just the neghbouring one, with therefore the potential for memory corruption?

It might be alright as all the index access in the 4D array is read-only, and we only update the 1D vectors, but how do know that those 1D arrays are not adjacent for different segments?

Of course, it seems pretty unlikely that this would generate a race-condition, but I don't think we have a solid guarantee.

I don't have access to any tools to check thread-safety sadly.

@markus-jehl
Copy link
Contributor Author

Darn, I only noticed this symmetry thing now. I ran TSAN over it and even the parallelisation across segments seems to be not thread-safe.

@KrisThielemans
Copy link
Collaborator

By the way, how reliable is TSAN (presumably with clang) for OpenMP? (I've tried it with gcc, but I had to build by own instrumented openmp library (or was it gcc?), so gave up).

@markus-jehl
Copy link
Contributor Author

I haven't used it with OpenMP before, and the more I look into it here the less I trust it. I think it might be getting confused with the various layers of bound checking and shared pointers when obtaining segments by sinogram.

@KrisThielemans
Copy link
Collaborator

KrisThielemans commented Mar 6, 2023

yeah.. by the way, there's no reason for this code

 shared_ptr<SegmentBySinogram<float> > segment_ptr;
 segment_ptr.reset(new SegmentBySinogram<float>(proj_data.get_segment_by_sinogram(bin.segment_num())));

the following is essentially equivalent but much clearer

   const auto segment(proj_data.get_segment_by_sinogram(bin.segment_num()));

(I'm hoping I didn't write those lines)

Note that this is going to make TSAN problems disappear.

@markus-jehl
Copy link
Contributor Author

This does look much cleaner indeed! But I still get TSAN problems... I'll have to look into this some more.

@markus-jehl
Copy link
Contributor Author

Finally I managed to get OpenMP working with TSAN for a very simple dummy for loop just printing something to cout. Then parallelising the least problematic for loop over segments where only the segment is modified, it already throws a lot of warnings in TSAN. Furthermore, using "atomic read" on the fan_data doesn't even compile, because it expects this to work on simple operations such as "v=x", not on complex functions. Even on the "[]"-style indexing in VectorWithOffset it complained.

I'm afraid we'll have to live with the slow serial implementation for now...

@KrisThielemans
Copy link
Collaborator

I managed to get OpenMP working with TSAN for a very simple dummy for loop just printing something to cout.

well, that's a bit strange. Writing to cout is most definitely not thread-safe with critical section, so it should have complained! I've tried to find some doc on clang/TSAM/OpenMP but gave up. The only info i could find (but it's old) is that you need to build your own libgomp, and then LD_PRELOAD it.

"atomic read" on the fan_data doesn't even compile, because it expects this to work on simple operations such as "v=x", not on complex functions. Even on the "[]"-style indexing in VectorWithOffset it complained

I'm not surprised by the atomic restrictions for fan_data, but am disappointed it cannot handle VectorWithOffset::operator[]. Does it work with std::vector::operator[] by the way?

We don't need atomic reads anywhere. They are only necessary when someone else can write to that memory. So, in the second loop, we read fan_data in parallel (fine), and write to a thread-local segment (fine), and then call set_segment, which should have its own internal critical section., so should be fine as well. You can always check by adding your own critical section for the set_segment.

Updating fan_data though is harder.

@markus-jehl
Copy link
Contributor Author

Yes, it's all very mysterious.

I've tried to find some doc on clang/TSAM/OpenMP but gave up. The only info i could find (but it's old) is that you need to build your own libgomp, and then LD_PRELOAD it.

Yes, this is how I got it to work eventually.

I'm not surprised by the atomic restrictions for fan_data, but am disappointed it cannot handle VectorWithOffset::operator[]. Does it work with std::vector::operator[] by the way?
We don't need atomic reads anywhere. They are only necessary when someone else can write to that memory. So, in the second loop, we read fan_data in parallel (fine), and write to a thread-local segment (fine), and then call set_segment, which should have its own internal critical section., so should be fine as well. You can always check by adding your own critical section for the set_segment.

I can try this a bit later this week!

@markus-jehl
Copy link
Contributor Author

As discussed, std::vector access works, but VectorWithOffset doesn't. There seems to be no way to use atomic, and TSAN still complains about all parallelisation (therefore TSAN can't be trusted).

I have now also tried to add an atomic_read function to FanData that has a critical section on the read, but that kills performance thoroughly.

Therefore, the best option is to parallelise only the loop that looks safe for now.

@KrisThielemans KrisThielemans changed the title Parallelising functions in ML_norm.cxx to improve performance. Parallelising fan-data to proj-data function in ML_norm.cxx Mar 29, 2023
Copy link
Collaborator

@KrisThielemans KrisThielemans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good now. can you just add something to the release notes? Thanks!

@KrisThielemans KrisThielemans merged commit a7e6d56 into UCL:master Mar 30, 2023
@markus-jehl markus-jehl deleted the issue/1167-methods-in-ML_norm-are-not-parallelised-yet branch January 29, 2024 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants