-
Notifications
You must be signed in to change notification settings - Fork 554
[QC] optimize numpy operations #3621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
to speed up computation and reduce memory consumption
ThomasLecocq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Roman,
Are those covered by tests (originally?) ?
It does indeed look very simple & efficient!
obspy/signal/quality_control.py
Outdated
| self.meta['sample_mean'] = full_samples.mean() | ||
|
|
||
| full_samples = np.concatenate([tr.data for tr in self.data]) | ||
| self.meta['sample_median'] = np.median(full_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this and the three following two lines be replaced by np.percentile(full_samples, [25,50,75]) ? Is that faster (I would suppose it'd be computing the distribution only once?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarks with an array of 100_000_000 elements suggest a speedup of a factor of 2 when using your suggestion, I would definitely go for this.
I also wonder, whether we should make use of the fact that np.mean(full_samples**2) is already calculated and can be used instead of computing np.std from scratch:
Instead of calling np.std(...),
squared = numpy.mean(full_samples**2) is stored and then
self.meta['sample_stdev'] = np.sqrt(squared - np.mean(full_samples)**2)
as this will potentially avoid computing np.mean(full_samples**2) twice.
My benchmarks suggest that this gives a factor of ~2 as well for computing the standard deviation. I suggest to make this change as well before merging the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I just updated the pull request. The updated code should reflect today's discussion.
|
oh and for the near future, please branch & PR against master, not maintenance - we'll get rid of this branch soon & release directly from master. |
|
Update to further improve performance as discussed. |
|
looks good, could you add a line in the changelog too, it's always nice to report on performance improvements when we release :-) |
|
Done. |
to speed up computation and reduce memory consumption
What does this PR do?
Some existing code is replaced with numpy calls. This speeds up computation and reduces memory consumption.
Why was it initiated? Any relevant Issues?
The speed was slow and the memory consumption high for large input files.
PR Checklist
masterfor new features,maintenance_...for bug fixesCONTRIBUTORS.txt.ready for reviewlabel when you are ready for the PR to be reviewed.