-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add Cython implementation for VOTable binary converters #18454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
82cd430
to
1dfea18
Compare
@@ -842,6 +828,12 @@ class Double(FloatingPoint): | |||
|
|||
format = "f8" | |||
|
|||
def binparse(self, read): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a little repetitive across the board. Looks like the only difference is a number. Can this method be inherited but access some self._expected_len
(name negotiable) that is set by the subclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, looking back at it now I can see there is a lot of duplication. I've moved things up to the parent Numeric class, let me know if that looks good. (Or if you see any adjustments needing to be made)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have the benchmark down, should maybe put some ballpark values here.
Hmm looks like @eerovaher proposed an alternative at #18455 . Are you interested to have a look, @stvoutsin ? Thanks, all! |
astropy/astropy-benchmarks#141 is merged so theoretically next push would trigger a new benchmark run with the new relevant benchmarks. 🤞 |
f6f4267
to
11d9403
Compare
ba78237
to
1babe2e
Compare
Hmm benchmark job says nothing has "significantly changed". Is this expected? I thought it would say "this and that is much faster" or something.
|
Apologies I think i introduced a regression while trying to fix an issue that caused failed tests in the CI.
I will also test with increasing the row sizes for the sample data to see how the runtime difference between main and this branch scales. |
I've run the astropy-benchmarks locally with 200k, 500k and 1M sample row sizes (https://github.com/astropy/astropy-benchmarks/blob/main/benchmarks/votable.py#L13). Results shown here: u/stvoutsin/binary2-cython vs main branch performance comparison
|
350e752
to
96b3904
Compare
Benchmark run looking good. Thanks!
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. |
Description
This pull request attempts to improve the performance of VOTable binary parsing by implementing optimized Cython converters.
Initial benchmarks have shown performance improvements of the order:
Problem:
We’re seeing relatively slow performance when parsing large VOTables using the BINARY2 serialization format.
Profiling results from py-spy show a significant portion of time is spent in astropy.io.votable.converters.binparse. Here’s an excerpt of the profiler output:
As shown, the hot path is dominated by the binparse function in converters.py.
Solution:
double
,float
,int
,long
,short
,unsignedByte
,boolean
,bit
)Compatibility:
Should be 100% backward compatible since the API has not been modified
Testing:
Added comprehensive benchmarks to astropy-benchmarks showing consistent 30-50% performance improvements across different table types and data patterns. astropy/astropy-benchmarks#141
Astropy:main
u/stvoutsin/binary2-cython
For better visualization:
I've also tested parsing outside of the astropy-benchmark PR with a VOTable with 50000 rows and aprox. 1000 columns (~50 million cells).
With this table the parsing took 30 seconds, down from 65 seconds in the version currently on main.
Fixes #18442
Any thoughts on this approach? Are there alternatives that I haven't considered? Perhaps someone with more Cython experience can let me know if I've made any obvious mistakes or if there are better way of doing any of this.