-
Notifications
You must be signed in to change notification settings - Fork 990
XlsxWorksheet constructor is parsing the whole sheet on construction #618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For sure we can be smarter here. Not sure how much difference lazily parsing cell content would make, but something to test and benchmark. Perhaps it would be enough to skip over unnecessary things when doing the pre-scan to determine properties like FieldCount and RowCount. I've also been toying with the idea of making the properties that require us to scan the files twice optional via configuration, at least for the XML based formats. Not sure it is possible for the binary formats due to the many different ways files break the spec. |
You can estimate |
It is often missing or incorrect so unfortunately it can't bed used. |
+1 on this issue. I'd like to implement my own method of detecting the data boundaries and this step is just consuming unnecessary time since I need to go through all the data again for my logic to work. There could be just a configuration to disable it for now and allow the boundaries to be set via a property. This has also been reported previously in an issue that was closed without resolution. #585 |
I am also facing memory issue when upload huge files. In my case files are more than 200MB in size. So I have decided to a build a library to read huge Xlsx file. library is mostly ready. If you try it today it should work fine. please fell free to try and share your feedbacks and bugs.
In coming days I will push few more changes to iron out any issue and add documentation. Link to repository https://github.com/ArjunVachhani/XlsxHelper |
Just skipping reading the cell contents when figuring out the field count for an .xslx made quite a big difference. I'll have to test if something similar can be done for the other formats and what impact that would have.
|
Something to also look into is if we can use a library like TurboXml when parsing XML. If the performance gains are enough perhaps it is worth the cost of having an additional dependency and the complication of having one code path for < net 8.0 and another for >= net 8.0. That specific library won't work very well though because of its push nature but maybe there are others out there that isn't quite as allocation happy has XmlReader is. |
@appel1 Can you provide info on when we can expect v3.8 to be released? |
Releated to #618 Avoid some unnecessary work in the preparing read. We can't get rid of it completely without breaking compatiblity, we need to know the column count at the very least. Before: | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:| | ReadSingleFileXslx | 76.74 ms | 0.964 ms | 0.902 ms | 3500.0000 | - | - | 37.88 MB | | ReadSingleFileXslb | 16.71 ms | 0.160 ms | 0.149 ms | 2031.2500 | 93.7500 | - | 20.37 MB | | ReadSingleFileXls | 18.54 ms | 0.268 ms | 0.251 ms | 4187.5000 | 1312.5000 | 906.2500 | 42.74 MB | | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:| | OpenSingleFileXslx | 35.689 ms | 0.3813 ms | 0.3566 ms | 1500.0000 | 71.4286 | - | 15.38 MB | | OpenSingleFileXslb | 6.197 ms | 0.0668 ms | 0.0592 ms | 656.2500 | 31.2500 | - | 6.56 MB | | OpenSingleFileXls | 7.859 ms | 0.0444 ms | 0.0370 ms | 1312.5000 | 437.5000 | 218.7500 | 13.49 MB | After: | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:| | ReadSingleFileXslx | 58.71 ms | 1.116 ms | 1.146 ms | 2666.6667 | - | - | 27.13 MB | | ReadSingleFileXslb | 15.09 ms | 0.140 ms | 0.124 ms | 1484.3750 | 62.5000 | - | 14.95 MB | | ReadSingleFileXls | 19.23 ms | 0.158 ms | 0.148 ms | 4187.5000 | 1312.5000 | 906.2500 | 42.74 MB | | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:| | OpenSingleFileXslx | 22.213 ms | 0.2360 ms | 0.2207 ms | 750.0000 | 62.5000 | - | 7.75 MB | | OpenSingleFileXslb | 5.517 ms | 0.0686 ms | 0.0608 ms | 421.8750 | 15.6250 | - | 4.27 MB | | OpenSingleFileXls | 7.858 ms | 0.0997 ms | 0.0833 ms | 1312.5000 | 437.5000 | 218.7500 | 13.49 MB |
Releated to #618 Avoid some unnecessary work in the preparing read. We can't get rid of it completely without breaking compatiblity, we need to know the column count at the very least. Before: | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:| | ReadSingleFileXslx | 76.74 ms | 0.964 ms | 0.902 ms | 3500.0000 | - | - | 37.88 MB | | ReadSingleFileXslb | 16.71 ms | 0.160 ms | 0.149 ms | 2031.2500 | 93.7500 | - | 20.37 MB | | ReadSingleFileXls | 18.54 ms | 0.268 ms | 0.251 ms | 4187.5000 | 1312.5000 | 906.2500 | 42.74 MB | | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:| | OpenSingleFileXslx | 35.689 ms | 0.3813 ms | 0.3566 ms | 1500.0000 | 71.4286 | - | 15.38 MB | | OpenSingleFileXslb | 6.197 ms | 0.0668 ms | 0.0592 ms | 656.2500 | 31.2500 | - | 6.56 MB | | OpenSingleFileXls | 7.859 ms | 0.0444 ms | 0.0370 ms | 1312.5000 | 437.5000 | 218.7500 | 13.49 MB | After: | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:| | ReadSingleFileXslx | 60.62 ms | 0.917 ms | 0.858 ms | 2666.6667 | - | - | 27.13 MB | | ReadSingleFileXslb | 14.74 ms | 0.118 ms | 0.105 ms | 1484.3750 | 62.5000 | - | 14.95 MB | | ReadSingleFileXls | 18.72 ms | 0.364 ms | 0.510 ms | 4187.5000 | 1312.5000 | 906.2500 | 42.74 MB | | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:| | OpenSingleFileXslx | 21.411 ms | 0.3481 ms | 0.3256 ms | 750.0000 | 62.5000 | - | 7.75 MB | | OpenSingleFileXslb | 5.587 ms | 0.0701 ms | 0.0656 ms | 421.8750 | 15.6250 | - | 4.27 MB | | OpenSingleFileXls | 7.748 ms | 0.1345 ms | 0.1258 ms | 1320.3125 | 437.5000 | 218.7500 | 13.49 MB |
ExcelDataReader/src/ExcelDataReader/Core/OpenXmlFormat/XlsxWorksheet.cs
Lines 38 to 71 in 2f14bd4
So basically calling
reader.NextResult()
is contributing 50% time of my workload and allocating tons ofCell
,Cell[]
and boxeddouble
for worksheet I even don't want to read.Profiler result
Maybe make those properties lazy calculated?
The text was updated successfully, but these errors were encountered: