XlsxWorksheet constructor is parsing the whole sheet on construction #618

OwnageIsMagic · 2022-11-20T17:02:20Z

ExcelDataReader/src/ExcelDataReader/Core/OpenXmlFormat/XlsxWorksheet.cs

Lines 38 to 71 in 2f14bd4

    
           while ((record = sheetStream.Read()) != null) 
        
           { 
        
               switch (record) 
        
               { 
        
                   case SheetDataBeginRecord _: 
        
                       inSheetData = true; 
        
                       break; 
        
                   case SheetDataEndRecord _: 
        
                       inSheetData = false; 
        
                       break; 
        
                   case RowHeaderRecord row when inSheetData: 
        
                       rowIndexMaximum = Math.Max(rowIndexMaximum, row.RowIndex); 
        
                       break; 
        
                   case CellRecord cell when inSheetData: 
        
                       columnIndexMaximum = Math.Max(columnIndexMaximum, cell.ColumnIndex); 
        
                       break; 
        
                   case ColumnRecord column: 
        
                       columnWidths.Add(column.Column); 
        
                       break; 
        
                   case SheetFormatPrRecord sheetFormatProperties: 
        
                       if (sheetFormatProperties.DefaultRowHeight != null) 
        
                           DefaultRowHeight = sheetFormatProperties.DefaultRowHeight.Value; 
        
                       break; 
        
                   case SheetPrRecord sheetProperties: 
        
                       CodeName = sheetProperties.CodeName; 
        
                       break; 
        
                   case MergeCellRecord mergeCell: 
        
                       cellRanges.Add(mergeCell.Range); 
        
                       break; 
        
                   case HeaderFooterRecord headerFooter: 
        
                       HeaderFooter = headerFooter.HeaderFooter; 
        
                       break; 
        
               } 
        
           }

So basically calling reader.NextResult() is contributing 50% time of my workload and allocating tons of Cell, Cell[] and boxed double for worksheet I even don't want to read.

Profiler result

Maybe make those properties lazy calculated?

The text was updated successfully, but these errors were encountered:

appel1 · 2022-11-20T18:16:29Z

For sure we can be smarter here. Not sure how much difference lazily parsing cell content would make, but something to test and benchmark. Perhaps it would be enough to skip over unnecessary things when doing the pre-scan to determine properties like FieldCount and RowCount.

I've also been toying with the idea of making the properties that require us to scan the files twice optional via configuration, at least for the XML based formats. Not sure it is possible for the binary formats due to the many different ways files break the spec.

OwnageIsMagic · 2022-11-20T19:39:50Z

You can estimate FieldCount/RowCount from <dimension /> element, I think just skipping sheetData would be a huge win.

appel1 · 2022-11-20T20:07:16Z

It is often missing or incorrect so unfortunately it can't bed used.

victor-gutemberg · 2023-02-16T15:01:11Z

+1 on this issue. I'd like to implement my own method of detecting the data boundaries and this step is just consuming unnecessary time since I need to go through all the data again for my logic to work.

There could be just a configuration to disable it for now and allow the boundaries to be set via a property.

This has also been reported previously in an issue that was closed without resolution. #585

ArjunVachhani · 2023-08-19T16:59:05Z

I am also facing memory issue when upload huge files. In my case files are more than 200MB in size.

So I have decided to a build a library to read huge Xlsx file. library is mostly ready. If you try it today it should work fine. please fell free to try and share your feedbacks and bugs.
To reduce memory our library is doing following

Does not load whole work sheet, instead streams 1 row at a time.
Instead of loading shared string into RAM, it uses indexed file to quickly lookup value.

In coming days I will push few more changes to iron out any issue and add documentation.

Link to repository https://github.com/ArjunVachhani/XlsxHelper

appel1 · 2024-05-18T17:04:39Z

Just skipping reading the cell contents when figuring out the field count for an .xslx made quite a big difference. I'll have to test if something similar can be done for the other formats and what impact that would have.

Current implementation
|         Method |     Mean |    Error |   StdDev |      Gen0 | Allocated |
|--------------- |---------:|---------:|---------:|----------:|----------:|
| OpenSingleFile | 82.66 ms | 0.713 ms | 0.667 ms | 3000.0000 |  33.69 MB |

Skip cell content when pre-scanning
|         Method |     Mean |    Error |   StdDev |      Gen0 |     Gen1 | Allocated |
|--------------- |---------:|---------:|---------:|----------:|---------:|----------:|
| OpenSingleFile | 56.83 ms | 0.215 ms | 0.168 ms | 2222.2222 | 222.2222 |  23.11 MB |

appel1 · 2024-05-27T12:27:01Z

Something to also look into is if we can use a library like TurboXml when parsing XML. If the performance gains are enough perhaps it is worth the cost of having an additional dependency and the complication of having one code path for < net 8.0 and another for >= net 8.0.

That specific library won't work very well though because of its push nature but maybe there are others out there that isn't quite as allocation happy has XmlReader is.

bl-kd-22 · 2024-10-09T21:52:57Z

@appel1 Can you provide info on when we can expect v3.8 to be released?

Releated to #618 Avoid some unnecessary work in the preparing read. We can't get rid of it completely without breaking compatiblity, we need to know the column count at the very least. Before: | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:| | ReadSingleFileXslx | 76.74 ms | 0.964 ms | 0.902 ms | 3500.0000 | - | - | 37.88 MB | | ReadSingleFileXslb | 16.71 ms | 0.160 ms | 0.149 ms | 2031.2500 | 93.7500 | - | 20.37 MB | | ReadSingleFileXls | 18.54 ms | 0.268 ms | 0.251 ms | 4187.5000 | 1312.5000 | 906.2500 | 42.74 MB | | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:| | OpenSingleFileXslx | 35.689 ms | 0.3813 ms | 0.3566 ms | 1500.0000 | 71.4286 | - | 15.38 MB | | OpenSingleFileXslb | 6.197 ms | 0.0668 ms | 0.0592 ms | 656.2500 | 31.2500 | - | 6.56 MB | | OpenSingleFileXls | 7.859 ms | 0.0444 ms | 0.0370 ms | 1312.5000 | 437.5000 | 218.7500 | 13.49 MB | After: | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:| | ReadSingleFileXslx | 58.71 ms | 1.116 ms | 1.146 ms | 2666.6667 | - | - | 27.13 MB | | ReadSingleFileXslb | 15.09 ms | 0.140 ms | 0.124 ms | 1484.3750 | 62.5000 | - | 14.95 MB | | ReadSingleFileXls | 19.23 ms | 0.158 ms | 0.148 ms | 4187.5000 | 1312.5000 | 906.2500 | 42.74 MB | | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:| | OpenSingleFileXslx | 22.213 ms | 0.2360 ms | 0.2207 ms | 750.0000 | 62.5000 | - | 7.75 MB | | OpenSingleFileXslb | 5.517 ms | 0.0686 ms | 0.0608 ms | 421.8750 | 15.6250 | - | 4.27 MB | | OpenSingleFileXls | 7.858 ms | 0.0997 ms | 0.0833 ms | 1312.5000 | 437.5000 | 218.7500 | 13.49 MB |

Releated to #618 Avoid some unnecessary work in the preparing read. We can't get rid of it completely without breaking compatiblity, we need to know the column count at the very least. Before: | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:| | ReadSingleFileXslx | 76.74 ms | 0.964 ms | 0.902 ms | 3500.0000 | - | - | 37.88 MB | | ReadSingleFileXslb | 16.71 ms | 0.160 ms | 0.149 ms | 2031.2500 | 93.7500 | - | 20.37 MB | | ReadSingleFileXls | 18.54 ms | 0.268 ms | 0.251 ms | 4187.5000 | 1312.5000 | 906.2500 | 42.74 MB | | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:| | OpenSingleFileXslx | 35.689 ms | 0.3813 ms | 0.3566 ms | 1500.0000 | 71.4286 | - | 15.38 MB | | OpenSingleFileXslb | 6.197 ms | 0.0668 ms | 0.0592 ms | 656.2500 | 31.2500 | - | 6.56 MB | | OpenSingleFileXls | 7.859 ms | 0.0444 ms | 0.0370 ms | 1312.5000 | 437.5000 | 218.7500 | 13.49 MB | After: | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |---------:|---------:|---------:|----------:|----------:|---------:|----------:| | ReadSingleFileXslx | 60.62 ms | 0.917 ms | 0.858 ms | 2666.6667 | - | - | 27.13 MB | | ReadSingleFileXslb | 14.74 ms | 0.118 ms | 0.105 ms | 1484.3750 | 62.5000 | - | 14.95 MB | | ReadSingleFileXls | 18.72 ms | 0.364 ms | 0.510 ms | 4187.5000 | 1312.5000 | 906.2500 | 42.74 MB | | Method | Mean | Error | StdDev | Gen0 | Gen1 | Gen2 | Allocated | |------------------- |----------:|----------:|----------:|----------:|---------:|---------:|----------:| | OpenSingleFileXslx | 21.411 ms | 0.3481 ms | 0.3256 ms | 750.0000 | 62.5000 | - | 7.75 MB | | OpenSingleFileXslb | 5.587 ms | 0.0701 ms | 0.0656 ms | 421.8750 | 15.6250 | - | 4.27 MB | | OpenSingleFileXls | 7.748 ms | 0.1345 ms | 0.1258 ms | 1320.3125 | 437.5000 | 218.7500 | 13.49 MB |

appel1 added the performance label Dec 3, 2022

OwnageIsMagic mentioned this issue Sep 3, 2023

Extract text from big documents faster #653

Closed

appel1 added this to the 3.8 milestone Jun 14, 2024

appel1 modified the milestones: 3.8, 4.0 Apr 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XlsxWorksheet constructor is parsing the whole sheet on construction #618

XlsxWorksheet constructor is parsing the whole sheet on construction #618

OwnageIsMagic commented Nov 20, 2022 •

edited

Loading

appel1 commented Nov 20, 2022

OwnageIsMagic commented Nov 20, 2022

appel1 commented Nov 20, 2022

victor-gutemberg commented Feb 16, 2023 •

edited

Loading

ArjunVachhani commented Aug 19, 2023

appel1 commented May 18, 2024

appel1 commented May 27, 2024 •

edited

Loading

bl-kd-22 commented Oct 9, 2024

XlsxWorksheet constructor is parsing the whole sheet on construction #618

XlsxWorksheet constructor is parsing the whole sheet on construction #618

Comments

OwnageIsMagic commented Nov 20, 2022 • edited Loading

appel1 commented Nov 20, 2022

OwnageIsMagic commented Nov 20, 2022

appel1 commented Nov 20, 2022

victor-gutemberg commented Feb 16, 2023 • edited Loading

ArjunVachhani commented Aug 19, 2023

appel1 commented May 18, 2024

appel1 commented May 27, 2024 • edited Loading

bl-kd-22 commented Oct 9, 2024

OwnageIsMagic commented Nov 20, 2022 •

edited

Loading

victor-gutemberg commented Feb 16, 2023 •

edited

Loading

appel1 commented May 27, 2024 •

edited

Loading