Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Very large memory being used for a 50mb excel file #525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kikaragyozov opened this issue Apr 26, 2021 · 29 comments
Open

Very large memory being used for a 50mb excel file #525

kikaragyozov opened this issue Apr 26, 2021 · 29 comments

Comments

@kikaragyozov
Copy link

When I make the call to ExcelReaderFactory.CreateReader(Stream stream), which stream contains the contents to a 50mb excel file - a total of 800 mb gets eaten up by that call (halting execution while doing so).

Is this using DOM? I thought only a single element/line was ever loaded at a time in the memory.

@andersnm
Copy link
Collaborator

Hi,

ExcelDataReader is not using DOM. And indeed, only one row is loaded in the memory at once (with some exceptions).

You're not measuring memory usage after calling AsDataSet()? 16x would not be surprising in that case. Please show your code, how you measure memory usage, and preferably the file that's causing problems. If you can't share that specific file, can you create a new one in which the problem is reproducible for us too? At the very least let us know if its XLS, XLSX or XLSB

@kikaragyozov
Copy link
Author

kikaragyozov commented Apr 27, 2021

Hi,

ExcelDataReader is not using DOM. And indeed, only one row is loaded in the memory at once (with some exceptions).

You're not measuring memory usage after calling AsDataSet()? 16x would not be surprising in that case. Please show your code, how you measure memory usage, and preferably the file that's causing problems. If you can't share that specific file, can you create a new one in which the problem is reproducible for us too? At the very least let us know if its XLS, XLSX or XLSB

The file is .xlsx.

I tried something out, and it appears there's more to this.

When I tested the code in a .NET 5 Console application, by simply invoking the following for the FileStream

var stream = File.Open(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using IExcelDataReader reader = ExcelReaderFactory.CreateReader(stream);

I didn't see any of that highly insane memory usage.

But when I did it in a ASP.NET Core (.NET5) project, which file was passed through ModelBinding via a IFormFile, and whose stream I received via a call to IFormFile.OpenReadStream() and calling the same constructor as above:

using IExcelDataReader reader = ExcelReaderFactory.CreateReader(stream);

The memory gobbles up to a 800 megabytes, if not even a gigabyte. I don't think it has something to do with my specific file, but something to do with how you handle a Microsoft.AspNetCore.Http.ReferenceReadStream, returend from IFormFile.ReadOpenStream().

If you could please test this out on a ASP.NET Core (.NET 5) project with a large enough file (20-50 mb), and confirm that this is indeed causing the issues, if not - I'll try to create a github repository with a file inside you can use to reproduce the issue, when I get the time to do so (Preferably at the end of the week).

@kikaragyozov
Copy link
Author

kikaragyozov commented Apr 27, 2021

It seems that you don't even need the model binding of an IFormFile. Simply executing

var stream = File.Open(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using IExcelDataReader reader = ExcelReaderFactory.CreateReader(stream);

from any action in the ASP.NET Core project (or even in the Program.cs the framework runs from) was enough to see those memory spikes.

@silkfire
Copy link

You could profile your app to pinpoint exactly which method uses up this much memory and share the data here.

@kikaragyozov
Copy link
Author

I'm not sure how exactly to pinpoint the method, but taking a snapshot of my memory, it shows a whooping 64, 435, 728 bytes for object type ExcelDataReader.Core.OpenXmlFormat.XlsxSST

@andersnm
Copy link
Collaborator

andersnm commented Apr 28, 2021

Can you also (rename the xlsx to zip and) unzip the XLSX and let us know the size of xl/sharedStrings.xml from inside the zip?

XlsxSST is the "Shared String Table", and is essentially a List<string> which contains all the strings in the workbook. When this object takes ~64mb, it means there are either 64mb/16 = 4 million strings in it, or the runtime has allocated 64mb in total for all strings in the spreadsheet (I'm not sure if/how the .NET runtime optimizes this).

In the former case, if the spreadsheet contains 4 million strings taking up most of the 50mb of compressed data, uncompressed at ~5:1 ratio, and double that for the internal UTF-16 string representation, gives approx 500mb of allocations. Plus overhead per string object and for the SST we're not that far from seeing expected behavior.

Can you tell a bit more about the characteristics of the workbook you are testing? How many worksheets are there, how many rows/cols, what is the distribution of number cells vs text cells? What is the average length of the strings?

@kikaragyozov
Copy link
Author

kikaragyozov commented Apr 28, 2021

Can you also (rename the xlsx to zip and) unzip the XLSX and let us know the size of xl/sharedStrings.xml from inside the zip?

XlsxSST is the "Shared String Table", and is essentially a List which contains all the strings in the workbook. When this object takes ~64mb, it means there are either 64mb/16 = 4 million strings in it, or the runtime has allocated 64mb in total for all strings in the spreadsheet (I'm not sure if/how the .NET runtime optimizes this).

In the former case, if the spreadsheet contains 4 million strings taking up most of the 50mb of compressed data, uncompressed at ~5:1 ratio, and double that for the internal UTF-16 string representation, gives approx 500mb of allocations. Plus overhead per string object and for the SST we're not that far from seeing expected behavior.

Can you tell a bit more about the characteristics of the workbook you are testing? How many worksheets are there, how many rows/cols, what is the distribution of number cells vs text cells? What is the average length of the strings?

@andersnm , please do not forget that this "expected" behavior of seeing 500+ mb in memory is only happening on the ASP.NET Core framework, not in a Console application, which should be our main concern. I'm using the same file, with the same NuGet package. Everything is the same, with the exception that one project is a simple Console App, while the other - an ASP.NET Core project.

In the very first lines of code - actually THE very first lines of code in both projects are the same - opening the file stream, and then opening an excel reader via the current library.

Does this not happen for you, when you test the very same file in these different environments? (Both use .NET 5)

sharedStrings.xml is 29.3 mb large. The file has 244 870 rows, a single worksheet/spreadsheet. 39 Columns, sorry but the rest I don't have the time right now to divulge - there are string columns, numbers, and dates. No formulas.

@andersnm
Copy link
Collaborator

@spiritbob Thanks for the additional information! This should be enough to attempt a repro at least. Alas no .NET 5 here, and no free diskspace to install, so it's likely going to be a while to repro myself.

This all sounds very strange. Not sure exactly how the memory usage was measured, so still keeping the door open for a "normal" explanation. Hopefully, this is not some peculiar streams implementation detail bug deep inside the .NET runtime that only affects ASP.NET Core and takes a week to investigate.

@kikaragyozov
Copy link
Author

@spiritbob Thanks for the additional information! This should be enough to attempt a repro at least. Alas no .NET 5 here, and no free diskspace to install, so it's likely going to be a while to repro myself.

This all sounds very strange. Not sure exactly how the memory usage was measured, so still keeping the door open for a "normal" explanation. Hopefully, this is not some peculiar streams implementation detail bug deep inside the .NET runtime that only affects ASP.NET Core and takes a week to investigate.

Come weekend I'll try to help out by creating actual github repos reproducing the issue for you (hopefully).

@kikaragyozov
Copy link
Author

kikaragyozov commented May 12, 2021

For the love of me - I could not create a sample file that mimics the behavior of the original one I've got. Sadly, I'm unable to post it as the data there is quite sensitive.

https://github.com/SpiritBob/ExcelReading/blob/main/WebProject/Program.cs
https://github.com/SpiritBob/ExcelReading/blob/main/ConsoleApp/Program.cs

Those are the only files worthy to note in the repo. When tested with the file I've got - the memory my debugger reports as being consumed on the Web project is 500-700 mb, while on the console project - it reaches only 100 mb (as reported by my debugger). Upon taking a memory snapshot after both executions, the Heap size is the same, so that's kind of confusing (in both cases around 64 mb). I'm not sure what exactly I need to look for, in order to understand from where those extra 400-500 mb are coming from, or if it's a visual bug needed to be reported to the Debugging team of Visual Studio.

@andersnm
Copy link
Collaborator

Throwing out some random thoughts here. Just to be sure, can you confirm if the problem only happens with this particular XLSX and not with other similarly sized XLSX?

And only when a debugger is attached or does it happen without debugger too? Task Manager -> Details -> (rclick columns) -> Select Columns -> Peak Working Set gives an idea of memory usage during runtime.

One idea is the debugger probably loads far more debug symbol information into memory for an ASP.NET application than for a console application.

Looking at your repo, the ConsoleApp uses netcoreapp3.1:
https://github.com/SpiritBob/ExcelReading/blob/5088e0b03641a921fa218b20ddc7145710ca35f2/ConsoleApp/ConsoleApp.csproj#L5

While WebProject uses net5.0:
https://github.com/SpiritBob/ExcelReading/blob/5088e0b03641a921fa218b20ddc7145710ca35f2/WebProject/WebProject.csproj#L4

Does it reproduce in the ConsoleApp if using net5.0 here as well?

Can we see the list of files and the file sizes from the unzipped XLSX, e.g the output from dir /s in a cmd from the unzipped XLSX directory?

@kikaragyozov
Copy link
Author

kikaragyozov commented May 13, 2021

Throwing out some random thoughts here. Just to be sure, can you confirm if the problem only happens with this particular XLSX > and not with other similarly sized XLSX?

I've downloaded some public xlsx sample files, one ranging about 50 mb in size, and none of them produced the thing my original file does. To note - all files I opened, were instantly loaded up when execution passed the ExcelReaderFactory.CreateReader method call, while my file takes about ~7-8 seconds for that method to return, which in the meantime you can see on the ASP.NET Core project how the memory slowly skyrockets to reported values, whereas on the console, it reaches a certain threshold and stays there for the remainder of the long operation.

And only when a debugger is attached or does it happen without debugger too? Task Manager -> Details -> (rclick columns) -Select Columns -> Peak Working Set gives an idea of memory usage during runtime.

I ran both projects via their release's executable, and tracked that on the Task Manager.

  • Memory for ASP NET Core after reader is about 240mb,. (compared to ~15-20mb when opening an extremely small file)
  • Memory for Console App after reader is about 120mb. (compared to ~12-17mb when opening an extremely small file)

Perhaps the debugger is then reporting incorrectly? (500 mb, but in reality the process uses much less than that?)
I was also keeping tabs on my total RAM consumption, in the event multiple processes were involved, and it always moved as expected (120 to 240mb). If the OS is doing some caching under the hoods/pre-loading (as my memory when testing this was about 80% full), then take that into account.

For the above tests, I've changed the Console App project to be NET 5.0, so nothing has changed.

Can we see the list of files and the file sizes from the unzipped XLSX, e.g the output from dir /s in a cmd from the unzipped XLSX directory?

The unzipped file's folder containing .xlsx's files is called stumpy_g. The full path to that folder has been obfuscated.

ObfuscatedPath\\stumpy_g>dir /s
 Volume in drive C has no label.
 Volume Serial Number is C260-93AC

 Directory of ObfuscatedPath\\stumpy_g

05/13/2021  10:13 AM    <DIR>          .
05/13/2021  10:13 AM    <DIR>          ..
05/13/2021  10:13 AM    <DIR>          docProps
05/13/2021  10:13 AM    <DIR>          xl
01/01/1980  12:00 AM             1,168 [Content_Types].xml
05/13/2021  10:13 AM    <DIR>          _rels
               1 File(s)          1,168 bytes

 Directory of ObfuscatedPath\\stumpy_g\docProps

05/13/2021  10:13 AM    <DIR>          .
05/13/2021  10:13 AM    <DIR>          ..
01/01/1980  12:00 AM               791 app.xml
01/01/1980  12:00 AM               625 core.xml
               2 File(s)          1,416 bytes

 Directory of ObfuscatedPath\\stumpy_g\xl

05/13/2021  10:13 AM    <DIR>          .
05/13/2021  10:13 AM    <DIR>          ..
01/01/1980  12:00 AM        30,798,086 sharedStrings.xml
01/01/1980  12:00 AM            16,986 styles.xml
05/13/2021  10:13 AM    <DIR>          theme
01/01/1980  12:00 AM             1,483 workbook.xml
05/13/2021  10:13 AM    <DIR>          worksheets
05/13/2021  10:13 AM    <DIR>          _rels
               3 File(s)     30,816,555 bytes

 Directory of ObfuscatedPath\\stumpy_g\xl\theme

05/13/2021  10:13 AM    <DIR>          .
05/13/2021  10:13 AM    <DIR>          ..
01/01/1980  12:00 AM             8,390 theme1.xml
               1 File(s)          8,390 bytes

 Directory of ObfuscatedPath\\stumpy_g\xl\worksheets

05/13/2021  10:13 AM    <DIR>          .
05/13/2021  10:13 AM    <DIR>          ..
01/01/1980  12:00 AM       278,566,993 sheet1.xml
               1 File(s)    278,566,993 bytes

 Directory of ObfuscatedPath\\stumpy_g\xl\_rels

05/13/2021  10:13 AM    <DIR>          .
05/13/2021  10:13 AM    <DIR>          ..
01/01/1980  12:00 AM               698 workbook.xml.rels
               1 File(s)            698 bytes

 Directory of ObfuscatedPath\\stumpy_g\_rels

05/13/2021  10:13 AM    <DIR>          .
05/13/2021  10:13 AM    <DIR>          ..
01/01/1980  12:00 AM               588 .rels
               1 File(s)            588 bytes

     Total Files Listed:
              10 File(s)    309,395,808 bytes
              20 Dir(s)  74,978,336,768 bytes free

@andersnm
Copy link
Collaborator

The two interesting files:

01/01/1980  12:00 AM        30,798,086 sharedStrings.xml
...
01/01/1980  12:00 AM       278,566,993 sheet1.xml

These files are pretty huge and ExcelDataReader parses both in CreateReader(). I suppose this is where the 7-8 seconds go, and accounts for ~600mb of allocations (~300mb XML, 16 bit per character).

F.ex another XLSX with mostly images would load faster and allocate less, since there is less of actual data in it to parse.

Could it be the website project is using a different garbage collector than the console project?

https://docs.microsoft.com/en-us/dotnet/standard/garbage-collection/workstation-server-gc

@kikaragyozov
Copy link
Author

kikaragyozov commented May 14, 2021

The two interesting files:

01/01/1980  12:00 AM        30,798,086 sharedStrings.xml
...
01/01/1980  12:00 AM       278,566,993 sheet1.xml

These files are pretty huge and ExcelDataReader parses both in CreateReader(). I suppose this is where the 7-8 seconds go, and accounts for ~600mb of allocations (~300mb XML, 16 bit per character).

F.ex another XLSX with mostly images would load faster and allocate less, since there is less of actual data in it to parse.

Could it be the website project is using a different garbage collector than the console project?

https://docs.microsoft.com/en-us/dotnet/standard/garbage-collection/workstation-server-gc

I can't believe it! This has been the cause all along! It seems a Console Application uses a Workstation GC, while the ASP.NET Core project uses Server GC.

When I disabled Server GC in the Console App and in the ASP.NET Core App via

<ServerGarbageCollection>false</ServerGarbageCollection>

Now both projects are capped at 100 mb memory! If I enable it - both are capped at around 500-700 mb memory!!!

This is a shoker. I'm curious to learn why the Server GC does not lift even a finger to take care of that garbage. I made a small test - queue 14 tasks on the ThreadPool for execution that open up the same file (via excelreader), and monitored how the memory behaves.

When it was Workstation GC - the memory capped at about 1009 mb.
When it was Server GC - the memory reached as high as 1.3GB-1.4GB.

Server GC finished the operation 4 seconds faster. (1:16 minutes vs 1:12 minutes)

Frankly, I thought the Server GC would beat the Workstation GC in terms of memory in that scenario, but it seemed the opposite.

@appel1
Copy link
Collaborator

appel1 commented May 14, 2021

The Server GC trades memory in favor of throughput. Then there's the Retain VM option that also affects how .NET frees memory back to the OS.

I have an idea how we can change ExcelDataReader to allocate less but will have to test if it has any real world impact. Could be that the objects we can do something about are so shortlived that it doesn't matter.

@kikaragyozov
Copy link
Author

kikaragyozov commented Aug 5, 2021

@andersnm Could this temporarily be fixed by not storing all the unzipped data/ XlsxSST in-memory/List, but instead a temporary file (perhaps only when the extracted size gets exponentially bigger than the original, zipped file)?

@andersnm
Copy link
Collaborator

andersnm commented Aug 5, 2021

@spiritbob I suppose using a temp file will not reduce the total allocations, but rather double the number due to the added roundtrip. Alas I don't have any other good suggestions here.

@kikaragyozov
Copy link
Author

kikaragyozov commented Sep 20, 2021

@andersnm Just to confirm: This issue with memory is only prevalent for xlsx files that have a large shared string table, correct? By inspecting the zip I can know if the file is too large to be handled and put it into a queue. For the other formats - that won't be posible, so I'll just take a slightly different approach if that's the case?

@andersnm
Copy link
Collaborator

@spiritbob The key factor is likely the the combined size of the XMLs inside file XLSX, not necessarily only the string table.

I am assuming the memory pressure occurs in the XML parser. Every tag name, text element, attribute name, attribute value inside the XML become individually allocated strings (although quickly freed; it's SAX-based). Internally .NET uses 16 bits per character, so there is a 2x memory usage from that, and the XML is parsed twice, so there are 2x memory allocations from that.

Note that this excludes images - there can be large images in the spreadsheet, but these are ignored completely.

@kikaragyozov
Copy link
Author

kikaragyozov commented Sep 20, 2021

@spiritbob The key factor is likely the the combined size of the XMLs inside file XLSX, not necessarily only the string table.

I am assuming the memory pressure occurs in the XML parser. Every tag name, text element, attribute name, attribute value inside the XML become individually allocated strings (although quickly freed; it's SAX-based). Internally .NET uses 16 bits per character, so there is a 2x memory usage from that, and the XML is parsed twice, so there are 2x memory allocations from that.

Note that this excludes images - there can be large images in the spreadsheet, but these are ignored completely.

From my internal testing it seems that the larger the shared string table is, the more memory is being used. For example: a 20 mb file with a 100 mb shared string table takes up as much as 310 mb of memory, while a 50 mb file with only 30 mb shared string table ends up taking 120 mb of memory. (both figures are the end result when ran with a workstation GC)

As for .xlsb files, which are ZIP files as well, there is no memory pressure there at all, but parsing the file takes a whooping 60-120 seconds, compared to the same file in xlsx format.

Hence why I'm thinking of distinguishing the two in some way, and for now that way is by their content types:
application/vnd.openxmlformats-officedocument.spreadsheetml.sharedStrings+xml for sharedStrings in .xlsx
application/vnd.ms-excel.sharedStrings for sharedStrings in .xlsb

@andersnm
Copy link
Collaborator

@spiritbob I trust your testing and agree it sounds indeed like the string table is the main factor! To add, there was a report on bad XLSB perf on some .NET runtimes in #493 - alas not solved yet.

@ArjunVachhani
Copy link

@andersnm @appel1 ExcelDataReader loads all the shared string into RAM. for larger file, should it be avoided?
I am thinking of loading shared string to seek-able file instead of RAM. When needed we can seek to a specific record and return the content.

I have implemented it at ArjunVachhani@b571bda

I would like you hear your thoughts.

@appel1
Copy link
Collaborator

appel1 commented Aug 15, 2023

I'll take a look but I'm not sure it is worth it because of the added complexity. Lots of pitfalls with read and write errors, performance cost of reading, string allocation increase, issues with antivirus, where to write it, configuration, cleanup and so on.

If memory usage is a real issue then maybe we could store it more efficiently in memory somehow.

@kikaragyozov
Copy link
Author

kikaragyozov commented Aug 15, 2023

@appel1 In my specific use-case, we'd like to avoid pressuring the RAM at all costs. CPU can take its time to compute, but if RAM gets overfilled, we'll be facing a serious issue of an application crash.

I've seen files that are just at about 70 mb large, and yet consume up to a gigabyte of ram. Now imagine a scenario where you have 20 concurrent such files being processed.

Is the above suggestion perfect for this case? Yes.

Are there workarounds? Yes, but much more complicated than simply having it supported outside-the-box. You'd have to create your very own Semaphore (if you'd like to preserve order of the processed files) and assume how much memory at one point the application can spare for these files and basically assign it to the semaphore and start using it.

@appel1
Copy link
Collaborator

appel1 commented Aug 15, 2023

Is there a sample Excel file that causes gigabytes of memory usage I can test things with?

@andersnm
Copy link
Collaborator

andersnm commented Aug 15, 2023

Hi - not sure loading 20 concurrent 70mb spreadsheets is necessarily something this project should aim to support out of the box.

If there is strong demand to externalize the SST storage, the one idea is to abstract it away behind an interface, and let users of the library provide their own custom SST storage implementation in the reader configuration. Internally it can remain a Dictionary<> by default.

EDIT: As suggested, a queue actually sounds like a good solution. More work and rearchitecting tho :/

@ArjunVachhani
Copy link

@kikaragyozov I am also facing memory issue when upload huge files. In my case files are more than 200MB in size.

So I have decided to a build a library to read huge Xlsx file. library is mostly ready. If you try it today it should work fine. please fell free to try and share your feedbacks and bugs.
In coming days I will push few more changes to iron out any issue and add documentation.

Link to repository https://github.com/ArjunVachhani/XlsxHelper

@kikaragyozov
Copy link
Author

kikaragyozov commented Dec 20, 2024

@kikaragyozov I am also facing memory issue when upload huge files. In my case files are more than 200MB in size.

So I have decided to a build a library to read huge Xlsx file. library is mostly ready. If you try it today it should work fine. please fell free to try and share your feedbacks and bugs. In coming days I will push few more changes to iron out any issue and add documentation.

Link to repository https://github.com/ArjunVachhani/XlsxHelper

I made a really quick test with a random .XLS file only to have it throw an exception with End of Central Directory Not Found.

@andersnm is there motivation in addressing this hopefully? The final suggestion in regards to providing more control to consumers on how the STS should be read/stored via an interface is a good start. This would avoid the turmoil of having to deal with all the added complexities @appel1 described in reading the thing lazily.

Internally, we've made a few specialized classes to help allocate a specific amount of memory in the process to be used exclusively by problematic excel files with shared string tables, and share it fairly, along with a safe upper-bound of how large the string table is allowed to be before we outright reject such file.

ExcelDataReader is by-far the best library out there currently in terms of performance. The only factor that still in my eyes deserves an optimization to truly bring this to the next level is hopefully these costly string allocations, taking up so much memory.

The official Open XML SDK by Microsoft has a serious issue when it comes to memory as well, as they, I believe, open the archive in Update mode, which is known to load the entire contents of related ZipArchiveEntries in memory.

@appel1
Copy link
Collaborator

appel1 commented Apr 24, 2025

I'm not opposed to abstracting away the string storage so it can be replaced if required.

Or we could optionally read the strings on demand instead. Something to experiment with but the overhead would probably make performance quite poor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants