-
Notifications
You must be signed in to change notification settings - Fork 990
Very large memory being used for a 50mb excel file #525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, ExcelDataReader is not using DOM. And indeed, only one row is loaded in the memory at once (with some exceptions). You're not measuring memory usage after calling AsDataSet()? 16x would not be surprising in that case. Please show your code, how you measure memory usage, and preferably the file that's causing problems. If you can't share that specific file, can you create a new one in which the problem is reproducible for us too? At the very least let us know if its XLS, XLSX or XLSB |
The file is .xlsx. I tried something out, and it appears there's more to this. When I tested the code in a .NET 5 Console application, by simply invoking the following for the FileStream
I didn't see any of that highly insane memory usage. But when I did it in a ASP.NET Core (.NET5) project, which file was passed through ModelBinding via a
The memory gobbles up to a 800 megabytes, if not even a gigabyte. I don't think it has something to do with my specific file, but something to do with how you handle a If you could please test this out on a ASP.NET Core (.NET 5) project with a large enough file (20-50 mb), and confirm that this is indeed causing the issues, if not - I'll try to create a github repository with a file inside you can use to reproduce the issue, when I get the time to do so (Preferably at the end of the week). |
It seems that you don't even need the model binding of an
from any action in the ASP.NET Core project (or even in the Program.cs the framework runs from) was enough to see those memory spikes. |
You could profile your app to pinpoint exactly which method uses up this much memory and share the data here. |
I'm not sure how exactly to pinpoint the method, but taking a snapshot of my memory, it shows a whooping 64, 435, 728 bytes for object type |
Can you also (rename the xlsx to zip and) unzip the XLSX and let us know the size of XlsxSST is the "Shared String Table", and is essentially a List<string> which contains all the strings in the workbook. When this object takes ~64mb, it means there are either 64mb/16 = 4 million strings in it, or the runtime has allocated 64mb in total for all strings in the spreadsheet (I'm not sure if/how the .NET runtime optimizes this). In the former case, if the spreadsheet contains 4 million strings taking up most of the 50mb of compressed data, uncompressed at ~5:1 ratio, and double that for the internal UTF-16 string representation, gives approx 500mb of allocations. Plus overhead per string object and for the SST we're not that far from seeing expected behavior. Can you tell a bit more about the characteristics of the workbook you are testing? How many worksheets are there, how many rows/cols, what is the distribution of number cells vs text cells? What is the average length of the strings? |
@andersnm , please do not forget that this "expected" behavior of seeing 500+ mb in memory is only happening on the ASP.NET Core framework, not in a Console application, which should be our main concern. I'm using the same file, with the same NuGet package. Everything is the same, with the exception that one project is a simple Console App, while the other - an ASP.NET Core project. In the very first lines of code - actually THE very first lines of code in both projects are the same - opening the file stream, and then opening an excel reader via the current library. Does this not happen for you, when you test the very same file in these different environments? (Both use .NET 5)
|
@spiritbob Thanks for the additional information! This should be enough to attempt a repro at least. Alas no .NET 5 here, and no free diskspace to install, so it's likely going to be a while to repro myself. This all sounds very strange. Not sure exactly how the memory usage was measured, so still keeping the door open for a "normal" explanation. Hopefully, this is not some peculiar streams implementation detail bug deep inside the .NET runtime that only affects ASP.NET Core and takes a week to investigate. |
Come weekend I'll try to help out by creating actual github repos reproducing the issue for you (hopefully). |
For the love of me - I could not create a sample file that mimics the behavior of the original one I've got. Sadly, I'm unable to post it as the data there is quite sensitive. https://github.com/SpiritBob/ExcelReading/blob/main/WebProject/Program.cs Those are the only files worthy to note in the repo. When tested with the file I've got - the memory my debugger reports as being consumed on the Web project is 500-700 mb, while on the console project - it reaches only 100 mb (as reported by my debugger). Upon taking a memory snapshot after both executions, the Heap size is the same, so that's kind of confusing (in both cases around 64 mb). I'm not sure what exactly I need to look for, in order to understand from where those extra 400-500 mb are coming from, or if it's a visual bug needed to be reported to the Debugging team of Visual Studio. |
Throwing out some random thoughts here. Just to be sure, can you confirm if the problem only happens with this particular XLSX and not with other similarly sized XLSX? And only when a debugger is attached or does it happen without debugger too? Task Manager -> Details -> (rclick columns) -> Select Columns -> Peak Working Set gives an idea of memory usage during runtime. One idea is the debugger probably loads far more debug symbol information into memory for an ASP.NET application than for a console application. Looking at your repo, the ConsoleApp uses netcoreapp3.1: While WebProject uses net5.0: Does it reproduce in the ConsoleApp if using net5.0 here as well? Can we see the list of files and the file sizes from the unzipped XLSX, e.g the output from |
I've downloaded some public xlsx sample files, one ranging about 50 mb in size, and none of them produced the thing my original file does. To note - all files I opened, were instantly loaded up when execution passed the
I ran both projects via their release's executable, and tracked that on the Task Manager.
Perhaps the debugger is then reporting incorrectly? (500 mb, but in reality the process uses much less than that?) For the above tests, I've changed the Console App project to be NET 5.0, so nothing has changed.
The unzipped file's folder containing
|
The two interesting files:
These files are pretty huge and ExcelDataReader parses both in F.ex another XLSX with mostly images would load faster and allocate less, since there is less of actual data in it to parse. Could it be the website project is using a different garbage collector than the console project? https://docs.microsoft.com/en-us/dotnet/standard/garbage-collection/workstation-server-gc |
I can't believe it! This has been the cause all along! It seems a Console Application uses a Workstation GC, while the ASP.NET Core project uses Server GC. When I disabled Server GC in the Console App and in the ASP.NET Core App via
Now both projects are capped at 100 mb memory! If I enable it - both are capped at around 500-700 mb memory!!! This is a shoker. I'm curious to learn why the Server GC does not lift even a finger to take care of that garbage. I made a small test - queue 14 tasks on the ThreadPool for execution that open up the same file (via excelreader), and monitored how the memory behaves. When it was Workstation GC - the memory capped at about 1009 mb. Server GC finished the operation 4 seconds faster. (1:16 minutes vs 1:12 minutes) Frankly, I thought the Server GC would beat the Workstation GC in terms of memory in that scenario, but it seemed the opposite. |
The Server GC trades memory in favor of throughput. Then there's the Retain VM option that also affects how .NET frees memory back to the OS. I have an idea how we can change ExcelDataReader to allocate less but will have to test if it has any real world impact. Could be that the objects we can do something about are so shortlived that it doesn't matter. |
@andersnm Could this temporarily be fixed by not storing all the unzipped data/ XlsxSST in-memory/List, but instead a temporary file (perhaps only when the extracted size gets exponentially bigger than the original, zipped file)? |
@spiritbob I suppose using a temp file will not reduce the total allocations, but rather double the number due to the added roundtrip. Alas I don't have any other good suggestions here. |
@andersnm Just to confirm: This issue with memory is only prevalent for xlsx files that have a large shared string table, correct? By inspecting the zip I can know if the file is too large to be handled and put it into a queue. For the other formats - that won't be posible, so I'll just take a slightly different approach if that's the case? |
@spiritbob The key factor is likely the the combined size of the XMLs inside file XLSX, not necessarily only the string table. I am assuming the memory pressure occurs in the XML parser. Every tag name, text element, attribute name, attribute value inside the XML become individually allocated strings (although quickly freed; it's SAX-based). Internally .NET uses 16 bits per character, so there is a 2x memory usage from that, and the XML is parsed twice, so there are 2x memory allocations from that. Note that this excludes images - there can be large images in the spreadsheet, but these are ignored completely. |
From my internal testing it seems that the larger the shared string table is, the more memory is being used. For example: a 20 mb file with a 100 mb shared string table takes up as much as 310 mb of memory, while a 50 mb file with only 30 mb shared string table ends up taking 120 mb of memory. (both figures are the end result when ran with a workstation GC) As for Hence why I'm thinking of distinguishing the two in some way, and for now that way is by their content types: |
@spiritbob I trust your testing and agree it sounds indeed like the string table is the main factor! To add, there was a report on bad XLSB perf on some .NET runtimes in #493 - alas not solved yet. |
@andersnm @appel1 ExcelDataReader loads all the shared string into RAM. for larger file, should it be avoided? I have implemented it at ArjunVachhani@b571bda I would like you hear your thoughts. |
I'll take a look but I'm not sure it is worth it because of the added complexity. Lots of pitfalls with read and write errors, performance cost of reading, string allocation increase, issues with antivirus, where to write it, configuration, cleanup and so on. If memory usage is a real issue then maybe we could store it more efficiently in memory somehow. |
@appel1 In my specific use-case, we'd like to avoid pressuring the RAM at all costs. CPU can take its time to compute, but if RAM gets overfilled, we'll be facing a serious issue of an application crash. I've seen files that are just at about 70 mb large, and yet consume up to a gigabyte of ram. Now imagine a scenario where you have 20 concurrent such files being processed. Is the above suggestion perfect for this case? Yes. Are there workarounds? Yes, but much more complicated than simply having it supported outside-the-box. You'd have to create your very own Semaphore (if you'd like to preserve order of the processed files) and assume how much memory at one point the application can spare for these files and basically assign it to the semaphore and start using it. |
Is there a sample Excel file that causes gigabytes of memory usage I can test things with? |
Hi - not sure loading 20 concurrent 70mb spreadsheets is necessarily something this project should aim to support out of the box. If there is strong demand to externalize the SST storage, the one idea is to abstract it away behind an interface, and let users of the library provide their own custom SST storage implementation in the reader configuration. Internally it can remain a EDIT: As suggested, a queue actually sounds like a good solution. More work and rearchitecting tho :/ |
@kikaragyozov I am also facing memory issue when upload huge files. In my case files are more than 200MB in size. So I have decided to a build a library to read huge Xlsx file. library is mostly ready. If you try it today it should work fine. please fell free to try and share your feedbacks and bugs. Link to repository https://github.com/ArjunVachhani/XlsxHelper |
I made a really quick test with a random .XLS file only to have it throw an exception with @andersnm is there motivation in addressing this hopefully? The final suggestion in regards to providing more control to consumers on how the STS should be read/stored via an interface is a good start. This would avoid the turmoil of having to deal with all the added complexities @appel1 described in reading the thing lazily. Internally, we've made a few specialized classes to help allocate a specific amount of memory in the process to be used exclusively by problematic excel files with shared string tables, and share it fairly, along with a safe upper-bound of how large the string table is allowed to be before we outright reject such file. ExcelDataReader is by-far the best library out there currently in terms of performance. The only factor that still in my eyes deserves an optimization to truly bring this to the next level is hopefully these costly string allocations, taking up so much memory. The official Open XML SDK by Microsoft has a serious issue when it comes to memory as well, as they, I believe, open the archive in Update mode, which is known to load the entire contents of related ZipArchiveEntries in memory. |
I'm not opposed to abstracting away the string storage so it can be replaced if required. Or we could optionally read the strings on demand instead. Something to experiment with but the overhead would probably make performance quite poor. |
When I make the call to
ExcelReaderFactory.CreateReader(Stream stream)
, which stream contains the contents to a 50mb excel file - a total of 800 mb gets eaten up by that call (halting execution while doing so).Is this using DOM? I thought only a single element/line was ever loaded at a time in the memory.
The text was updated successfully, but these errors were encountered: