-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
π‘ High: Implement streaming file operations for large files
Issue Summary
RDMP uses synchronous file operations that load entire files into memory, causing thread blocking, memory exhaustion, and poor performance with large files (Excel, CSV, archives).
π¨ High Impact
- Thread Pool Blocking: Synchronous file I/O blocks threads
- Memory Exhaustion: Large files loaded entirely into memory
- Poor Performance: File operations scale poorly with size
- UI Freezing: Application becomes unresponsive during file operations
π Current Problems
1. Excel File Processing
// ExcelDataFlowSource.cs - Lines 80-82
using var fs = new FileStream(_fileToLoad.File.FullName, FileMode.Open);
using var workbook = new XSSFWorkbook(fs); // Loads entire workbook!Problem: Entire Excel workbook (possibly 100MB+) loaded into memory
2. CSV File Loading
// DelimitedFlatFileDataFlowSource.cs
var dataTable = new DataTable();
// Load entire CSV into DataTable
dataTable.Load(csvReader);Problem: Entire CSV file loaded before processing begins
3. Web File Downloads
// WebFileDownloader.cs - Lines 83-98
var response = await httpClient.GetAsync(url);
response.Content.ReadAsStreamAsync().Result.CopyTo(output); // .Result blocks!Problem: Blocking download with .Result and loads entire file into memory
4. Archive Operations
// FileUnzipper.cs - Lines 54-78
using var zipFile = ZipFile.OpenRead(filePath);
foreach (var entry in zipFile.Entries)
{
entry.ExtractToFile(destinationPath); // Synchronous extraction
}Problem: Synchronous extraction without streaming
5. AWS S3 Operations
// AWSS3BucketReleaseDestination.cs
Task.Run(async () => await _s3Helper.GetBucket(BucketName)).Result; // .Result blocks!Problem: Blocking async operations
π Performance Impact
| File Size | Current Approach | Streaming Approach | Improvement |
|---|---|---|---|
| 1MB | ~100ms, 5MB memory | ~50ms, 1MB memory | 2x faster, 5x less memory |
| 10MB | ~1s, 50MB memory | ~200ms, 2MB memory | 5x faster, 25x less memory |
| 100MB | ~10s, 500MB memory | ~800ms, 5MB memory | 12.5x faster, 100x less memory |
| 1GB | ~100s, 5GB memory | ~5s, 20MB memory | 20x faster, 250x less memory |
π Recommended Solution
1. Streaming Excel Processing
public class StreamingExcelReader : IAsyncEnumerable<DataRow>
{
private readonly string _filePath;
private readonly int _bufferSize;
public IAsyncEnumerator<DataRow> GetAsyncEnumerator(CancellationToken cancellationToken = default)
{
return new StreamingExcelEnumerator(_filePath, _bufferSize, cancellationToken);
}
}
public class StreamingExcelEnumerator : IAsyncEnumerator<DataRow>
{
private readonly IExcelDataReader _reader;
private DataRow _current;
public async ValueTask<bool> MoveNextAsync()
{
return await Task.Run(() => _reader.Read(), _cancellationToken);
}
public DataRow Current => _current ??= CreateDataRowFromReader(_reader);
}2. Async CSV Streaming
public class AsyncCsvReader : IAsyncEnumerable<string[]>
{
private readonly string _filePath;
private readonly int _bufferSize;
public IAsyncEnumerator<string[]> GetAsyncEnumerator(CancellationToken cancellationToken = default)
{
return new AsyncCsvEnumerator(_filePath, _bufferSize, cancellationToken);
}
}
public class AsyncCsvEnumerator : IAsyncEnumerator<string[]>
{
private readonly StreamReader _reader;
private string[] _current;
public async ValueTask<bool> MoveNextAsync()
{
var line = await _reader.ReadLineAsync();
if (line == null) return false;
_current = ParseCsvLine(line);
return true;
}
public string[] Current => _current;
}3. Non-blocking File Downloads
public class AsyncFileDownloader
{
private readonly HttpClient _httpClient;
private readonly int _bufferSize = 81920; // 80KB buffer
public async Task DownloadAsync(string url, string destinationPath, IProgress<long> progress = null, CancellationToken cancellationToken = default)
{
await using var response = await _httpClient.GetAsync(url, HttpCompletionOption.ResponseHeadersRead, cancellationToken);
await using var contentStream = await response.Content.ReadAsStreamAsync(cancellationToken);
await using var fileStream = new FileStream(destinationPath, FileMode.Create, FileAccess.Write, FileShare.None, _bufferSize, true);
var buffer = new byte[_bufferSize];
long totalBytesRead = 0;
int bytesRead;
while ((bytesRead = await contentStream.ReadAsync(buffer, 0, buffer.Length, cancellationToken)) > 0)
{
await fileStream.WriteAsync(buffer, 0, bytesRead, cancellationToken);
totalBytesRead += bytesRead;
progress?.Report(totalBytesRead);
}
}
}4. Streaming Archive Operations
public class StreamingArchiveExtractor
{
public async IAsyncEnumerable<ArchiveEntry> ExtractAsync(string archivePath, [EnumeratorCancellation] CancellationToken cancellationToken = default)
{
await using var fileStream = new FileStream(archivePath, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 81920, useAsync: true);
using var archive = ZipFile.Open(fileStream, ZipArchiveMode.Read);
foreach (var entry in archive.Entries)
{
cancellationToken.ThrowIfCancellationRequested();
yield return new ArchiveEntry
{
Name = entry.FullName,
Size = entry.CompressedLength,
LastModified = entry.LastWriteTime.DateTime,
Extractor = (destinationPath) => ExtractEntryAsync(entry, destinationPath, cancellationToken)
};
}
}
private async Task ExtractEntryAsync(ZipArchiveEntry entry, string destinationPath, CancellationToken cancellationToken)
{
await using var entryStream = entry.Open();
await using var fileStream = new FileStream(destinationPath, FileMode.Create, FileAccess.Write, FileShare.None, bufferSize: 81920, useAsync: true);
await entryStream.CopyToAsync(fileStream, 81920, cancellationToken);
}
}5. Memory-Efficient File Processing Pipeline
public class StreamingDataPipeline
{
public async Task ProcessLargeFileAsync<T>(string filePath, Func<T, Task> processor, int chunkSize = 1000, CancellationToken cancellationToken = default)
{
await foreach (var item in ReadFileAsync<T>(filePath, cancellationToken))
{
await processor(item);
}
}
private async IAsyncEnumerable<T> ReadFileAsync<T>(string filePath, [EnumeratorCancellation] CancellationToken cancellationToken = default)
{
// File type detection and appropriate reader selection
var extension = Path.GetExtension(filePath).ToLowerInvariant();
return extension switch
{
".csv" => ReadCsvAsync<T>(filePath, cancellationToken),
".xlsx" or ".xls" => ReadExcelAsync<T>(filePath, cancellationToken),
".json" => ReadJsonAsync<T>(filePath, cancellationToken),
_ => throw new NotSupportedException($"File type {extension} not supported")
};
}
}π― Implementation Plan
Phase 1 (Week 1): Core Streaming Infrastructure
- Implement StreamingExcelReader for Excel files
- Create AsyncCsvReader for CSV files
- Add AsyncFileDownloader for web downloads
Phase 2 (Week 2): Archive and Cloud Operations
- Implement StreamingArchiveExtractor for ZIP files
- Update AWS S3 operations to be truly async
- Add progress reporting for long-running operations
Phase 3 (Week 3): Pipeline Integration
- Update data pipeline components to use streaming
- Modify import/export operations for chunked processing
- Add cancellation token support throughout
Phase 4 (Week 4): UI Integration and Monitoring
- Update UI components to show progress for file operations
- Add file processing monitoring and metrics
- Implement error handling and retry logic
β Acceptance Criteria
- All file operations use streaming with configurable buffer sizes
- Memory usage stays constant regardless of file size (buffered processing)
- UI remains responsive during file operations
- Progress reporting available for long-running operations
- Cancellation token support for all async file operations
- Performance tests show 5-20x improvement for large files
π Areas Requiring Updates
High Priority:
- ExcelDataFlowSource - Streaming Excel processing
- WebFileDownloader - Async, non-blocking downloads
- FileUnzipper - Streaming archive extraction
- DelimitedFlatFileDataFlowSource - Streaming CSV processing
Medium Priority:
- AWSS3BucketReleaseDestination - True async S3 operations
- Export Operations - Chunked file export
- Import Operations - Streaming file import
- Backup Operations - Incremental backup processing
Low Priority:
- Log File Processing - Streaming log analysis
- Configuration Loading - Async config file reading
π Expected Impact
- Memory Usage: 90-95% reduction for large file operations
- Performance: 5-20x faster processing of large files
- UI Responsiveness: No more freezing during file operations
- Scalability: Handle files of any size with constant memory usage
- Resource Efficiency: Better thread pool utilization
π Related Issues
- Async database operations conversion
- Memory usage optimization
- UI responsiveness improvements
- Performance bottleneck analysis
Metadata
Metadata
Assignees
Labels
No labels