A high-performance parallel and distributed mergesort implementation for sorting large binary files that exceed available main memory. This project explores multiple parallelization strategies including OpenMP, FastFlow, and MPI+FastFlow hybrid models.
This project implements an out-of-core mergesort algorithm capable of sorting files containing variable-length binary records. The solution leverages:
- Memory-mapped I/O for efficient file access
- Lightweight indexing to avoid loading full records into memory
- Multi-threaded parallelism using OpenMP and FastFlow
- Distributed computing via MPI for multi-node execution
- Hybrid MPI+FastFlow model combining distributed and shared-memory parallelism
- ✅ Sorts files up to 32GB+ with minimal memory footprint
- ✅ Multiple execution policies: Sequential, OpenMP, FastFlow, MPI+FastFlow
- ✅ K-way merge algorithm with min-heap optimization
- ✅ Asynchronous communication with double buffering (MPI)
- ✅ SIMD vectorization support via C++20 execution policies
The implementation follows a multi-phase approach:
- Indexing Phase: Memory-map the input file and build a lightweight index (
RecordTask) without loading full payloads - Chunking Phase: Divide the index into manageable chunks for parallel processing
- Parallel Sorting: Sort chunks in parallel using configurable execution policies
- Merging Phase: Apply k-way merge with min-heap to combine sorted chunks
- Output Phase: Write sorted records to output file (optional)
// On-disk binary format
struct Record {
uint64_t key; // 8-byte sorting key
uint32_t len; // Payload length
char payload[]; // Variable-length data
};
// Lightweight in-memory proxy (24 bytes)
struct RecordTask {
uint64_t key; // Sorting key
uint32_t len; // Payload length
size_t foffset; // File offset to original record
};
// Work range for parallel processing
struct WorkRange {
size_t start_idx;
size_t end_idx;
const RecordTask* range_ptr;
size_t ff_id; // FastFlow worker ID
};For a 200GB file with PAYLOAD_MAX = 1024 bytes:
- Average record size: ~516 bytes
- Estimated record count: ~416M records
- In-memory index size: ~16GB RAM (using
RecordTaskproxies)
Basic single-threaded implementation for baseline comparison.
Data-parallel approach using OpenMP directives:
- Divides work into ranges, one per thread
- Uses
#pragma omp parallel forwith static scheduling - SIMD vectorization with
std::execution::unseq - Parallel divide-and-conquer merge strategy
Structured parallel pipeline with multiple stages:
Emitter → [Worker Farm] → [Merge Network] → Collector
- RecordTaskSplitter: Distributes work ranges to workers
- RecordSortWorker: Farm of parallel sorting workers
- RecordTaskMerger: Multi-level binary tree of 2-way mergers
- RecordCollector: Routes tasks between sorting and merge layers
Hybrid model for multi-node execution:
Rank 0 (Emitter) → Ranks 1..P-2 (Workers) → Rank P-1 (Collector)
Key Optimizations:
- Asynchronous communication (
MPI_Isend,MPI_Irecv) - Double buffering for overlapping computation and communication
- Each worker runs internal FastFlow pipeline
- Final k-way merge at collector
# Required dependencies
- C++20 compatible compiler (g++ 10+)
- OpenMPI 4.0+
- OpenMP support
- FastFlow library
- R (optional, for performance plots)# Set FastFlow path (if not in ~/fastflow)
export FF_ROOT=/path/to/fastflow
# Build all targets
make all
# Build with debug symbols
make DEBUG=1 all
# Build specific targets
make mergesort # Main sorting program
make record_gen # Test data generator
# Clean build artifacts
make clean # Remove binaries
make cleanall # Remove binaries and test files# Clone the repository
git clone https://github.com/DropB1t/spm-mergesort.git
cd spm-mergesort
# Build the project
make all./record_gen <num_records> [payload_max] [output_file] [seed]
# Examples:
./record_gen 1000000 # 1M records, default settings
./record_gen 50000000 1024 data.dat # 50M records, max payload 1KB
./record_gen 10000000 512 test.dat 42 # Custom seed for reproducibilityParameters:
num_records: Number of records to generate (required)payload_max: Maximum payload size in bytes (default: 1024, min: 8)output_file: Output filename (default: records.dat)seed: Random seed (default: current timestamp)
./mergesort <policy> <num_processes> <num_threads> <chunk_size> <record_count> [csv_file]
# Examples:
./mergesort OMP 1 8 1000000 10000000 # OpenMP with 8 threads
./mergesort FastFlow 1 16 5000000 50000000 # FastFlow with 16 workers
# MPI+FastFlow (requires mpirun)
mpirun -n 4 ./mergesort MPI_FF 4 8 1000000 10000000Parameters:
policy: Execution policy (Sequential, OMP, FastFlow, MPI_FF)num_processes: Number of MPI processes (use 1 for non-MPI policies)num_threads: Number of threads/workers per processchunk_size: Records per chunkrecord_count: Total records in input filecsv_file: Optional CSV file for benchmark results
# Run comprehensive benchmark suite
./ms_benchmark.sh
# Custom options
./ms_benchmark.sh --executable ./mergesort --runs 5 --time 01:00:00
# Available options:
--executable PATH Path to benchmark executable
--time TIME SLURM time limit (default: 00:30:00)
--runs NUMBER Number of runs per configuration (default: 3)
--help Show help messageThe benchmark suite tests:
- Multiple execution policies
- Various thread/process counts
- Different chunk sizes
- Multiple data sizes (up to 50M records / 25GB)
The k-way merge optimization provides:
- Time Complexity: O(log k) for selecting minimum element
- Reduced Merge Passes: Merges k runs simultaneously vs. traditional 2-way merge
- Cache Efficiency: Better memory access patterns
Using mmap() with POSIX_MADV_SEQUENTIAL:
- Delegates memory management to kernel
- Proactive prefetching of sequential pages
- Aggressive release of processed pages
- Reduces page faults and improves throughput
spm-mergesort/
├── mergesort.cpp # Main sorting implementation
├── record_gen.cpp # Test data generator
├── ms_benchmark.sh # Benchmark suite script
├── plot.r # R visualization script
├── Makefile # Build configuration
├── README.md # This file
└── include/
├── defines.hpp # Global constants and enums
├── record.hpp # Record data structures
├── timer.hpp # Benchmark timing utilities
└── utils.hpp # Helper functions
Key parameters in include/defines.hpp:
constexpr size_t PAYLOAD_MIN = 8; // Minimum payload size
constexpr size_t PAYLOAD_MAX = 1024; // Maximum payload size
constexpr size_t max_chunk_size = 1000000; // Default chunk size
std::string INPUT_FILE = "records.dat"; // Input filename
std::string OUTPUT_FILE = "sorted.dat"; // Output filenameResults are saved in structured directories:
benchmark_results/run_<timestamp>/ # CSV files with measurements
performance_plots/run_<timestamp>/ # Generated graphs (if R available)
benchmark_logs/run_<timestamp>/ # Individual job logs
Each benchmark run generates:
- CSV file with detailed metrics (execution time, speedup, efficiency)
- Performance graphs (if R is installed)
- Summary report with configuration and statistics
// Double buffering for overlapping computation and communication
std::vector<char> recv_buf[2];
std::vector<char> send_buf[2];
MPI_Request recv_reqs[2], send_reqs[2];
// Non-blocking receives
MPI_Irecv(recv_buf[0].data(), size, MPI_BYTE, source, tag, comm, &recv_reqs[0]);
MPI_Irecv(recv_buf[1].data(), size, MPI_BYTE, source, tag, comm, &recv_reqs[1]);RecordSortingPipeline(work_ranges, num_workers, record_tasks) {
auto sorting_farm = build_sorting_farm(num_workers);
pipeline->add_stage(sorting_farm);
for (auto level : merger_levels) {
auto merger_farm = build_merger_farm(level, num_mergers, record_tasks);
pipeline->add_stage(merger_farm);
}
pipeline->add_stage(build_last_merger(final_level, record_tasks));
}std::sort(std::execution::unseq, span.begin(), span.end(),
[](const RecordTask& a, const RecordTask& b) {
return std::tie(a.key, a.foffset) < std::tie(b.key, b.foffset);
});Potential optimizations and extensions:
- Decentralized Emitter: Use MPI-IO for workers to read input directly
- Parallel Collector: Multi-level tree-based merge network with multiple processes
- GPU Acceleration: Offload sorting to CUDA/OpenCL
- Compression: Add payload compression for reduced I/O
- Fault Tolerance: Checkpointing for long-running distributed jobs
Yuriy Rymarchuk
MSc in Computer Science and Networking - Parallel and Distributed Systems Course
University of Pisa
This project is licensed under the MIT License - see the LICENSE file for details.
- GitHub repository of FastFlow library