diff --git a/_posts/2024-01-08-Streaming-data-from-PostgreSQL-to-a-CSV-file.md b/_posts/2024-01-08-Streaming-data-from-PostgreSQL-to-a-CSV-file.md index df5acc8..bf463e7 100644 --- a/_posts/2024-01-08-Streaming-data-from-PostgreSQL-to-a-CSV-file.md +++ b/_posts/2024-01-08-Streaming-data-from-PostgreSQL-to-a-CSV-file.md @@ -390,7 +390,7 @@ assert check_df.l_orderkey.is_monotonic_increasing ## FastBCP -While our focus here is on Python tools, we added FastBCP as a reference regarding CPU and memory usage. FastBCP has been developed in-house by Romain Ferraton at [Architecture & Performance](https://www.architecture-performance.fr/). It is a command line tool, written in C#, that is compatible with any operating system where dotnet is installed. We used dotnet on Linux in the present case. +While our focus here is on Python tools, we added FastBCP as a reference regarding CPU and memory usage. [FastBCP](https://www.arpe.io/fastbcp) has been developed at [Architecture & Performance](https://www.architecture-performance.fr/ap-logiciels/). It is a command line tool, written in C#, that is compatible with any operating system where dotnet is installed. We used dotnet on Linux in the present case. FastBCP employs parallel threads, reading data through multiple connections by partitioning SQL on the 'l_orderkey' column, using the "random" method. This approach results in distinct CSV files, later merged into a final output. It's worth mentioning that due to its parallel settings, the resulting data in the CSV file may not be sorted. This is why the ORDER BY clause is removed from the query in this particular case. Also, the returned elapsed time take the merging phase into account. diff --git a/_posts/2025-03-31-Arnold-tongues-with-Numba-Numba-CUDA-and-Datashader.md b/_posts/2025-03-31-Arnold-tongues-with-Numba-Numba-CUDA-and-Datashader.md index d8b73f1..ca9b6e4 100644 --- a/_posts/2025-03-31-Arnold-tongues-with-Numba-Numba-CUDA-and-Datashader.md +++ b/_posts/2025-03-31-Arnold-tongues-with-Numba-Numba-CUDA-and-Datashader.md @@ -611,4 +611,25 @@ plot_phase_space(df, size_x=2_000, size_y=1_000)
- +{% if page.comments %} + + + +{% endif %} \ No newline at end of file diff --git a/_posts/2025-07-12-Git-commit-temporal-analysis.md b/_posts/2025-07-12-Git-commit-temporal-analysis.md new file mode 100644 index 0000000..6afa2a2 --- /dev/null +++ b/_posts/2025-07-12-Git-commit-temporal-analysis.md @@ -0,0 +1,546 @@ +--- +title: A Git commit temporal analysis +layout: post +comments: true +author: François Pacull +tags: +- Python +- git +- numba +- pandas +- matplotlib +- radial +- timestamp +--- + + +In this Python notebook, we are going to analyze *git commit* timestamps across multiple repositories to identify temporal patterns in a git user coding activity (me, actually). + +**Outline** +- [Imports and package versions](#imports) +- [Repository Discovery and Data Extraction](#discovery) + - [Data Collection](#collection) + - [Data Preprocessing](#preprocessing) +- [Visualizations](#visualizations) + - [Weekly Distribution](#weekly) + - [Hourly Distribution (Linear)](#hourly_linear) + - [Hourly Distribution (Polar)](#hourly_polar) + - [Temporal Heatmap](#heatmap) + +## Imports and Package Versions + +`BASE_DIR` is the root folder containing all the git repositories. The `USER_FILTERS` list contains substrings to match against git author names for filtering commits from a specific user with various names (github, gitlab from various organizations). You can adapt these two variables with your own directory and git user names. + + +```python +import os +import subprocess +from collections import Counter +from datetime import datetime + +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import tol_colors as tc + +BASE_DIR = "/home/francois/Workspace" +USER_FILTERS = ["pacull", "djfrancesco"] +``` + +We are using Python 3.13.3 on a Linux OS: + + pandas : 2.2.3 + numpy : 2.2.6 + matplotlib: 3.10.3 + tol_colors: 2.0.0 + + +## Repository discovery and data extraction + +Here we introduce functions to recursively scan directories for git repositories, extract commit metadata using `git log` in a Python `subprocess`, specifically commit timestamps and author names, and filter commits by author name using case-insensitive substring matching. + + +```python +def is_git_repo(path): + return os.path.isdir(os.path.join(path, ".git")) + + +def get_all_git_repos(base_dir): + git_repos = [] + for root, dirs, files in os.walk(base_dir): + if is_git_repo(root): + git_repos.append(root) + dirs.clear() + return git_repos + + +def get_commits(repo_path): + try: + result = subprocess.run( + ["git", "-C", repo_path, "log", "--pretty=format:%an|%aI"], + stdout=subprocess.PIPE, + stderr=subprocess.DEVNULL, + check=True, + text=True, + ) + lines = result.stdout.strip().split("\n") + filtered_lines = [] + for line in lines: + if line: + author = line.split("|")[0].lower() + if any(u.lower() in author for u in USER_FILTERS): + filtered_lines.append(line) + return filtered_lines + except subprocess.CalledProcessError: + return [] + + +def parse_commit_times(commit_lines): + hours = [] + weekdays = [] + for line in commit_lines: + author, iso_date = line.split("|") + dt = datetime.fromisoformat(iso_date) + hours.append(dt.hour) + weekdays.append(dt.strftime("%A")) + return hours, weekdays +``` + +### Data collection + +So let's use these previous functions to iterate through the repositories, extract commit timestamps and parse them into hour-of-day and weekday components. + + +```python +all_hours = [] +all_weekdays = [] + +repos = get_all_git_repos(BASE_DIR) +for repo in repos: + commits = get_commits(repo) + hours, weekdays = parse_commit_times(commits) + all_hours.extend(hours) + all_weekdays.extend(weekdays) + +print(f"Total commits found: {len(all_hours)}") +``` + + Total commits found: 7605 + +### Data preprocessing + +Now we convert the extracted data and create *frequency* dataframes for each hour of the day or day of the week. + +```python +hour_counts = Counter(all_hours) +hour_df = pd.DataFrame( + { + "hour": list(range(24)), + "commit_count": [hour_counts.get(h, 0) for h in range(24)], + } +) +hour_df = hour_df.set_index("hour") +hour_df["distrib"] = hour_df["commit_count"] / hour_df["commit_count"].sum() + +days_order = [ + "Monday", + "Tuesday", + "Wednesday", + "Thursday", + "Friday", + "Saturday", + "Sunday", +] +weekday_counts = Counter(all_weekdays) +weekday_df = pd.DataFrame( + { + "weekday": days_order, + "commit_count": [weekday_counts.get(day, 0) for day in days_order], + } +) +weekday_df = weekday_df.set_index("weekday") +weekday_df["distrib"] = weekday_df["commit_count"] / weekday_df["commit_count"].sum() +``` + + +```python +hour_df.head(3) +``` + + + + +| + | commit_count | +distrib | +
|---|---|---|
| hour | ++ | + |
| 0 | +12 | +0.001578 | +
| 1 | +3 | +0.000394 | +
| 2 | +0 | +0.000000 | +
| + | commit_count | +distrib | +
|---|---|---|
| weekday | ++ | + |
| Monday | +1527 | +0.200789 | +
| Tuesday | +1264 | +0.166206 | +
| Wednesday | +1291 | +0.169757 | +
+
+
+
+
+
+
| hour | +0 | +1 | +... | +22 | +23 | +
|---|---|---|---|---|---|
| weekday | ++ | + | + | + | + |
| Monday | +0.000000 | +0.0 | +... | +0.499671 | +0.092045 | +
| Tuesday | +0.039448 | +0.0 | +... | +0.197239 | +0.078895 | +
| Wednesday | +0.078895 | +0.0 | +... | +0.262985 | +0.026298 | +
3 rows × 24 columns
+
+
+
+
+### Software Versions
+- **FastTransfer**: Version 0.13.12.0 (X64 architecture, .NET 8.0.20)
+- **Operating System**: Ubuntu 24.04.3 LTS
+- **Source Engine**: DuckDB v1.3.2 (for Parquet reading and streaming)
+- **Target Database**: PostgreSQL 16.10
+
+### Hardware Configuration
+- **Compute**: 32 vCores @ 2.3 GHz with 64 GB RAM
+- **Storage**: 400 GB local NVMe where PostgreSQL's data directory resides
+- **Network**: 4 Gbps bandwidth
+- **Location**: Gravelines (GRA11) datacenter
+
+The local NVMe delivers strong sequential write performance at 1465 MiB/s (measured with fio), providing ample disk bandwidth for our data loading workloads.
+
+This configuration represents a practical mid-range setup, not the smallest instance that would struggle with parallel workloads, nor an oversized machine that would mask performance characteristics.
+
+### The Data: TPC-H Orders Table
+
+We're using the TPC-H benchmark's orders table at scale factor 10, which gives us:
+- 16 Parquet files, evenly distributed at 29.2 MiB each
+- Total dataset size: 467.8 MiB
+- 15 million rows with mixed data types (integers, decimals, dates, and varchar)
+
+The data resides in an OVH S3-compatible object storage bucket in the Gravelines region, and each file contains roughly 937,500 rows. This distribution allows us to test parallel loading strategies effectively.
+
+## FastTransfer in Action: The Command That Does the Heavy Lifting
+
+Here's the actual command we use to load data:
+
+```bash
+./FastTransfer \
+ --sourceconnectiontype "duckdbstream" \
+ --sourceserver ":memory:" \
+ --query "SELECT * exclude filename from read_parquet('s3://arpeiofastbcp/tpch/sf10/orders/*.parquet', filename=true) t" \
+ --targetconnectiontype "pgcopy" \
+ --targetserver "localhost:5432" \
+ --targetuser "fasttransfer" \
+ --targetpassword "********" \
+ --targetdatabase "tpch" \
+ --targetschema "tpch_10_test" \
+ --targettable "orders" \
+ --method "DataDriven" \
+ --distributeKeyColumn "filename" \
+ --datadrivenquery "select file from glob('s3://arpeiofastbcp/tpch/sf10/orders/*.parquet')" \
+ --loadmode "Truncate" \
+ --mapmethod "Name" \
+ --batchsize 10000 \
+ --degree 16
+```
+
+Let's break down the key components and understand what each parameter does:
+
+### Source Configuration
+- **`--sourceconnectiontype "duckdbstream"`**: Uses DuckDB's memory-efficient streaming connection
+- **`--sourceserver ":memory:"`**: Runs DuckDB in-memory mode for temporary data processing without persisting to disk
+- **`--query`**: The DuckDB SQL that leverages the `read_parquet()` function to directly access Parquet files from S3, with `filename=true` to capture file origins for distribution
+
+### Target Configuration
+- **`--targetconnectiontype "pgcopy"`**: Uses PostgreSQL's native COPY protocol, a fast method for bulk loading data into PostgreSQL
+- **`--targetserver "localhost:5432"`**: Standard PostgreSQL connection details
+- **`--targetuser` and `--targetpassword`**: Database authentication credentials
+
+### Parallelization Strategy
+- **`--method "DataDriven"`**: Distributes work based on distinct values in a specified column, in our case each worker processes specific files
+- **`--distributeKeyColumn "filename"`**: Uses the filename column to assign work to workers, ensuring each file is processed by exactly one worker
+- **`--datadrivenquery`**: Overrides the default distinct value selection with an explicit file list using `glob()`, giving us precise control over work distribution
+- **`--degree 16`**: Creates 16 parallel workers. FastTransfer supports 1-1024 workers, or negative values for CPU-adaptive scaling (e.g., `-2` uses half available CPUs)
+
+### Loading Configuration
+- **`--loadmode "Truncate"`**: Clears the target table before loading, ensuring a clean slate (alternative is `"Append"` for adding to existing data)
+- **`--mapmethod "Name"`**: Maps source to target columns by name rather than position, providing flexibility when column orders differ
+- **`--batchsize 10000`**: Processes 10,000 rows per bulk copy operation (default is 1,048,576). Smaller batches can reduce memory usage but may impact throughput
+
+### About FastTransfer
+
+FastTransfer is designed specifically for efficient data movement between different database systems, particularly excelling with large datasets (>1 million cells). The tool requires the target table to pre-exist and supports various database types including ClickHouse, MySQL, Oracle, PostgreSQL, and SQL Server. Its strength lies in intelligent work distribution, whether using file-based distribution like our DataDriven approach, or other methods like CTID (PostgreSQL-specific), RangeId (numeric ranges), or Random (modulo-based distribution).
+
+## Performance Analysis: Where Theory Meets Reality
+
+We tested four different table configurations to understand how PostgreSQL constraints and logging independently affect loading performance. Each test was run multiple times, reporting the best result to minimize noise from network variability or system background tasks.
+
+### Configuration 1: WITH PK / LOGGED
+
+Standard production table with primary key on `o_orderkey` and full WAL durability:
+
+| Degree of Parallelism | Load Time (seconds) | Speedup |
+|----------------------|---------------------|---------|
+| 1 | 50.5 | 1.0x |
+| 2 | 28.8 | 1.8x |
+| 4 | 17.8 | 2.8x |
+| 8 | 16.1 | 3.1x |
+| 16 | 19.2 | 2.6x |
+
+Peaks at 8 workers (3.1x speedup). Constraint checking and WAL logging create severe contention.
+
+### Configuration 2: WITH PK / UNLOGGED
+
+Primary key with WAL logging disabled:
+
+| Degree of Parallelism | Load Time (seconds) | Speedup |
+|----------------------|---------------------|---------|
+| 1 | 46.3 | 1.0x |
+| 2 | 25.5 | 1.8x |
+| 4 | 14.5 | 3.2x |
+| 8 | 9.3 | 5.0x |
+| 16 | 7.8 | 5.9x |
+
+Removing WAL overhead significantly improves scaling. Continues to 16 workers due to reduced contention.
+
+### Configuration 3: WITHOUT PK / LOGGED
+
+No constraints, WAL logging enabled:
+
+| Degree of Parallelism | Load Time (seconds) | Speedup |
+|----------------------|---------------------|---------|
+| 1 | 45.3 | 1.0x |
+| 2 | 24.2 | 1.9x |
+| 4 | 13.2 | 3.4x |
+| 8 | 8.7 | 5.2x |
+| 16 | 8.7 | 5.2x |
+
+Better than WITH PK/LOGGED but plateaus at 8 workers due to WAL contention.
+
+### Configuration 4: WITHOUT PK / UNLOGGED
+
+Maximum performance configuration - no constraints, no WAL:
+
+| Degree of Parallelism | Load Time (seconds) | Speedup |
+|----------------------|---------------------|---------|
+| 1 | 44.5 | 1.0x |
+| 2 | 25.4 | 1.8x |
+| 4 | 13.4 | 3.3x |
+| 8 | 7.8 | 5.7x |
+| 16 | 5.1 | 8.7x |
+
+Best scaling - achieves 8.7x speedup at 16 workers, finally hitting network bandwidth limits.
+
+## Visual Performance Comparison
+
+
+
+The comparison reveals how primary keys and WAL logging independently bottleneck performance. WITHOUT PK/UNLOGGED achieves the best scaling (8.7x at 16 workers), while WITH PK/LOGGED caps at 3.1x. The intermediate configurations show each factor's impact: removing the primary key or disabling WAL each provide significant improvements, with their combination delivering maximum performance.
+
+## Network and I/O Considerations
+
+Different configurations reveal different bottlenecks:
+
+- **WITH PK / LOGGED**: Constraint checking + WAL overhead limits to 3.1x
+- **WITH PK / UNLOGGED**: WAL removal allows 5.9x scaling
+- **WITHOUT PK / LOGGED**: WAL contention plateaus at 5.2x
+- **WITHOUT PK / UNLOGGED**: Best scaling at 8.7x (467.8 MiB in 5.1s ≈ 92 MB/s)
+
+At 92 MB/s with 4 Gbps network (~500 MB/s) and 1465 MiB/s local NVMe capacity, neither network nor disk I/O are the bottleneck. The limitation could come from several sources: S3 object storage throughput, DuckDB Parquet parsing overhead, or PostgreSQL's internal coordination when multiple workers write concurrently to the same table.
+
+## Conclusion
+
+FastTransfer achieves 5.1-second load times for 467.8 MiB of Parquet data from OVH S3 to PostgreSQL, reaching 92 MB/s throughput with WITHOUT PK/UNLOGGED configuration at degree 16. Testing four configurations reveals that primary keys and WAL logging each independently constrain performance, with optimal settings varying from degree 8 (LOGGED) to degree 16+ (UNLOGGED). The results demonstrate that cloud-based data pipelines can achieve strong performance when configuration matches use case requirements.
+
+---
+
+## About FastTransfer
+
+FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB.
+
+**Key Features:**
+- Advanced parallelization strategies for optimal performance
+- Cross-platform compatibility with major databases
+- Flexible configuration for various data migration scenarios
+- Production-ready with comprehensive logging and monitoring
+
+For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/).
diff --git a/_posts/2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md b/_posts/2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md
new file mode 100644
index 0000000..5ae4af7
--- /dev/null
+++ b/_posts/2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md
@@ -0,0 +1,499 @@
+---
+title: FastTransfer Performance with Citus Columnar Storage in PostgreSQL
+layout: post
+comments: true
+author: François Pacull
+categories: [database, performance]
+tags:
+- FastTransfer
+- Citus PostgreSQL
+- Columnar storage
+- Database migration
+- PostgreSQL Docker
+- Performance benchmarks
+---
+
+## Introduction
+
+Data migration between database systems often becomes a bottleneck in modern data pipelines, particularly when dealing with analytical workloads. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to address these challenges through advanced parallelization strategies. This post demonstrates FastTransfer's performance when working with PostgreSQL databases enhanced with the [Citus extension](https://docs.citusdata.com/en/v13.0/) for columnar storage.
+
+## Understanding FastTransfer
+
+FastTransfer is a command-line tool designed to address common data migration challenges. In our testing, we've found it particularly effective for scenarios where traditional migration approaches fall short.
+
+### Core Capabilities
+
+The tool offers several features that we've found valuable in production environments:
+
+- **Cross-platform compatibility**: Works with PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, DuckDB, and other major databases
+- **Advanced parallelization**: Multiple strategies for parallel data extraction and loading, allowing you to optimize for your specific use case
+- **Flexible configuration**: Fine-grained control over batch sizes, mapping methods, and load modes to tune performance
+- **Production-ready features**: Comprehensive logging, error handling, and monitoring help ensure reliable migrations
+
+### Parallelization Strategies
+
+One aspect we particularly appreciate about FastTransfer is its range of parallelization methods, accessible through the `-M, --method
+
+### Citus Columnar → PostgreSQL Transfer Performance
+
+
+## Key Takeaways
+
+### Performance Summary
+
+From our benchmarks with 15 million rows:
+
+| Scenario | Best Method | Time | Speedup | Key Insight |
+|----------|------------|------|---------|-------------|
+| PostgreSQL → Citus | Ctid (8 threads) | 3.3s | 3.74x | Direct row access provides best performance |
+| Citus → PostgreSQL | RangeId UNLOGGED (8 threads) | 3.9s | 2.52x | UNLOGGED tables dramatically improve write speed |
+| Cross-compatible | RangeId (4 threads) | 5.3s | 2.29x | Good balance of performance and portability |
+
+### Important Considerations
+
+1. **Storage vs Speed Trade-off**: Columnar storage reduces disk usage by 76% but adds ~20% write overhead
+2. **Diminishing Returns**: Parallelization beyond 4 threads often shows limited benefit
+3. **Method Limitations**: Not all methods work with all storage types (e.g., Ctid incompatible with columnar)
+4. **Asymmetric Performance**: Reading from columnar is faster than writing to it
+
+## Analysis and Insights
+
+After running these benchmarks, several patterns became clear that might help inform your migration strategy.
+
+### Why Ctid Typically Outperforms Other Methods
+
+In our testing, the Ctid method consistently delivered the best performance for PostgreSQL sources. This makes sense when you consider that ctid provides direct access to physical row locations, eliminating the need for sorting or complex query planning that other methods require.
+
+### Scalability Patterns
+
+One interesting finding from our tests relates to how parallelization efficiency changes with thread count:
+
+#### The Law of Diminishing Returns
+
+As we increased parallelism, we observed declining efficiency across all methods:
+- **Sweet Spot**: In most cases, 4 threads offered the best balance between performance and resource utilization
+- **Efficiency Cliff**: At 8 threads, efficiency often dropped below 50%, suggesting that the overhead of coordination begins to outweigh the benefits
+
+### Understanding Columnar Storage Impact
+
+Our benchmarks revealed several important considerations when working with columnar storage:
+
+#### Write Performance Trade-offs
+
+We observed that writing to columnar storage introduces approximately 19% overhead compared to standard tables (12,092 ms vs 10,141 ms). This overhead comes from several sources:
+- Compression processing (LZ4 in our configuration)
+- Data reorganization into columnar format (stripes and chunks)
+- Additional metadata management
+
+However, it's important to remember that this overhead delivers significant storage savings, in our case, a 76% reduction in disk usage.
+
+#### Read Performance Benefits
+
+Conversely, reading from columnar storage proved notably efficient:
+- Transfers from Citus to PostgreSQL completed 18% faster than the reverse direction
+- Compressed data requires less I/O bandwidth
+- Sequential reading patterns align well with columnar storage organization
+
+#### Asymmetric Performance Characteristics
+
+One surprising finding was that Citus → PostgreSQL transfers consistently outperformed PostgreSQL → Citus transfers. This asymmetry makes sense when you consider that:
+- Reading benefits from compression outweigh writing penalties
+- Standard PostgreSQL tables have highly optimized write paths
+- The combination results in better overall performance when columnar is the source
+
+#### Method Compatibility Considerations
+
+It's worth noting that not all parallelization methods work with columnar storage. The Ctid method, while excellent for standard PostgreSQL tables, isn't compatible with columnar architecture due to the different way data is organized and accessed.
+
+## Conclusion
+
+FastTransfer effectively handles migrations involving Citus columnar storage, achieving up to 76% storage savings while maintaining high transfer speeds. The choice of parallelization method significantly impacts performance, with Ntile delivering the best balance for columnar targets. These results demonstrate that columnar storage and efficient data migration are not mutually exclusive when using the right tools.
+
+---
+
+## About FastTransfer
+
+FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB.
+
+**Key Features:**
+- Advanced parallelization strategies for optimal performance
+- Cross-platform compatibility with major databases
+- Flexible configuration for various data migration scenarios
+- Production-ready with comprehensive logging and monitoring
+
+For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/).
diff --git a/_posts/2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md b/_posts/2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md
new file mode 100644
index 0000000..8c24bd3
--- /dev/null
+++ b/_posts/2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md
@@ -0,0 +1,140 @@
+---
+title: High-Speed PostgreSQL Replication on OVH with FastTransfer
+layout: post
+comments: true
+author: François Pacull
+categories: [database, performance]
+tags:
+- FastTransfer
+- PostgreSQL replication
+- OVH
+- High-performance
+- Database migration speed
+- TPC-H benchmark
+- 20 Gbps network
+- c3-256
+- PostgreSQL parallel transfer
+---
+
+
+## Introduction
+
+PostgreSQL-to-PostgreSQL replication at scale requires tools that can fully leverage modern cloud infrastructure and network capabilities. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to maximize throughput through advanced parallelization. This post demonstrates FastTransfer's performance transferring 113GB of TPC-H data between OVH c3-256 instances over a 20 Gbit/s network.
+
+## Infrastructure Setup
+
+For our testing environment, we deployed PostgreSQL on two OVH c3-256 instances in the Paris datacenter. Here's what we're working with:
+
+- **OVH Instances**: c3-256 (256GB RAM, 128 vCores @2.3GHz, 400GB NVMe)
+- **Network**: 20 Gbit/s vrack, Paris datacenter (eu-west-par-c)
+- **OS**: Ubuntu 24
+- **PostgreSQL**: Version 16
+- **Dataset**: TPC-H SF100 lineitem table (~600M rows, ~113GB)
+
+
+
+## PostgreSQL Configuration
+
+Optimized for bulk operations: 80GB shared_buffers, 128 parallel workers, minimal WAL logging. Target tables: UNLOGGED, no primary keys.
+
+## Target Database Disk Performance
+
+The target PostgreSQL instance uses the native 400GB NVMe instance disk (not block storage) for database storage. This provides excellent I/O performance crucial for high-speed data ingestion:
+
+### FIO Benchmark Command
+```bash
+fio --name=seqwrite --filename=/tmp/fio-test --rw=write \
+ --bs=1M --size=8G --direct=1 --numjobs=1 --runtime=30 --group_reporting
+```
+
+### Results
+```
+Sequential Write Performance (8GB test, 1MB blocks):
+- Throughput: 1,260 MB/s (1.26 GB/s)
+- IOPS: 1,259
+- Average latency: 787 microseconds
+- 95th percentile: 1.5ms
+- 99th percentile: 2.3ms
+```
+
+The native NVMe storage delivers consistent low-latency writes with over 1.2 GB/s throughput, ensuring disk I/O is not a bottleneck for the PostgreSQL COPY operations even at peak network transfer rates.
+
+## Network Performance
+
+The private network connection between source and target instances was tested using iperf3 to verify bandwidth capacity:
+
+### iperf3 Benchmark Command
+```bash
+# On target instance
+iperf3 -s
+
+# On source instance
+iperf3 -c 10.10.0.50 -P 64 -t 30
+```
+
+### Results
+```
+Network Throughput Test (64 parallel streams, 30 seconds):
+- Average throughput: 20.5 Gbit/s
+- Total data transferred: 71.7 GB
+- Consistent performance across all streams
+```
+
+The network delivers full line-rate performance, slightly exceeding the nominal 20 Gbit/s specification. With 64 parallel TCP streams, the network provides ample bandwidth for FastTransfer's parallel data transfer operations.
+
+## FastTransfer Command
+
+FastTransfer version: 0.13.12
+
+```bash
+./FastTransfer \
+ --sourceconnectiontype "pgcopy" \
+ --sourceconnectstring "Host=localhost;Port=5432;Database=tpch;Trust Server Certificate=True;Application Name=FastTransfer;Maximum Pool Size=150;Timeout=15;Command Timeout=10800;Username=fasttransfer;Password=******" \
+ --sourceschema "tpch_100" --sourcetable "lineitem" \
+ --targetconnectiontype "pgcopy" \
+ --targetconnectstring "Host=10.10.0.50;Port=5432;Database=tpch;Trust Server Certificate=True;Application Name=FastTransfer;Maximum Pool Size=150;Timeout=15;Command Timeout=10800;Username=fasttransfer;Password=******" \
+ --targetschema "tpch_100" --targettable "lineitem" \
+ --loadmode "Truncate" --method "Ctid" --degree 128
+```
+
+Note the `Maximum Pool Size`=150 in the connection string, increased from the default 100 to support 128 parallel threads.
+
+## Performance Results
+
+### Transfer Time
+
+
+
+Transfer time: 749s (single thread) → 70s (128 threads)
+
+### Throughput Scaling
+
+
+
+Throughput: 145 MB/s → 1,880 MB/s (75% of 20 Gbit/s link capacity)
+
+
+## Results Summary
+
+- **113GB transferred in 70 seconds** (degree=128)
+- **1.88 GB/s peak throughput** achieved
+- **10.7x speedup** with 128 parallel connections
+- **Optimal range**: 32-64 threads for best efficiency/performance balance
+
+## Conclusion
+
+FastTransfer achieves 1.88 GB/s throughput when transferring 113GB of data between PostgreSQL instances, utilizing 75% of the available 20 Gbit/s network capacity. The 10.7x speedup with 128 parallel connections demonstrates excellent scalability on OVH's high-end infrastructure. These results confirm that FastTransfer can effectively saturate modern cloud networking for PostgreSQL-to-PostgreSQL migrations.
+
+---
+
+## About FastTransfer
+
+FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB.
+
+**Key Features:**
+- Advanced parallelization strategies for optimal performance
+- Cross-platform compatibility with major databases
+- Flexible configuration for various data migration scenarios
+- Production-ready with comprehensive logging and monitoring
+
+For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/).
diff --git a/_posts/2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md b/_posts/2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md
new file mode 100644
index 0000000..8d43b3c
--- /dev/null
+++ b/_posts/2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md
@@ -0,0 +1,415 @@
+---
+title: Performance Analysis of Parallel Data Replication Between Two PostgreSQL 18 Instances on OVH
+layout: post
+comments: true
+author: François Pacull
+categories: [database, performance]
+tags:
+- FastTransfer
+- PostgreSQL 18
+- Performance analysis
+- PostgreSQL replication
+- OVH
+- High-performance
+- Database migration speed
+- TPC-H benchmark
+- 20 Gbps network
+- c3-256
+- PostgreSQL parallel transfer
+---
+
+
+
+
+## Introduction
+
+Parallel data replication between PostgreSQL instances presents unique challenges at scale, particularly when attempting to maximize throughput on high-performance cloud infrastructure. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to leverage advanced parallelization strategies for efficient data movement. This post provides a performance analysis of FastTransfer transferring 77GB of data between two PostgreSQL 18 instances on OVH c3-256 servers, examining CPU, disk I/O, and network bottlenecks across parallelism degrees from 1 to 128.
+
+### Test Configuration
+
+The test dataset consists of the TPC-H SF100 lineitem table (~600M rows, ~77GB), configured as an UNLOGGED table without indexes, constraints, or triggers. Both instances were tuned for bulk loading operations, with all durability features disabled, large memory allocations, and PostgreSQL 18's `io_uring` support enabled (configuration details in Appendix A). Despite this comprehensive optimization, it appears that lock contention emerges at high parallelism degrees, limiting scalability.
+
+Testing was performed at eight parallelism degrees, executed sequentially in a progressive loading pattern: 1, 2, 4, 8, 16, 32, 64, and 128, with each step doubling to systematically increase load. Each configuration was run only once rather than following standard statistical practice of multiple runs with mean, standard deviation, and confidence intervals. This single-run approach was adopted after preliminary tests showed minimal variation between successive runs, indicating stable and reproducible results under these controlled conditions.
+
+### OVH Infrastructure Setup
+
+The test environment consists of two identical OVH cloud instances designed for heavy workloads:
+
+
+
+**Figure 1: OVH Infrastructure Architecture** - The test setup consists of two identical c3-256 instances (128 vCores, 256GB RAM, 400GB NVMe) running PostgreSQL 18 on Ubuntu 24.04. The source instance contains the TPC-H SF100 lineitem table. FastTransfer orchestrates parallel data replication across a 20 Gbit/s vrack private network connection to the target instance. Both instances are located in the Paris datacenter (eu-west-par-c) for minimal network latency.
+
+**Hardware Configuration:**
+
+- **Instance Type**: OVH c3-256
+- **Memory**: 256GB RAM
+- **CPU**: 128 vCores @ 2.3 GHz
+- **Storage**:
+ - **Target**: 400GB local NVMe SSD
+ - **Source**: OVH Block Storage (high-speed-gen2 with ~2TB, Bandwidth : 1 GB/s, Performance : 20,000 IOPS)
+- **Network**: 20 Gbit/s vrack (2.5 GB/s)
+
+The source instance PostgreSQL data directory resides on attached OVH Block Storage rather than local NVMe. This asymmetric storage configuration does not affect the analysis conclusions, as the source PostgreSQL instance exhibits backpressure behavior rather than storage-limited performance.
+
+**Software Stack:**
+
+- **OS**: Ubuntu 24.04.3 LTS with Linux kernel 6.8
+- **PostgreSQL**: Version 18.0, with `io_uring`, huge pages (`vm.nr_hugepages=45000`)
+- **FastTransfer**: Version 0.13.12
+
+**Infrastructure Performance Baseline:**
+
+- **Network**: 20.5 Gbit/s (2.56 GB/s) verified with iperf3
+- **Target Disk Sequential Write**: 3,741 MB/s (FIO benchmark with 128K blocks)
+- **Target Disk Random Write**: 88.2 MB/s, 22,600 IOPS (FIO, 4K blocks)
+
+### Overall Performance
+
+FastTransfer achieves strong absolute performance, transferring 77GB in just 67 seconds at degree 128, equivalent to 1.15 GB/s sustained throughput. The parallel replication process scales continuously across all tested degrees, with total elapsed time decreasing from 878 seconds (degree 1) to 67 seconds (degree 128). The system delivers consistent real-world performance improvements even at large parallelism levels, though lock contention on the target PostgreSQL instance appears to increasingly limit scaling efficiency beyond degree 32.
+
+
+
+**Figure 2: Total Elapsed Time by Degree of Parallelism** - Wall-clock time improves continuously across all tested degrees, from 878 seconds (degree 1) to 67 seconds (degree 128). Performance gains remain positive throughout, though the rate of improvement diminishes beyond degree 32 due to increasing lock contention.
+
+## 1. CPU Usage Analysis
+
+### 1.1 Mean and Peak CPU Usage
+
+
+
+
+**Figure 3: Mean CPU Usage by Component** - Target PostgreSQL (red) dominates resource consumption at high parallelism, while source PostgreSQL (blue) reaches around 12 cores.
+
+
+
+
+**Figure 4: Peak CPU Usage by Component** - Target PostgreSQL exhibits high peak values (~6,969% at degree 128). The large spikes combined with relatively lower mean values indicate high variance, characteristic of processes alternating between lock contention and productive work.
+
+**Component Scaling Summary:**
+
+| Component | Degree 1 | Degree 128 | Speedup | Efficiency |
+| ----------------- | ---------------- | ------------------- | ------- | ---------- |
+| Source PostgreSQL | 93% | 1,175% | 11.9x | 9.3% |
+| FastTransfer | 31% | 631% | 20.1x | 15.7% |
+| Target PostgreSQL | 98% | 3,294% | 33.6x | 26.3% |
+
+Source PostgreSQL's poor scaling appears to stem from backpressure: FastTransfer's batch-and-wait protocol means source processes send a batch, then block waiting for target acknowledgment. When the target cannot consume data quickly due to lock contention, this delay propagates backward. At degree 128, the source processes collectively use only 11.7 cores (0.11 cores/process), suggesting they're waiting rather than actively working.
+
+Note also that FastTransfer uses PostgreSQL's Ctid pseudo-column for table partitioning, which doesn't allow a perfect distribution, some partitions are smaller than others, causing processes to complete and exit before others.
+
+### 1.2 FastTransfer
+
+
+
+**Figure 5: FastTransfer User vs System CPU** - At degree 128, FastTransfer uses 419% user CPU (66%) and 212% system CPU (34%).
+
+FastTransfer uses in the present case PostgreSQL's binary COPY protocol for both source and target (`--sourceconnectiontype "pgcopy"` and `--targetconnectiontype "pgcopy"`). Data flows directly from source PostgreSQL's COPY TO BINARY through FastTransfer to target PostgreSQL's COPY FROM BINARY without data transformation. FastTransfer acts as an intelligent network proxy coordinating parallel streams and batch acknowledgments, explaining its relatively low CPU usage. This would less be the case if we were transfering data between distinct RDBMS types.
+
+## 2. The Lock Contention Problem: System CPU Analysis
+
+### 2.1 System CPU
+
+
+
+**Figure 6: System CPU as % of Total CPU** - Target PostgreSQL (red line) crosses the 50% warning threshold at degree 16, exceeds 70% at degree 32, and peaks at 83.9% at degree 64. At this maximum, only 16.2% of CPU time performs productive work while 83.9% appears spent on lock contention and kernel overhead.
+
+CPU time divides into two categories: User CPU (application code performing actual data insertion) and System CPU (kernel operations handling locks, synchronization, context switches, I/O). A healthy system maintains system CPU below 30%.
+
+**System CPU Progression:**
+
+| Degree | Total CPU | User CPU | System CPU | System % | Productive Work |
+| ------ | --------- | -------- | ---------- | -------- | ------------------------- |
+| 1 | 98% | 80% | 18% | 18.2% | Healthy baseline |
+| 16 | 1,342% | 496% | 846% | 63.0% | Warning threshold crossed |
+| 32 | 2,436% | 602% | 1,834% | 75.3% | High contention |
+| 64 | 4,596% | 743% | 3,854% | 83.9% | **Maximum contention** |
+| 128 | 4,230% | 1,248% | 2,982% | 70.5% | Reduced contention |
+
+At degree 64, processes appear to spend 83.9% of time managing locks rather than inserting data. By degree 128, system CPU percentage unexpectedly decreases to 70.5% for unclear reasons, though absolute performance continues to improve.
+
+### 2.2 Possible Causes of Lock Contention
+
+The target table was already optimized for bulk loading (UNLOGGED, no indexes, no constraints, no triggers), eliminating all standard overhead sources. So the contention could stem from PostgreSQL's fundamental architecture:
+
+1. **Shared Buffer Pool Locks**: All 128 parallel connections compete for buffer pool partition locks to read/modify/write pages.
+
+2. **Relation Extension Locks**: When the table grows, PostgreSQL requires an exclusive lock (only one process at a time).
+
+3. **Free Space Map (FSM) Locks**: All 128 writers query and update the FSM to find pages with free space, creating constant FSM thrashing.
+
+## 3. Distribution and Time Series Analysis
+
+### 3.1 CPU Distribution
+
+
+
+**Figure 7: CPU Distribution at Degree 4** - Tight, healthy distributions with small standard deviations. All components operate consistently without significant contention.
+
+
+
+**Figure 8: CPU Distribution at Degree 32** - Target PostgreSQL (red) becomes bimodal with wide spread (1000-3000% range), indicating some samples capture waiting processes while others capture active processes. Source (blue) remains relatively tight.
+
+
+
+**Figure 9: CPU Distribution at Degree 128** - Target PostgreSQL (red) spans nearly 0-10000%, indicating highly variable behavior. Some processes are nearly starved (near 0%) while others burn high CPU on lock spinning (>8000%). This wide distribution suggests lock thrashing.
+
+### 3.2 CPU Time Series
+
+
+
+**Figure 10: CPU Over Time at Degree 4** - All components show stable, smooth CPU usage with minimal oscillations throughout the test duration.
+
+
+
+**Figure 11: CPU Over Time at Degree 32** - Target PostgreSQL (red) shows increasing variability and oscillations, indicating periods of successful lock acquisition alternating with blocking periods.
+
+
+
+**Figure 12: CPU Over Time at Degree 128** - Target PostgreSQL (red) exhibits oscillations with wild CPU swings, suggesting significant lock thrashing. Source (blue) and FastTransfer (green) show variability reflecting downstream backpressure.
+
+## 4. Performance Scaling Analysis: Degrees 64 to 128
+
+### 4.1 Continued Performance Improvement at Extreme Parallelism
+
+Degree 128 achieves the best absolute performance in the test suite, completing the transfer in 67 seconds compared to 92 seconds at degree 64, a meaningful 1.37x speedup that brings total throughput to 1.15 GB/s. While this represents 68.7% efficiency for the doubling operation (rather than the theoretical 2x), the continued improvement demonstrates that the system remains functional and beneficial even at extreme parallelism levels.
+
+### 4.2 Unexpected Efficiency Improvements at Degree 128
+
+Degree 128 exhibits a counterintuitive result: lower system CPU overhead (70.5%) than degree 64 (83.9%) despite doubling parallelism, while total CPU actually decreases by 8.0% (4,596% → 4,230%). User CPU efficiency improves by 82.1% (16.2% → 29.5% of total CPU), meaning nearly double the proportion of CPU time goes to productive work rather than lock contention. The reason for these improvements remains unclear.
+
+**The Comparative Analysis:**
+
+| Metric | Degree 64 | Degree 128 | Change |
+| ----------------------- | -------------------- | -------------------- | -------------------- |
+| Elapsed Time | 92s | 67s | 1.37x speedup |
+| Total CPU | 4,596% | 4,230% | -8.0% |
+| User CPU | 743% (16.2% of total)| 1,248% (29.5% of total) | +68.0% |
+| System CPU | 3,854% (83.9% of total) | 2,982% (70.5% of total) | -22.6% |
+| Network Throughput | 1,033 MB/s mean | 1,088 MB/s mean | +5.3% |
+| Network Peak | 2,335 MB/s (93.4%) | 2,904 MB/s (116.2%) | Saturation |
+| Disk Throughput | 759 MB/s | 1,099 MB/s | +44.8% |
+
+**Open Question: Why Does Efficiency Improve at Degree 128?**
+
+The improvement from degree 64 to 128 is puzzling for several reasons:
+
+1. **Why does network bandwidth increase by 5.3%** (1,033 MB/s → 1,088 MB/s) when adding more parallelism to an already saturated network? At degree 128, network peaks at 2,904 MB/s (116.2% of capacity), yet mean throughput still increases.
+
+2. **Why does system CPU overhead decrease** from 83.9% to 70.5% despite doubling parallelism? More processes should create more lock contention, not less.
+
+3. **Why does user CPU efficiency nearly double** (16.2% → 29.5% of total) when adding 64 more processes competing for the same resources?
+
+One hypothesis is that network saturation at degree 128 acts as a pacing mechanism, rate-limiting data delivery and preventing all 128 processes from simultaneously contending for locks. However, this doesn't fully explain why network throughput itself increases, nor why the efficiency gains are so substantial. The interaction between network saturation, lock contention, and process scheduling appears more complex than initially understood.
+
+## 5. Disk I/O and Network Analysis
+
+### 5.1 Source Disk I/O Analysis
+
+The source instance has 256GB RAM with a Postgres `effective_cache_size` of 192GB, and the lineitem table is ~77GB. An important detail explains the disk behavior across test runs:
+
+Degree 1 was the first test run with no prior warm-up or cold run to pre-load the table into cache. During this first run at degree 1, there is a heavy disk activity (500 MB/s, ~50% peak utilization) where the table is loaded into memory (shared_buffers + OS page cache). At degrees 2-128, there is essentially zero disk activity; the entire table remains cached in memory from the initial degree 1 load. This explains why degree 2 is more than twice as fast as degree 1: the degree 1 run includes the initial table-loading overhead, while degree 2 benefits from the already-cached table with no disk loading required. The speedup from degree 1 to 2 reflects both the doubling of parallelism and the elimination of the initial cache-loading penalty.
+
+
+
+**Figure 13: Source Disk Utilization Over Time** - Shows disk utilization across all test runs (vertical lines mark test boundaries for degrees 1, 2, 4, 8, 16, 32, 64, 128). At degree 1, utilization peaks at ~50% during the initial table load, then drops to near-zero. At higher degrees (2-128), utilization remains below 1% throughout, confirming the disk is idle and not limiting performance.
+
+Disk utilization measures the percentage of time the disk is busy serving I/O requests. Source disk I/O is not a bottleneck at any parallelism degree.
+
+### 5.2 Target Disk I/O Time Series
+
+
+
+**Figure 14: Target Disk Write Throughput Over Time** - Throughput exhibits bursty behavior with spikes to 2000-3759 MB/s followed by drops to near zero. Sustained baseline varies from ~100 MB/s (low degrees) to ~300 MB/s (degree 128) but never sustains disk capacity.
+
+
+
+**Figure 15: Target Disk Utilization Over Time** - Mean utilization remains below 25% across all degrees. Spikes reach 70-90% during bursts but quickly return to low baseline. This suggests disk I/O is not the bottleneck.
+
+### 5.3 Network Throughput Analysis
+
+
+
+**Figure 16: Target Network Ingress Over Time** - At degree 128, throughput plateaus at ~2,450 MB/s (98% of capacity) during active bursts, but averages only 1,088 MB/s (43.5%) due to alternating active/idle periods. At degrees 1-64, network remains well below capacity.
+
+Network saturation only occurs at degree 128 during active bursts. Therefore, network doesn't explain poor scaling from degree 1 through 64, target CPU lock contention remains the primary bottleneck.
+
+### 5.4 Cross-Degree Scaling Analysis
+
+
+
+**Figure 17: Mean Disk Write Throughput by Degree** - Scales from 90 MB/s (degree 1) to 1,099 MB/s (degree 128), only 12.3x improvement for 128x parallelism (9.6% efficiency).
+
+
+
+**Figure 18: Network Throughput Comparison: Source TX vs Target RX** - At degree 128, source transmits 1,684 MB/s while target receives only 1,088 MB/s, creating a 596 MB/s (35%) deficit. This suggests the target cannot keep pace with source data production, likely due to CPU lock contention.
+
+The apparent 35% violation of flow conservation is explained by TCP retransmissions. The source TX counter (measured via `sar -n DEV`) counts both original packets and retransmitted packets, while the target RX counter only counts successfully received unique packets. When the target is overloaded with CPU lock contention (83.9% system CPU at degree 64), it cannot drain receive buffers fast enough, causing packet drops that trigger TCP retransmissions. The 596 MB/s "deficit" is actually retransmitted data counted twice at the source but only once at the target, providing quantitative evidence of the target's inability to keep pace with source data production.
+
+### 5.5 I/O Analysis Conclusions
+
+1. **Disk does not appear to be the bottleneck**: 24% average utilization at degree 128 with 76% idle capacity. PostgreSQL matches FIO peak (3,759 MB/s) but sustains only 170 MB/s average.
+
+2. **Network does not appear to be the bottleneck for degrees 1-64**: Utilization remains below 42% through degree 64. Saturation occurs only at degree 128 during active bursts (~2,450 MB/s plateau).
+
+3. **Target CPU lock contention appears to be the root cause**: Low disk utilization + network saturation only at degree 128 + poor scaling efficiency throughout + high system CPU percentage (83.9% at degree 64) all point to the same conclusion.
+
+4. **Backpressure suggests target bottleneck**: Source can produce 1,684 MB/s but target can only consume 1,088 MB/s. Source processes use only 0.11 cores/process, suggesting they're blocked waiting for target acknowledgments.
+
+## 6. Conclusions
+
+### 6.1 Performance Achievement and Bottleneck Analysis
+
+FastTransfer successfully demonstrates strong absolute performance, achieving a 13.1x speedup that reduces 77GB transfer time from approximately 15 minutes (878s) to just over 1 minute (67s). This represents practical, production-ready performance with sustained throughput of 1.15 GB/s at degree 128. The system delivers continuous performance improvements across all tested parallelism degrees, confirming that parallel replication provides meaningful benefits even when facing coordination challenges.
+
+The primary scaling limitation appears to be target PostgreSQL lock contention beyond degree 32. System CPU grows to 83.9% at degree 64, meaning only 16.2% of CPU performs productive work. Degree 128 continues to improve absolute performance (67s vs 92s) even as total CPU decreases from 4,596% to 4,230%, though the reason for this unexpected efficiency improvement remains unclear.
+
+### 6.2 Why Additional Tuning should not Help
+
+The target table is rather optimally configured (UNLOGGED, no indexes, no constraints, no triggers). PostgreSQL configuration includes all recommended bulk loading optimizations (80GB shared_buffers, huge pages, `io_uring`, fsync=off). Despite this, system CPU remains at 70-84% at high degrees.
+
+The bottleneck appears to be more architectural than configurational, with bffer pool partition locks, relation extension lock, FSM access... No configuration parameter appears able to eliminate these fundamental coordination requirements.
+
+### 6.3 Future Work: PostgreSQL Instrumentation Analysis
+
+While this analysis relied on system-level metrics, a follow-up study will use PostgreSQL's internal instrumentation to provide direct evidence of lock contention and wait events. This will validate the hypotheses presented in this analysis using database engine-level metrics.
+
+
+## Appendix A: PostgreSQL Configuration
+
+Both PostgreSQL 18 instances were tuned for maximum bulk loading performance.
+
+### Target PostgreSQL Configuration (Key Settings)
+
+```ini
+# Memory allocation
+shared_buffers = 80GB # 31% of 256GB RAM
+huge_pages = on # vm.nr_hugepages=45000
+work_mem = 256MB
+maintenance_work_mem = 16GB
+
+# Durability disabled (benchmark only, NOT production)
+synchronous_commit = off
+fsync = off
+full_page_writes = off
+
+# WAL configuration (minimal for UNLOGGED)
+wal_level = minimal
+wal_buffers = 128MB
+max_wal_size = 128GB
+checkpoint_timeout = 15min
+checkpoint_completion_target = 0.5
+
+# Background writer (AGGRESSIVE)
+bgwriter_delay = 10ms # Down from default 200ms
+bgwriter_lru_maxpages = 2000 # 2x default
+bgwriter_lru_multiplier = 8.0 # 2x default
+bgwriter_flush_after = 0
+
+# I/O configuration (PostgreSQL 18 optimizations)
+backend_flush_after = 0
+effective_io_concurrency = 400 # Optimized for NVMe
+maintenance_io_concurrency = 400
+io_method = io_uring # NEW PG18: async I/O
+io_max_concurrency = 512 # NEW PG18
+io_workers = 8 # NEW PG18: up from default 3
+
+# Worker processes
+max_worker_processes = 128
+max_parallel_workers = 128
+
+# Autovacuum (PostgreSQL 18)
+autovacuum = on
+autovacuum_worker_slots = 32 # NEW PG18: runtime adjustment
+autovacuum_max_workers = 16
+autovacuum_vacuum_cost_delay = 0 # No throttling
+
+# Query tuning
+enable_partitionwise_join = on
+enable_partitionwise_aggregate = on
+random_page_cost = 1.1 # NVMe SSD
+effective_cache_size = 192GB # ~75% of RAM
+```
+
+### Source PostgreSQL Configuration (Key Settings)
+
+The source instance is optimized for fast parallel reads to support high-throughput data extraction:
+
+```ini
+# Memory allocation
+shared_buffers = 80GB # ~31% of 256GB RAM
+huge_pages = on # vm.nr_hugepages=45000
+work_mem = 256MB
+maintenance_work_mem = 4GB # Lower than target (16GB)
+
+# Durability disabled (benchmark only, NOT production)
+synchronous_commit = off
+fsync = off
+full_page_writes = off
+
+# WAL configuration
+wal_level = minimal
+wal_buffers = -1 # Auto-sized
+max_wal_size = 32GB # Smaller than target (128GB)
+checkpoint_timeout = 60min # Longer than target (15min)
+checkpoint_completion_target = 0.9
+
+# Background writer
+bgwriter_delay = 50ms # Less aggressive than target (10ms)
+bgwriter_lru_maxpages = 1000 # Half of target (2000)
+bgwriter_lru_multiplier = 4.0 # Half of target (8.0)
+bgwriter_flush_after = 2MB
+
+# I/O configuration (PostgreSQL 18 optimizations)
+backend_flush_after = 0
+effective_io_concurrency = 400 # Identical to target
+maintenance_io_concurrency = 400
+io_method = io_uring # NEW PG18: async I/O
+io_max_concurrency = 512 # NEW PG18
+io_workers = 8 # NEW PG18
+
+# Worker processes
+max_connections = 500 # Higher than target for parallel readers
+max_worker_processes = 128
+max_parallel_workers_per_gather = 64
+max_parallel_workers = 128
+
+# Query tuning (optimized for parallel reads)
+enable_partitionwise_join = on
+enable_partitionwise_aggregate = on
+random_page_cost = 1.1 # Block Storage (not NVMe)
+effective_cache_size = 192GB # ~75% of RAM
+default_statistics_target = 500
+
+# Autovacuum (PostgreSQL 18)
+autovacuum = on
+autovacuum_worker_slots = 32 # NEW PG18: runtime adjustment
+autovacuum_max_workers = 16
+autovacuum_vacuum_cost_delay = 0 # No throttling
+```
+
+### Table Configuration
+
+The target table eliminates all overhead sources:
+
+- **UNLOGGED**: No WAL write, flush, or archival overhead
+- **No indexes**: Eliminates 50-80% of bulk load cost
+- **No primary key**: No index maintenance or uniqueness checking
+- **No constraints**: No foreign key, check, or unique validation
+- **No triggers**: No trigger execution overhead
+
+This represents the absolute minimum overhead possible.
+
+---
+
+## About FastTransfer
+
+FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB.
+
+**Key Features:**
+
+- Advanced parallelization strategies for optimal performance
+- Cross-platform compatibility with major databases
+- Flexible configuration for various data migration scenarios
+- Production-ready with comprehensive logging and monitoring
+
+For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/).
diff --git a/_posts/WP_2025-05-23-Computing-and-Visualizing-Billions-of-Bohemian-Eigenvalues-with-Python.md b/_posts/WP_2025-05-23-Computing-and-Visualizing-Billions-of-Bohemian-Eigenvalues-with-Python.md
new file mode 100644
index 0000000..f17db10
--- /dev/null
+++ b/_posts/WP_2025-05-23-Computing-and-Visualizing-Billions-of-Bohemian-Eigenvalues-with-Python.md
@@ -0,0 +1,239 @@
+
+According to [www.bohemianmatrices.com/](http://www.bohemianmatrices.com/),
+
+> A family of Bohemian matrices is a distribution of random matrices where the matrix entries are sampled from a discrete set of bounded height. The discrete set must be independent of the dimension of the matrices.
+
+In our case, we sample 5x5 matrix entries from the discrete set {-1, 0, 1}. For example, here are 2 random matrices with these specifications:
+
+$$
+\begin{pmatrix}
+0 & 0 & 0 & -1 & 0 \\
+0 & 0 & 0 & 1 & -1 \\
+1 & -1 & 0 & -1 & -1 \\
+1 & -1 & 1 & 1 & 0 \\
+-1 & 0 & 1 & 0 & -1
+\end{pmatrix}
+$$
+
+$$
+\begin{pmatrix}
+-1 & 0 & 0 & 0 & 0 \\
+0 & 1 & 0 & 1 & 1 \\
+1 & -1 & 0 & -1 & -1 \\
+1 & 1 & 1 & 1 & -1 \\
+-1 & 0 & 1 & -1 & -1
+\end{pmatrix}
+$$
+
+Bohemian **eigenvalues** are the eigenvalues of a family of Bohemian matrices. Eigenvalues $λ$ satisfy the equation $Av = λv$, where $A$ is the matrix and $v$ is a corresponding eigenvector. If the matrices are 5 by 5, the eigenvalue solver is going to return 5 complex eigenvalues.
+
+So what if we want to compute all the possible eigenvalues from all these matrices? Well that would imply $3^{5 \times 5} = 847,288,609,443$ matrices, resulting in $4.236 × 10^{12}$ eigenvalues. Just storing all these matrices would require some significant space. Each 5×5 matrix has 25 entries, and if we need 1 byte per entry to represent {-1, 0, 1}, we get a total storage amount of :
+$847,288,609,443$ matrices × 25 bytes/matrix = $21,182,215,236,075$ bytes ≈ 21.18 TB
+
+So instead of computing all possible matrices, we are only going to sample 1 billion matrices. The first motivation to compute these complex eigenvalues is to observe some interesting patterns when plotted in the complex plane. See for example the beautiful gallery from the bohemianmatrices web site : [www.bohemianmatrices.com/gallery/](http://www.bohemianmatrices.com/gallery/). Actually, we are just going to reproduce one of the gallery images :
+
+
+
+
+
+
| + | commit_count | +distrib | +
|---|---|---|
| hour | ++ | + |
| 0 | +12 | +0.001578 | +
| 1 | +3 | +0.000394 | +
| 2 | +0 | +0.000000 | +
| + | commit_count | +distrib | +
|---|---|---|
| weekday | ++ | + |
| Monday | +1527 | +0.200789 | +
| Tuesday | +1264 | +0.166206 | +
| Wednesday | +1291 | +0.169757 | +
+
+
+
+
+
+
| hour | +0 | +1 | +... | +22 | +23 | +
|---|---|---|---|---|---|
| weekday | ++ | + | + | + | + |
| Monday | +0.000000 | +0.0 | +... | +0.499671 | +0.092045 | +
| Tuesday | +0.039448 | +0.0 | +... | +0.197239 | +0.078895 | +
| Wednesday | +0.078895 | +0.0 | +... | +0.262985 | +0.026298 | +
3 rows × 24 columns
+
+
+
+
+### Software Versions
+- **FastTransfer**: Version 0.13.12.0 (X64 architecture, .NET 8.0.20)
+- **Operating System**: Ubuntu 24.04.3 LTS
+- **Source Engine**: DuckDB v1.3.2 (for Parquet reading and streaming)
+- **Target Database**: PostgreSQL 16.10
+
+### Hardware Configuration
+- **Compute**: 32 vCores @ 2.3 GHz with 64 GB RAM
+- **Storage**: 400 GB local NVMe where PostgreSQL's data directory resides
+- **Network**: 4 Gbps bandwidth
+- **Location**: Gravelines (GRA11) datacenter
+
+The local NVMe delivers strong sequential write performance at 1465 MiB/s (measured with fio), providing ample disk bandwidth for our data loading workloads.
+
+This configuration represents a practical mid-range setup, not the smallest instance that would struggle with parallel workloads, nor an oversized machine that would mask performance characteristics.
+
+### The Data: TPC-H Orders Table
+
+We're using the TPC-H benchmark's orders table at scale factor 10, which gives us:
+- 16 Parquet files, evenly distributed at 29.2 MiB each
+- Total dataset size: 467.8 MiB
+- 15 million rows with mixed data types (integers, decimals, dates, and varchar)
+
+The data resides in an OVH S3-compatible object storage bucket in the Gravelines region, and each file contains roughly 937,500 rows. This distribution allows us to test parallel loading strategies effectively.
+
+## FastTransfer in Action: The Command That Does the Heavy Lifting
+
+Here's the actual command we use to load data:
+
+```bash
+./FastTransfer \
+ --sourceconnectiontype "duckdbstream" \
+ --sourceserver ":memory:" \
+ --query "SELECT * exclude filename from read_parquet('s3://arpeiofastbcp/tpch/sf10/orders/*.parquet', filename=true) t" \
+ --targetconnectiontype "pgcopy" \
+ --targetserver "localhost:5432" \
+ --targetuser "fasttransfer" \
+ --targetpassword "********" \
+ --targetdatabase "tpch" \
+ --targetschema "tpch_10_test" \
+ --targettable "orders" \
+ --method "DataDriven" \
+ --distributeKeyColumn "filename" \
+ --datadrivenquery "select file from glob('s3://arpeiofastbcp/tpch/sf10/orders/*.parquet')" \
+ --loadmode "Truncate" \
+ --mapmethod "Name" \
+ --batchsize 10000 \
+ --degree 16
+```
+
+Let's break down the key components and understand what each parameter does:
+
+### Source Configuration
+- **`--sourceconnectiontype "duckdbstream"`**: Uses DuckDB's memory-efficient streaming connection
+- **`--sourceserver ":memory:"`**: Runs DuckDB in-memory mode for temporary data processing without persisting to disk
+- **`--query`**: The DuckDB SQL that leverages the `read_parquet()` function to directly access Parquet files from S3, with `filename=true` to capture file origins for distribution
+
+### Target Configuration
+- **`--targetconnectiontype "pgcopy"`**: Uses PostgreSQL's native COPY protocol, a fast method for bulk loading data into PostgreSQL
+- **`--targetserver "localhost:5432"`**: Standard PostgreSQL connection details
+- **`--targetuser` and `--targetpassword`**: Database authentication credentials
+
+### Parallelization Strategy
+- **`--method "DataDriven"`**: Distributes work based on distinct values in a specified column, in our case each worker processes specific files
+- **`--distributeKeyColumn "filename"`**: Uses the filename column to assign work to workers, ensuring each file is processed by exactly one worker
+- **`--datadrivenquery`**: Overrides the default distinct value selection with an explicit file list using `glob()`, giving us precise control over work distribution
+- **`--degree 16`**: Creates 16 parallel workers. FastTransfer supports 1-1024 workers, or negative values for CPU-adaptive scaling (e.g., `-2` uses half available CPUs)
+
+### Loading Configuration
+- **`--loadmode "Truncate"`**: Clears the target table before loading, ensuring a clean slate (alternative is `"Append"` for adding to existing data)
+- **`--mapmethod "Name"`**: Maps source to target columns by name rather than position, providing flexibility when column orders differ
+- **`--batchsize 10000`**: Processes 10,000 rows per bulk copy operation (default is 1,048,576). Smaller batches can reduce memory usage but may impact throughput
+
+### About FastTransfer
+
+FastTransfer is designed specifically for efficient data movement between different database systems, particularly excelling with large datasets (>1 million cells). The tool requires the target table to pre-exist and supports various database types including ClickHouse, MySQL, Oracle, PostgreSQL, and SQL Server. Its strength lies in intelligent work distribution, whether using file-based distribution like our DataDriven approach, or other methods like CTID (PostgreSQL-specific), RangeId (numeric ranges), or Random (modulo-based distribution).
+
+## Performance Analysis: Where Theory Meets Reality
+
+We tested four different table configurations to understand how PostgreSQL constraints and logging independently affect loading performance. Each test was run multiple times, reporting the best result to minimize noise from network variability or system background tasks.
+
+### Configuration 1: WITH PK / LOGGED
+
+Standard production table with primary key on `o_orderkey` and full WAL durability:
+
+| Degree of Parallelism | Load Time (seconds) | Speedup |
+|----------------------|---------------------|---------|
+| 1 | 50.5 | 1.0x |
+| 2 | 28.8 | 1.8x |
+| 4 | 17.8 | 2.8x |
+| 8 | 16.1 | 3.1x |
+| 16 | 19.2 | 2.6x |
+
+Peaks at 8 workers (3.1x speedup). Constraint checking and WAL logging create severe contention.
+
+### Configuration 2: WITH PK / UNLOGGED
+
+Primary key with WAL logging disabled:
+
+| Degree of Parallelism | Load Time (seconds) | Speedup |
+|----------------------|---------------------|---------|
+| 1 | 46.3 | 1.0x |
+| 2 | 25.5 | 1.8x |
+| 4 | 14.5 | 3.2x |
+| 8 | 9.3 | 5.0x |
+| 16 | 7.8 | 5.9x |
+
+Removing WAL overhead significantly improves scaling. Continues to 16 workers due to reduced contention.
+
+### Configuration 3: WITHOUT PK / LOGGED
+
+No constraints, WAL logging enabled:
+
+| Degree of Parallelism | Load Time (seconds) | Speedup |
+|----------------------|---------------------|---------|
+| 1 | 45.3 | 1.0x |
+| 2 | 24.2 | 1.9x |
+| 4 | 13.2 | 3.4x |
+| 8 | 8.7 | 5.2x |
+| 16 | 8.7 | 5.2x |
+
+Better than WITH PK/LOGGED but plateaus at 8 workers due to WAL contention.
+
+### Configuration 4: WITHOUT PK / UNLOGGED
+
+Maximum performance configuration - no constraints, no WAL:
+
+| Degree of Parallelism | Load Time (seconds) | Speedup |
+|----------------------|---------------------|---------|
+| 1 | 44.5 | 1.0x |
+| 2 | 25.4 | 1.8x |
+| 4 | 13.4 | 3.3x |
+| 8 | 7.8 | 5.7x |
+| 16 | 5.1 | 8.7x |
+
+Best scaling - achieves 8.7x speedup at 16 workers, finally hitting network bandwidth limits.
+
+## Visual Performance Comparison
+
+
+
+The comparison reveals how primary keys and WAL logging independently bottleneck performance. WITHOUT PK/UNLOGGED achieves the best scaling (8.7x at 16 workers), while WITH PK/LOGGED caps at 3.1x. The intermediate configurations show each factor's impact: removing the primary key or disabling WAL each provide significant improvements, with their combination delivering maximum performance.
+
+## Network and I/O Considerations
+
+Different configurations reveal different bottlenecks:
+
+- **WITH PK / LOGGED**: Constraint checking + WAL overhead limits to 3.1x
+- **WITH PK / UNLOGGED**: WAL removal allows 5.9x scaling
+- **WITHOUT PK / LOGGED**: WAL contention plateaus at 5.2x
+- **WITHOUT PK / UNLOGGED**: Best scaling at 8.7x (467.8 MiB in 5.1s ≈ 92 MB/s)
+
+At 92 MB/s with 4 Gbps network (~500 MB/s) and 1465 MiB/s local NVMe capacity, neither network nor disk I/O are the bottleneck. The limitation could come from several sources: S3 object storage throughput, DuckDB Parquet parsing overhead, or PostgreSQL's internal coordination when multiple workers write concurrently to the same table.
+
+## Conclusion
+
+FastTransfer achieves 5.1-second load times for 467.8 MiB of Parquet data from OVH S3 to PostgreSQL, reaching 92 MB/s throughput with WITHOUT PK/UNLOGGED configuration at degree 16. Testing four configurations reveals that primary keys and WAL logging each independently constrain performance, with optimal settings varying from degree 8 (LOGGED) to degree 16+ (UNLOGGED). The results demonstrate that cloud-based data pipelines can achieve strong performance when configuration matches use case requirements.
+
+---
+
+## About FastTransfer
+
+FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB.
+
+**Key Features:**
+- Advanced parallelization strategies for optimal performance
+- Cross-platform compatibility with major databases
+- Flexible configuration for various data migration scenarios
+- Production-ready with comprehensive logging and monitoring
+
+For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/).
\ No newline at end of file
diff --git a/_posts/WP_2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md b/_posts/WP_2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md
new file mode 100644
index 0000000..bb25592
--- /dev/null
+++ b/_posts/WP_2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md
@@ -0,0 +1,485 @@
+
+## Introduction
+
+Data migration between database systems often becomes a bottleneck in modern data pipelines, particularly when dealing with analytical workloads. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to address these challenges through advanced parallelization strategies. This post demonstrates FastTransfer's performance when working with PostgreSQL databases enhanced with the [Citus extension](https://docs.citusdata.com/en/v13.0/) for columnar storage.
+
+## Understanding FastTransfer
+
+FastTransfer is a command-line tool designed to address common data migration challenges. In our testing, we've found it particularly effective for scenarios where traditional migration approaches fall short.
+
+### Core Capabilities
+
+The tool offers several features that we've found valuable in production environments:
+
+- **Cross-platform compatibility**: Works with PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, DuckDB, and other major databases
+- **Advanced parallelization**: Multiple strategies for parallel data extraction and loading, allowing you to optimize for your specific use case
+- **Flexible configuration**: Fine-grained control over batch sizes, mapping methods, and load modes to tune performance
+- **Production-ready features**: Comprehensive logging, error handling, and monitoring help ensure reliable migrations
+
+### Parallelization Strategies
+
+One aspect we particularly appreciate about FastTransfer is its range of parallelization methods, accessible through the `-M, --method
+
+### Citus Columnar → PostgreSQL Transfer Performance
+
+
+## Key Takeaways
+
+### Performance Summary
+
+From our benchmarks with 15 million rows:
+
+| Scenario | Best Method | Time | Speedup | Key Insight |
+|----------|------------|------|---------|-------------|
+| PostgreSQL → Citus | Ctid (8 threads) | 3.3s | 3.74x | Direct row access provides best performance |
+| Citus → PostgreSQL | RangeId UNLOGGED (8 threads) | 3.9s | 2.52x | UNLOGGED tables dramatically improve write speed |
+| Cross-compatible | RangeId (4 threads) | 5.3s | 2.29x | Good balance of performance and portability |
+
+### Important Considerations
+
+1. **Storage vs Speed Trade-off**: Columnar storage reduces disk usage by 76% but adds ~20% write overhead
+2. **Diminishing Returns**: Parallelization beyond 4 threads often shows limited benefit
+3. **Method Limitations**: Not all methods work with all storage types (e.g., Ctid incompatible with columnar)
+4. **Asymmetric Performance**: Reading from columnar is faster than writing to it
+
+## Analysis and Insights
+
+After running these benchmarks, several patterns became clear that might help inform your migration strategy.
+
+### Why Ctid Typically Outperforms Other Methods
+
+In our testing, the Ctid method consistently delivered the best performance for PostgreSQL sources. This makes sense when you consider that ctid provides direct access to physical row locations, eliminating the need for sorting or complex query planning that other methods require.
+
+### Scalability Patterns
+
+One interesting finding from our tests relates to how parallelization efficiency changes with thread count:
+
+#### The Law of Diminishing Returns
+
+As we increased parallelism, we observed declining efficiency across all methods:
+- **Sweet Spot**: In most cases, 4 threads offered the best balance between performance and resource utilization
+- **Efficiency Cliff**: At 8 threads, efficiency often dropped below 50%, suggesting that the overhead of coordination begins to outweigh the benefits
+
+### Understanding Columnar Storage Impact
+
+Our benchmarks revealed several important considerations when working with columnar storage:
+
+#### Write Performance Trade-offs
+
+We observed that writing to columnar storage introduces approximately 19% overhead compared to standard tables (12,092 ms vs 10,141 ms). This overhead comes from several sources:
+- Compression processing (LZ4 in our configuration)
+- Data reorganization into columnar format (stripes and chunks)
+- Additional metadata management
+
+However, it's important to remember that this overhead delivers significant storage savings, in our case, a 76% reduction in disk usage.
+
+#### Read Performance Benefits
+
+Conversely, reading from columnar storage proved notably efficient:
+- Transfers from Citus to PostgreSQL completed 18% faster than the reverse direction
+- Compressed data requires less I/O bandwidth
+- Sequential reading patterns align well with columnar storage organization
+
+#### Asymmetric Performance Characteristics
+
+One surprising finding was that Citus → PostgreSQL transfers consistently outperformed PostgreSQL → Citus transfers. This asymmetry makes sense when you consider that:
+- Reading benefits from compression outweigh writing penalties
+- Standard PostgreSQL tables have highly optimized write paths
+- The combination results in better overall performance when columnar is the source
+
+#### Method Compatibility Considerations
+
+It's worth noting that not all parallelization methods work with columnar storage. The Ctid method, while excellent for standard PostgreSQL tables, isn't compatible with columnar architecture due to the different way data is organized and accessed.
+
+## Conclusion
+
+FastTransfer effectively handles migrations involving Citus columnar storage, achieving up to 76% storage savings while maintaining high transfer speeds. The choice of parallelization method significantly impacts performance, with Ntile delivering the best balance for columnar targets. These results demonstrate that columnar storage and efficient data migration are not mutually exclusive when using the right tools.
+
+---
+
+## About FastTransfer
+
+FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB.
+
+**Key Features:**
+- Advanced parallelization strategies for optimal performance
+- Cross-platform compatibility with major databases
+- Flexible configuration for various data migration scenarios
+- Production-ready with comprehensive logging and monitoring
+
+For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/).
diff --git a/_posts/WP_2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md b/_posts/WP_2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md
new file mode 100644
index 0000000..fd7dfb1
--- /dev/null
+++ b/_posts/WP_2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md
@@ -0,0 +1,121 @@
+## Introduction
+
+PostgreSQL-to-PostgreSQL replication at scale requires tools that can fully leverage modern cloud infrastructure and network capabilities. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to maximize throughput through advanced parallelization. This post demonstrates FastTransfer's performance transferring 113GB of TPC-H data between OVH c3-256 instances over a 20 Gbit/s network.
+
+## Infrastructure Setup
+
+For our testing environment, we deployed PostgreSQL on two OVH c3-256 instances in the Paris datacenter. Here's what we're working with:
+
+- **OVH Instances**: c3-256 (256GB RAM, 128 vCores @2.3GHz, 400GB NVMe)
+- **Network**: 20 Gbit/s vrack, Paris datacenter (eu-west-par-c)
+- **OS**: Ubuntu 24
+- **PostgreSQL**: Version 16
+- **Dataset**: TPC-H SF100 lineitem table (~600M rows, ~113GB)
+
+
+
+## PostgreSQL Configuration
+
+Optimized for bulk operations: 80GB shared_buffers, 128 parallel workers, minimal WAL logging. Target tables: UNLOGGED, no primary keys.
+
+## Target Database Disk Performance
+
+The target PostgreSQL instance uses the native 400GB NVMe instance disk (not block storage) for database storage. This provides excellent I/O performance crucial for high-speed data ingestion:
+
+### FIO Benchmark Command
+```bash
+fio --name=seqwrite --filename=/tmp/fio-test --rw=write \
+ --bs=1M --size=8G --direct=1 --numjobs=1 --runtime=30 --group_reporting
+```
+
+### Results
+```
+Sequential Write Performance (8GB test, 1MB blocks):
+- Throughput: 1,260 MB/s (1.26 GB/s)
+- IOPS: 1,259
+- Average latency: 787 microseconds
+- 95th percentile: 1.5ms
+- 99th percentile: 2.3ms
+```
+
+The native NVMe storage delivers consistent low-latency writes with over 1.2 GB/s throughput, ensuring disk I/O is not a bottleneck for the PostgreSQL COPY operations even at peak network transfer rates.
+
+## Network Performance
+
+The private network connection between source and target instances was tested using iperf3 to verify bandwidth capacity:
+
+### iperf3 Benchmark Command
+```bash
+# On target instance
+iperf3 -s
+
+# On source instance
+iperf3 -c 10.10.0.50 -P 64 -t 30
+```
+
+### Results
+```
+Network Throughput Test (64 parallel streams, 30 seconds):
+- Average throughput: 20.5 Gbit/s
+- Total data transferred: 71.7 GB
+- Consistent performance across all streams
+```
+
+The network delivers full line-rate performance, slightly exceeding the nominal 20 Gbit/s specification. With 64 parallel TCP streams, the network provides ample bandwidth for FastTransfer's parallel data transfer operations.
+
+## FastTransfer Command
+
+FastTransfer version: 0.13.12
+
+```bash
+./FastTransfer \
+ --sourceconnectiontype "pgcopy" \
+ --sourceconnectstring "Host=localhost;Port=5432;Database=tpch;Trust Server Certificate=True;Application Name=FastTransfer;Maximum Pool Size=150;Timeout=15;Command Timeout=10800;Username=fasttransfer;Password=******" \
+ --sourceschema "tpch_100" --sourcetable "lineitem" \
+ --targetconnectiontype "pgcopy" \
+ --targetconnectstring "Host=10.10.0.50;Port=5432;Database=tpch;Trust Server Certificate=True;Application Name=FastTransfer;Maximum Pool Size=150;Timeout=15;Command Timeout=10800;Username=fasttransfer;Password=******" \
+ --targetschema "tpch_100" --targettable "lineitem" \
+ --loadmode "Truncate" --method "Ctid" --degree 128
+```
+
+Note the `Maximum Pool Size`=150 in the connection string, increased from the default 100 to support 128 parallel threads.
+
+## Performance Results
+
+### Transfer Time
+
+
+
+Transfer time: 749s (single thread) → 70s (128 threads)
+
+### Throughput Scaling
+
+
+
+Throughput: 145 MB/s → 1,880 MB/s (75% of 20 Gbit/s link capacity)
+
+
+## Results Summary
+
+- **113GB transferred in 70 seconds** (degree=128)
+- **1.88 GB/s peak throughput** achieved
+- **10.7x speedup** with 128 parallel connections
+- **Optimal range**: 32-64 threads for best efficiency/performance balance
+
+## Conclusion
+
+FastTransfer achieves 1.88 GB/s throughput when transferring 113GB of data between PostgreSQL instances, utilizing 75% of the available 20 Gbit/s network capacity. The 10.7x speedup with 128 parallel connections demonstrates excellent scalability on OVH's high-end infrastructure. These results confirm that FastTransfer can effectively saturate modern cloud networking for PostgreSQL-to-PostgreSQL migrations.
+
+---
+
+## About FastTransfer
+
+FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB.
+
+**Key Features:**
+- Advanced parallelization strategies for optimal performance
+- Cross-platform compatibility with major databases
+- Flexible configuration for various data migration scenarios
+- Production-ready with comprehensive logging and monitoring
+
+For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/).
\ No newline at end of file
diff --git a/_posts/WP_2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md b/_posts/WP_2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md
new file mode 100644
index 0000000..573d7d5
--- /dev/null
+++ b/_posts/WP_2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md
@@ -0,0 +1,392 @@
+## Introduction
+
+Parallel data replication between PostgreSQL instances presents unique challenges at scale, particularly when attempting to maximize throughput on high-performance cloud infrastructure. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to leverage advanced parallelization strategies for efficient data movement. This post provides a performance analysis of FastTransfer transferring 77GB of data between two PostgreSQL 18 instances on OVH c3-256 servers, examining CPU, disk I/O, and network bottlenecks across parallelism degrees from 1 to 128.
+
+### Test Configuration
+
+The test dataset consists of the TPC-H SF100 lineitem table (~600M rows, ~77GB), configured as an UNLOGGED table without indexes, constraints, or triggers. Both instances were tuned for bulk loading operations, with all durability features disabled, large memory allocations, and PostgreSQL 18's `io_uring` support enabled (configuration details in Appendix A). Despite this comprehensive optimization, it appears that lock contention emerges at high parallelism degrees, limiting scalability.
+
+Testing was performed at eight parallelism degrees, executed sequentially in a progressive loading pattern: 1, 2, 4, 8, 16, 32, 64, and 128, with each step doubling to systematically increase load. Each configuration was run only once rather than following standard statistical practice of multiple runs with mean, standard deviation, and confidence intervals. This single-run approach was adopted after preliminary tests showed minimal variation between successive runs, indicating stable and reproducible results under these controlled conditions.
+
+### OVH Infrastructure Setup
+
+The test environment consists of two identical OVH cloud instances designed for heavy workloads:
+
+
+
+**Figure 1: OVH Infrastructure Architecture** - The test setup consists of two identical c3-256 instances (128 vCores, 256GB RAM, 400GB NVMe) running PostgreSQL 18 on Ubuntu 24.04. The source instance contains the TPC-H SF100 lineitem table. FastTransfer orchestrates parallel data replication across a 20 Gbit/s vrack private network connection to the target instance. Both instances are located in the Paris datacenter (eu-west-par-c) for minimal network latency.
+
+**Hardware Configuration:**
+
+- **Instance Type**: OVH c3-256
+- **Memory**: 256GB RAM
+- **CPU**: 128 vCores @ 2.3 GHz
+- **Storage**:
+ - **Target**: 400GB local NVMe SSD
+ - **Source**: OVH Block Storage (high-speed-gen2 with ~2TB, Bandwidth : 1 GB/s, Performance : 20,000 IOPS)
+- **Network**: 20 Gbit/s vrack (2.5 GB/s)
+
+The source instance PostgreSQL data directory resides on attached OVH Block Storage rather than local NVMe. This asymmetric storage configuration does not affect the analysis conclusions, as the source PostgreSQL instance exhibits backpressure behavior rather than storage-limited performance.
+
+**Software Stack:**
+
+- **OS**: Ubuntu 24.04.3 LTS with Linux kernel 6.8
+- **PostgreSQL**: Version 18.0, with `io_uring`, huge pages (`vm.nr_hugepages=45000`)
+- **FastTransfer**: Version 0.13.12
+
+**Infrastructure Performance Baseline:**
+
+- **Network**: 20.5 Gbit/s (2.56 GB/s) verified with iperf3
+- **Target Disk Sequential Write**: 3,741 MB/s (FIO benchmark with 128K blocks)
+- **Target Disk Random Write**: 88.2 MB/s, 22,600 IOPS (FIO, 4K blocks)
+
+### Overall Performance
+
+FastTransfer achieves strong absolute performance, transferring 77GB in just 67 seconds at degree 128, equivalent to 1.15 GB/s sustained throughput. The parallel replication process scales continuously across all tested degrees, with total elapsed time decreasing from 878 seconds (degree 1) to 67 seconds (degree 128). The system delivers consistent real-world performance improvements even at large parallelism levels, though lock contention on the target PostgreSQL instance appears to increasingly limit scaling efficiency beyond degree 32.
+
+
+
+**Figure 2: Total Elapsed Time by Degree of Parallelism** - Wall-clock time improves continuously across all tested degrees, from 878 seconds (degree 1) to 67 seconds (degree 128). Performance gains remain positive throughout, though the rate of improvement diminishes beyond degree 32 due to increasing lock contention.
+
+## 1. CPU Usage Analysis
+
+### 1.1 Mean and Peak CPU Usage
+
+
+
+
+**Figure 3: Mean CPU Usage by Component** - Target PostgreSQL (red) dominates resource consumption at high parallelism, while source PostgreSQL (blue) reaches around 12 cores.
+
+
+
+
+**Figure 4: Peak CPU Usage by Component** - Target PostgreSQL exhibits high peak values (~6,969% at degree 128). The large spikes combined with relatively lower mean values indicate high variance, characteristic of processes alternating between lock contention and productive work.
+
+**Component Scaling Summary:**
+
+| Component | Degree 1 | Degree 128 | Speedup | Efficiency |
+| ----------------- | ---------------- | ------------------- | ------- | ---------- |
+| Source PostgreSQL | 93% | 1,175% | 11.9x | 9.3% |
+| FastTransfer | 31% | 631% | 20.1x | 15.7% |
+| Target PostgreSQL | 98% | 3,294% | 33.6x | 26.3% |
+
+Source PostgreSQL's poor scaling appears to stem from backpressure: FastTransfer's batch-and-wait protocol means source processes send a batch, then block waiting for target acknowledgment. When the target cannot consume data quickly due to lock contention, this delay propagates backward. At degree 128, the source processes collectively use only 11.7 cores (0.11 cores/process), suggesting they're waiting rather than actively working.
+
+Note also that FastTransfer uses PostgreSQL's Ctid pseudo-column for table partitioning, which doesn't allow a perfect distribution, some partitions are smaller than others, causing processes to complete and exit before others.
+
+### 1.2 FastTransfer
+
+
+
+**Figure 5: FastTransfer User vs System CPU** - At degree 128, FastTransfer uses 419% user CPU (66%) and 212% system CPU (34%).
+
+FastTransfer uses in the present case PostgreSQL's binary COPY protocol for both source and target (`--sourceconnectiontype "pgcopy"` and `--targetconnectiontype "pgcopy"`). Data flows directly from source PostgreSQL's COPY TO BINARY through FastTransfer to target PostgreSQL's COPY FROM BINARY without data transformation. FastTransfer acts as an intelligent network proxy coordinating parallel streams and batch acknowledgments, explaining its relatively low CPU usage. This would less be the case if we were transfering data between distinct RDBMS types.
+
+## 2. The Lock Contention Problem: System CPU Analysis
+
+### 2.1 System CPU
+
+
+
+**Figure 6: System CPU as % of Total CPU** - Target PostgreSQL (red line) crosses the 50% warning threshold at degree 16, exceeds 70% at degree 32, and peaks at 83.9% at degree 64. At this maximum, only 16.2% of CPU time performs productive work while 83.9% appears spent on lock contention and kernel overhead.
+
+CPU time divides into two categories: User CPU (application code performing actual data insertion) and System CPU (kernel operations handling locks, synchronization, context switches, I/O). A healthy system maintains system CPU below 30%.
+
+**System CPU Progression:**
+
+| Degree | Total CPU | User CPU | System CPU | System % | Productive Work |
+| ------ | --------- | -------- | ---------- | -------- | ------------------------- |
+| 1 | 98% | 80% | 18% | 18.2% | Healthy baseline |
+| 16 | 1,342% | 496% | 846% | 63.0% | Warning threshold crossed |
+| 32 | 2,436% | 602% | 1,834% | 75.3% | High contention |
+| 64 | 4,596% | 743% | 3,854% | 83.9% | **Maximum contention** |
+| 128 | 4,230% | 1,248% | 2,982% | 70.5% | Reduced contention |
+
+At degree 64, processes appear to spend 83.9% of time managing locks rather than inserting data. By degree 128, system CPU percentage unexpectedly decreases to 70.5% for unclear reasons, though absolute performance continues to improve.
+
+### 2.2 Possible Causes of Lock Contention
+
+The target table was already optimized for bulk loading (UNLOGGED, no indexes, no constraints, no triggers), eliminating all standard overhead sources. So the contention could stem from PostgreSQL's fundamental architecture:
+
+1. **Shared Buffer Pool Locks**: All 128 parallel connections compete for buffer pool partition locks to read/modify/write pages.
+
+2. **Relation Extension Locks**: When the table grows, PostgreSQL requires an exclusive lock (only one process at a time).
+
+3. **Free Space Map (FSM) Locks**: All 128 writers query and update the FSM to find pages with free space, creating constant FSM thrashing.
+
+## 3. Distribution and Time Series Analysis
+
+### 3.1 CPU Distribution
+
+
+
+**Figure 7: CPU Distribution at Degree 4** - Tight, healthy distributions with small standard deviations. All components operate consistently without significant contention.
+
+
+
+**Figure 8: CPU Distribution at Degree 32** - Target PostgreSQL (red) becomes bimodal with wide spread (1000-3000% range), indicating some samples capture waiting processes while others capture active processes. Source (blue) remains relatively tight.
+
+
+
+**Figure 9: CPU Distribution at Degree 128** - Target PostgreSQL (red) spans nearly 0-10000%, indicating highly variable behavior. Some processes are nearly starved (near 0%) while others burn high CPU on lock spinning (>8000%). This wide distribution suggests lock thrashing.
+
+### 3.2 CPU Time Series
+
+
+
+**Figure 10: CPU Over Time at Degree 4** - All components show stable, smooth CPU usage with minimal oscillations throughout the test duration.
+
+
+
+**Figure 11: CPU Over Time at Degree 32** - Target PostgreSQL (red) shows increasing variability and oscillations, indicating periods of successful lock acquisition alternating with blocking periods.
+
+
+
+**Figure 12: CPU Over Time at Degree 128** - Target PostgreSQL (red) exhibits oscillations with wild CPU swings, suggesting significant lock thrashing. Source (blue) and FastTransfer (green) show variability reflecting downstream backpressure.
+
+## 4. Performance Scaling Analysis: Degrees 64 to 128
+
+### 4.1 Continued Performance Improvement at Extreme Parallelism
+
+Degree 128 achieves the best absolute performance in the test suite, completing the transfer in 67 seconds compared to 92 seconds at degree 64, a meaningful 1.37x speedup that brings total throughput to 1.15 GB/s. While this represents 68.7% efficiency for the doubling operation (rather than the theoretical 2x), the continued improvement demonstrates that the system remains functional and beneficial even at extreme parallelism levels.
+
+### 4.2 Unexpected Efficiency Improvements at Degree 128
+
+Degree 128 exhibits a counterintuitive result: lower system CPU overhead (70.5%) than degree 64 (83.9%) despite doubling parallelism, while total CPU actually decreases by 8.0% (4,596% → 4,230%). User CPU efficiency improves by 82.1% (16.2% → 29.5% of total CPU), meaning nearly double the proportion of CPU time goes to productive work rather than lock contention. The reason for these improvements remains unclear.
+
+**The Comparative Analysis:**
+
+| Metric | Degree 64 | Degree 128 | Change |
+| ----------------------- | -------------------- | -------------------- | -------------------- |
+| Elapsed Time | 92s | 67s | 1.37x speedup |
+| Total CPU | 4,596% | 4,230% | -8.0% |
+| User CPU | 743% (16.2% of total)| 1,248% (29.5% of total) | +68.0% |
+| System CPU | 3,854% (83.9% of total) | 2,982% (70.5% of total) | -22.6% |
+| Network Throughput | 1,033 MB/s mean | 1,088 MB/s mean | +5.3% |
+| Network Peak | 2,335 MB/s (93.4%) | 2,904 MB/s (116.2%) | Saturation |
+| Disk Throughput | 759 MB/s | 1,099 MB/s | +44.8% |
+
+**Open Question: Why Does Efficiency Improve at Degree 128?**
+
+The improvement from degree 64 to 128 is puzzling for several reasons:
+
+1. **Why does network bandwidth increase by 5.3%** (1,033 MB/s → 1,088 MB/s) when adding more parallelism to an already saturated network? At degree 128, network peaks at 2,904 MB/s (116.2% of capacity), yet mean throughput still increases.
+
+2. **Why does system CPU overhead decrease** from 83.9% to 70.5% despite doubling parallelism? More processes should create more lock contention, not less.
+
+3. **Why does user CPU efficiency nearly double** (16.2% → 29.5% of total) when adding 64 more processes competing for the same resources?
+
+One hypothesis is that network saturation at degree 128 acts as a pacing mechanism, rate-limiting data delivery and preventing all 128 processes from simultaneously contending for locks. However, this doesn't fully explain why network throughput itself increases, nor why the efficiency gains are so substantial. The interaction between network saturation, lock contention, and process scheduling appears more complex than initially understood.
+
+## 5. Disk I/O and Network Analysis
+
+### 5.1 Source Disk I/O Analysis
+
+The source instance has 256GB RAM with a Postgres `effective_cache_size` of 192GB, and the lineitem table is ~77GB. An important detail explains the disk behavior across test runs:
+
+Degree 1 was the first test run with no prior warm-up or cold run to pre-load the table into cache. During this first run at degree 1, there is a heavy disk activity (500 MB/s, ~50% peak utilization) where the table is loaded into memory (shared_buffers + OS page cache). At degrees 2-128, there is essentially zero disk activity; the entire table remains cached in memory from the initial degree 1 load. This explains why degree 2 is more than twice as fast as degree 1: the degree 1 run includes the initial table-loading overhead, while degree 2 benefits from the already-cached table with no disk loading required. The speedup from degree 1 to 2 reflects both the doubling of parallelism and the elimination of the initial cache-loading penalty.
+
+
+
+**Figure 13: Source Disk Utilization Over Time** - Shows disk utilization across all test runs (vertical lines mark test boundaries for degrees 1, 2, 4, 8, 16, 32, 64, 128). At degree 1, utilization peaks at ~50% during the initial table load, then drops to near-zero. At higher degrees (2-128), utilization remains below 1% throughout, confirming the disk is idle and not limiting performance.
+
+Disk utilization measures the percentage of time the disk is busy serving I/O requests. Source disk I/O is not a bottleneck at any parallelism degree.
+
+### 5.2 Target Disk I/O Time Series
+
+
+
+**Figure 14: Target Disk Write Throughput Over Time** - Throughput exhibits bursty behavior with spikes to 2000-3759 MB/s followed by drops to near zero. Sustained baseline varies from ~100 MB/s (low degrees) to ~300 MB/s (degree 128) but never sustains disk capacity.
+
+
+
+**Figure 15: Target Disk Utilization Over Time** - Mean utilization remains below 25% across all degrees. Spikes reach 70-90% during bursts but quickly return to low baseline. This suggests disk I/O is not the bottleneck.
+
+### 5.3 Network Throughput Analysis
+
+
+
+**Figure 16: Target Network Ingress Over Time** - At degree 128, throughput plateaus at ~2,450 MB/s (98% of capacity) during active bursts, but averages only 1,088 MB/s (43.5%) due to alternating active/idle periods. At degrees 1-64, network remains well below capacity.
+
+Network saturation only occurs at degree 128 during active bursts. Therefore, network doesn't explain poor scaling from degree 1 through 64, target CPU lock contention remains the primary bottleneck.
+
+### 5.4 Cross-Degree Scaling Analysis
+
+
+
+**Figure 17: Mean Disk Write Throughput by Degree** - Scales from 90 MB/s (degree 1) to 1,099 MB/s (degree 128), only 12.3x improvement for 128x parallelism (9.6% efficiency).
+
+
+
+**Figure 18: Network Throughput Comparison: Source TX vs Target RX** - At degree 128, source transmits 1,684 MB/s while target receives only 1,088 MB/s, creating a 596 MB/s (35%) deficit. This suggests the target cannot keep pace with source data production, likely due to CPU lock contention.
+
+The apparent 35% violation of flow conservation is explained by TCP retransmissions. The source TX counter (measured via `sar -n DEV`) counts both original packets and retransmitted packets, while the target RX counter only counts successfully received unique packets. When the target is overloaded with CPU lock contention (83.9% system CPU at degree 64), it cannot drain receive buffers fast enough, causing packet drops that trigger TCP retransmissions. The 596 MB/s "deficit" is actually retransmitted data counted twice at the source but only once at the target, providing quantitative evidence of the target's inability to keep pace with source data production.
+
+### 5.5 I/O Analysis Conclusions
+
+1. **Disk does not appear to be the bottleneck**: 24% average utilization at degree 128 with 76% idle capacity. PostgreSQL matches FIO peak (3,759 MB/s) but sustains only 170 MB/s average.
+
+2. **Network does not appear to be the bottleneck for degrees 1-64**: Utilization remains below 42% through degree 64. Saturation occurs only at degree 128 during active bursts (~2,450 MB/s plateau).
+
+3. **Target CPU lock contention appears to be the root cause**: Low disk utilization + network saturation only at degree 128 + poor scaling efficiency throughout + high system CPU percentage (83.9% at degree 64) all point to the same conclusion.
+
+4. **Backpressure suggests target bottleneck**: Source can produce 1,684 MB/s but target can only consume 1,088 MB/s. Source processes use only 0.11 cores/process, suggesting they're blocked waiting for target acknowledgments.
+
+## 6. Conclusions
+
+### 6.1 Performance Achievement and Bottleneck Analysis
+
+FastTransfer successfully demonstrates strong absolute performance, achieving a 13.1x speedup that reduces 77GB transfer time from approximately 15 minutes (878s) to just over 1 minute (67s). This represents practical, production-ready performance with sustained throughput of 1.15 GB/s at degree 128. The system delivers continuous performance improvements across all tested parallelism degrees, confirming that parallel replication provides meaningful benefits even when facing coordination challenges.
+
+The primary scaling limitation appears to be target PostgreSQL lock contention beyond degree 32. System CPU grows to 83.9% at degree 64, meaning only 16.2% of CPU performs productive work. Degree 128 continues to improve absolute performance (67s vs 92s) even as total CPU decreases from 4,596% to 4,230%, though the reason for this unexpected efficiency improvement remains unclear.
+
+### 6.2 Why Additional Tuning should not Help
+
+The target table is rather optimally configured (UNLOGGED, no indexes, no constraints, no triggers). PostgreSQL configuration includes all recommended bulk loading optimizations (80GB shared_buffers, huge pages, `io_uring`, fsync=off). Despite this, system CPU remains at 70-84% at high degrees.
+
+The bottleneck appears to be more architectural than configurational, with bffer pool partition locks, relation extension lock, FSM access... No configuration parameter appears able to eliminate these fundamental coordination requirements.
+
+### 6.3 Future Work: PostgreSQL Instrumentation Analysis
+
+While this analysis relied on system-level metrics, a follow-up study will use PostgreSQL's internal instrumentation to provide direct evidence of lock contention and wait events. This will validate the hypotheses presented in this analysis using database engine-level metrics.
+
+
+## Appendix A: PostgreSQL Configuration
+
+Both PostgreSQL 18 instances were tuned for maximum bulk loading performance.
+
+### Target PostgreSQL Configuration (Key Settings)
+
+```ini
+# Memory allocation
+shared_buffers = 80GB # 31% of 256GB RAM
+huge_pages = on # vm.nr_hugepages=45000
+work_mem = 256MB
+maintenance_work_mem = 16GB
+
+# Durability disabled (benchmark only, NOT production)
+synchronous_commit = off
+fsync = off
+full_page_writes = off
+
+# WAL configuration (minimal for UNLOGGED)
+wal_level = minimal
+wal_buffers = 128MB
+max_wal_size = 128GB
+checkpoint_timeout = 15min
+checkpoint_completion_target = 0.5
+
+# Background writer (AGGRESSIVE)
+bgwriter_delay = 10ms # Down from default 200ms
+bgwriter_lru_maxpages = 2000 # 2x default
+bgwriter_lru_multiplier = 8.0 # 2x default
+bgwriter_flush_after = 0
+
+# I/O configuration (PostgreSQL 18 optimizations)
+backend_flush_after = 0
+effective_io_concurrency = 400 # Optimized for NVMe
+maintenance_io_concurrency = 400
+io_method = io_uring # NEW PG18: async I/O
+io_max_concurrency = 512 # NEW PG18
+io_workers = 8 # NEW PG18: up from default 3
+
+# Worker processes
+max_worker_processes = 128
+max_parallel_workers = 128
+
+# Autovacuum (PostgreSQL 18)
+autovacuum = on
+autovacuum_worker_slots = 32 # NEW PG18: runtime adjustment
+autovacuum_max_workers = 16
+autovacuum_vacuum_cost_delay = 0 # No throttling
+
+# Query tuning
+enable_partitionwise_join = on
+enable_partitionwise_aggregate = on
+random_page_cost = 1.1 # NVMe SSD
+effective_cache_size = 192GB # ~75% of RAM
+```
+
+### Source PostgreSQL Configuration (Key Settings)
+
+The source instance is optimized for fast parallel reads to support high-throughput data extraction:
+
+```ini
+# Memory allocation
+shared_buffers = 80GB # ~31% of 256GB RAM
+huge_pages = on # vm.nr_hugepages=45000
+work_mem = 256MB
+maintenance_work_mem = 4GB # Lower than target (16GB)
+
+# Durability disabled (benchmark only, NOT production)
+synchronous_commit = off
+fsync = off
+full_page_writes = off
+
+# WAL configuration
+wal_level = minimal
+wal_buffers = -1 # Auto-sized
+max_wal_size = 32GB # Smaller than target (128GB)
+checkpoint_timeout = 60min # Longer than target (15min)
+checkpoint_completion_target = 0.9
+
+# Background writer
+bgwriter_delay = 50ms # Less aggressive than target (10ms)
+bgwriter_lru_maxpages = 1000 # Half of target (2000)
+bgwriter_lru_multiplier = 4.0 # Half of target (8.0)
+bgwriter_flush_after = 2MB
+
+# I/O configuration (PostgreSQL 18 optimizations)
+backend_flush_after = 0
+effective_io_concurrency = 400 # Identical to target
+maintenance_io_concurrency = 400
+io_method = io_uring # NEW PG18: async I/O
+io_max_concurrency = 512 # NEW PG18
+io_workers = 8 # NEW PG18
+
+# Worker processes
+max_connections = 500 # Higher than target for parallel readers
+max_worker_processes = 128
+max_parallel_workers_per_gather = 64
+max_parallel_workers = 128
+
+# Query tuning (optimized for parallel reads)
+enable_partitionwise_join = on
+enable_partitionwise_aggregate = on
+random_page_cost = 1.1 # Block Storage (not NVMe)
+effective_cache_size = 192GB # ~75% of RAM
+default_statistics_target = 500
+
+# Autovacuum (PostgreSQL 18)
+autovacuum = on
+autovacuum_worker_slots = 32 # NEW PG18: runtime adjustment
+autovacuum_max_workers = 16
+autovacuum_vacuum_cost_delay = 0 # No throttling
+```
+
+### Table Configuration
+
+The target table eliminates all overhead sources:
+
+- **UNLOGGED**: No WAL write, flush, or archival overhead
+- **No indexes**: Eliminates 50-80% of bulk load cost
+- **No primary key**: No index maintenance or uniqueness checking
+- **No constraints**: No foreign key, check, or unique validation
+- **No triggers**: No trigger execution overhead
+
+This represents the absolute minimum overhead possible.
+
+---
+
+## About FastTransfer
+
+FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB.
+
+**Key Features:**
+
+- Advanced parallelization strategies for optimal performance
+- Cross-platform compatibility with major databases
+- Flexible configuration for various data migration scenarios
+- Production-ready with comprehensive logging and monitoring
+
+For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/).
diff --git a/img/2022-11-23_01/mutual_refs_02.jpg b/img/2022-11-23_01/mutual_refs_02.jpg
index a5246bd..2bf848d 100644
Binary files a/img/2022-11-23_01/mutual_refs_02.jpg and b/img/2022-11-23_01/mutual_refs_02.jpg differ
diff --git a/img/2025-07-12_01/output_17_0.png b/img/2025-07-12_01/output_17_0.png
new file mode 100644
index 0000000..91b2c71
Binary files /dev/null and b/img/2025-07-12_01/output_17_0.png differ
diff --git a/img/2025-07-12_01/output_19_0.png b/img/2025-07-12_01/output_19_0.png
new file mode 100644
index 0000000..a50b7df
Binary files /dev/null and b/img/2025-07-12_01/output_19_0.png differ
diff --git a/img/2025-07-12_01/output_21_0.png b/img/2025-07-12_01/output_21_0.png
new file mode 100644
index 0000000..4d18b4a
Binary files /dev/null and b/img/2025-07-12_01/output_21_0.png differ
diff --git a/img/2025-07-12_01/output_26_0.png b/img/2025-07-12_01/output_26_0.png
new file mode 100644
index 0000000..5eff70f
Binary files /dev/null and b/img/2025-07-12_01/output_26_0.png differ
diff --git a/img/2025-09-29_01/transfer_citus_to_pg.jpg b/img/2025-09-29_01/transfer_citus_to_pg.jpg
new file mode 100644
index 0000000..781eaac
Binary files /dev/null and b/img/2025-09-29_01/transfer_citus_to_pg.jpg differ
diff --git a/img/2025-09-29_01/transfer_pg_to_citus.jpg b/img/2025-09-29_01/transfer_pg_to_citus.jpg
new file mode 100644
index 0000000..98395ed
Binary files /dev/null and b/img/2025-09-29_01/transfer_pg_to_citus.jpg differ
diff --git a/img/2025-09-29_02/architecture.jpg b/img/2025-09-29_02/architecture.jpg
new file mode 100644
index 0000000..64800bb
Binary files /dev/null and b/img/2025-09-29_02/architecture.jpg differ
diff --git a/img/2025-09-29_02/transfer_s3_to_postgres_comparison.jpg b/img/2025-09-29_02/transfer_s3_to_postgres_comparison.jpg
new file mode 100644
index 0000000..e14631a
Binary files /dev/null and b/img/2025-09-29_02/transfer_s3_to_postgres_comparison.jpg differ
diff --git a/img/2025-09-29_03/architecture.jpg b/img/2025-09-29_03/architecture.jpg
new file mode 100644
index 0000000..d5d99b0
Binary files /dev/null and b/img/2025-09-29_03/architecture.jpg differ
diff --git a/img/2025-09-29_03/lineitem_elapsed_time.jpg b/img/2025-09-29_03/lineitem_elapsed_time.jpg
new file mode 100644
index 0000000..155509b
Binary files /dev/null and b/img/2025-09-29_03/lineitem_elapsed_time.jpg differ
diff --git a/img/2025-09-29_03/lineitem_throughput.jpg b/img/2025-09-29_03/lineitem_throughput.jpg
new file mode 100644
index 0000000..8b2c6c4
Binary files /dev/null and b/img/2025-09-29_03/lineitem_throughput.jpg differ
diff --git a/img/2025-10-25_01/2019_DEC_FR_ROUBAIX_IMAGE_0242_BD.jpg b/img/2025-10-25_01/2019_DEC_FR_ROUBAIX_IMAGE_0242_BD.jpg
new file mode 100644
index 0000000..2ff8b38
Binary files /dev/null and b/img/2025-10-25_01/2019_DEC_FR_ROUBAIX_IMAGE_0242_BD.jpg differ
diff --git a/img/2025-10-25_01/architecture.png b/img/2025-10-25_01/architecture.png
new file mode 100644
index 0000000..29d8b13
Binary files /dev/null and b/img/2025-10-25_01/architecture.png differ
diff --git a/img/2025-10-25_01/cross_degree_disk_write_mean.png b/img/2025-10-25_01/cross_degree_disk_write_mean.png
new file mode 100644
index 0000000..13c2804
Binary files /dev/null and b/img/2025-10-25_01/cross_degree_disk_write_mean.png differ
diff --git a/img/2025-10-25_01/cross_degree_network_comparison.png b/img/2025-10-25_01/cross_degree_network_comparison.png
new file mode 100644
index 0000000..2a015d6
Binary files /dev/null and b/img/2025-10-25_01/cross_degree_network_comparison.png differ
diff --git a/img/2025-10-25_01/elapsed_time_by_degree.png b/img/2025-10-25_01/elapsed_time_by_degree.png
new file mode 100644
index 0000000..6f1664e
Binary files /dev/null and b/img/2025-10-25_01/elapsed_time_by_degree.png differ
diff --git a/img/2025-10-25_01/plot_01_mean_cpu.png b/img/2025-10-25_01/plot_01_mean_cpu.png
new file mode 100644
index 0000000..2929099
Binary files /dev/null and b/img/2025-10-25_01/plot_01_mean_cpu.png differ
diff --git a/img/2025-10-25_01/plot_02_peak_cpu.png b/img/2025-10-25_01/plot_02_peak_cpu.png
new file mode 100644
index 0000000..0bb1780
Binary files /dev/null and b/img/2025-10-25_01/plot_02_peak_cpu.png differ
diff --git a/img/2025-10-25_01/plot_03_fasttransfer_user_system.png b/img/2025-10-25_01/plot_03_fasttransfer_user_system.png
new file mode 100644
index 0000000..ea430b3
Binary files /dev/null and b/img/2025-10-25_01/plot_03_fasttransfer_user_system.png differ
diff --git a/img/2025-10-25_01/plot_06_system_cpu_percentage.png b/img/2025-10-25_01/plot_06_system_cpu_percentage.png
new file mode 100644
index 0000000..7b9beca
Binary files /dev/null and b/img/2025-10-25_01/plot_06_system_cpu_percentage.png differ
diff --git a/img/2025-10-25_01/plot_10_timeseries_degree_4.png b/img/2025-10-25_01/plot_10_timeseries_degree_4.png
new file mode 100644
index 0000000..46a2693
Binary files /dev/null and b/img/2025-10-25_01/plot_10_timeseries_degree_4.png differ
diff --git a/img/2025-10-25_01/plot_11_timeseries_degree_32.png b/img/2025-10-25_01/plot_11_timeseries_degree_32.png
new file mode 100644
index 0000000..b136e05
Binary files /dev/null and b/img/2025-10-25_01/plot_11_timeseries_degree_32.png differ
diff --git a/img/2025-10-25_01/plot_12_timeseries_degree_128.png b/img/2025-10-25_01/plot_12_timeseries_degree_128.png
new file mode 100644
index 0000000..7928eb7
Binary files /dev/null and b/img/2025-10-25_01/plot_12_timeseries_degree_128.png differ
diff --git a/img/2025-10-25_01/plot_7_distribution_degree_4.png b/img/2025-10-25_01/plot_7_distribution_degree_4.png
new file mode 100644
index 0000000..a5c4252
Binary files /dev/null and b/img/2025-10-25_01/plot_7_distribution_degree_4.png differ
diff --git a/img/2025-10-25_01/plot_8_distribution_degree_32.png b/img/2025-10-25_01/plot_8_distribution_degree_32.png
new file mode 100644
index 0000000..d2ff8bf
Binary files /dev/null and b/img/2025-10-25_01/plot_8_distribution_degree_32.png differ
diff --git a/img/2025-10-25_01/plot_9_distribution_degree_128.png b/img/2025-10-25_01/plot_9_distribution_degree_128.png
new file mode 100644
index 0000000..6a4d267
Binary files /dev/null and b/img/2025-10-25_01/plot_9_distribution_degree_128.png differ
diff --git a/img/2025-10-25_01/source_disk_utilization_timeseries.png b/img/2025-10-25_01/source_disk_utilization_timeseries.png
new file mode 100644
index 0000000..6250679
Binary files /dev/null and b/img/2025-10-25_01/source_disk_utilization_timeseries.png differ
diff --git a/img/2025-10-25_01/target_disk_utilization_timeseries.png b/img/2025-10-25_01/target_disk_utilization_timeseries.png
new file mode 100644
index 0000000..6dde077
Binary files /dev/null and b/img/2025-10-25_01/target_disk_utilization_timeseries.png differ
diff --git a/img/2025-10-25_01/target_disk_write_throughput_timeseries.png b/img/2025-10-25_01/target_disk_write_throughput_timeseries.png
new file mode 100644
index 0000000..8d89552
Binary files /dev/null and b/img/2025-10-25_01/target_disk_write_throughput_timeseries.png differ
diff --git a/img/2025-10-25_01/target_network_rx_timeseries.png b/img/2025-10-25_01/target_network_rx_timeseries.png
new file mode 100644
index 0000000..83a1fb5
Binary files /dev/null and b/img/2025-10-25_01/target_network_rx_timeseries.png differ