diff --git a/_posts/2024-01-08-Streaming-data-from-PostgreSQL-to-a-CSV-file.md b/_posts/2024-01-08-Streaming-data-from-PostgreSQL-to-a-CSV-file.md index df5acc8..bf463e7 100644 --- a/_posts/2024-01-08-Streaming-data-from-PostgreSQL-to-a-CSV-file.md +++ b/_posts/2024-01-08-Streaming-data-from-PostgreSQL-to-a-CSV-file.md @@ -390,7 +390,7 @@ assert check_df.l_orderkey.is_monotonic_increasing ## FastBCP -While our focus here is on Python tools, we added FastBCP as a reference regarding CPU and memory usage. FastBCP has been developed in-house by Romain Ferraton at [Architecture & Performance](https://www.architecture-performance.fr/). It is a command line tool, written in C#, that is compatible with any operating system where dotnet is installed. We used dotnet on Linux in the present case. +While our focus here is on Python tools, we added FastBCP as a reference regarding CPU and memory usage. [FastBCP](https://www.arpe.io/fastbcp) has been developed at [Architecture & Performance](https://www.architecture-performance.fr/ap-logiciels/). It is a command line tool, written in C#, that is compatible with any operating system where dotnet is installed. We used dotnet on Linux in the present case. FastBCP employs parallel threads, reading data through multiple connections by partitioning SQL on the 'l_orderkey' column, using the "random" method. This approach results in distinct CSV files, later merged into a final output. It's worth mentioning that due to its parallel settings, the resulting data in the CSV file may not be sorted. This is why the ORDER BY clause is removed from the query in this particular case. Also, the returned elapsed time take the merging phase into account. diff --git a/_posts/2025-03-31-Arnold-tongues-with-Numba-Numba-CUDA-and-Datashader.md b/_posts/2025-03-31-Arnold-tongues-with-Numba-Numba-CUDA-and-Datashader.md index d8b73f1..ca9b6e4 100644 --- a/_posts/2025-03-31-Arnold-tongues-with-Numba-Numba-CUDA-and-Datashader.md +++ b/_posts/2025-03-31-Arnold-tongues-with-Numba-Numba-CUDA-and-Datashader.md @@ -611,4 +611,25 @@ plot_phase_space(df, size_x=2_000, size_y=1_000)

- +{% if page.comments %} +

+ + +{% endif %} \ No newline at end of file diff --git a/_posts/2025-07-12-Git-commit-temporal-analysis.md b/_posts/2025-07-12-Git-commit-temporal-analysis.md new file mode 100644 index 0000000..6afa2a2 --- /dev/null +++ b/_posts/2025-07-12-Git-commit-temporal-analysis.md @@ -0,0 +1,546 @@ +--- +title: A Git commit temporal analysis +layout: post +comments: true +author: François Pacull +tags: +- Python +- git +- numba +- pandas +- matplotlib +- radial +- timestamp +--- + + +In this Python notebook, we are going to analyze *git commit* timestamps across multiple repositories to identify temporal patterns in a git user coding activity (me, actually). + +**Outline** +- [Imports and package versions](#imports) +- [Repository Discovery and Data Extraction](#discovery) + - [Data Collection](#collection) + - [Data Preprocessing](#preprocessing) +- [Visualizations](#visualizations) + - [Weekly Distribution](#weekly) + - [Hourly Distribution (Linear)](#hourly_linear) + - [Hourly Distribution (Polar)](#hourly_polar) + - [Temporal Heatmap](#heatmap) + +## Imports and Package Versions + +`BASE_DIR` is the root folder containing all the git repositories. The `USER_FILTERS` list contains substrings to match against git author names for filtering commits from a specific user with various names (github, gitlab from various organizations). You can adapt these two variables with your own directory and git user names. + + +```python +import os +import subprocess +from collections import Counter +from datetime import datetime + +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import tol_colors as tc + +BASE_DIR = "/home/francois/Workspace" +USER_FILTERS = ["pacull", "djfrancesco"] +``` + +We are using Python 3.13.3 on a Linux OS: + + pandas : 2.2.3 + numpy : 2.2.6 + matplotlib: 3.10.3 + tol_colors: 2.0.0 + + +## Repository discovery and data extraction + +Here we introduce functions to recursively scan directories for git repositories, extract commit metadata using `git log` in a Python `subprocess`, specifically commit timestamps and author names, and filter commits by author name using case-insensitive substring matching. + + +```python +def is_git_repo(path): + return os.path.isdir(os.path.join(path, ".git")) + + +def get_all_git_repos(base_dir): + git_repos = [] + for root, dirs, files in os.walk(base_dir): + if is_git_repo(root): + git_repos.append(root) + dirs.clear() + return git_repos + + +def get_commits(repo_path): + try: + result = subprocess.run( + ["git", "-C", repo_path, "log", "--pretty=format:%an|%aI"], + stdout=subprocess.PIPE, + stderr=subprocess.DEVNULL, + check=True, + text=True, + ) + lines = result.stdout.strip().split("\n") + filtered_lines = [] + for line in lines: + if line: + author = line.split("|")[0].lower() + if any(u.lower() in author for u in USER_FILTERS): + filtered_lines.append(line) + return filtered_lines + except subprocess.CalledProcessError: + return [] + + +def parse_commit_times(commit_lines): + hours = [] + weekdays = [] + for line in commit_lines: + author, iso_date = line.split("|") + dt = datetime.fromisoformat(iso_date) + hours.append(dt.hour) + weekdays.append(dt.strftime("%A")) + return hours, weekdays +``` + +### Data collection + +So let's use these previous functions to iterate through the repositories, extract commit timestamps and parse them into hour-of-day and weekday components. + + +```python +all_hours = [] +all_weekdays = [] + +repos = get_all_git_repos(BASE_DIR) +for repo in repos: + commits = get_commits(repo) + hours, weekdays = parse_commit_times(commits) + all_hours.extend(hours) + all_weekdays.extend(weekdays) + +print(f"Total commits found: {len(all_hours)}") +``` + + Total commits found: 7605 + +### Data preprocessing + +Now we convert the extracted data and create *frequency* dataframes for each hour of the day or day of the week. + +```python +hour_counts = Counter(all_hours) +hour_df = pd.DataFrame( + { + "hour": list(range(24)), + "commit_count": [hour_counts.get(h, 0) for h in range(24)], + } +) +hour_df = hour_df.set_index("hour") +hour_df["distrib"] = hour_df["commit_count"] / hour_df["commit_count"].sum() + +days_order = [ + "Monday", + "Tuesday", + "Wednesday", + "Thursday", + "Friday", + "Saturday", + "Sunday", +] +weekday_counts = Counter(all_weekdays) +weekday_df = pd.DataFrame( + { + "weekday": days_order, + "commit_count": [weekday_counts.get(day, 0) for day in days_order], + } +) +weekday_df = weekday_df.set_index("weekday") +weekday_df["distrib"] = weekday_df["commit_count"] / weekday_df["commit_count"].sum() +``` + + +```python +hour_df.head(3) +``` + + + + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	commit_count	distrib
hour
0	12	0.001578
1	3	0.000394
2	0	0.000000

+ + + + +```python +assert hour_df.distrib.sum() == 1.0 +``` + + +```python +weekday_df.head(3) +``` + + + + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	commit_count	distrib
weekday
Monday	1527	0.200789
Tuesday	1264	0.166206
Wednesday	1291	0.169757

+ + + + +```python +assert weekday_df.distrib.sum() == 1.0 +``` + +## Visualizations + +### Weekly distribution + +Here we create a bar chart showing normalized commit frequency across days of the week. + + +```python +plt.figure(figsize=(10, 6)) +colormap = tc.YlOrBr +color = colormap(0.5) +plt.bar( + weekday_df.index, + weekday_df["distrib"], + color=color, + alpha=0.7, + edgecolor="black", + linewidth=0.5, +) +plt.title("Distribution of commits by day of week", fontsize=18) +plt.xlabel("Weekday", fontsize=16) +plt.ylabel("Relative frequency", fontsize=16) +plt.tick_params(axis="both", labelsize=14) +plt.grid(axis="y", linestyle="--", alpha=0.3) +plt.tight_layout() +``` + + +

+ weekday +

+ + +### Hourly Distribution (Linear) + +The next figure displays commit frequency across a 24-hour period. + + +```python +fig, ax = plt.subplots(figsize=(14, 6)) +ax.bar( + hour_df.index, + hour_df["distrib"], + color=color, + alpha=0.7, + edgecolor="black", + linewidth=0.5, +) +ax.set_title("Commit distribution by hour of day (linear)", fontsize=18, pad=20) +ax.set_xlabel("Hour of day", fontsize=16) +ax.set_ylabel("Relative frequency", fontsize=16) +ax.set_xlim(-0.5, 23.5) +ax.grid(axis="y", linestyle="--", alpha=0.3) +ax.set_xticks(range(24)) +ax.set_xticklabels([f"{h:02d}" for h in range(24)]) +plt.tick_params(axis="both", labelsize=14) +plt.tight_layout() +``` + + +

+ hourly linear +

+ + +### Hourly Distribution (polar) + +This is the same data as avove, but plotted in polar coordinates. + +```python +fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection="polar")) +theta = np.linspace(0, 2 * np.pi, 24, endpoint=False) +radii = hour_df["distrib"].values +ax.bar( + theta, + radii, + width=2 * np.pi / 24, + color=color, + alpha=0.7, + edgecolor="black", + linewidth=0.5, +) +ax.set_theta_zero_location("N") +ax.set_theta_direction(-1) +ax.set_thetagrids(range(0, 360, 15), [f"{h:02d}" for h in range(0, 24, 1)]) +ax.set_ylim(0, max(radii) * 1.1) +ax.set_yticklabels([]) +plt.tick_params(axis="both", labelsize=14) +ax.set_title("Commit distribution by hour of day (polar)", fontsize=18, pad=20) +ax.grid(True, alpha=0.3) +plt.tight_layout() +``` + +

+ hourly polar +

+ + + +### Temporal heatmap + +The figure here is a two-dimensional heatmap correlating weekday and hour-of-day commit patterns. Data is normalized to show percentage of total commits. + + +```python +commit_data = [] +for i, (hour, weekday) in enumerate(zip(all_hours, all_weekdays)): + commit_data.append({"hour": hour, "weekday": weekday}) + +commit_df = pd.DataFrame(commit_data) +heatmap_data = commit_df.groupby(["weekday", "hour"]).size().unstack(fill_value=0) +heatmap_data = heatmap_data.reindex(days_order) + +all_hours_cols = list(range(24)) +heatmap_data = heatmap_data.reindex(columns=all_hours_cols, fill_value=0) +heatmap_normalized = heatmap_data / heatmap_data.sum().sum() * 100 +``` + + +```python +heatmap_normalized.head(3) +``` + + + + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

hour	0	1	...	22	23
weekday
Monday	0.000000	0.0	...	0.499671	0.092045
Tuesday	0.039448	0.0	...	0.197239	0.078895
Wednesday	0.078895	0.0	...	0.262985	0.026298

3 rows × 24 columns

+ + + + +```python +assert np.isclose( + np.float64(heatmap_normalized.sum().sum()), 100.0, rtol=1e-10, atol=1e-10 +) +``` + + +```python +fig, ax = plt.subplots(figsize=(16, 8)) +im = ax.imshow( + heatmap_normalized, + aspect="auto", + cmap="tol.YlOrBr", + interpolation="nearest", + vmin=0, +) +ax.set_xticks(np.arange(24)) +ax.set_yticks(np.arange(len(days_order))) +ax.set_xticklabels([f"{h:02d}" for h in range(24)]) +ax.set_yticklabels(days_order) +plt.setp(ax.get_xticklabels(), rotation=0, ha="center") +cbar = plt.colorbar(im, ax=ax) +cbar.set_label( + "Percentage of total commits (%)", rotation=270, labelpad=20, fontsize=16 +) +cbar.ax.tick_params(labelsize=14) +ax.set_title("Commit activity heatmap by hour and weekday", fontsize=18, pad=20) +ax.set_xlabel("Hour of day", fontsize=16) +ax.set_ylabel("Day of week", fontsize=16) +ax.set_xticks(np.arange(24) - 0.5, minor=True) +ax.set_yticks(np.arange(len(days_order)) - 0.5, minor=True) +plt.tick_params(axis="both", labelsize=14) +ax.grid(which="minor", color="gray", linestyle="-", linewidth=0.5, alpha=0.3) +plt.tight_layout() +``` + +

+ heatmap +

+ + +So it seems that friday 4 pm is my most productive hour of the week! + + +{% if page.comments %} +

+ + +{% endif %} \ No newline at end of file diff --git a/_posts/2025-09-29-Data-Loading-From-S3-Parquet-Files-to-PostgreSQL-on-OVH-Cloud.md b/_posts/2025-09-29-Data-Loading-From-S3-Parquet-Files-to-PostgreSQL-on-OVH-Cloud.md new file mode 100644 index 0000000..c69174a --- /dev/null +++ b/_posts/2025-09-29-Data-Loading-From-S3-Parquet-Files-to-PostgreSQL-on-OVH-Cloud.md @@ -0,0 +1,197 @@ +--- +title: Data Loading from S3 Parquet Files to PostgreSQL on OVH Cloud +layout: post +comments: true +author: François Pacull +categories: [database, performance] +tags: +- FastTransfer +- S3 Parquet files +- PostgreSQL bulk loading +- OVH Object Storage +- DuckDB +- Parallel data ingestion +- Cloud database performance +--- + +## Introduction + +Loading data efficiently from cloud storage into databases remains a critical challenge for data engineers. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) addresses this challenge through parallel data loading capabilities and flexible configuration options. This post demonstrates FastTransfer's performance when loading Parquet files from OVH's S3-compatible object storage into PostgreSQL on OVH public cloud infrastructure. + +## The Setup: Cloud Infrastructure That Makes Sense + +For our testing environment, we deployed PostgreSQL on an OVH c3-64 instance in their Gravelines datacenter. Here's what we're working with: + + Architecture diagram.

+ +### Software Versions +- **FastTransfer**: Version 0.13.12.0 (X64 architecture, .NET 8.0.20) +- **Operating System**: Ubuntu 24.04.3 LTS +- **Source Engine**: DuckDB v1.3.2 (for Parquet reading and streaming) +- **Target Database**: PostgreSQL 16.10 + +### Hardware Configuration +- **Compute**: 32 vCores @ 2.3 GHz with 64 GB RAM +- **Storage**: 400 GB local NVMe where PostgreSQL's data directory resides +- **Network**: 4 Gbps bandwidth +- **Location**: Gravelines (GRA11) datacenter + +The local NVMe delivers strong sequential write performance at 1465 MiB/s (measured with fio), providing ample disk bandwidth for our data loading workloads. + +This configuration represents a practical mid-range setup, not the smallest instance that would struggle with parallel workloads, nor an oversized machine that would mask performance characteristics. + +### The Data: TPC-H Orders Table + +We're using the TPC-H benchmark's orders table at scale factor 10, which gives us: +- 16 Parquet files, evenly distributed at 29.2 MiB each +- Total dataset size: 467.8 MiB +- 15 million rows with mixed data types (integers, decimals, dates, and varchar) + +The data resides in an OVH S3-compatible object storage bucket in the Gravelines region, and each file contains roughly 937,500 rows. This distribution allows us to test parallel loading strategies effectively. + +## FastTransfer in Action: The Command That Does the Heavy Lifting + +Here's the actual command we use to load data: + +```bash +./FastTransfer \ + --sourceconnectiontype "duckdbstream" \ + --sourceserver ":memory:" \ + --query "SELECT * exclude filename from read_parquet('s3://arpeiofastbcp/tpch/sf10/orders/*.parquet', filename=true) t" \ + --targetconnectiontype "pgcopy" \ + --targetserver "localhost:5432" \ + --targetuser "fasttransfer" \ + --targetpassword "********" \ + --targetdatabase "tpch" \ + --targetschema "tpch_10_test" \ + --targettable "orders" \ + --method "DataDriven" \ + --distributeKeyColumn "filename" \ + --datadrivenquery "select file from glob('s3://arpeiofastbcp/tpch/sf10/orders/*.parquet')" \ + --loadmode "Truncate" \ + --mapmethod "Name" \ + --batchsize 10000 \ + --degree 16 +``` + +Let's break down the key components and understand what each parameter does: + +### Source Configuration +- **`--sourceconnectiontype "duckdbstream"`**: Uses DuckDB's memory-efficient streaming connection +- **`--sourceserver ":memory:"`**: Runs DuckDB in-memory mode for temporary data processing without persisting to disk +- **`--query`**: The DuckDB SQL that leverages the `read_parquet()` function to directly access Parquet files from S3, with `filename=true` to capture file origins for distribution + +### Target Configuration +- **`--targetconnectiontype "pgcopy"`**: Uses PostgreSQL's native COPY protocol, a fast method for bulk loading data into PostgreSQL +- **`--targetserver "localhost:5432"`**: Standard PostgreSQL connection details +- **`--targetuser` and `--targetpassword`**: Database authentication credentials + +### Parallelization Strategy +- **`--method "DataDriven"`**: Distributes work based on distinct values in a specified column, in our case each worker processes specific files +- **`--distributeKeyColumn "filename"`**: Uses the filename column to assign work to workers, ensuring each file is processed by exactly one worker +- **`--datadrivenquery`**: Overrides the default distinct value selection with an explicit file list using `glob()`, giving us precise control over work distribution +- **`--degree 16`**: Creates 16 parallel workers. FastTransfer supports 1-1024 workers, or negative values for CPU-adaptive scaling (e.g., `-2` uses half available CPUs) + +### Loading Configuration +- **`--loadmode "Truncate"`**: Clears the target table before loading, ensuring a clean slate (alternative is `"Append"` for adding to existing data) +- **`--mapmethod "Name"`**: Maps source to target columns by name rather than position, providing flexibility when column orders differ +- **`--batchsize 10000`**: Processes 10,000 rows per bulk copy operation (default is 1,048,576). Smaller batches can reduce memory usage but may impact throughput + +### About FastTransfer + +FastTransfer is designed specifically for efficient data movement between different database systems, particularly excelling with large datasets (>1 million cells). The tool requires the target table to pre-exist and supports various database types including ClickHouse, MySQL, Oracle, PostgreSQL, and SQL Server. Its strength lies in intelligent work distribution, whether using file-based distribution like our DataDriven approach, or other methods like CTID (PostgreSQL-specific), RangeId (numeric ranges), or Random (modulo-based distribution). + +## Performance Analysis: Where Theory Meets Reality + +We tested four different table configurations to understand how PostgreSQL constraints and logging independently affect loading performance. Each test was run multiple times, reporting the best result to minimize noise from network variability or system background tasks. + +### Configuration 1: WITH PK / LOGGED + +Standard production table with primary key on `o_orderkey` and full WAL durability: + +| Degree of Parallelism | Load Time (seconds) | Speedup | +|----------------------|---------------------|---------| +| 1 | 50.5 | 1.0x | +| 2 | 28.8 | 1.8x | +| 4 | 17.8 | 2.8x | +| 8 | 16.1 | 3.1x | +| 16 | 19.2 | 2.6x | + +Peaks at 8 workers (3.1x speedup). Constraint checking and WAL logging create severe contention. + +### Configuration 2: WITH PK / UNLOGGED + +Primary key with WAL logging disabled: + +| Degree of Parallelism | Load Time (seconds) | Speedup | +|----------------------|---------------------|---------| +| 1 | 46.3 | 1.0x | +| 2 | 25.5 | 1.8x | +| 4 | 14.5 | 3.2x | +| 8 | 9.3 | 5.0x | +| 16 | 7.8 | 5.9x | + +Removing WAL overhead significantly improves scaling. Continues to 16 workers due to reduced contention. + +### Configuration 3: WITHOUT PK / LOGGED + +No constraints, WAL logging enabled: + +| Degree of Parallelism | Load Time (seconds) | Speedup | +|----------------------|---------------------|---------| +| 1 | 45.3 | 1.0x | +| 2 | 24.2 | 1.9x | +| 4 | 13.2 | 3.4x | +| 8 | 8.7 | 5.2x | +| 16 | 8.7 | 5.2x | + +Better than WITH PK/LOGGED but plateaus at 8 workers due to WAL contention. + +### Configuration 4: WITHOUT PK / UNLOGGED + +Maximum performance configuration - no constraints, no WAL: + +| Degree of Parallelism | Load Time (seconds) | Speedup | +|----------------------|---------------------|---------| +| 1 | 44.5 | 1.0x | +| 2 | 25.4 | 1.8x | +| 4 | 13.4 | 3.3x | +| 8 | 7.8 | 5.7x | +| 16 | 5.1 | 8.7x | + +Best scaling - achieves 8.7x speedup at 16 workers, finally hitting network bandwidth limits. + +## Visual Performance Comparison + + Performance Comparison.

+ +The comparison reveals how primary keys and WAL logging independently bottleneck performance. WITHOUT PK/UNLOGGED achieves the best scaling (8.7x at 16 workers), while WITH PK/LOGGED caps at 3.1x. The intermediate configurations show each factor's impact: removing the primary key or disabling WAL each provide significant improvements, with their combination delivering maximum performance. + +## Network and I/O Considerations + +Different configurations reveal different bottlenecks: + +- **WITH PK / LOGGED**: Constraint checking + WAL overhead limits to 3.1x +- **WITH PK / UNLOGGED**: WAL removal allows 5.9x scaling +- **WITHOUT PK / LOGGED**: WAL contention plateaus at 5.2x +- **WITHOUT PK / UNLOGGED**: Best scaling at 8.7x (467.8 MiB in 5.1s ≈ 92 MB/s) + +At 92 MB/s with 4 Gbps network (~500 MB/s) and 1465 MiB/s local NVMe capacity, neither network nor disk I/O are the bottleneck. The limitation could come from several sources: S3 object storage throughput, DuckDB Parquet parsing overhead, or PostgreSQL's internal coordination when multiple workers write concurrently to the same table. + +## Conclusion + +FastTransfer achieves 5.1-second load times for 467.8 MiB of Parquet data from OVH S3 to PostgreSQL, reaching 92 MB/s throughput with WITHOUT PK/UNLOGGED configuration at degree 16. Testing four configurations reveals that primary keys and WAL logging each independently constrain performance, with optimal settings varying from degree 8 (LOGGED) to degree 16+ (UNLOGGED). The results demonstrate that cloud-based data pipelines can achieve strong performance when configuration matches use case requirements. + +--- + +## About FastTransfer + +FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB. + +**Key Features:** +- Advanced parallelization strategies for optimal performance +- Cross-platform compatibility with major databases +- Flexible configuration for various data migration scenarios +- Production-ready with comprehensive logging and monitoring + +For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/). diff --git a/_posts/2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md b/_posts/2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md new file mode 100644 index 0000000..5ae4af7 --- /dev/null +++ b/_posts/2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md @@ -0,0 +1,499 @@ +--- +title: FastTransfer Performance with Citus Columnar Storage in PostgreSQL +layout: post +comments: true +author: François Pacull +categories: [database, performance] +tags: +- FastTransfer +- Citus PostgreSQL +- Columnar storage +- Database migration +- PostgreSQL Docker +- Performance benchmarks +--- + +## Introduction + +Data migration between database systems often becomes a bottleneck in modern data pipelines, particularly when dealing with analytical workloads. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to address these challenges through advanced parallelization strategies. This post demonstrates FastTransfer's performance when working with PostgreSQL databases enhanced with the [Citus extension](https://docs.citusdata.com/en/v13.0/) for columnar storage. + +## Understanding FastTransfer + +FastTransfer is a command-line tool designed to address common data migration challenges. In our testing, we've found it particularly effective for scenarios where traditional migration approaches fall short. + +### Core Capabilities + +The tool offers several features that we've found valuable in production environments: + +- **Cross-platform compatibility**: Works with PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, DuckDB, and other major databases +- **Advanced parallelization**: Multiple strategies for parallel data extraction and loading, allowing you to optimize for your specific use case +- **Flexible configuration**: Fine-grained control over batch sizes, mapping methods, and load modes to tune performance +- **Production-ready features**: Comprehensive logging, error handling, and monitoring help ensure reliable migrations + +### Parallelization Strategies + +One aspect we particularly appreciate about FastTransfer is its range of parallelization methods, accessible through the `-M, --method ` parameter. Each method addresses different scenarios you might encounter in practice. + +#### Available Methods + +| Method | Description | Best Used For | +|--------|-------------|---------------| +| **None** | Sequential processing without parallelism | Small datasets, debugging, or establishing baseline performance | +| **Ctid** | PostgreSQL-specific method using the internal `ctid` pseudo-column (physical row location) | PostgreSQL sources with large tables where maximum speed is critical | +| **Ntile** | Uses SQL window function `NTILE()` to create evenly distributed partitions | Cross-database migrations where portability matters more than peak performance | +| **DataDriven** | Partitions based on distinct values of a distribution key column | Tables with natural categorical boundaries (e.g., by region, department) | +| **Random** | Applies modulo operation on distribution key | Situations where load balancing is critical and no natural partition exists | +| **Rowid** | Oracle-specific using ROWID slices | Oracle databases requiring parallel extraction | +| **RangeId** | Divides data using numeric ranges | Tables with sequential IDs or numeric keys, good balance of performance and portability | +| **NZDataSlice** | Netezza-specific leveraging native data slices | Netezza data warehouse environments | + +For this analysis, we focus on three methods that we've found most effective in PostgreSQL environments: **Ctid**, **Ntile**, and **RangeId**. Each offers distinct advantages depending on your specific use case. + +Also not that in more recent versions of FastTransfer (v0.13.8) a new method was introduced **Physloc** parallel method for SQL Server sources. + +## Understanding Citus Columnar Storage + +Before diving into the performance results, it's worth understanding what makes columnar storage particularly interesting for data migration scenarios. [Citus](https://docs.citusdata.com/en/v13.0/) is an open-source PostgreSQL extension that adds distributed database capabilities and columnar storage options. + +**Note:** Citus extension is primarily supported on Linux distributions. macOS users can run it via Docker, while Windows users may need to use Docker or WSL. + +The key insight with columnar storage is that it organizes data by columns rather than rows. This approach offers significant advantages for analytical workloads: + +- **Better compression**: Similar values in a column compress more efficiently +- **Reduced I/O**: Queries reading specific columns avoid loading unnecessary data +- **Storage efficiency**: In our tests, we've seen 70-80% storage reduction for typical analytical datasets + +However, this storage format also introduces trade-offs that affect migration performance, as we'll explore in our benchmarks. + +### Configuration Options + +One thing we appreciate about Citus is the flexibility it provides for tuning columnar storage. The following parameters have proven particularly important in our migration scenarios: + +| Parameter | Options | Description | Practical Consideration | +|-----------|---------|-------------|------------------------| +| **compression** | `none`, `pglz`, `zstd`, `lz4`, `lz4hc` | Algorithm for compressing data | LZ4 offers good speed; ZSTD better compression ratio | +| **compression_level** | 1-19 | Compression aggressiveness | Higher values mean better compression but slower writes | +| **stripe_row_count** | Default: 150,000 | Rows per stripe | Larger stripes can improve scan performance | +| **chunk_row_count** | Default: 10,000 | Rows per chunk within stripes | Affects granularity of data access | + +In our tests, we chose LZ4 compression with custom stripe sizing based on our experience balancing compression speed with storage efficiency. Your optimal configuration may vary depending on your specific data characteristics and performance requirements. + +## Experimental Setup + +To provide practical, reproducible results, we conducted our tests in a controlled environment that many developers might relate to "a local development setup". While production environments will naturally differ, the relative performance patterns we observed should translate to larger scales. + +### Test Environment + +We ran our tests using two PostgreSQL instances in separate Docker containers on a Linux laptop with a 12th Gen Intel® Core™ i9-12900H processor. By running all components locally, we eliminated network latency as a variable, allowing us to focus on the actual transfer and storage performance characteristics. + +### Docker Container 1: Standard PostgreSQL (Source) + +```bash +docker run -d --name pg-plain \ + -e POSTGRES_PASSWORD=********** \ + -p 5434:5432 \ + postgres:17 +``` + +Version information: +```sql +SELECT version(); +-- PostgreSQL 17.6 (Debian 17.6-1.pgdg13+1) on x86_64-pc-linux-gnu +``` + +### Docker Container 2: PostgreSQL with Citus (Target) + +```bash +docker run -d --name citus \ + -p 5433:5432 \ + -e POSTGRES_PASSWORD=********** \ + citusdata/citus:13.0 +``` + +Version information: +```sql +SELECT * FROM citus_version(); +-- Citus 13.0.3 on x86_64-pc-linux-gnu + +SELECT version(); +-- PostgreSQL 17.2 (Debian 17.2-1.pgdg120+1) on x86_64-pc-linux-gnu +``` + +### Test Data: TPCH Orders Table + +For our benchmarks, we selected the TPCH benchmark's `orders` table at scale factor 10, which provides a realistic dataset of 15 million rows. This table represents a common pattern in analytical workloads, that is, a mix of numeric, date, and text fields that you might encounter in real-world scenarios: + +```sql +SELECT COUNT(*) FROM tpch_10.orders; +-- 15000000 +``` + +#### Table Schema + +| Column | Data Type | Max Length | +|--------|-----------|------------| +| o_orderkey | integer | - | +| o_custkey | integer | - | +| o_orderstatus | character | 1 | +| o_totalprice | numeric(15,2) | - | +| o_orderdate | date | - | +| o_orderpriority | varchar | 15 | +| o_clerk | varchar | 15 | +| o_shippriority | integer | - | +| o_comment | varchar | 79 | + +Note that this table has no primary key, which may affect how certain parallelization methods perform. + +### Creating the Citus Columnar Target Table + +```sql +CREATE EXTENSION IF NOT EXISTS citus_columnar; + +CREATE TABLE orders ( + o_orderkey int4 NOT NULL, + o_custkey int4 NULL, + o_orderstatus bpchar(1) NULL, + o_totalprice numeric(15, 2) NULL, + o_orderdate date NULL, + o_orderpriority varchar(15) NULL, + o_clerk varchar(15) NULL, + o_shippriority int4 NULL, + o_comment varchar(79) NULL +) USING columnar WITH ( + columnar.compression = 'lz4', + columnar.stripe_row_limit = 131072 +); +``` + +## Performance Results: PostgreSQL → Citus Columnar + +Now let's examine the actual performance results. We tested FastTransfer v0.13.5 with different parallelization methods and degrees of parallelism to understand how each approach handles the overhead of writing to columnar storage. + +### Method: None (Baseline) + +Sequential transfer without any parallelization: + +```bash +./FastTransfer \ +--sourceconnectiontype "pgcopy" \ +--sourceserver "localhost:5434" \ +--sourceuser "postgres" \ +--sourcepassword "**********" \ +--sourcedatabase "postgres" \ +--sourceschema "tpch_10" \ +--sourcetable "orders" \ +--targetconnectiontype "pgcopy" \ +--targetserver "localhost:5433" \ +--targetuser "postgres" \ +--targetpassword "**********" \ +--targetdatabase "postgres" \ +--targetschema "public" \ +--targettable "orders" \ +--loadmode "Truncate" \ +--method "None" +``` + +**Result**: 12,092 ms (with custom columnar parameters: LZ4 compression, 131,072 stripe_row_limit) + +For comparison: +- **Standard PostgreSQL table** (non-columnar): 10,141 ms (best of 3: 10,468 ms, 11,461 ms, 10,141 ms) +- **Columnar with default parameters**: 13,393 ms (best of 3: 13,447 ms, 13,393 ms, 13,587 ms) + +**Key Observations**: + +These baseline results reveal important trade-offs: +- Writing to columnar storage with custom parameters adds approximately 19% overhead compared to standard tables +- Default columnar parameters increase this overhead to about 32% +- However, the storage benefits are substantial: columnar storage (default parameters) uses only 454MB compared to 1.9GB for standard tables, a 76% reduction + +This trade-off between write performance and storage efficiency is a recurring theme in our analysis. + +### Method: Ctid (PostgreSQL-Optimized Parallelization) + +The Ctid method takes advantage of PostgreSQL's internal row locator (ctid), which represents the physical location of rows on disk. In our testing, this approach has consistently shown strong performance for PostgreSQL-to-PostgreSQL transfers: + +```bash +./FastTransfer \ +# ... [same connection parameters] ... +--method "Ctid" \ +--degree [1|2|4|8] # Parallelism degree +``` + +#### Ctid Method Performance + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 12,358 | 1.00x | 100% | 1,213,846 | +| 2 | 6,947 | 1.78x | 89% | 2,159,363 | +| 4 | 4,132 | 2.99x | 75% | 3,630,543 | +| 8 | 3,303 | 3.74x | 47% | 4,540,691 | + +*Note: Efficiency = (Speedup / Threads) × 100%. Lower efficiency at higher thread counts indicates coordination overhead.* + +### Method: Ntile (Distribution-Based Parallelization) + +The Ntile method uses SQL window functions to divide data into evenly distributed buckets. While more portable across database systems, we've found it generally requires more processing overhead: + +```bash +./FastTransfer \ +# ... [same connection parameters] ... +--method "Ntile" \ +--distributekeycolumn "o_orderkey" \ +--degree [1|2|4|8] +``` + +#### Ntile Method Performance + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 16,023 | 1.00x | 100% | 936,148 | +| 2 | 11,063 | 1.45x | 72% | 1,355,889 | +| 4 | 9,680 | 1.66x | 41% | 1,549,587 | +| 8 | 8,575 | 1.87x | 23% | 1,749,271 | + +*The lower speedup compared to Ctid reflects the overhead of the NTILE() window function calculation.* + +### Method: RangeId (Numeric Range Partitioning) + +The RangeId method divides data based on numeric ranges, which works particularly well when you have sequential numeric keys. In our experience, it offers a good balance between portability and performance: + +```bash +./FastTransfer \ +# ... [same connection parameters] ... +--method "RangeId" \ +--distributekeycolumn "o_orderkey" \ +--degree [1|2|4|8] +``` + +#### RangeId Method Performance + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 12,257 | 1.00x | 100% | 1,223,864 | +| 2 | 7,462 | 1.64x | 82% | 2,010,182 | +| 4 | 5,347 | 2.29x | 57% | 2,805,058 | +| 8 | 4,780 | 2.56x | 32% | 3,138,075 | + +*RangeId shows good scalability up to 4 threads, making it a solid middle ground between Ctid and Ntile.* + +### Comparative Analysis: PostgreSQL → Citus + +Looking across all three methods, some clear patterns emerge from our testing: + +- **Best Overall Performance**: Ctid with 8 parallel threads achieved a 3.74x speedup (3,303 ms) +- **RangeId as Alternative**: While not quite matching Ctid, RangeId performed respectably (4,780 ms at degree 8, 2.56x speedup) and may be preferable when ctid access is restricted +- **Efficiency Sweet Spots**: Interestingly, RangeId showed better efficiency at lower parallelism (82% at degree 2) compared to Ctid's peak efficiency (75% at degree 4) +- **General Pattern**: For PostgreSQL to Citus transfers, we consistently see: Ctid > RangeId > Ntile + +## Performance Results: Citus Columnar → PostgreSQL + +Transferring data from Citus columnar tables back to standard PostgreSQL presents different challenges. + +### Method: None (Baseline) + +```bash +./FastTransfer \ +--sourceconnectiontype "pgcopy" \ +--sourceserver "localhost:5433" \ +--sourceuser "postgres" \ +--sourcepassword "**********" \ +--sourcedatabase "postgres" \ +--sourceschema "public" \ +--sourcetable "orders" \ +--targetconnectiontype "pgcopy" \ +--targetserver "localhost:5434" \ +--targetuser "postgres" \ +--targetpassword "**********" \ +--targetdatabase "postgres" \ +--targetschema "tpch_10" \ +--targettable "orders" \ +--loadmode "Truncate" \ +--method "None" +``` + +**Result**: 9,952 ms + +Interestingly, this is faster than the PostgreSQL → Citus direction (12,092 ms). While this might seem counterintuitive at first, it makes sense when you consider the underlying mechanics: +- Reading from columnar storage benefits from compression, reducing I/O significantly +- Writing to standard PostgreSQL tables avoids the compression overhead +- In this scenario, the read benefits outweigh the write costs + +### Method: Ctid (Not Supported) + +```bash +./FastTransfer \ +# ... [same parameters] ... +--method "Ctid" \ +--degree 1 +``` + +**Result**: +``` +Fail Parallel Ctid Load +Source : Npgsql +Message : 0A000: UPDATE and CTID scans not supported for ColumnarScan +Failed Load +``` + +This error is expected and highlights an important limitation: columnar storage doesn't maintain traditional ctid values due to its fundamentally different storage architecture. This is something to consider when planning your migration strategy. + +### Method: Ntile + +#### Ntile Method Performance (Citus → PostgreSQL) + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 13,766 | 1.00x | 100% | 1,089,859 | +| 2 | 10,461 | 1.32x | 66% | 1,433,869 | +| 4 | 10,170 | 1.35x | 34% | 1,474,926 | +| 8 | 9,281 | 1.48x | 19% | 1,616,206 | + +*Limited scalability suggests that columnar read patterns don't parallelize as effectively with Ntile.* + +### Method: RangeId + +#### RangeId Method Performance (Citus → PostgreSQL) + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 10,717 | 1.00x | 100% | 1,399,515 | +| 2 | 8,007 | 1.34x | 67% | 1,873,341 | +| 4 | 6,006 | 1.78x | 45% | 2,497,500 | +| 8 | 6,520 | 1.64x | 21% | 2,300,613 | + +*Performance plateaus at 4 threads, suggesting this is the optimal parallelization level for LOGGED tables.* + +### RangeId with UNLOGGED Target Tables + +To further optimize performance, we tested RangeId transfers to UNLOGGED PostgreSQL tables, which skip WAL (Write-Ahead Logging): + +```sql +ALTER TABLE tpch_10.orders SET UNLOGGED; +``` + +Note that columnar tables cannot be configured as UNLOGGED (`ERROR: unlogged columnar tables are not supported`), which is a current limitation of the Citus extension. + +#### RangeId Performance with UNLOGGED Target Tables + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | Improvement vs LOGGED | +|-----------------|-----------|---------|------------|----------------------|----------------------| +| 1 | 9,389 | 1.00x | 100% | 1,597,719 | +14% | +| 2 | 6,059 | 1.55x | 78% | 2,476,032 | +32% | +| 4 | 4,771 | 1.97x | 49% | 3,143,764 | +26% | +| 8 | 3,946 | 2.38x | 30% | 3,801,114 | +65% | + +*UNLOGGED tables show consistent improvement across all parallelization levels, with the benefit increasing at higher thread counts.* + +**Key Findings with UNLOGGED Tables**: + +- **Best Overall Time**: 3,946 ms with 8 threads (compared to 6,006 ms for LOGGED at 4 threads) +- **Improved Scalability**: Performance continues improving through 8 threads, unlike LOGGED tables which plateau +- **Consistent Benefits**: 14-65% faster across all parallelization levels, with greater benefits at higher thread counts +- **Trade-off**: Remember that UNLOGGED tables don't write to WAL, so they're not crash-safe + +### Comparative Analysis: Citus → PostgreSQL + +When transferring data from Citus columnar back to standard PostgreSQL, we observed these patterns: + +- **Best Overall Performance**: RangeId with UNLOGGED target tables at degree 8 achieved remarkable performance (3,946 ms, representing a 2.52x speedup) +- **Best LOGGED Performance**: RangeId with degree 4 reached 6,006 ms (1.66x speedup) +- **Method Effectiveness**: For this direction, we consistently found: RangeId UNLOGGED > RangeId LOGGED > Ntile +- **Important Limitation**: Ctid method is not available for columnar source tables +- **UNLOGGED Benefits**: The performance advantage of UNLOGGED tables becomes more pronounced at higher parallelism levels +- **Scalability Differences**: UNLOGGED tables continued to show improvements through degree 8, while LOGGED tables plateaued around degree 4 + +## Performance Visualization + +The following charts illustrate the performance characteristics we observed across different methods and parallelization levels: + +### PostgreSQL → Citus Columnar Transfer Performance + Chart showing transfer performance from PostgreSQL to Citus Columnar storage, comparing Ctid, Ntile, and RangeId parallelization methods across 1, 2, 4, and 8 threads. Ctid shows the best performance, achieving 3.3 seconds with 8 threads.

Chart showing transfer performance from PostgreSQL to Citus Columnar storage, comparing Ctid, Ntile, and RangeId parallelization methods across 1, 2, 4, and 8 threads. Ctid shows the best performance, achieving 3.3 seconds with 8 threads.

+ +### Citus Columnar → PostgreSQL Transfer Performance + Chart showing transfer performance from Citus Columnar to PostgreSQL, comparing Ntile and RangeId methods with both LOGGED and UNLOGGED target tables. RangeId with UNLOGGED tables shows the best performance at 3.9 seconds with 8 threads.

Chart showing transfer performance from Citus Columnar to PostgreSQL, comparing Ntile and RangeId methods with both LOGGED and UNLOGGED target tables. RangeId with UNLOGGED tables shows the best performance at 3.9 seconds with 8 threads.

+ +## Key Takeaways + +### Performance Summary + +From our benchmarks with 15 million rows: + +| Scenario | Best Method | Time | Speedup | Key Insight | +|----------|------------|------|---------|-------------| +| PostgreSQL → Citus | Ctid (8 threads) | 3.3s | 3.74x | Direct row access provides best performance | +| Citus → PostgreSQL | RangeId UNLOGGED (8 threads) | 3.9s | 2.52x | UNLOGGED tables dramatically improve write speed | +| Cross-compatible | RangeId (4 threads) | 5.3s | 2.29x | Good balance of performance and portability | + +### Important Considerations + +1. **Storage vs Speed Trade-off**: Columnar storage reduces disk usage by 76% but adds ~20% write overhead +2. **Diminishing Returns**: Parallelization beyond 4 threads often shows limited benefit +3. **Method Limitations**: Not all methods work with all storage types (e.g., Ctid incompatible with columnar) +4. **Asymmetric Performance**: Reading from columnar is faster than writing to it + +## Analysis and Insights + +After running these benchmarks, several patterns became clear that might help inform your migration strategy. + +### Why Ctid Typically Outperforms Other Methods + +In our testing, the Ctid method consistently delivered the best performance for PostgreSQL sources. This makes sense when you consider that ctid provides direct access to physical row locations, eliminating the need for sorting or complex query planning that other methods require. + +### Scalability Patterns + +One interesting finding from our tests relates to how parallelization efficiency changes with thread count: + +#### The Law of Diminishing Returns + +As we increased parallelism, we observed declining efficiency across all methods: +- **Sweet Spot**: In most cases, 4 threads offered the best balance between performance and resource utilization +- **Efficiency Cliff**: At 8 threads, efficiency often dropped below 50%, suggesting that the overhead of coordination begins to outweigh the benefits + +### Understanding Columnar Storage Impact + +Our benchmarks revealed several important considerations when working with columnar storage: + +#### Write Performance Trade-offs + +We observed that writing to columnar storage introduces approximately 19% overhead compared to standard tables (12,092 ms vs 10,141 ms). This overhead comes from several sources: +- Compression processing (LZ4 in our configuration) +- Data reorganization into columnar format (stripes and chunks) +- Additional metadata management + +However, it's important to remember that this overhead delivers significant storage savings, in our case, a 76% reduction in disk usage. + +#### Read Performance Benefits + +Conversely, reading from columnar storage proved notably efficient: +- Transfers from Citus to PostgreSQL completed 18% faster than the reverse direction +- Compressed data requires less I/O bandwidth +- Sequential reading patterns align well with columnar storage organization + +#### Asymmetric Performance Characteristics + +One surprising finding was that Citus → PostgreSQL transfers consistently outperformed PostgreSQL → Citus transfers. This asymmetry makes sense when you consider that: +- Reading benefits from compression outweigh writing penalties +- Standard PostgreSQL tables have highly optimized write paths +- The combination results in better overall performance when columnar is the source + +#### Method Compatibility Considerations + +It's worth noting that not all parallelization methods work with columnar storage. The Ctid method, while excellent for standard PostgreSQL tables, isn't compatible with columnar architecture due to the different way data is organized and accessed. + +## Conclusion + +FastTransfer effectively handles migrations involving Citus columnar storage, achieving up to 76% storage savings while maintaining high transfer speeds. The choice of parallelization method significantly impacts performance, with Ntile delivering the best balance for columnar targets. These results demonstrate that columnar storage and efficient data migration are not mutually exclusive when using the right tools. + +--- + +## About FastTransfer + +FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB. + +**Key Features:** +- Advanced parallelization strategies for optimal performance +- Cross-platform compatibility with major databases +- Flexible configuration for various data migration scenarios +- Production-ready with comprehensive logging and monitoring + +For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/). diff --git a/_posts/2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md b/_posts/2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md new file mode 100644 index 0000000..8c24bd3 --- /dev/null +++ b/_posts/2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md @@ -0,0 +1,140 @@ +--- +title: High-Speed PostgreSQL Replication on OVH with FastTransfer +layout: post +comments: true +author: François Pacull +categories: [database, performance] +tags: +- FastTransfer +- PostgreSQL replication +- OVH +- High-performance +- Database migration speed +- TPC-H benchmark +- 20 Gbps network +- c3-256 +- PostgreSQL parallel transfer +--- + + +## Introduction + +PostgreSQL-to-PostgreSQL replication at scale requires tools that can fully leverage modern cloud infrastructure and network capabilities. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to maximize throughput through advanced parallelization. This post demonstrates FastTransfer's performance transferring 113GB of TPC-H data between OVH c3-256 instances over a 20 Gbit/s network. + +## Infrastructure Setup + +For our testing environment, we deployed PostgreSQL on two OVH c3-256 instances in the Paris datacenter. Here's what we're working with: + +- **OVH Instances**: c3-256 (256GB RAM, 128 vCores @2.3GHz, 400GB NVMe) +- **Network**: 20 Gbit/s vrack, Paris datacenter (eu-west-par-c) +- **OS**: Ubuntu 24 +- **PostgreSQL**: Version 16 +- **Dataset**: TPC-H SF100 lineitem table (~600M rows, ~113GB) + + Architecture diagram.

+ +## PostgreSQL Configuration + +Optimized for bulk operations: 80GB shared_buffers, 128 parallel workers, minimal WAL logging. Target tables: UNLOGGED, no primary keys. + +## Target Database Disk Performance + +The target PostgreSQL instance uses the native 400GB NVMe instance disk (not block storage) for database storage. This provides excellent I/O performance crucial for high-speed data ingestion: + +### FIO Benchmark Command +```bash +fio --name=seqwrite --filename=/tmp/fio-test --rw=write \ + --bs=1M --size=8G --direct=1 --numjobs=1 --runtime=30 --group_reporting +``` + +### Results +``` +Sequential Write Performance (8GB test, 1MB blocks): +- Throughput: 1,260 MB/s (1.26 GB/s) +- IOPS: 1,259 +- Average latency: 787 microseconds +- 95th percentile: 1.5ms +- 99th percentile: 2.3ms +``` + +The native NVMe storage delivers consistent low-latency writes with over 1.2 GB/s throughput, ensuring disk I/O is not a bottleneck for the PostgreSQL COPY operations even at peak network transfer rates. + +## Network Performance + +The private network connection between source and target instances was tested using iperf3 to verify bandwidth capacity: + +### iperf3 Benchmark Command +```bash +# On target instance +iperf3 -s + +# On source instance +iperf3 -c 10.10.0.50 -P 64 -t 30 +``` + +### Results +``` +Network Throughput Test (64 parallel streams, 30 seconds): +- Average throughput: 20.5 Gbit/s +- Total data transferred: 71.7 GB +- Consistent performance across all streams +``` + +The network delivers full line-rate performance, slightly exceeding the nominal 20 Gbit/s specification. With 64 parallel TCP streams, the network provides ample bandwidth for FastTransfer's parallel data transfer operations. + +## FastTransfer Command + +FastTransfer version: 0.13.12 + +```bash +./FastTransfer \ + --sourceconnectiontype "pgcopy" \ + --sourceconnectstring "Host=localhost;Port=5432;Database=tpch;Trust Server Certificate=True;Application Name=FastTransfer;Maximum Pool Size=150;Timeout=15;Command Timeout=10800;Username=fasttransfer;Password=******" \ + --sourceschema "tpch_100" --sourcetable "lineitem" \ + --targetconnectiontype "pgcopy" \ + --targetconnectstring "Host=10.10.0.50;Port=5432;Database=tpch;Trust Server Certificate=True;Application Name=FastTransfer;Maximum Pool Size=150;Timeout=15;Command Timeout=10800;Username=fasttransfer;Password=******" \ + --targetschema "tpch_100" --targettable "lineitem" \ + --loadmode "Truncate" --method "Ctid" --degree 128 +``` + +Note the `Maximum Pool Size`=150 in the connection string, increased from the default 100 to support 128 parallel threads. + +## Performance Results + +### Transfer Time + + Transfer Time.

+ +Transfer time: 749s (single thread) → 70s (128 threads) + +### Throughput Scaling + + Throughput.

+ + +## Introduction + +Parallel data replication between PostgreSQL instances presents unique challenges at scale, particularly when attempting to maximize throughput on high-performance cloud infrastructure. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to leverage advanced parallelization strategies for efficient data movement. This post provides a performance analysis of FastTransfer transferring 77GB of data between two PostgreSQL 18 instances on OVH c3-256 servers, examining CPU, disk I/O, and network bottlenecks across parallelism degrees from 1 to 128. + +### Test Configuration + +The test dataset consists of the TPC-H SF100 lineitem table (~600M rows, ~77GB), configured as an UNLOGGED table without indexes, constraints, or triggers. Both instances were tuned for bulk loading operations, with all durability features disabled, large memory allocations, and PostgreSQL 18's `io_uring` support enabled (configuration details in Appendix A). Despite this comprehensive optimization, it appears that lock contention emerges at high parallelism degrees, limiting scalability. + +Testing was performed at eight parallelism degrees, executed sequentially in a progressive loading pattern: 1, 2, 4, 8, 16, 32, 64, and 128, with each step doubling to systematically increase load. Each configuration was run only once rather than following standard statistical practice of multiple runs with mean, standard deviation, and confidence intervals. This single-run approach was adopted after preliminary tests showed minimal variation between successive runs, indicating stable and reproducible results under these controlled conditions. + +### OVH Infrastructure Setup + +The test environment consists of two identical OVH cloud instances designed for heavy workloads: + + Architecture diagram.

+ +**Figure 1: OVH Infrastructure Architecture** - The test setup consists of two identical c3-256 instances (128 vCores, 256GB RAM, 400GB NVMe) running PostgreSQL 18 on Ubuntu 24.04. The source instance contains the TPC-H SF100 lineitem table. FastTransfer orchestrates parallel data replication across a 20 Gbit/s vrack private network connection to the target instance. Both instances are located in the Paris datacenter (eu-west-par-c) for minimal network latency. + +**Hardware Configuration:** + +- **Instance Type**: OVH c3-256 +- **Memory**: 256GB RAM +- **CPU**: 128 vCores @ 2.3 GHz +- **Storage**: + - **Target**: 400GB local NVMe SSD + - **Source**: OVH Block Storage (high-speed-gen2 with ~2TB, Bandwidth : 1 GB/s, Performance : 20,000 IOPS) +- **Network**: 20 Gbit/s vrack (2.5 GB/s) + +The source instance PostgreSQL data directory resides on attached OVH Block Storage rather than local NVMe. This asymmetric storage configuration does not affect the analysis conclusions, as the source PostgreSQL instance exhibits backpressure behavior rather than storage-limited performance. + +**Software Stack:** + +- **OS**: Ubuntu 24.04.3 LTS with Linux kernel 6.8 +- **PostgreSQL**: Version 18.0, with `io_uring`, huge pages (`vm.nr_hugepages=45000`) +- **FastTransfer**: Version 0.13.12 + +**Infrastructure Performance Baseline:** + +- **Network**: 20.5 Gbit/s (2.56 GB/s) verified with iperf3 +- **Target Disk Sequential Write**: 3,741 MB/s (FIO benchmark with 128K blocks) +- **Target Disk Random Write**: 88.2 MB/s, 22,600 IOPS (FIO, 4K blocks) + +### Overall Performance + +FastTransfer achieves strong absolute performance, transferring 77GB in just 67 seconds at degree 128, equivalent to 1.15 GB/s sustained throughput. The parallel replication process scales continuously across all tested degrees, with total elapsed time decreasing from 878 seconds (degree 1) to 67 seconds (degree 128). The system delivers consistent real-world performance improvements even at large parallelism levels, though lock contention on the target PostgreSQL instance appears to increasingly limit scaling efficiency beyond degree 32. + + Elapsed time by degree.

+ +**Figure 2: Total Elapsed Time by Degree of Parallelism** - Wall-clock time improves continuously across all tested degrees, from 878 seconds (degree 1) to 67 seconds (degree 128). Performance gains remain positive throughout, though the rate of improvement diminishes beyond degree 32 due to increasing lock contention. + +## 1. CPU Usage Analysis + +### 1.1 Mean and Peak CPU Usage + + Plot 01: mean CPU.

+ + +**Figure 3: Mean CPU Usage by Component** - Target PostgreSQL (red) dominates resource consumption at high parallelism, while source PostgreSQL (blue) reaches around 12 cores. + + Plot 02: peak CPU.

+ + +**Figure 4: Peak CPU Usage by Component** - Target PostgreSQL exhibits high peak values (~6,969% at degree 128). The large spikes combined with relatively lower mean values indicate high variance, characteristic of processes alternating between lock contention and productive work. + +**Component Scaling Summary:** + +| Component | Degree 1 | Degree 128 | Speedup | Efficiency | +| ----------------- | ---------------- | ------------------- | ------- | ---------- | +| Source PostgreSQL | 93% | 1,175% | 11.9x | 9.3% | +| FastTransfer | 31% | 631% | 20.1x | 15.7% | +| Target PostgreSQL | 98% | 3,294% | 33.6x | 26.3% | + +Source PostgreSQL's poor scaling appears to stem from backpressure: FastTransfer's batch-and-wait protocol means source processes send a batch, then block waiting for target acknowledgment. When the target cannot consume data quickly due to lock contention, this delay propagates backward. At degree 128, the source processes collectively use only 11.7 cores (0.11 cores/process), suggesting they're waiting rather than actively working. + +Note also that FastTransfer uses PostgreSQL's Ctid pseudo-column for table partitioning, which doesn't allow a perfect distribution, some partitions are smaller than others, causing processes to complete and exit before others. + +### 1.2 FastTransfer + + Plot 3: FastTransfer User vs System CPU.

Plot 3: FastTransfer User vs System CPU.

+ +**Figure 5: FastTransfer User vs System CPU** - At degree 128, FastTransfer uses 419% user CPU (66%) and 212% system CPU (34%). + +FastTransfer uses in the present case PostgreSQL's binary COPY protocol for both source and target (`--sourceconnectiontype "pgcopy"` and `--targetconnectiontype "pgcopy"`). Data flows directly from source PostgreSQL's COPY TO BINARY through FastTransfer to target PostgreSQL's COPY FROM BINARY without data transformation. FastTransfer acts as an intelligent network proxy coordinating parallel streams and batch acknowledgments, explaining its relatively low CPU usage. This would less be the case if we were transfering data between distinct RDBMS types. + +## 2. The Lock Contention Problem: System CPU Analysis + +### 2.1 System CPU + + Plot 6: System CPU as % of Total CPU.

+ +**Figure 6: System CPU as % of Total CPU** - Target PostgreSQL (red line) crosses the 50% warning threshold at degree 16, exceeds 70% at degree 32, and peaks at 83.9% at degree 64. At this maximum, only 16.2% of CPU time performs productive work while 83.9% appears spent on lock contention and kernel overhead. + +CPU time divides into two categories: User CPU (application code performing actual data insertion) and System CPU (kernel operations handling locks, synchronization, context switches, I/O). A healthy system maintains system CPU below 30%. + +**System CPU Progression:** + +| Degree | Total CPU | User CPU | System CPU | System % | Productive Work | +| ------ | --------- | -------- | ---------- | -------- | ------------------------- | +| 1 | 98% | 80% | 18% | 18.2% | Healthy baseline | +| 16 | 1,342% | 496% | 846% | 63.0% | Warning threshold crossed | +| 32 | 2,436% | 602% | 1,834% | 75.3% | High contention | +| 64 | 4,596% | 743% | 3,854% | 83.9% | **Maximum contention** | +| 128 | 4,230% | 1,248% | 2,982% | 70.5% | Reduced contention | + +At degree 64, processes appear to spend 83.9% of time managing locks rather than inserting data. By degree 128, system CPU percentage unexpectedly decreases to 70.5% for unclear reasons, though absolute performance continues to improve. + +### 2.2 Possible Causes of Lock Contention + +The target table was already optimized for bulk loading (UNLOGGED, no indexes, no constraints, no triggers), eliminating all standard overhead sources. So the contention could stem from PostgreSQL's fundamental architecture: + +1. **Shared Buffer Pool Locks**: All 128 parallel connections compete for buffer pool partition locks to read/modify/write pages. + +2. **Relation Extension Locks**: When the table grows, PostgreSQL requires an exclusive lock (only one process at a time). + +3. **Free Space Map (FSM) Locks**: All 128 writers query and update the FSM to find pages with free space, creating constant FSM thrashing. + +## 3. Distribution and Time Series Analysis + +### 3.1 CPU Distribution + + Plot 7: CPU Distribution at Degree 4.

+ +**Figure 7: CPU Distribution at Degree 4** - Tight, healthy distributions with small standard deviations. All components operate consistently without significant contention. + + Plot 8: CPU Distribution at Degree 32.

+ +**Figure 8: CPU Distribution at Degree 32** - Target PostgreSQL (red) becomes bimodal with wide spread (1000-3000% range), indicating some samples capture waiting processes while others capture active processes. Source (blue) remains relatively tight. + + Plot 9: CPU Distribution at Degree 128.

+ +**Figure 9: CPU Distribution at Degree 128** - Target PostgreSQL (red) spans nearly 0-10000%, indicating highly variable behavior. Some processes are nearly starved (near 0%) while others burn high CPU on lock spinning (>8000%). This wide distribution suggests lock thrashing. + +### 3.2 CPU Time Series + + Plot 10: Time Series at Degree 4.

+ +**Figure 10: CPU Over Time at Degree 4** - All components show stable, smooth CPU usage with minimal oscillations throughout the test duration. + + Plot 11: Time Series at Degree 32.

+ +**Figure 11: CPU Over Time at Degree 32** - Target PostgreSQL (red) shows increasing variability and oscillations, indicating periods of successful lock acquisition alternating with blocking periods. + + Plot 12: Time Series at Degree 128.

+ +**Figure 12: CPU Over Time at Degree 128** - Target PostgreSQL (red) exhibits oscillations with wild CPU swings, suggesting significant lock thrashing. Source (blue) and FastTransfer (green) show variability reflecting downstream backpressure. + +## 4. Performance Scaling Analysis: Degrees 64 to 128 + +### 4.1 Continued Performance Improvement at Extreme Parallelism + +Degree 128 achieves the best absolute performance in the test suite, completing the transfer in 67 seconds compared to 92 seconds at degree 64, a meaningful 1.37x speedup that brings total throughput to 1.15 GB/s. While this represents 68.7% efficiency for the doubling operation (rather than the theoretical 2x), the continued improvement demonstrates that the system remains functional and beneficial even at extreme parallelism levels. + +### 4.2 Unexpected Efficiency Improvements at Degree 128 + +Degree 128 exhibits a counterintuitive result: lower system CPU overhead (70.5%) than degree 64 (83.9%) despite doubling parallelism, while total CPU actually decreases by 8.0% (4,596% → 4,230%). User CPU efficiency improves by 82.1% (16.2% → 29.5% of total CPU), meaning nearly double the proportion of CPU time goes to productive work rather than lock contention. The reason for these improvements remains unclear. + +**The Comparative Analysis:** + +| Metric | Degree 64 | Degree 128 | Change | +| ----------------------- | -------------------- | -------------------- | -------------------- | +| Elapsed Time | 92s | 67s | 1.37x speedup | +| Total CPU | 4,596% | 4,230% | -8.0% | +| User CPU | 743% (16.2% of total)| 1,248% (29.5% of total) | +68.0% | +| System CPU | 3,854% (83.9% of total) | 2,982% (70.5% of total) | -22.6% | +| Network Throughput | 1,033 MB/s mean | 1,088 MB/s mean | +5.3% | +| Network Peak | 2,335 MB/s (93.4%) | 2,904 MB/s (116.2%) | Saturation | +| Disk Throughput | 759 MB/s | 1,099 MB/s | +44.8% | + +**Open Question: Why Does Efficiency Improve at Degree 128?** + +The improvement from degree 64 to 128 is puzzling for several reasons: + +1. **Why does network bandwidth increase by 5.3%** (1,033 MB/s → 1,088 MB/s) when adding more parallelism to an already saturated network? At degree 128, network peaks at 2,904 MB/s (116.2% of capacity), yet mean throughput still increases. + +2. **Why does system CPU overhead decrease** from 83.9% to 70.5% despite doubling parallelism? More processes should create more lock contention, not less. + +3. **Why does user CPU efficiency nearly double** (16.2% → 29.5% of total) when adding 64 more processes competing for the same resources? + +One hypothesis is that network saturation at degree 128 acts as a pacing mechanism, rate-limiting data delivery and preventing all 128 processes from simultaneously contending for locks. However, this doesn't fully explain why network throughput itself increases, nor why the efficiency gains are so substantial. The interaction between network saturation, lock contention, and process scheduling appears more complex than initially understood. + +## 5. Disk I/O and Network Analysis + +### 5.1 Source Disk I/O Analysis + +The source instance has 256GB RAM with a Postgres `effective_cache_size` of 192GB, and the lineitem table is ~77GB. An important detail explains the disk behavior across test runs: + +Degree 1 was the first test run with no prior warm-up or cold run to pre-load the table into cache. During this first run at degree 1, there is a heavy disk activity (500 MB/s, ~50% peak utilization) where the table is loaded into memory (shared_buffers + OS page cache). At degrees 2-128, there is essentially zero disk activity; the entire table remains cached in memory from the initial degree 1 load. This explains why degree 2 is more than twice as fast as degree 1: the degree 1 run includes the initial table-loading overhead, while degree 2 benefits from the already-cached table with no disk loading required. The speedup from degree 1 to 2 reflects both the doubling of parallelism and the elimination of the initial cache-loading penalty. + + Source Disk Utilization Time Series.

+ +**Figure 13: Source Disk Utilization Over Time** - Shows disk utilization across all test runs (vertical lines mark test boundaries for degrees 1, 2, 4, 8, 16, 32, 64, 128). At degree 1, utilization peaks at ~50% during the initial table load, then drops to near-zero. At higher degrees (2-128), utilization remains below 1% throughout, confirming the disk is idle and not limiting performance. + +Disk utilization measures the percentage of time the disk is busy serving I/O requests. Source disk I/O is not a bottleneck at any parallelism degree. + +### 5.2 Target Disk I/O Time Series + + Target Disk Write Throughput Time Series.

Target Disk Write Throughput Time Series.

+ +**Figure 14: Target Disk Write Throughput Over Time** - Throughput exhibits bursty behavior with spikes to 2000-3759 MB/s followed by drops to near zero. Sustained baseline varies from ~100 MB/s (low degrees) to ~300 MB/s (degree 128) but never sustains disk capacity. + + Target Disk Utilization Time Series.

+ +**Figure 15: Target Disk Utilization Over Time** - Mean utilization remains below 25% across all degrees. Spikes reach 70-90% during bursts but quickly return to low baseline. This suggests disk I/O is not the bottleneck. + +### 5.3 Network Throughput Analysis + + Target Network RX Time Series.

+ +**Figure 16: Target Network Ingress Over Time** - At degree 128, throughput plateaus at ~2,450 MB/s (98% of capacity) during active bursts, but averages only 1,088 MB/s (43.5%) due to alternating active/idle periods. At degrees 1-64, network remains well below capacity. + +Network saturation only occurs at degree 128 during active bursts. Therefore, network doesn't explain poor scaling from degree 1 through 64, target CPU lock contention remains the primary bottleneck. + +### 5.4 Cross-Degree Scaling Analysis + + Cross Degree Mean Disk Write.

+ +**Figure 17: Mean Disk Write Throughput by Degree** - Scales from 90 MB/s (degree 1) to 1,099 MB/s (degree 128), only 12.3x improvement for 128x parallelism (9.6% efficiency). + + Cross Degree Network Comparison.

+ bhime_original +

+ +Although I like the resulting plot, the main point of this Python notebook is to be able to compute and visualize these 5 billion eigenvalues smoothly. My laptop has 32GB of RAM and a 20-cores Intel i9 CPU. And for that, we are going to use some great packages: [numpy](https://numpy.org/), [numba](https://numba.pydata.org/), [pyarrow](https://arrow.apache.org/docs/python/index.html), [dask](https://www.dask.org/) and [datashader](https://datashader.org/). Note that I tried to perform the eigenvalue computations with PyTorch on my GPU but it did not improve the overall efficiency for these very small matrices as compared to numba. + + +We are going to process by batch for the eigenvalues computation and the visualization. This workflow has two distinct steps: + +1. **Generate, compute and store**: We generate matrices in batches, compute eigenvalues in parallel using Numba's `njit`, and write the eigenvalues incrementally to a Parquet file. + +2. **Load and visualize**: We use dask to load data from the Parquet file in chunks, distributing points into partition buckets for out-of-core visualization with Datashader. + +This *chunk* approach allows us to process billions of eigenvalues without overwhelming system memory. + +## Package versions + + Python implementation: CPython + Python version : 3.13.3 + + dask : 2025.5.0 + datashader: 0.18.1 + numpy : 2.2.6 + pyarrow : 20.0.0 + colorcet : 3.1.0 + numba : 0.61.2 + tqdm : 4.67.1 + +## First Part : computing eigenvalues in batches + + +```python +import gc +import os + +import numpy as np +import pyarrow as pa +import pyarrow.parquet as pq +from numba import njit, prange +from tqdm import tqdm + +PARQUET_FP = "./eigenvalues.parquet" +RS = 124 +``` + +```python +@njit(parallel=True, fastmath=True) +def compute_eigenvalues_batch(entries, indices, mat_size): + batch_size = indices.shape[0] // (mat_size * mat_size) + eigs = np.empty(batch_size * mat_size, dtype=np.complex64) + for i in prange(batch_size): + mat = np.empty((mat_size, mat_size), dtype=np.complex64) + for j in range(mat_size): + for k in range(mat_size): + index = indices[i * mat_size * mat_size + j * mat_size + k] + mat[j, k] = entries[index] + start = i * mat_size + eigs[start : start + mat_size] = np.linalg.eigvals(mat) + return eigs + +def compute_and_save_eigenvalues( + output_path, entries, mat_size, total_count, batch_size, seed=42 +): + np.random.seed(seed) + entries_array = np.array(entries, dtype=np.float32) + num_choices = len(entries) + schema = pa.schema( + [ + ("x", pa.float32()), + ("y", pa.float32()), + ] + ) + with pq.ParquetWriter(output_path, schema) as writer: + for offset in tqdm(range(0, total_count, batch_size)): + current_batch = min(batch_size, total_count - offset) + num_elements = current_batch * mat_size * mat_size + random_indices = np.random.randint( + 0, num_choices, size=num_elements + ).astype(np.uint8) + eigvals = compute_eigenvalues_batch(entries_array, random_indices, mat_size) + table = pa.table( + { + "x": eigvals.real.astype(np.float32), + "y": eigvals.imag.astype(np.float32), + }, + schema=schema, + ) + writer.write_table(table) + del eigvals, table + gc.collect() +``` + +We remove the parquet file if it's already there: + +```python +if os.path.exists(PARQUET_FP): + os.remove(PARQUET_FP) + print(f"Deleted: {PARQUET_FP}") +else: + print(f"File does not exist: {PARQUET_FP}") +``` + + File does not exist: ./eigenvalues.parquet + +Let's generate 1 billion 5×5 matrices with entries from {-1, 0, 1} and compute their eigenvalues by batch of 10,000,000: + +```python +%%time + +compute_and_save_eigenvalues( + output_path=PARQUET_FP, + entries=[-1, 0, 1], + mat_size=5, + total_count=1_000_000_000, + batch_size=10_000_000, + seed=RS, +) +``` + + 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [19:49<00:00, 11.90s/it] + + CPU times: user 4h 14min 28s, sys: 1min 22s, total: 4h 15min 51s + Wall time: 19min 49s + + +## Second part : visualization with Datashader + +Datashader handles large datasets efficiently when used with dask by processing data in chunks and creating density plots without loading all points into memory, only the final raster - the image - is kept in memory. + + +```python +import dask.dataframe as dd +import datashader as ds +import datashader.transfer_functions as tf +from datashader import reductions as rd +from colorcet import palette + +cmap = palette["kgy"] +bg_col = "black" +``` + +We filter out real eigenvalues located near the real axis (those with imaginary part magnitude smaller than `eps`) to focus on the complex structure. We also remove potential eigenvalues outside the square box $-3\leq x\leq3$ and $-3\leq y\leq3$. The `max_points` parameter caps pixel density values to prevent oversaturation. + + +```python +def visualize( + parquet_path, + plot_width=1600, + plot_height=1600, + x_range=(-3, 3), + y_range=(-3, 3), + eps=1e-5, + cmap="fire", + bg_col="black", + output_path="output.png", + how="log", + max_points=1000 +): + ddf = dd.read_parquet(parquet_path, columns=["x", "y"]) + nrows, ncols = ddf.shape[0].compute(), ddf.shape[1] + print(f"Before filtering : {nrows} rows, {ncols} columns") + ddf = ddf[ + (ddf["x"] >= x_range[0]) + & (ddf["x"] <= x_range[1]) + & (ddf["y"] >= y_range[0]) + & (ddf["y"] <= y_range[1]) + & ((ddf["y"] <= -eps) | (ddf["y"] >= eps)) + ] + nrows, ncols = ddf.shape[0].compute(), ddf.shape[1] + print(f"After filtering : {nrows} rows, {ncols} columns") + cvs = ds.Canvas( + plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range + ) + agg = cvs.points(ddf, "x", "y", agg=rd.count()) + agg_capped = agg.where(agg <= max_points, max_points) + img = tf.shade(agg_capped, cmap=cmap, how=how) + img = tf.set_background(img, bg_col) + img.to_pil().save(output_path) + return img +``` + + +```python +%%time + +visualize( + parquet_path=PARQUET_FP, + cmap=cmap, + bg_col=bg_col, + output_path="bohemian_01.png", + eps=1e-3, + how="eq_hist", + max_points=2000 +) +``` + + Before filtering : 5000000000 rows, 2 columns + After filtering : 2640645434 rows, 2 columns + CPU times: user 9min 15s, sys: 11min 16s, total: 20min 31s + Wall time: 1min 45s + + +

+ bhime +

\ No newline at end of file diff --git a/_posts/WP_2025-07-12-Git-commit-temporal-analysis.md b/_posts/WP_2025-07-12-Git-commit-temporal-analysis.md new file mode 100644 index 0000000..9ba4534 --- /dev/null +++ b/_posts/WP_2025-07-12-Git-commit-temporal-analysis.md @@ -0,0 +1,457 @@ + +In this Python notebook, we are going to analyze *git commit* timestamps across multiple repositories to identify temporal patterns in a git user coding activity - me, actually. + +## Imports and Package Versions + +`BASE_DIR` is the root folder containing all the git repositories. The `USER_FILTERS` list contains substrings to match against git author names for filtering commits from a specific user with various names - github, gitlab from various organizations. You can adapt these two variables with your own directory and git user names. + + +```python +import os +import subprocess +from collections import Counter +from datetime import datetime + +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import tol_colors as tc + +BASE_DIR = "/home/francois/Workspace" +USER_FILTERS = ["pacull", "djfrancesco"] +``` + +We are using Python 3.13.3 on a Linux OS: + + pandas : 2.2.3 + numpy : 2.2.6 + matplotlib: 3.10.3 + tol_colors: 2.0.0 + + +## Repository discovery and data extraction + +Here we introduce functions to recursively scan directories for git repositories, extract commit metadata using `git log` in a Python `subprocess`, specifically commit timestamps and author names, and filter commits by author name using case-insensitive substring matching. + + +```python +def is_git_repo(path): + return os.path.isdir(os.path.join(path, ".git")) + + +def get_all_git_repos(base_dir): + git_repos = [] + for root, dirs, files in os.walk(base_dir): + if is_git_repo(root): + git_repos.append(root) + dirs.clear() + return git_repos + + +def get_commits(repo_path): + try: + result = subprocess.run( + ["git", "-C", repo_path, "log", "--pretty=format:%an|%aI"], + stdout=subprocess.PIPE, + stderr=subprocess.DEVNULL, + check=True, + text=True, + ) + lines = result.stdout.strip().split("\n") + filtered_lines = [] + for line in lines: + if line: + author = line.split("|")[0].lower() + if any(u.lower() in author for u in USER_FILTERS): + filtered_lines.append(line) + return filtered_lines + except subprocess.CalledProcessError: + return [] + + +def parse_commit_times(commit_lines): + hours = [] + weekdays = [] + for line in commit_lines: + author, iso_date = line.split("|") + dt = datetime.fromisoformat(iso_date) + hours.append(dt.hour) + weekdays.append(dt.strftime("%A")) + return hours, weekdays +``` + +### Data collection + +So let's use these previous functions to iterate through the repositories, extract commit timestamps and parse them into hour-of-day and weekday components. + + +```python +all_hours = [] +all_weekdays = [] + +repos = get_all_git_repos(BASE_DIR) +for repo in repos: + commits = get_commits(repo) + hours, weekdays = parse_commit_times(commits) + all_hours.extend(hours) + all_weekdays.extend(weekdays) + +print(f"Total commits found: {len(all_hours)}") +``` + + Total commits found: 7605 + +### Data preprocessing + +Now we convert the extracted data and create *frequency* dataframes for each hour of the day or day of the week. + +```python +hour_counts = Counter(all_hours) +hour_df = pd.DataFrame( + { + "hour": list(range(24)), + "commit_count": [hour_counts.get(h, 0) for h in range(24)], + } +) +hour_df = hour_df.set_index("hour") +hour_df["distrib"] = hour_df["commit_count"] / hour_df["commit_count"].sum() + +days_order = [ + "Monday", + "Tuesday", + "Wednesday", + "Thursday", + "Friday", + "Saturday", + "Sunday", +] +weekday_counts = Counter(all_weekdays) +weekday_df = pd.DataFrame( + { + "weekday": days_order, + "commit_count": [weekday_counts.get(day, 0) for day in days_order], + } +) +weekday_df = weekday_df.set_index("weekday") +weekday_df["distrib"] = weekday_df["commit_count"] / weekday_df["commit_count"].sum() +``` + + +```python +hour_df.head(3) +``` + + + + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	commit_count	distrib
hour
0	12	0.001578
1	3	0.000394
2	0	0.000000

+ + + + +```python +assert hour_df.distrib.sum() == 1.0 +``` + + +```python +weekday_df.head(3) +``` + + + + +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	commit_count	distrib
weekday
Monday	1527	0.200789
Tuesday	1264	0.166206
Wednesday	1291	0.169757

+ weekday +

+ hourly linear +

+ hourly polar +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

hour	0	1	...	22	23
weekday
Monday	0.000000	0.0	...	0.499671	0.092045
Tuesday	0.039448	0.0	...	0.197239	0.078895
Wednesday	0.078895	0.0	...	0.262985	0.026298

3 rows × 24 columns

+ heatmap +

+ + +So it seems that friday 4 pm is my most productive hour of the week! diff --git a/_posts/WP_2025-09-29-Data-Loading-From-S3-Parquet-Files-to-PostgreSQL-on-OVH-Cloud.md b/_posts/WP_2025-09-29-Data-Loading-From-S3-Parquet-Files-to-PostgreSQL-on-OVH-Cloud.md new file mode 100644 index 0000000..437ee79 --- /dev/null +++ b/_posts/WP_2025-09-29-Data-Loading-From-S3-Parquet-Files-to-PostgreSQL-on-OVH-Cloud.md @@ -0,0 +1,182 @@ + +## Introduction + +Loading data efficiently from cloud storage into databases remains a critical challenge for data engineers. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) addresses this challenge through parallel data loading capabilities and flexible configuration options. This post demonstrates FastTransfer's performance when loading Parquet files from OVH's S3-compatible object storage into PostgreSQL on OVH public cloud infrastructure. + +## The Setup: Cloud Infrastructure That Makes Sense + +For our testing environment, we deployed PostgreSQL on an OVH c3-64 instance in their Gravelines datacenter. Here's what we're working with: + + Architecture diagram.

+ +The comparison reveals how primary keys and WAL logging independently bottleneck performance. WITHOUT PK/UNLOGGED achieves the best scaling (8.7x at 16 workers), while WITH PK/LOGGED caps at 3.1x. The intermediate configurations show each factor's impact: removing the primary key or disabling WAL each provide significant improvements, with their combination delivering maximum performance. + +## Network and I/O Considerations + +Different configurations reveal different bottlenecks: + +- **WITH PK / LOGGED**: Constraint checking + WAL overhead limits to 3.1x +- **WITH PK / UNLOGGED**: WAL removal allows 5.9x scaling +- **WITHOUT PK / LOGGED**: WAL contention plateaus at 5.2x +- **WITHOUT PK / UNLOGGED**: Best scaling at 8.7x (467.8 MiB in 5.1s ≈ 92 MB/s) + +At 92 MB/s with 4 Gbps network (~500 MB/s) and 1465 MiB/s local NVMe capacity, neither network nor disk I/O are the bottleneck. The limitation could come from several sources: S3 object storage throughput, DuckDB Parquet parsing overhead, or PostgreSQL's internal coordination when multiple workers write concurrently to the same table. + +## Conclusion + +FastTransfer achieves 5.1-second load times for 467.8 MiB of Parquet data from OVH S3 to PostgreSQL, reaching 92 MB/s throughput with WITHOUT PK/UNLOGGED configuration at degree 16. Testing four configurations reveals that primary keys and WAL logging each independently constrain performance, with optimal settings varying from degree 8 (LOGGED) to degree 16+ (UNLOGGED). The results demonstrate that cloud-based data pipelines can achieve strong performance when configuration matches use case requirements. + +--- + +## About FastTransfer + +FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB. + +**Key Features:** +- Advanced parallelization strategies for optimal performance +- Cross-platform compatibility with major databases +- Flexible configuration for various data migration scenarios +- Production-ready with comprehensive logging and monitoring + +For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/). \ No newline at end of file diff --git a/_posts/WP_2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md b/_posts/WP_2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md new file mode 100644 index 0000000..bb25592 --- /dev/null +++ b/_posts/WP_2025-09-29-FastTransfer-Performance-with-Citus-Columnar-Storage-in-PostgreSQL.md @@ -0,0 +1,485 @@ + +## Introduction + +Data migration between database systems often becomes a bottleneck in modern data pipelines, particularly when dealing with analytical workloads. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to address these challenges through advanced parallelization strategies. This post demonstrates FastTransfer's performance when working with PostgreSQL databases enhanced with the [Citus extension](https://docs.citusdata.com/en/v13.0/) for columnar storage. + +## Understanding FastTransfer + +FastTransfer is a command-line tool designed to address common data migration challenges. In our testing, we've found it particularly effective for scenarios where traditional migration approaches fall short. + +### Core Capabilities + +The tool offers several features that we've found valuable in production environments: + +- **Cross-platform compatibility**: Works with PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, DuckDB, and other major databases +- **Advanced parallelization**: Multiple strategies for parallel data extraction and loading, allowing you to optimize for your specific use case +- **Flexible configuration**: Fine-grained control over batch sizes, mapping methods, and load modes to tune performance +- **Production-ready features**: Comprehensive logging, error handling, and monitoring help ensure reliable migrations + +### Parallelization Strategies + +One aspect we particularly appreciate about FastTransfer is its range of parallelization methods, accessible through the `-M, --method ` parameter. Each method addresses different scenarios you might encounter in practice. + +#### Available Methods + +| Method | Description | Best Used For | +|--------|-------------|---------------| +| **None** | Sequential processing without parallelism | Small datasets, debugging, or establishing baseline performance | +| **Ctid** | PostgreSQL-specific method using the internal `ctid` pseudo-column (physical row location) | PostgreSQL sources with large tables where maximum speed is critical | +| **Ntile** | Uses SQL window function `NTILE()` to create evenly distributed partitions | Cross-database migrations where portability matters more than peak performance | +| **DataDriven** | Partitions based on distinct values of a distribution key column | Tables with natural categorical boundaries (e.g., by region, department) | +| **Random** | Applies modulo operation on distribution key | Situations where load balancing is critical and no natural partition exists | +| **Rowid** | Oracle-specific using ROWID slices | Oracle databases requiring parallel extraction | +| **RangeId** | Divides data using numeric ranges | Tables with sequential IDs or numeric keys, good balance of performance and portability | +| **NZDataSlice** | Netezza-specific leveraging native data slices | Netezza data warehouse environments | + +For this analysis, we focus on three methods that we've found most effective in PostgreSQL environments: **Ctid**, **Ntile**, and **RangeId**. Each offers distinct advantages depending on your specific use case. + +Also not that in more recent versions of FastTransfer (v0.13.8) a new method was introduced **Physloc** parallel method for SQL Server sources. + +## Understanding Citus Columnar Storage + +Before diving into the performance results, it's worth understanding what makes columnar storage particularly interesting for data migration scenarios. [Citus](https://docs.citusdata.com/en/v13.0/) is an open-source PostgreSQL extension that adds distributed database capabilities and columnar storage options. + +**Note:** Citus extension is primarily supported on Linux distributions. macOS users can run it via Docker, while Windows users may need to use Docker or WSL. + +The key insight with columnar storage is that it organizes data by columns rather than rows. This approach offers significant advantages for analytical workloads: + +- **Better compression**: Similar values in a column compress more efficiently +- **Reduced I/O**: Queries reading specific columns avoid loading unnecessary data +- **Storage efficiency**: In our tests, we've seen 70-80% storage reduction for typical analytical datasets + +However, this storage format also introduces trade-offs that affect migration performance, as we'll explore in our benchmarks. + +### Configuration Options + +One thing we appreciate about Citus is the flexibility it provides for tuning columnar storage. The following parameters have proven particularly important in our migration scenarios: + +| Parameter | Options | Description | Practical Consideration | +|-----------|---------|-------------|------------------------| +| **compression** | `none`, `pglz`, `zstd`, `lz4`, `lz4hc` | Algorithm for compressing data | LZ4 offers good speed; ZSTD better compression ratio | +| **compression_level** | 1-19 | Compression aggressiveness | Higher values mean better compression but slower writes | +| **stripe_row_count** | Default: 150,000 | Rows per stripe | Larger stripes can improve scan performance | +| **chunk_row_count** | Default: 10,000 | Rows per chunk within stripes | Affects granularity of data access | + +In our tests, we chose LZ4 compression with custom stripe sizing based on our experience balancing compression speed with storage efficiency. Your optimal configuration may vary depending on your specific data characteristics and performance requirements. + +## Experimental Setup + +To provide practical, reproducible results, we conducted our tests in a controlled environment that many developers might relate to "a local development setup". While production environments will naturally differ, the relative performance patterns we observed should translate to larger scales. + +### Test Environment + +We ran our tests using two PostgreSQL instances in separate Docker containers on a Linux laptop with a 12th Gen Intel® Core™ i9-12900H processor. By running all components locally, we eliminated network latency as a variable, allowing us to focus on the actual transfer and storage performance characteristics. + +### Docker Container 1: Standard PostgreSQL (Source) + +```bash +docker run -d --name pg-plain \ + -e POSTGRES_PASSWORD=********** \ + -p 5434:5432 \ + postgres:17 +``` + +Version information: +```sql +SELECT version(); +-- PostgreSQL 17.6 (Debian 17.6-1.pgdg13+1) on x86_64-pc-linux-gnu +``` + +### Docker Container 2: PostgreSQL with Citus (Target) + +```bash +docker run -d --name citus \ + -p 5433:5432 \ + -e POSTGRES_PASSWORD=********** \ + citusdata/citus:13.0 +``` + +Version information: +```sql +SELECT * FROM citus_version(); +-- Citus 13.0.3 on x86_64-pc-linux-gnu + +SELECT version(); +-- PostgreSQL 17.2 (Debian 17.2-1.pgdg120+1) on x86_64-pc-linux-gnu +``` + +### Test Data: TPCH Orders Table + +For our benchmarks, we selected the TPCH benchmark's `orders` table at scale factor 10, which provides a realistic dataset of 15 million rows. This table represents a common pattern in analytical workloads, that is, a mix of numeric, date, and text fields that you might encounter in real-world scenarios: + +```sql +SELECT COUNT(*) FROM tpch_10.orders; +-- 15000000 +``` + +#### Table Schema + +| Column | Data Type | Max Length | +|--------|-----------|------------| +| o_orderkey | integer | - | +| o_custkey | integer | - | +| o_orderstatus | character | 1 | +| o_totalprice | numeric(15,2) | - | +| o_orderdate | date | - | +| o_orderpriority | varchar | 15 | +| o_clerk | varchar | 15 | +| o_shippriority | integer | - | +| o_comment | varchar | 79 | + +Note that this table has no primary key, which may affect how certain parallelization methods perform. + +### Creating the Citus Columnar Target Table + +```sql +CREATE EXTENSION IF NOT EXISTS citus_columnar; + +CREATE TABLE orders ( + o_orderkey int4 NOT NULL, + o_custkey int4 NULL, + o_orderstatus bpchar(1) NULL, + o_totalprice numeric(15, 2) NULL, + o_orderdate date NULL, + o_orderpriority varchar(15) NULL, + o_clerk varchar(15) NULL, + o_shippriority int4 NULL, + o_comment varchar(79) NULL +) USING columnar WITH ( + columnar.compression = 'lz4', + columnar.stripe_row_limit = 131072 +); +``` + +## Performance Results: PostgreSQL → Citus Columnar + +Now let's examine the actual performance results. We tested FastTransfer v0.13.5 with different parallelization methods and degrees of parallelism to understand how each approach handles the overhead of writing to columnar storage. + +### Method: None (Baseline) + +Sequential transfer without any parallelization: + +```bash +./FastTransfer \ +--sourceconnectiontype "pgcopy" \ +--sourceserver "localhost:5434" \ +--sourceuser "postgres" \ +--sourcepassword "**********" \ +--sourcedatabase "postgres" \ +--sourceschema "tpch_10" \ +--sourcetable "orders" \ +--targetconnectiontype "pgcopy" \ +--targetserver "localhost:5433" \ +--targetuser "postgres" \ +--targetpassword "**********" \ +--targetdatabase "postgres" \ +--targetschema "public" \ +--targettable "orders" \ +--loadmode "Truncate" \ +--method "None" +``` + +**Result**: 12,092 ms (with custom columnar parameters: LZ4 compression, 131,072 stripe_row_limit) + +For comparison: +- **Standard PostgreSQL table** (non-columnar): 10,141 ms (best of 3: 10,468 ms, 11,461 ms, 10,141 ms) +- **Columnar with default parameters**: 13,393 ms (best of 3: 13,447 ms, 13,393 ms, 13,587 ms) + +**Key Observations**: + +These baseline results reveal important trade-offs: +- Writing to columnar storage with custom parameters adds approximately 19% overhead compared to standard tables +- Default columnar parameters increase this overhead to about 32% +- However, the storage benefits are substantial: columnar storage (default parameters) uses only 454MB compared to 1.9GB for standard tables, a 76% reduction + +This trade-off between write performance and storage efficiency is a recurring theme in our analysis. + +### Method: Ctid (PostgreSQL-Optimized Parallelization) + +The Ctid method takes advantage of PostgreSQL's internal row locator (ctid), which represents the physical location of rows on disk. In our testing, this approach has consistently shown strong performance for PostgreSQL-to-PostgreSQL transfers: + +```bash +./FastTransfer \ +# ... [same connection parameters] ... +--method "Ctid" \ +--degree [1|2|4|8] # Parallelism degree +``` + +#### Ctid Method Performance + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 12,358 | 1.00x | 100% | 1,213,846 | +| 2 | 6,947 | 1.78x | 89% | 2,159,363 | +| 4 | 4,132 | 2.99x | 75% | 3,630,543 | +| 8 | 3,303 | 3.74x | 47% | 4,540,691 | + +*Note: Efficiency = (Speedup / Threads) × 100%. Lower efficiency at higher thread counts indicates coordination overhead.* + +### Method: Ntile (Distribution-Based Parallelization) + +The Ntile method uses SQL window functions to divide data into evenly distributed buckets. While more portable across database systems, we've found it generally requires more processing overhead: + +```bash +./FastTransfer \ +# ... [same connection parameters] ... +--method "Ntile" \ +--distributekeycolumn "o_orderkey" \ +--degree [1|2|4|8] +``` + +#### Ntile Method Performance + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 16,023 | 1.00x | 100% | 936,148 | +| 2 | 11,063 | 1.45x | 72% | 1,355,889 | +| 4 | 9,680 | 1.66x | 41% | 1,549,587 | +| 8 | 8,575 | 1.87x | 23% | 1,749,271 | + +*The lower speedup compared to Ctid reflects the overhead of the NTILE() window function calculation.* + +### Method: RangeId (Numeric Range Partitioning) + +The RangeId method divides data based on numeric ranges, which works particularly well when you have sequential numeric keys. In our experience, it offers a good balance between portability and performance: + +```bash +./FastTransfer \ +# ... [same connection parameters] ... +--method "RangeId" \ +--distributekeycolumn "o_orderkey" \ +--degree [1|2|4|8] +``` + +#### RangeId Method Performance + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 12,257 | 1.00x | 100% | 1,223,864 | +| 2 | 7,462 | 1.64x | 82% | 2,010,182 | +| 4 | 5,347 | 2.29x | 57% | 2,805,058 | +| 8 | 4,780 | 2.56x | 32% | 3,138,075 | + +*RangeId shows good scalability up to 4 threads, making it a solid middle ground between Ctid and Ntile.* + +### Comparative Analysis: PostgreSQL → Citus + +Looking across all three methods, some clear patterns emerge from our testing: + +- **Best Overall Performance**: Ctid with 8 parallel threads achieved a 3.74x speedup (3,303 ms) +- **RangeId as Alternative**: While not quite matching Ctid, RangeId performed respectably (4,780 ms at degree 8, 2.56x speedup) and may be preferable when ctid access is restricted +- **Efficiency Sweet Spots**: Interestingly, RangeId showed better efficiency at lower parallelism (82% at degree 2) compared to Ctid's peak efficiency (75% at degree 4) +- **General Pattern**: For PostgreSQL to Citus transfers, we consistently see: Ctid > RangeId > Ntile + +## Performance Results: Citus Columnar → PostgreSQL + +Transferring data from Citus columnar tables back to standard PostgreSQL presents different challenges. + +### Method: None (Baseline) + +```bash +./FastTransfer \ +--sourceconnectiontype "pgcopy" \ +--sourceserver "localhost:5433" \ +--sourceuser "postgres" \ +--sourcepassword "**********" \ +--sourcedatabase "postgres" \ +--sourceschema "public" \ +--sourcetable "orders" \ +--targetconnectiontype "pgcopy" \ +--targetserver "localhost:5434" \ +--targetuser "postgres" \ +--targetpassword "**********" \ +--targetdatabase "postgres" \ +--targetschema "tpch_10" \ +--targettable "orders" \ +--loadmode "Truncate" \ +--method "None" +``` + +**Result**: 9,952 ms + +Interestingly, this is faster than the PostgreSQL → Citus direction (12,092 ms). While this might seem counterintuitive at first, it makes sense when you consider the underlying mechanics: +- Reading from columnar storage benefits from compression, reducing I/O significantly +- Writing to standard PostgreSQL tables avoids the compression overhead +- In this scenario, the read benefits outweigh the write costs + +### Method: Ctid (Not Supported) + +```bash +./FastTransfer \ +# ... [same parameters] ... +--method "Ctid" \ +--degree 1 +``` + +**Result**: +``` +Fail Parallel Ctid Load +Source : Npgsql +Message : 0A000: UPDATE and CTID scans not supported for ColumnarScan +Failed Load +``` + +This error is expected and highlights an important limitation: columnar storage doesn't maintain traditional ctid values due to its fundamentally different storage architecture. This is something to consider when planning your migration strategy. + +### Method: Ntile + +#### Ntile Method Performance (Citus → PostgreSQL) + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 13,766 | 1.00x | 100% | 1,089,859 | +| 2 | 10,461 | 1.32x | 66% | 1,433,869 | +| 4 | 10,170 | 1.35x | 34% | 1,474,926 | +| 8 | 9,281 | 1.48x | 19% | 1,616,206 | + +*Limited scalability suggests that columnar read patterns don't parallelize as effectively with Ntile.* + +### Method: RangeId + +#### RangeId Method Performance (Citus → PostgreSQL) + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | +|-----------------|-----------|---------|------------|----------------------| +| 1 | 10,717 | 1.00x | 100% | 1,399,515 | +| 2 | 8,007 | 1.34x | 67% | 1,873,341 | +| 4 | 6,006 | 1.78x | 45% | 2,497,500 | +| 8 | 6,520 | 1.64x | 21% | 2,300,613 | + +*Performance plateaus at 4 threads, suggesting this is the optimal parallelization level for LOGGED tables.* + +### RangeId with UNLOGGED Target Tables + +To further optimize performance, we tested RangeId transfers to UNLOGGED PostgreSQL tables, which skip WAL (Write-Ahead Logging): + +```sql +ALTER TABLE tpch_10.orders SET UNLOGGED; +``` + +Note that columnar tables cannot be configured as UNLOGGED (`ERROR: unlogged columnar tables are not supported`), which is a current limitation of the Citus extension. + +#### RangeId Performance with UNLOGGED Target Tables + +| Parallel Threads | Time (ms) | Speedup | Efficiency | Throughput (rows/sec) | Improvement vs LOGGED | +|-----------------|-----------|---------|------------|----------------------|----------------------| +| 1 | 9,389 | 1.00x | 100% | 1,597,719 | +14% | +| 2 | 6,059 | 1.55x | 78% | 2,476,032 | +32% | +| 4 | 4,771 | 1.97x | 49% | 3,143,764 | +26% | +| 8 | 3,946 | 2.38x | 30% | 3,801,114 | +65% | + +*UNLOGGED tables show consistent improvement across all parallelization levels, with the benefit increasing at higher thread counts.* + +**Key Findings with UNLOGGED Tables**: + +- **Best Overall Time**: 3,946 ms with 8 threads (compared to 6,006 ms for LOGGED at 4 threads) +- **Improved Scalability**: Performance continues improving through 8 threads, unlike LOGGED tables which plateau +- **Consistent Benefits**: 14-65% faster across all parallelization levels, with greater benefits at higher thread counts +- **Trade-off**: Remember that UNLOGGED tables don't write to WAL, so they're not crash-safe + +### Comparative Analysis: Citus → PostgreSQL + +When transferring data from Citus columnar back to standard PostgreSQL, we observed these patterns: + +- **Best Overall Performance**: RangeId with UNLOGGED target tables at degree 8 achieved remarkable performance (3,946 ms, representing a 2.52x speedup) +- **Best LOGGED Performance**: RangeId with degree 4 reached 6,006 ms (1.66x speedup) +- **Method Effectiveness**: For this direction, we consistently found: RangeId UNLOGGED > RangeId LOGGED > Ntile +- **Important Limitation**: Ctid method is not available for columnar source tables +- **UNLOGGED Benefits**: The performance advantage of UNLOGGED tables becomes more pronounced at higher parallelism levels +- **Scalability Differences**: UNLOGGED tables continued to show improvements through degree 8, while LOGGED tables plateaued around degree 4 + +## Performance Visualization + +The following charts illustrate the performance characteristics we observed across different methods and parallelization levels: + +### PostgreSQL → Citus Columnar Transfer Performance + Chart showing transfer performance from PostgreSQL to Citus Columnar storage, comparing Ctid, Ntile, and RangeId parallelization methods across 1, 2, 4, and 8 threads. Ctid shows the best performance, achieving 3.3 seconds with 8 threads.

+ +## Key Takeaways + +### Performance Summary + +From our benchmarks with 15 million rows: + +| Scenario | Best Method | Time | Speedup | Key Insight | +|----------|------------|------|---------|-------------| +| PostgreSQL → Citus | Ctid (8 threads) | 3.3s | 3.74x | Direct row access provides best performance | +| Citus → PostgreSQL | RangeId UNLOGGED (8 threads) | 3.9s | 2.52x | UNLOGGED tables dramatically improve write speed | +| Cross-compatible | RangeId (4 threads) | 5.3s | 2.29x | Good balance of performance and portability | + +### Important Considerations + +1. **Storage vs Speed Trade-off**: Columnar storage reduces disk usage by 76% but adds ~20% write overhead +2. **Diminishing Returns**: Parallelization beyond 4 threads often shows limited benefit +3. **Method Limitations**: Not all methods work with all storage types (e.g., Ctid incompatible with columnar) +4. **Asymmetric Performance**: Reading from columnar is faster than writing to it + +## Analysis and Insights + +After running these benchmarks, several patterns became clear that might help inform your migration strategy. + +### Why Ctid Typically Outperforms Other Methods + +In our testing, the Ctid method consistently delivered the best performance for PostgreSQL sources. This makes sense when you consider that ctid provides direct access to physical row locations, eliminating the need for sorting or complex query planning that other methods require. + +### Scalability Patterns + +One interesting finding from our tests relates to how parallelization efficiency changes with thread count: + +#### The Law of Diminishing Returns + +As we increased parallelism, we observed declining efficiency across all methods: +- **Sweet Spot**: In most cases, 4 threads offered the best balance between performance and resource utilization +- **Efficiency Cliff**: At 8 threads, efficiency often dropped below 50%, suggesting that the overhead of coordination begins to outweigh the benefits + +### Understanding Columnar Storage Impact + +Our benchmarks revealed several important considerations when working with columnar storage: + +#### Write Performance Trade-offs + +We observed that writing to columnar storage introduces approximately 19% overhead compared to standard tables (12,092 ms vs 10,141 ms). This overhead comes from several sources: +- Compression processing (LZ4 in our configuration) +- Data reorganization into columnar format (stripes and chunks) +- Additional metadata management + +However, it's important to remember that this overhead delivers significant storage savings, in our case, a 76% reduction in disk usage. + +#### Read Performance Benefits + +Conversely, reading from columnar storage proved notably efficient: +- Transfers from Citus to PostgreSQL completed 18% faster than the reverse direction +- Compressed data requires less I/O bandwidth +- Sequential reading patterns align well with columnar storage organization + +#### Asymmetric Performance Characteristics + +One surprising finding was that Citus → PostgreSQL transfers consistently outperformed PostgreSQL → Citus transfers. This asymmetry makes sense when you consider that: +- Reading benefits from compression outweigh writing penalties +- Standard PostgreSQL tables have highly optimized write paths +- The combination results in better overall performance when columnar is the source + +#### Method Compatibility Considerations + +It's worth noting that not all parallelization methods work with columnar storage. The Ctid method, while excellent for standard PostgreSQL tables, isn't compatible with columnar architecture due to the different way data is organized and accessed. + +## Conclusion + +FastTransfer effectively handles migrations involving Citus columnar storage, achieving up to 76% storage savings while maintaining high transfer speeds. The choice of parallelization method significantly impacts performance, with Ntile delivering the best balance for columnar targets. These results demonstrate that columnar storage and efficient data migration are not mutually exclusive when using the right tools. + +--- + +## About FastTransfer + +FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB. + +**Key Features:** +- Advanced parallelization strategies for optimal performance +- Cross-platform compatibility with major databases +- Flexible configuration for various data migration scenarios +- Production-ready with comprehensive logging and monitoring + +For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/). diff --git a/_posts/WP_2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md b/_posts/WP_2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md new file mode 100644 index 0000000..fd7dfb1 --- /dev/null +++ b/_posts/WP_2025-09-29-High-Speed-PostgreSQL-Replication-on-OVH-with-FastTransfer.md @@ -0,0 +1,121 @@ +## Introduction + +PostgreSQL-to-PostgreSQL replication at scale requires tools that can fully leverage modern cloud infrastructure and network capabilities. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to maximize throughput through advanced parallelization. This post demonstrates FastTransfer's performance transferring 113GB of TPC-H data between OVH c3-256 instances over a 20 Gbit/s network. + +## Infrastructure Setup + +For our testing environment, we deployed PostgreSQL on two OVH c3-256 instances in the Paris datacenter. Here's what we're working with: + +- **OVH Instances**: c3-256 (256GB RAM, 128 vCores @2.3GHz, 400GB NVMe) +- **Network**: 20 Gbit/s vrack, Paris datacenter (eu-west-par-c) +- **OS**: Ubuntu 24 +- **PostgreSQL**: Version 16 +- **Dataset**: TPC-H SF100 lineitem table (~600M rows, ~113GB) + + Architecture diagram.

+ +Transfer time: 749s (single thread) → 70s (128 threads) + +### Throughput Scaling + + Throughput.

+ +Throughput: 145 MB/s → 1,880 MB/s (75% of 20 Gbit/s link capacity) + + +## Results Summary + +- **113GB transferred in 70 seconds** (degree=128) +- **1.88 GB/s peak throughput** achieved +- **10.7x speedup** with 128 parallel connections +- **Optimal range**: 32-64 threads for best efficiency/performance balance + +## Conclusion + +FastTransfer achieves 1.88 GB/s throughput when transferring 113GB of data between PostgreSQL instances, utilizing 75% of the available 20 Gbit/s network capacity. The 10.7x speedup with 128 parallel connections demonstrates excellent scalability on OVH's high-end infrastructure. These results confirm that FastTransfer can effectively saturate modern cloud networking for PostgreSQL-to-PostgreSQL migrations. + +--- + +## About FastTransfer + +FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB. + +**Key Features:** +- Advanced parallelization strategies for optimal performance +- Cross-platform compatibility with major databases +- Flexible configuration for various data migration scenarios +- Production-ready with comprehensive logging and monitoring + +For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/). \ No newline at end of file diff --git a/_posts/WP_2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md b/_posts/WP_2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md new file mode 100644 index 0000000..573d7d5 --- /dev/null +++ b/_posts/WP_2025-10-25-Performance-Analysis-of-Parallel-Data-Replication-Between-Two-PostgreSQL-18-Instances-on-OVH.md @@ -0,0 +1,392 @@ +## Introduction + +Parallel data replication between PostgreSQL instances presents unique challenges at scale, particularly when attempting to maximize throughput on high-performance cloud infrastructure. [FastTransfer](https://aetperf.github.io/FastTransfer-Documentation/) is a commercial data migration tool designed to leverage advanced parallelization strategies for efficient data movement. This post provides a performance analysis of FastTransfer transferring 77GB of data between two PostgreSQL 18 instances on OVH c3-256 servers, examining CPU, disk I/O, and network bottlenecks across parallelism degrees from 1 to 128. + +### Test Configuration + +The test dataset consists of the TPC-H SF100 lineitem table (~600M rows, ~77GB), configured as an UNLOGGED table without indexes, constraints, or triggers. Both instances were tuned for bulk loading operations, with all durability features disabled, large memory allocations, and PostgreSQL 18's `io_uring` support enabled (configuration details in Appendix A). Despite this comprehensive optimization, it appears that lock contention emerges at high parallelism degrees, limiting scalability. + +Testing was performed at eight parallelism degrees, executed sequentially in a progressive loading pattern: 1, 2, 4, 8, 16, 32, 64, and 128, with each step doubling to systematically increase load. Each configuration was run only once rather than following standard statistical practice of multiple runs with mean, standard deviation, and confidence intervals. This single-run approach was adopted after preliminary tests showed minimal variation between successive runs, indicating stable and reproducible results under these controlled conditions. + +### OVH Infrastructure Setup + +The test environment consists of two identical OVH cloud instances designed for heavy workloads: + + Architecture diagram.

+ +**Figure 10: CPU Over Time at Degree 4** - All components show stable, smooth CPU usage with minimal oscillations throughout the test duration. + + Plot 11: Time Series at Degree 32.

+ +**Figure 18: Network Throughput Comparison: Source TX vs Target RX** - At degree 128, source transmits 1,684 MB/s while target receives only 1,088 MB/s, creating a 596 MB/s (35%) deficit. This suggests the target cannot keep pace with source data production, likely due to CPU lock contention. + +The apparent 35% violation of flow conservation is explained by TCP retransmissions. The source TX counter (measured via `sar -n DEV`) counts both original packets and retransmitted packets, while the target RX counter only counts successfully received unique packets. When the target is overloaded with CPU lock contention (83.9% system CPU at degree 64), it cannot drain receive buffers fast enough, causing packet drops that trigger TCP retransmissions. The 596 MB/s "deficit" is actually retransmitted data counted twice at the source but only once at the target, providing quantitative evidence of the target's inability to keep pace with source data production. + +### 5.5 I/O Analysis Conclusions + +1. **Disk does not appear to be the bottleneck**: 24% average utilization at degree 128 with 76% idle capacity. PostgreSQL matches FIO peak (3,759 MB/s) but sustains only 170 MB/s average. + +2. **Network does not appear to be the bottleneck for degrees 1-64**: Utilization remains below 42% through degree 64. Saturation occurs only at degree 128 during active bursts (~2,450 MB/s plateau). + +3. **Target CPU lock contention appears to be the root cause**: Low disk utilization + network saturation only at degree 128 + poor scaling efficiency throughout + high system CPU percentage (83.9% at degree 64) all point to the same conclusion. + +4. **Backpressure suggests target bottleneck**: Source can produce 1,684 MB/s but target can only consume 1,088 MB/s. Source processes use only 0.11 cores/process, suggesting they're blocked waiting for target acknowledgments. + +## 6. Conclusions + +### 6.1 Performance Achievement and Bottleneck Analysis + +FastTransfer successfully demonstrates strong absolute performance, achieving a 13.1x speedup that reduces 77GB transfer time from approximately 15 minutes (878s) to just over 1 minute (67s). This represents practical, production-ready performance with sustained throughput of 1.15 GB/s at degree 128. The system delivers continuous performance improvements across all tested parallelism degrees, confirming that parallel replication provides meaningful benefits even when facing coordination challenges. + +The primary scaling limitation appears to be target PostgreSQL lock contention beyond degree 32. System CPU grows to 83.9% at degree 64, meaning only 16.2% of CPU performs productive work. Degree 128 continues to improve absolute performance (67s vs 92s) even as total CPU decreases from 4,596% to 4,230%, though the reason for this unexpected efficiency improvement remains unclear. + +### 6.2 Why Additional Tuning should not Help + +The target table is rather optimally configured (UNLOGGED, no indexes, no constraints, no triggers). PostgreSQL configuration includes all recommended bulk loading optimizations (80GB shared_buffers, huge pages, `io_uring`, fsync=off). Despite this, system CPU remains at 70-84% at high degrees. + +The bottleneck appears to be more architectural than configurational, with bffer pool partition locks, relation extension lock, FSM access... No configuration parameter appears able to eliminate these fundamental coordination requirements. + +### 6.3 Future Work: PostgreSQL Instrumentation Analysis + +While this analysis relied on system-level metrics, a follow-up study will use PostgreSQL's internal instrumentation to provide direct evidence of lock contention and wait events. This will validate the hypotheses presented in this analysis using database engine-level metrics. + + +## Appendix A: PostgreSQL Configuration + +Both PostgreSQL 18 instances were tuned for maximum bulk loading performance. + +### Target PostgreSQL Configuration (Key Settings) + +```ini +# Memory allocation +shared_buffers = 80GB # 31% of 256GB RAM +huge_pages = on # vm.nr_hugepages=45000 +work_mem = 256MB +maintenance_work_mem = 16GB + +# Durability disabled (benchmark only, NOT production) +synchronous_commit = off +fsync = off +full_page_writes = off + +# WAL configuration (minimal for UNLOGGED) +wal_level = minimal +wal_buffers = 128MB +max_wal_size = 128GB +checkpoint_timeout = 15min +checkpoint_completion_target = 0.5 + +# Background writer (AGGRESSIVE) +bgwriter_delay = 10ms # Down from default 200ms +bgwriter_lru_maxpages = 2000 # 2x default +bgwriter_lru_multiplier = 8.0 # 2x default +bgwriter_flush_after = 0 + +# I/O configuration (PostgreSQL 18 optimizations) +backend_flush_after = 0 +effective_io_concurrency = 400 # Optimized for NVMe +maintenance_io_concurrency = 400 +io_method = io_uring # NEW PG18: async I/O +io_max_concurrency = 512 # NEW PG18 +io_workers = 8 # NEW PG18: up from default 3 + +# Worker processes +max_worker_processes = 128 +max_parallel_workers = 128 + +# Autovacuum (PostgreSQL 18) +autovacuum = on +autovacuum_worker_slots = 32 # NEW PG18: runtime adjustment +autovacuum_max_workers = 16 +autovacuum_vacuum_cost_delay = 0 # No throttling + +# Query tuning +enable_partitionwise_join = on +enable_partitionwise_aggregate = on +random_page_cost = 1.1 # NVMe SSD +effective_cache_size = 192GB # ~75% of RAM +``` + +### Source PostgreSQL Configuration (Key Settings) + +The source instance is optimized for fast parallel reads to support high-throughput data extraction: + +```ini +# Memory allocation +shared_buffers = 80GB # ~31% of 256GB RAM +huge_pages = on # vm.nr_hugepages=45000 +work_mem = 256MB +maintenance_work_mem = 4GB # Lower than target (16GB) + +# Durability disabled (benchmark only, NOT production) +synchronous_commit = off +fsync = off +full_page_writes = off + +# WAL configuration +wal_level = minimal +wal_buffers = -1 # Auto-sized +max_wal_size = 32GB # Smaller than target (128GB) +checkpoint_timeout = 60min # Longer than target (15min) +checkpoint_completion_target = 0.9 + +# Background writer +bgwriter_delay = 50ms # Less aggressive than target (10ms) +bgwriter_lru_maxpages = 1000 # Half of target (2000) +bgwriter_lru_multiplier = 4.0 # Half of target (8.0) +bgwriter_flush_after = 2MB + +# I/O configuration (PostgreSQL 18 optimizations) +backend_flush_after = 0 +effective_io_concurrency = 400 # Identical to target +maintenance_io_concurrency = 400 +io_method = io_uring # NEW PG18: async I/O +io_max_concurrency = 512 # NEW PG18 +io_workers = 8 # NEW PG18 + +# Worker processes +max_connections = 500 # Higher than target for parallel readers +max_worker_processes = 128 +max_parallel_workers_per_gather = 64 +max_parallel_workers = 128 + +# Query tuning (optimized for parallel reads) +enable_partitionwise_join = on +enable_partitionwise_aggregate = on +random_page_cost = 1.1 # Block Storage (not NVMe) +effective_cache_size = 192GB # ~75% of RAM +default_statistics_target = 500 + +# Autovacuum (PostgreSQL 18) +autovacuum = on +autovacuum_worker_slots = 32 # NEW PG18: runtime adjustment +autovacuum_max_workers = 16 +autovacuum_vacuum_cost_delay = 0 # No throttling +``` + +### Table Configuration + +The target table eliminates all overhead sources: + +- **UNLOGGED**: No WAL write, flush, or archival overhead +- **No indexes**: Eliminates 50-80% of bulk load cost +- **No primary key**: No index maintenance or uniqueness checking +- **No constraints**: No foreign key, check, or unique validation +- **No triggers**: No trigger execution overhead + +This represents the absolute minimum overhead possible. + +--- + +## About FastTransfer + +FastTransfer is a commercial high-performance data migration tool developed by [arpe.io](https://arpe.io). It provides parallel data transfer capabilities across multiple database platforms including PostgreSQL, MySQL, Oracle, SQL Server, ClickHouse, and DuckDB. + +**Key Features:** + +- Advanced parallelization strategies for optimal performance +- Cross-platform compatibility with major databases +- Flexible configuration for various data migration scenarios +- Production-ready with comprehensive logging and monitoring + +For licensing information, support options, and to request a trial, visit the [official documentation](https://aetperf.github.io/FastTransfer-Documentation/). diff --git a/img/2022-11-23_01/mutual_refs_02.jpg b/img/2022-11-23_01/mutual_refs_02.jpg index a5246bd..2bf848d 100644 Binary files a/img/2022-11-23_01/mutual_refs_02.jpg and b/img/2022-11-23_01/mutual_refs_02.jpg differ diff --git a/img/2025-07-12_01/output_17_0.png b/img/2025-07-12_01/output_17_0.png new file mode 100644 index 0000000..91b2c71 Binary files /dev/null and b/img/2025-07-12_01/output_17_0.png differ diff --git a/img/2025-07-12_01/output_19_0.png b/img/2025-07-12_01/output_19_0.png new file mode 100644 index 0000000..a50b7df Binary files /dev/null and b/img/2025-07-12_01/output_19_0.png differ diff --git a/img/2025-07-12_01/output_21_0.png b/img/2025-07-12_01/output_21_0.png new file mode 100644 index 0000000..4d18b4a Binary files /dev/null and b/img/2025-07-12_01/output_21_0.png differ diff --git a/img/2025-07-12_01/output_26_0.png b/img/2025-07-12_01/output_26_0.png new file mode 100644 index 0000000..5eff70f Binary files /dev/null and b/img/2025-07-12_01/output_26_0.png differ diff --git a/img/2025-09-29_01/transfer_citus_to_pg.jpg b/img/2025-09-29_01/transfer_citus_to_pg.jpg new file mode 100644 index 0000000..781eaac Binary files /dev/null and b/img/2025-09-29_01/transfer_citus_to_pg.jpg differ diff --git a/img/2025-09-29_01/transfer_pg_to_citus.jpg b/img/2025-09-29_01/transfer_pg_to_citus.jpg new file mode 100644 index 0000000..98395ed Binary files /dev/null and b/img/2025-09-29_01/transfer_pg_to_citus.jpg differ diff --git a/img/2025-09-29_02/architecture.jpg b/img/2025-09-29_02/architecture.jpg new file mode 100644 index 0000000..64800bb Binary files /dev/null and b/img/2025-09-29_02/architecture.jpg differ diff --git a/img/2025-09-29_02/transfer_s3_to_postgres_comparison.jpg b/img/2025-09-29_02/transfer_s3_to_postgres_comparison.jpg new file mode 100644 index 0000000..e14631a Binary files /dev/null and b/img/2025-09-29_02/transfer_s3_to_postgres_comparison.jpg differ diff --git a/img/2025-09-29_03/architecture.jpg b/img/2025-09-29_03/architecture.jpg new file mode 100644 index 0000000..d5d99b0 Binary files /dev/null and b/img/2025-09-29_03/architecture.jpg differ diff --git a/img/2025-09-29_03/lineitem_elapsed_time.jpg b/img/2025-09-29_03/lineitem_elapsed_time.jpg new file mode 100644 index 0000000..155509b Binary files /dev/null and b/img/2025-09-29_03/lineitem_elapsed_time.jpg differ diff --git a/img/2025-09-29_03/lineitem_throughput.jpg b/img/2025-09-29_03/lineitem_throughput.jpg new file mode 100644 index 0000000..8b2c6c4 Binary files /dev/null and b/img/2025-09-29_03/lineitem_throughput.jpg differ diff --git a/img/2025-10-25_01/2019_DEC_FR_ROUBAIX_IMAGE_0242_BD.jpg b/img/2025-10-25_01/2019_DEC_FR_ROUBAIX_IMAGE_0242_BD.jpg new file mode 100644 index 0000000..2ff8b38 Binary files /dev/null and b/img/2025-10-25_01/2019_DEC_FR_ROUBAIX_IMAGE_0242_BD.jpg differ diff --git a/img/2025-10-25_01/architecture.png b/img/2025-10-25_01/architecture.png new file mode 100644 index 0000000..29d8b13 Binary files /dev/null and b/img/2025-10-25_01/architecture.png differ diff --git a/img/2025-10-25_01/cross_degree_disk_write_mean.png b/img/2025-10-25_01/cross_degree_disk_write_mean.png new file mode 100644 index 0000000..13c2804 Binary files /dev/null and b/img/2025-10-25_01/cross_degree_disk_write_mean.png differ diff --git a/img/2025-10-25_01/cross_degree_network_comparison.png b/img/2025-10-25_01/cross_degree_network_comparison.png new file mode 100644 index 0000000..2a015d6 Binary files /dev/null and b/img/2025-10-25_01/cross_degree_network_comparison.png differ diff --git a/img/2025-10-25_01/elapsed_time_by_degree.png b/img/2025-10-25_01/elapsed_time_by_degree.png new file mode 100644 index 0000000..6f1664e Binary files /dev/null and b/img/2025-10-25_01/elapsed_time_by_degree.png differ diff --git a/img/2025-10-25_01/plot_01_mean_cpu.png b/img/2025-10-25_01/plot_01_mean_cpu.png new file mode 100644 index 0000000..2929099 Binary files /dev/null and b/img/2025-10-25_01/plot_01_mean_cpu.png differ diff --git a/img/2025-10-25_01/plot_02_peak_cpu.png b/img/2025-10-25_01/plot_02_peak_cpu.png new file mode 100644 index 0000000..0bb1780 Binary files /dev/null and b/img/2025-10-25_01/plot_02_peak_cpu.png differ diff --git a/img/2025-10-25_01/plot_03_fasttransfer_user_system.png b/img/2025-10-25_01/plot_03_fasttransfer_user_system.png new file mode 100644 index 0000000..ea430b3 Binary files /dev/null and b/img/2025-10-25_01/plot_03_fasttransfer_user_system.png differ diff --git a/img/2025-10-25_01/plot_06_system_cpu_percentage.png b/img/2025-10-25_01/plot_06_system_cpu_percentage.png new file mode 100644 index 0000000..7b9beca Binary files /dev/null and b/img/2025-10-25_01/plot_06_system_cpu_percentage.png differ diff --git a/img/2025-10-25_01/plot_10_timeseries_degree_4.png b/img/2025-10-25_01/plot_10_timeseries_degree_4.png new file mode 100644 index 0000000..46a2693 Binary files /dev/null and b/img/2025-10-25_01/plot_10_timeseries_degree_4.png differ diff --git a/img/2025-10-25_01/plot_11_timeseries_degree_32.png b/img/2025-10-25_01/plot_11_timeseries_degree_32.png new file mode 100644 index 0000000..b136e05 Binary files /dev/null and b/img/2025-10-25_01/plot_11_timeseries_degree_32.png differ diff --git a/img/2025-10-25_01/plot_12_timeseries_degree_128.png b/img/2025-10-25_01/plot_12_timeseries_degree_128.png new file mode 100644 index 0000000..7928eb7 Binary files /dev/null and b/img/2025-10-25_01/plot_12_timeseries_degree_128.png differ diff --git a/img/2025-10-25_01/plot_7_distribution_degree_4.png b/img/2025-10-25_01/plot_7_distribution_degree_4.png new file mode 100644 index 0000000..a5c4252 Binary files /dev/null and b/img/2025-10-25_01/plot_7_distribution_degree_4.png differ diff --git a/img/2025-10-25_01/plot_8_distribution_degree_32.png b/img/2025-10-25_01/plot_8_distribution_degree_32.png new file mode 100644 index 0000000..d2ff8bf Binary files /dev/null and b/img/2025-10-25_01/plot_8_distribution_degree_32.png differ diff --git a/img/2025-10-25_01/plot_9_distribution_degree_128.png b/img/2025-10-25_01/plot_9_distribution_degree_128.png new file mode 100644 index 0000000..6a4d267 Binary files /dev/null and b/img/2025-10-25_01/plot_9_distribution_degree_128.png differ diff --git a/img/2025-10-25_01/source_disk_utilization_timeseries.png b/img/2025-10-25_01/source_disk_utilization_timeseries.png new file mode 100644 index 0000000..6250679 Binary files /dev/null and b/img/2025-10-25_01/source_disk_utilization_timeseries.png differ diff --git a/img/2025-10-25_01/target_disk_utilization_timeseries.png b/img/2025-10-25_01/target_disk_utilization_timeseries.png new file mode 100644 index 0000000..6dde077 Binary files /dev/null and b/img/2025-10-25_01/target_disk_utilization_timeseries.png differ diff --git a/img/2025-10-25_01/target_disk_write_throughput_timeseries.png b/img/2025-10-25_01/target_disk_write_throughput_timeseries.png new file mode 100644 index 0000000..8d89552 Binary files /dev/null and b/img/2025-10-25_01/target_disk_write_throughput_timeseries.png differ diff --git a/img/2025-10-25_01/target_network_rx_timeseries.png b/img/2025-10-25_01/target_network_rx_timeseries.png new file mode 100644 index 0000000..83a1fb5 Binary files /dev/null and b/img/2025-10-25_01/target_network_rx_timeseries.png differ