1 unstable release
Uses new Rust 2024
| 0.1.0 | Jul 13, 2025 |
|---|
#2333 in Database interfaces
32KB
519 lines
pgparquet
A quick and dirty CLI to read Parquet files from Google Cloud Storage (GCS) and stream into PostgreSQL. Other storage backends may be added in the future.
The pg_parquet extension is great, but cannot be installed on hosted PostgreSQL providers (eg: GCP). DuckDB can read from parquet and write to PostgreSQL, but it doesn't support Google Application Default Credentials (ADC) for authentication, which makes authentication more challenging.
[!NOTE] This project is a prototype as I learn Rust - there may be bugs or inefficiencies. Feel free to contribute!
Features
- Streaming Processing: Efficiently streams large Parquet files without loading them entirely into memory
- High-Performance COPY: Uses PostgreSQL's COPY command for optimal bulk loading performance
- Automatic Schema Mapping: Converts Arrow/Parquet schemas to PostgreSQL table schemas
- Batch Processing: Configurable batch size for optimal performance
- Table Management: Can create tables automatically and optionally truncate before loading
- Error Handling: Comprehensive error handling with detailed logging
- Authentication: Uses Google Application Default Credentials (ADC)
Prerequisites
- Rust: Install Rust from rustup.rs
- PostgreSQL: Access to a PostgreSQL database
- Google Cloud Authentication: Set up Application Default Credentials (automatic when running in GCP)
Installation
cd pgparquet
cargo build --release
Usage
Basic Usage
Create a new table and load a single parquet file:
pgparquet \
--path gs://my-bucket/data/single-file.parquet \
--database-url "postgresql://user:password@localhost:5432/mydb" \
--table analytics.my_table \
--create-table
Wipe a table and load all parquet files from a folder:
pgparquet \
--path gs://my-bucket/data/parquet-files/ \
--database-url "postgresql://user:password@localhost:5432/mydb" \
--table analytics.my_table \
--truncate
Command Line Options
--path, -p: GCS path - end with '.parquet' for a single file or '/' for a folder- Examples:
gs://bucket/file.parquetorgs://bucket/folder/
- Examples:
--database-url, -d: PostgreSQL connection string (required)--table, -t: Target table name in PostgreSQL (can include schema: schema.table)--batch-size: Number of records to process in each batch (default: 1000)--create-table: Create the table if it doesn't exist--truncate: Truncate the table before loading data
Environment Variables
You can also use environment variables for sensitive information or the log level:
export DATABASE_URL="postgresql://user:password@localhost:5432/mydb"
export RUST_LOG=info # Set logging level
Data Type Mapping
The tool automatically maps Arrow/Parquet data types to PostgreSQL types:
| Arrow/Parquet Type | PostgreSQL Type |
|---|---|
| Boolean | BOOLEAN |
| Int8, Int16 | SMALLINT |
| Int32 | INTEGER |
| Int64 | BIGINT |
| UInt64 | NUMERIC |
| Float32 | REAL |
| Float64 | DOUBLE PRECISION |
| Utf8, LargeUtf8 | TEXT |
| Binary, LargeBinary | BYTEA |
| Date32, Date64 | DATE |
| Time32, Time64 | TIME |
| Timestamp | TIMESTAMP |
| Decimal128, Decimal256 | NUMERIC |
| List, Struct, Map | JSONB |
Performance Considerations
- COPY Command: The tool uses PostgreSQL's COPY command which is significantly faster than INSERT statements for bulk loading
- Batch Size: Larger batch sizes can improve performance but use more memory
- Buffer Size: Data is buffered in 5MiB chunks before being sent via COPY
- Network: Ensure good network connectivity between your application and both GCS and PostgreSQL
- PostgreSQL Configuration: Consider adjusting PostgreSQL settings for bulk loading:
shared_buffersmaintenance_work_memcheckpoint_segments
Troubleshooting
Authentication Issues
-
Verify your Google Cloud credentials:
gcloud auth application-default print-access-token -
Check your user or service account has the necessary permissions:
storage.objects.getstorage.objects.list
PostgreSQL Connection Issues
-
Verify connection string format:
postgresql://[user[:password]@][host][:port][/dbname][?param1=value1&...] -
Test connection manually:
psql "postgresql://user:password@host:5432/dbname"
Performance Issues
- Monitor memory usage with larger batch sizes
- Check network latency to both GCS and PostgreSQL
- Consider running closer to your data (same region)
License
This project is licensed under the MIT License.
Dependencies
~41–62MB
~1M SLoC