This document provides a high-level introduction to Apache Cloudberry, a distributed massively parallel processing (MPP) database system. It covers the overall architecture, core subsystems, and their interactions. For detailed information about specific subsystems, refer to the following pages:
Apache Cloudberry is an open-source distributed relational database management system derived from PostgreSQL and Greenplum. It extends PostgreSQL's single-node architecture with MPP capabilities for parallel query execution across multiple segment servers. The system maintains PostgreSQL compatibility while adding distributed transaction management, parallel query optimization, and shared-nothing data partitioning.
Sources: src/include/catalog/catversion.h54-56 README.md architecture diagrams
Cloudberry builds on PostgreSQL 14 as its foundation, inheriting the query processing pipeline, storage engine, and transaction management systems. The catalog version identifier uses a "3" prefix to distinguish Greenplum/Cloudberry versions from PostgreSQL:
From Greenplum, Cloudberry inherits the distributed MPP architecture, including the Query Dispatcher/Query Executor (QD/QE) model, interconnect layer for data movement, and GUC synchronization mechanisms for distributed configuration management.
Sources: src/include/catalog/catversion.h44-60 src/backend/utils/misc/guc_gp.c1-16
Apache Cloudberry follows a shared-nothing MPP architecture with these primary layers:
| Layer | Components | Purpose |
|---|---|---|
| Client | psql, libpq, pg_dump | Connection and command submission |
| Connection | postmaster, authentication | Process forking and security |
| Coordinator (QD) | Parser, Planner, Dispatcher | Query analysis and coordination |
| Segments (QE) | Executors, Storage | Parallel data processing |
| Storage | Heap, Indexes, WAL | Data persistence and durability |
| Background | bgwriter, checkpointer, walwriter, autovacuum | Maintenance operations |
Sources: Architecture Diagram 1, src/backend/postmaster/postmaster.c1-66
The postmaster process src/backend/postmaster/postmaster.c1-66 is the main server process that:
BackendStartup()Each backend process runs PostgresMain() src/backend/tcop/postgres.c1-18 to execute the query processing loop.
Sources: src/backend/postmaster/postmaster.c1-373 src/backend/tcop/postgres.c113-152 Architecture Diagram 1
The query processing pipeline transforms SQL text through multiple stages:
base_yyparse() converts SQL into a parse tree using the grammar rulesplanner() function generates execution plans
optimizer)Sources: src/backend/parser/gram.y1-110 src/backend/optimizer/plan/planner.c1-89 src/backend/optimizer/path/allpaths.c1-22 src/backend/optimizer/plan/createplan.c1-42 Architecture Diagram 2
All data modifications flow through the Write-Ahead Log (WAL) to ensure durability and enable crash recovery:
The WAL subsystem src/backend/access/transam/xlog.c1-13 is the foundation for ACID properties:
XLogInsertRecord() ensures all changes are logged before data pages are modifiedXLogWrite() and XLogFlush() persist records to diskStartupXLOG() replays WAL records after a crashCreateCheckPoint() periodically flushes dirty buffers to reduce recovery timeWAL records are written to segment files in pg_wal/ with a typical size of 16MB per segment.
Sources: src/backend/access/transam/xlog.c1-100 src/backend/storage/buffer/bufmgr.c Architecture Diagram 3
Cloudberry extends PostgreSQL with distributed query execution:
CdbDispatchPlan() src/backend/cdb/dispatcher/cdbdisp_query.cDISTRIBUTED BY clauseSources: src/backend/cdb/dispatcher/cdbdisp_query.c src/backend/utils/misc/guc_gp.c112-116 src/backend/cdb/cdbllize.c Architecture Diagram 7
The Grand Unified Configuration (GUC) system manages all server parameters:
| Component | File | Purpose |
|---|---|---|
| Core GUC | guc.c | PostgreSQL parameters |
| GP Extensions | guc_gp.c | Cloudberry/Greenplum-specific parameters |
| Configuration File | postgresql.conf | Persistent settings |
| Auto Config | postgresql.auto.conf | ALTER SYSTEM settings |
| GUC Restore List | gp_guc_restore_list | QD to QE synchronization |
Parameters can be set at multiple levels (server start, per-database, per-role, per-session) with different contexts (POSTMASTER, SIGHUP, BACKEND, USERSET).
In distributed mode, the QD maintains gp_guc_restore_list src/backend/utils/misc/guc_gp.c115-116 to track GUC changes that must be dispatched to QE segments.
Sources: src/backend/utils/misc/guc.c1-18 src/backend/utils/misc/guc_gp.c1-116 doc/src/sgml/config.sgml1-50
PostgreSQL system catalogs store metadata about database objects:
Cloudberry adds Greenplum-specific catalogs:
The catalog version src/include/catalog/catversion.h59 identifies schema compatibility.
Sources: src/include/catalog/pg_proc.h1-162 src/include/catalog/catversion.h44-60 doc/src/sgml/catalogs.sgml1-78
The build system supports multiple platforms:
./configure and make configure.ac./configure flags enable features like --enable-orca, --with-ssl, --enable-debugSources: configure.ac GNUmakefile src/tools/msvc/ Architecture Diagram 5
This overview introduced the major components. For detailed documentation, see:
| Subsystem | Page | Description |
|---|---|---|
| Query Planning | #3.6.2 | Standard planner and ORCA optimizer |
| Transaction Management | #3.4 | MVCC, 2PC, distributed transactions |
| Storage Engine | #3.5 | Heap tables, indexes, visibility maps |
| WAL and Recovery | #3.3 | Write-ahead logging and crash recovery |
| Client Libraries | #4.1 | libpq API and frontend/backend protocol |
| Replication | #5 | Physical and logical replication |
| Autovacuum | #6.3 | Automatic maintenance |
| Distributed Execution | #7 | QD/QE coordination and interconnect |
Sources: Table of contents JSON structure
Refresh this wiki
This wiki was recently refreshed. Please wait 4 days to refresh again.