Data Distribution and Replication

Relevant source files

This page describes how Apache Fluss distributes and replicates data across a cluster. It covers bucket-based data distribution, partition management, the replication model, and how Fluss ensures high availability through leader-follower replication and the In-Sync Replica (ISR) mechanism.

Overview

Fluss distributes data using a two-level hierarchy:

Partitions: Optional logical grouping of data based on partition keys (e.g., date). Each partition contains its own set of buckets fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java45-48
Buckets: The fundamental unit of data distribution and replication. Each table (or partition) is divided into a configurable number of buckets fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java147-148

Data is replicated across multiple TabletServer nodes to ensure fault tolerance. Each bucket has one leader replica and zero or more follower replicas. The leader handles all read and write operations, while followers replicate data from the leader using a dedicated fetcher mechanism fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java150-153

Bucket-Based Distribution

Bucket Assignment

Every table in Fluss is divided into a fixed number of buckets specified at table creation time. If not specified, the cluster uses default.bucket.number fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java74-82 Records are distributed to buckets using a deterministic hash function applied to the bucket key. The ReplicaManager on the TabletServer is responsible for managing these physical data structures and their lifecycles fluss-server/src/main/java/org/apache/fluss/server/replica/ReplicaManager.java86-109

The following diagram shows how the CoordinatorServer and TabletServer interact to manage data distribution via state machines and metadata storage in ZooKeeper.

Data Distribution Code Map

Sources: fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java95-107 fluss-server/src/main/java/org/apache/fluss/server/replica/ReplicaManager.java89-109 fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java147-163 fluss-server/src/main/java/org/apache/fluss/server/kv/KvTablet.java107-150

Bucket Key Selection

Primary Key Tables: By default, the bucket key is the primary key. Replicas for PK tables contain both a LogTablet for the changelog and a KvTablet for the current state if the replica is the leader fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java150-152
Log Tables: If no bucket key is specified, records are distributed across buckets. These replicas contain only a LogTablet fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java152-153

Replication Architecture

Fluss uses a leader-based replication model. The CoordinatorServer manages the lifecycle of buckets and replicas through state transitions.

Replica States and Management

The ReplicaManager on each TabletServer handles the physical instantiation of replicas. It uses NotifyLeaderAndIsrData to transition replicas between leader and follower roles fluss-server/src/main/java/org/apache/fluss/server/replica/ReplicaManager.java74-75

Leader: Handles all read/write requests. It manages the high watermark and sequence numbers for exactly-once semantics fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java150-152
Follower: Periodically fetches data from the leader to maintain sync via ReplicaFetcherThread fluss-server/src/test/java/org/apache/fluss/server/replica/fetcher/ReplicaFetcherThreadTest.java137-142

Replica Leadership Transition

Sources: fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java112-113 fluss-server/src/main/java/org/apache/fluss/server/replica/ReplicaManager.java74-75 fluss-server/src/test/java/org/apache/fluss/server/replica/fetcher/ReplicaFetcherThreadTest.java137-142

In-Sync Replicas (ISR) Mechanism

The ISR is the set of replicas that are currently synchronized with the leader.

Fetching and ISR Maintenance

Followers pull data from the leader via ReplicaFetcherThread.

Leader Epoch: Every leadership change increments the leaderEpoch to prevent data inconsistency during network partitions fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java24
ISR Adjustment: If a follower falls behind, the leader sends an AdjustIsrRequest to the coordinator to remove it from the ISR fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java51
High Watermark: The leader maintains a highWatermark, which is the maximum offset that has been replicated to all replicas in the ISR fluss-server/src/main/java/org/apache/fluss/server/replica/ReplicaManager.java139

Partition Management

For partitioned tables, Fluss supports both manual and automatic partitioning.

Validation: Partition values are validated to ensure they match supported types (e.g., STRING, BIGINT, DATE) fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java72-78
Auto-Partitioning: The AutoPartitionManager in the CoordinatorServer periodically creates new partitions based on the table's partition strategy fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorServer.java132
Replica Handling: Partitions are physically represented as distinct TableBucket instances, each managed as a Replica fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java148-152

Rebalancing and Rack-Awareness

Rebalancing

The RebalanceManager in the coordinator tracks the count of replicas and leaders per TabletServer to identify imbalances fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorServer.java36 Rebalancing involves moving TableBucketReplica instances to different servers to normalize load. The coordinator generates a RebalanceTask which is persisted in ZooKeeper fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java115

Rack-Aware Deployment

Fluss utilizes server rack information to ensure that replicas of the same bucket are not all placed on the same physical rack.

Server Registration: The TabletServer reports its rack property during registration with ZooKeeper fluss-server/src/main/java/org/apache/fluss/server/tablet/TabletServer.java106
Constraint Enforcement: The coordinator uses this rack metadata during bucket assignment. Either all servers must have a rack configured, or none; otherwise, an InvalidServerRackInfoException is thrown fluss-server/src/main/java/org/apache/fluss/server/tablet/TabletServer.java103-105

Sources: fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java74-92 fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorEventProcessor.java45-115 fluss-server/src/main/java/org/apache/fluss/server/replica/ReplicaManager.java74-142 fluss-server/src/main/java/org/apache/fluss/server/replica/Replica.java147-153 fluss-server/src/main/java/org/apache/fluss/server/tablet/TabletServer.java95-106 fluss-server/src/main/java/org/apache/fluss/server/coordinator/CoordinatorServer.java120-154

Data Distribution and Replication

Overview

Bucket-Based Distribution

Bucket Assignment

Bucket Key Selection

Replication Architecture

Replica States and Management

In-Sync Replicas (ISR) Mechanism

Fetching and ISR Maintenance

Partition Management

Rebalancing and Rack-Awareness

Rebalancing

Rack-Aware Deployment

On this page

Data Distribution and Replication

Overview

Bucket-Based Distribution

Bucket Assignment

Bucket Key Selection

Replication Architecture

Replica States and Management

In-Sync Replicas (ISR) Mechanism

Fetching and ISR Maintenance

Partition Management

Rebalancing and Rack-Awareness

Rebalancing

Rack-Aware Deployment

On this page