Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Bug fix] Fault-Domain-Aware Instance Assignment failing rebalance with minimize data movement#17799

Open
J-HowHuang wants to merge 3 commits intoapache:masterfrom
J-HowHuang:fix-FD-aware-selector
Open

[Bug fix] Fault-Domain-Aware Instance Assignment failing rebalance with minimize data movement#17799
J-HowHuang wants to merge 3 commits intoapache:masterfrom
J-HowHuang:fix-FD-aware-selector

Conversation

@J-HowHuang
Copy link
Collaborator

@J-HowHuang J-HowHuang commented Mar 3, 2026

Description

For tables using FD_AWARE_INSTANCE_PARTITION_SELECTOR as their partition selector in instance assignment config, it's likely to fail rebalance when minimizeDataMovement=true if the instances didn't change in all pools. This will result in throwing an exception here java.util.NoSuchElementException since the map is empty

CandidateQueue(Map<Integer, LinkedHashSet<String>> faultDomainToCandidateInstancesMap) {
_map = new TreeMap<>();
faultDomainToCandidateInstancesMap.entrySet().stream().filter(kv -> !kv.getValue().isEmpty())
.forEach(kv -> _map.put(kv.getKey(), new LinkedList<>(kv.getValue())));
_iter = _map.firstKey();
}

Reproduce

Run quickstart offline, for table airlineStats_OFFLINE, remove its tierConfigs and add the following instanceAssignmentConfigMap

"instanceAssignmentConfigMap": {
      "OFFLINE": {
        "tagPoolConfig": {
          "tag": "DefaultTenant_OFFLINE",
          "poolBased": false,
          "numPools": 1
        },
        "replicaGroupPartitionConfig": {
          "replicaGroupBased": true,
          "numInstances": 0,
          "numReplicaGroups": 1,
          "numInstancesPerReplicaGroup": 1,
          "numPartitions": 0,
          "numInstancesPerPartition": 0,
          "minimizeDataMovement": false
        },
        "partitionSelector": "FD_AWARE_INSTANCE_PARTITION_SELECTOR",
        "minimizeDataMovement": false
      }
    }

Run rebalance with minimize data movement enabled.
Results in

{
  "jobId": "b6c56b27-450a-49c0-ac69-16835ed47fff",
  "status": "FAILED",
  "description": "Caught exception while fetching/calculating instance partitions: java.util.NoSuchElementException"
}

Controller log:

2026/03/03 12:23:37.194 INFO [InstanceTagPoolSelector] [grizzly-http-server-2] Selecting 1 instances for table: airlineStats_OFFLINE
2026/03/03 12:23:37.194 INFO [InstanceAssignmentDriver] [grizzly-http-server-2] No instance constraint is configured, using default hash-based-rotate instance constraint
2026/03/03 12:23:37.194 INFO [HashBasedRotateInstanceConstraintApplier] [grizzly-http-server-2] Rotating instances for table: airlineStats_OFFLINE with hash: 802879867
2026/03/03 12:23:37.194 INFO [FDAwareInstancePartitionSelector] [grizzly-http-server-2] Assigning 1 replica groups to 1 fault domains
2026/03/03 12:23:37.194 INFO [FDAwareInstancePartitionSelector] [grizzly-http-server-2] Warning, normalizing isn't finished yet
2026/03/03 12:23:37.195 WARN [TableRebalancer-airlineStats_OFFLINE-2b912476-fb58-4b47-907a-7db8f87d85c1] [grizzly-http-server-2] Caught exception while fetching/calculating instance partitions, aborting the rebalance
java.util.NoSuchElementException
	at java.base/java.util.TreeMap.key(TreeMap.java:1602)
	at java.base/java.util.TreeMap.firstKey(TreeMap.java:291)
	at org.apache.pinot.controller.helix.core.assignment.instance.FDAwareInstancePartitionSelector$CandidateQueue.<init>(FDAwareInstancePartitionSelector.java:255)
	at org.apache.pinot.controller.helix.core.assignment.instance.FDAwareInstancePartitionSelector$ReplicaGroupBasedAssignmentState.fill(FDAwareInstancePartitionSelector.java:395)
	at org.apache.pinot.controller.helix.core.assignment.instance.FDAwareInstancePartitionSelector.selectInstances(FDAwareInstancePartitionSelector.java:195)
	at org.apache.pinot.controller.helix.core.assignment.instance.InstanceAssignmentDriver.getInstancePartitions(InstanceAssignmentDriver.java:154)
	at org.apache.pinot.controller.helix.core.assignment.instance.InstanceAssignmentDriver.getInstancePartitions(InstanceAssignmentDriver.java:126)
	at org.apache.pinot.controller.helix.core.assignment.instance.InstanceAssignmentDriver.assignInstances(InstanceAssignmentDriver.java:69)
	at org.apache.pinot.controller.helix.core.rebalance.TableRebalancer.getInstancePartitions(TableRebalancer.java:1274)

Change

Add a check before FDAwareInstancePartitionSelector$ReplicaGroupBasedAssignmentState.fill to see if there's any instance to fill, otherwise skip.

@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.24%. Comparing base (7f70124) to head (56a7113).
⚠️ Report is 16 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17799      +/-   ##
============================================
+ Coverage     63.21%   63.24%   +0.03%     
- Complexity     1454     1456       +2     
============================================
  Files          3183     3186       +3     
  Lines        191500   191615     +115     
  Branches      29289    29315      +26     
============================================
+ Hits         121048   121183     +135     
+ Misses        60986    60947      -39     
- Partials       9466     9485      +19     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.22% <100.00%> (+0.03%) ⬆️
java-21 63.22% <100.00%> (+0.03%) ⬆️
temurin 63.24% <100.00%> (+0.03%) ⬆️
unittests 63.23% <100.00%> (+0.03%) ⬆️
unittests1 55.61% <ø> (+0.03%) ⬆️
unittests2 34.14% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix @J-HowHuang! Is it possible to add a small test for this case?

Also, IIUC, does this mean that all steady state rebalances (no instance additions / removals) with minimizeDataMovement currently fail when the instance selector is FDAwareInstancePartitionSelector?

public void fill(Map<Integer, LinkedHashSet<String>> faultDomainToCandidateInstancesMap) {
// skip filling if there is no candidate instance, which can happen when minimize data movement is enabled and
// no new instances are added to any pool
if (faultDomainToCandidateInstancesMap.values().stream().allMatch(HashSet::isEmpty)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Set::isEmpty or Collection::isEmpty would be more idiomatic rather than using HashSet::isEmpty for a LinkedHashSet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@yashmayya yashmayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix @J-HowHuang! Is it possible to add a small test for this case?

Also, IIUC, does this mean that all steady state rebalances (no instance additions / removals) with minimizeDataMovement currently fail when the instance selector is FDAwareInstancePartitionSelector?

@J-HowHuang
Copy link
Collaborator Author

@yashmayya Added the test and verified that it would fail without the fix.

Comment on lines +2 to +11
* Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE
* file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
* License. You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
* an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
* specific language governing permissions and limitations under the License.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be reverted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants