Tags: Q1Liu/kafka
Tags
[LI-HOTFIX] Add Zookeeper pagination support for /brokers/topics znode ( linkedin#435) TICKET = LIKAFKA-49497 LI_DESCRIPTION = Switch from Apache to LI Zookeeper dependency and add a GetAllChildrenPaginated option for the /brokers/topics znode (which supports 'list topics' responses greater than 1 MB). The feature is controlled by a new li.zookeeper.pagination.enable config (default = false), with the intention that it be enabled only for critical clusters, at least until it's proven itself in battle. EXIT_CRITERIA = When this change is accepted upstream and pulled to this repo.
[LI-HOTFIX] Log lastCaughtUpTime on ISR shrinkage (linkedin#432) TICKET = N/A EXIT_CRITERIA = When upstream also log similar info
[LI-HOTFIX] Add metric for total connection count (linkedin#430) This metric is added for showing the total client connection count to a broker. The metric will be useful when we monitor the connection count and measure the max connections of a broker. TICKET = N/A LI_DESCRIPTION = LIKAFKA-49259 EXIT_CRITERIA = When upstream implement similar sensors
[LI-HOTFIX] Support full LeaderAndISR through LiCombinedControl reque… …sts (linkedin#427) TICKET = LIKAFKA-49560 LI_DESCRIPTION = As described in the ticket, we found that the LiCombinedControl requests can be disabled for newly added brokers. The newly added brokers experience the following behavior 1. the LiCombinedContrel requests are enabled when the broker starts up and host 0 replicas 2. the LiCombinedControl requests are disabled when some replicas are assigned to the broker 3. the LiCombinedControl requests can only be re-enabled after the broker is restarted The reason for the problem in step 2 is that once the LiCombinedControl request is enabled and a full LeaderAndISR request needs to be sent, it will try to merge the request while doesn't honor the full request type. As a result, brokers receive the LeaderAndISR as part of the LiCombinedControl, and treat it as an incremental LeaderAndISR instead of a full request. This PR tries to address the problem by having the full support of the LeaderAndISR request type within LiCombinedControl. EXIT_CRITERIA = The same as the LiCombinedControl request.
[LI-HOTFIX] Broker to controller request should not use cached contro… …ller node (linkedin#425) TICKET = LIKAFKA-49304 LI_DESCRIPTION = Per the slack discussions https://linkedin-randd.slack.com/archives/C014EKBE170/p1669951054188429?thread_ts=1667860160.319959&cid=C014EKBE170, the current implementation of broker-to-controller requests results in Unauthorized errors if the cached controller node has been migrated to a different cluster and is still alive. The impact is that any broker-to-controller requests, including the AlterISR requests, will be blocked, resulting in the permanent inconsistency of ISR info. We should handle such migrated controllers gracefully and use the correct controller instead of the previously cached obsolete controller. This PR has the following changes replace the "li.alter.isr.enable" config with the "li.deny.alter.isr" config, since the former is no longer needed and the latter is used for constructing an integration test to reproduce the problem above change the logic in BrokerToControllerRequestThread such that we always query the latest controller node when a request is constructed. Doing this should not result in performance degradation since the ControllerNodeProvider is either a MetadataCacheControllerNodeProvider or RaftControllerNodeProvider, both of which retrieves the controller from the local cache. EXIT_CRITERIA = When this change is accepted upstream and pulled into this repo.
Add support for request TotalTimeMs latency histograms (linkedin#423) TICKET = LIKAFKA-47556 Establish Kafka Server SLOs LI_DESCRIPTION = This PR is to add support for request TotalTimeMs latency histograms such that we could counter the number of requests in different latency ranges. The bin boundaries are configurable. EXIT_CRITERIA = N/A
[LI-HOTFIX] Reject invalid replica assignment cancellations (linkedin… …#422) TICKET = https://issues.apache.org/jira/browse/KAFKA-14424 LI_DESCRIPTION = When reassigning replicas, kafka runs a sanity check to ensure all of the target replicas are alive before allowing the reassignment request to proceed. However, for a request that cancels an ongoing reassignment, there is no such check. The result is that if the original replicas are offline, the cancellation may result in partitions without any leaders. This problem has been observed multiple times in our clusters. This PR adds the sanity check to ensure all of the original replicas are online before approving the cancellation request. EXIT_CRITERIA = When the issue is resolved in Apache kafka and the fix is pulled in.
[LI-FIXUP] Populate the error fields of the LiCombinedControlResponse… … properly (linkedin#421) TICKET = N/A LI_DESCRIPTION = This reverts commit a2ac1c2 (linkedin#408) The original commit is incorrect and is a backward incompatible schema change on the LiCombinedControlResponse.json. This PR reverts the original schema change, and addresses the original problem properly: 1. when the LiCombinedControl request version is below 1, the response populates the LeaderAndIsrPartitionErrors field with the error code. When the version is at or greather than 1, it populates the LeaderAndIsrTopics field. 2. When the LiCombinedControl request version is below 1, the StopReplicaPartitionErrors field of the LiCombinedControlResponse should be populated according to the StopReplicaPartitionStates of the LiCombinedControlRequest. When the version is at or greather than 1, the StopReplicaPartitionErrors field should be populated according to the StopReplicaTopicStates of the LiCombinedControlRequest. EXIT_CRITERIA = The same as the LiCombinedControlRequest feature.
Exclude the fetch requests with large fetch.max.wait.ms in SizeBucket… …Metrics linkedin#418 TICKET = LIKAFKA-47556 Establish Kafka Server SLOs LI_DESCRIPTION = This PR is to exclude the fetch requests that has fetch.max.wait.ms greater than the default setting for SizeBucketMetrics, otherwise the P999 metrics do not reflect the broker performance correctly because P999 could be just maxWait in the condition that there isn't sufficient data to immediately satisfy the requirement given by fetch.min.bytes for some of the time and maxWait is set to a large number (e.g., 30 seconds). EXIT_CRITERIA = N/A
Exclude the fetch requests with large fetch.max.wait.ms in SizeBucket… …Metrics linkedin#418 TICKET = LIKAFKA-47556 Establish Kafka Server SLOs LI_DESCRIPTION = This PR is to exclude the fetch requests that has fetch.max.wait.ms greater than the default setting for SizeBucketMetrics, otherwise the P999 metrics do not reflect the broker performance correctly because P999 could be just maxWait in the condition that there isn't sufficient data to immediately satisfy the requirement given by fetch.min.bytes for some of the time and maxWait is set to a large number (e.g., 30 seconds). EXIT_CRITERIA = N/A
PreviousNext