[FLINK-37937] Test Cassandra source in real cluster conditions to better test the split #37

echauchot · 2025-07-31T17:24:38Z

A bug in split calculation (ring fraction calculation) was uncovered by this PR. The existing split tests are run on an embedded Cassandra cluster with only one node. This leads to having ringFraction always equal to 1 (the single node hosts 100% of the data) during the tests. This masks the bug.
Test splits on an embedded cluster of 2 nodes.
Additional changes:

implement robust refresh size estimates for tests
fix timeouts configuration
upgrade to latest cassandra 4.x
PS: tests are longer to setup because of the node cluster: measured on my laptop 2min16 vs 57s but the better split testings is worth it I think.

R: @Poorvankbhatia

echauchot · 2025-09-03T08:39:09Z

@Poorvankbhatia PTAL

...r-cassandra/src/test/java/org/apache/flink/connector/cassandra/CassandraTestEnvironment.java

flink-connector-cassandra/src/test/resources/cassandra.yml

Poorvankbhatia · 2025-09-07T13:59:17Z

Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments.
One issue, when I ran mvn clean install on my local Mac (on PR#37) it failed because of (passes on main branch):

org.apache.flink.connector.cassandra.source.CassandraSourceITCase.testSourceSingleSplit(TestEnvironment, DataStreamSourceExternalContext, CheckpointingMode)[1]  Time elapsed: 0.188 s  <<< ERROR!
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)

Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working).

echauchot · 2025-09-08T08:55:46Z

Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments. One issue, when I ran mvn clean install on my local Mac (on PR#37) it failed because of (passes on main branch):
org.apache.flink.connector.cassandra.source.CassandraSourceITCase.testSourceSingleSplit(TestEnvironment, DataStreamSourceExternalContext, CheckpointingMode)[1]  Time elapsed: 0.188 s  <<< ERROR!
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working).

Yes I have this from time to time on my laptop I could raise the configured query timeouts and also see with the new startup conditions.

Poorvankbhatia · 2025-09-08T09:45:34Z

I could raise the configured query timeouts and also see with the new startup conditions.

Yes that would be pretty helpful. 👍

echauchot · 2025-09-08T10:39:57Z

I could raise the configured query timeouts and also see with the new startup conditions.

Yes that would be pretty helpful. 👍

Actually the timeouts were incorrectly applied due to the error in naming the conf file. So I just applied the original timeouts values and ran 5 times on my laptop with no issue such as above.

echauchot · 2025-09-08T10:43:01Z

@Poorvankbhatia thanks for your review ! I addressed all the comments, PTAL.

Poorvankbhatia

It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good.
Thank you for resolving the comments.
LGTM.

echauchot · 2025-09-09T07:59:11Z

It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good. Thank you for resolving the comments. LGTM.

Regarding the CI: Only the name test succeeds, it fails in timeout when downloading the flink binaries at https://archive.apache.org. It is incredibly slow (less than 100KB/s) and it has been for some time. I'll see if another url works better.

Regarding timeouts on write requests, I'll raise the write request timeout, could you please test on you local machine. On mine it works well.

echauchot · 2025-09-09T08:39:56Z

Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments. One issue, when I ran mvn clean install on my local Mac (on PR#37) it failed because of (passes on main branch):
org.apache.flink.connector.cassandra.source.CassandraSourceITCase.testSourceSingleSplit(TestEnvironment, DataStreamSourceExternalContext, CheckpointingMode)[1]  Time elapsed: 0.188 s  <<< ERROR!
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working).
Yes I have this from time to time on my laptop I could raise the configured query timeouts and also see with the new startup conditions.

One thing strikes me though: the write query uses CL=1 whereas it should use lower CL=ANY as org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForWriting is configured with CL=ANY. This means that somehow org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForReading (CL=1) is used for writing as well

echauchot · 2025-09-09T11:02:41Z

Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments. One issue, when I ran mvn clean install on my local Mac (on PR#37) it failed because of (passes on main branch):
org.apache.flink.connector.cassandra.source.CassandraSourceITCase.testSourceSingleSplit(TestEnvironment, DataStreamSourceExternalContext, CheckpointingMode)[1]  Time elapsed: 0.188 s  <<< ERROR!
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working).
Yes I have this from time to time on my laptop I could raise the configured query timeouts and also see with the new startup conditions.
One thing strikes me though: the write query uses CL=1 whereas it should use lower CL=ANY as org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForWriting is configured with CL=ANY. This means that somehow org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForReading (CL=1) is used for writing as well

Ah no, the builders are used for the old connectors tests. It is in org.apache.flink.connector.testframe.external.ExternalSystemSplitDataWriter#writeRecords at the mapper side that it needs to be configured

echauchot · 2025-09-09T11:19:23Z

@Poorvankbhatia I have dealt with write timeouts: put consistency level to lower ANY instead of default (CL=1) when writing test data. Raise write request timeout to the same timeout as read request.

Please test on your local machine that the flaky issue is gone.

I'll merge afterwards

echauchot · 2025-09-10T10:22:18Z

It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good. Thank you for resolving the comments. LGTM.

Regarding the CI: Only the name test succeeds, it fails in timeout when downloading the flink binaries at https://archive.apache.org. It is incredibly slow (less than 100KB/s) and it has been for some time. I'll see if another url works better.

Regarding timeouts on write requests, I'll raise the write request timeout, could you please test on you local machine. On mine it works well.

Downloading Flink binary still timeouts, because we cannot download it within allowed 1h20 because archive.apache.org seem too slow (it seems to be experiencing issues lately: https://status.apache.org/#past-incidents). The other PRs use another version of Flink (migrated to 2.0 and 2.1) so they benefit from a cache it.

echauchot · 2025-09-10T13:22:30Z

@Poorvankbhatia I have dealt with write timeouts: put consistency level to lower ANY instead of default (CL=1) when writing test data. Raise write request timeout to the same timeout as read request.

Please test on your local machine that the flaky issue is gone.

I'll merge afterwards

Regarding machine load and timeouts / missing data: I have put consistency level to ONE to be coherent between read and write requests because CL=ANY in write requests could lead to hint writes that would be invisible to subsequent read requests. I guess with the extended timeouts of 15s on writes it will give enough time for a write to wait for 1 replica before ack even on loaded machines

please test on your local machine as I cannot reproduce on mine.

echauchot · 2025-09-12T07:59:43Z

It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good. Thank you for resolving the comments. LGTM.

Regarding the CI: Only the name test succeeds, it fails in timeout when downloading the flink binaries at https://archive.apache.org. It is incredibly slow (less than 100KB/s) and it has been for some time. I'll see if another url works better.
Regarding timeouts on write requests, I'll raise the write request timeout, could you please test on you local machine. On mine it works well.

Downloading Flink binary still timeouts, because we cannot download it within allowed 1h20 because archive.apache.org seem too slow (it seems to be experiencing issues lately: https://status.apache.org/#past-incidents). The other PRs use another version of Flink (migrated to 2.0 and 2.1) so they benefit from a cache it.

FYI: https://lists.apache.org/thread/vb9bd4toq11g0x59x0q47dp5lqdlqgyr

echauchot · 2025-09-16T14:54:33Z

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

Poorvankbhatia · 2025-09-17T05:37:03Z

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

echauchot · 2025-09-17T07:28:37Z

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.

I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.

Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

echauchot · 2025-09-17T10:49:06Z

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.

I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.

Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

Well, even 10 min timeout is not enough to start 2 containers on the github CI !
Locally all tests run in 2 min including the startup time.
I might change my plans (not start in parallel or change the wait condition) to make it pass on github CI

Poorvankbhatia · 2025-09-17T16:47:16Z

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.
I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.
Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

Well, even 10 min timeout is not enough to start 2 containers on the github CI ! Locally all tests run in 2 min including the startup time. I might change my plans (not start in parallel or change the wait condition) to make it pass on github CI

I ran the change on my local (> 10 times) and it succeeded every time. Pretty sure it is a CI issue 😄 @echauchot

echauchot · 2025-09-18T08:51:24Z

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.
I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.
Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

Well, even 10 min timeout is not enough to start 2 containers on the github CI ! Locally all tests run in 2 min including the startup time. I might change my plans (not start in parallel or change the wait condition) to make it pass on github CI

I ran the change on my local (> 10 times) and it succeeded every time. Pretty sure it is a CI issue 😄 @echauchot

Yes, github actions is a shared environment so pretty loaded.
With 2 containers starting in sequence instead of parallel and a 3 minutes startup timeout, it passes on the Github CI.
For some (strange) reason, this PR automatically links to the build of a very old commit. Here is the link to the build of the last commit of the PR: https://github.com/echauchot/flink-connector-cassandra/runs/50669333696
It is all green, so I'll merge the PR as you already approved it.

…to update size estimates: flush updates the SSTables and refreshsizeestimates updates the size estimates based on them

…tions that are not interpreted by Cassandra cluster

…tion accessible to tests and not call estimate_size during tests. In refreshSizeEstimates wait until system.size_estimates has at least a row that has non-null mean_partition_size

…coherent between read and write request because CL=ANY in write requests could lead to hint writes that would be invisible to subsequent read requests. Raise write request timeout to the same timeout as read request. Put replication factor to 2 to deal with temporary down cassandra container. But back sequential start of the 2 containers.

echauchot · 2025-09-19T10:06:36Z

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.

I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.

Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

@Poorvankbhatia I understood the Github build issue: it is not a cache issue as we supposed, it is just due to the fact that, in the meantime, I merged your SQL PR to main and that PR references the old cassandraContainer variable. Github tests the current PR by automatically doing a rebase on main. This rebase does not fail because there are no conflicts, it just adds my changes on top of yours. Hence the cassandraContainer variable that is left in the code and the compilation issue. I'll manually rebase onto main and replace cassandraContainer by cassandraContainer1 in you code.

…d of cassandraContainer

Poorvankbhatia · 2025-09-19T11:02:01Z

CI is green 😄 . I think you should merge. 👏

boring-cyborg bot added the component=Connectors/Cassandra label Jul 31, 2025

Poorvankbhatia reviewed Sep 7, 2025

View reviewed changes

...r-cassandra/src/test/java/org/apache/flink/connector/cassandra/CassandraTestEnvironment.java Show resolved Hide resolved

Poorvankbhatia reviewed Sep 7, 2025

View reviewed changes

...r-cassandra/src/test/java/org/apache/flink/connector/cassandra/CassandraTestEnvironment.java Outdated Show resolved Hide resolved

Poorvankbhatia reviewed Sep 7, 2025

View reviewed changes

...r-cassandra/src/test/java/org/apache/flink/connector/cassandra/CassandraTestEnvironment.java Outdated Show resolved Hide resolved

Poorvankbhatia reviewed Sep 7, 2025

View reviewed changes

flink-connector-cassandra/src/test/resources/cassandra.yml Outdated Show resolved Hide resolved

Poorvankbhatia approved these changes Sep 8, 2025

View reviewed changes

echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 176baad to 623d78f Compare September 9, 2025 11:17

echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 623d78f to 6069d83 Compare September 10, 2025 13:19

echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 6069d83 to 98320b7 Compare September 12, 2025 10:48

boring-cyborg bot added the component=BuildSystem label Sep 12, 2025

echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 590cb87 to 98320b7 Compare September 16, 2025 09:12

echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 60e9162 to e706c1b Compare September 18, 2025 08:33

echauchot added 7 commits September 18, 2025 11:03

[FLINK-37937] Add a node to Cassandra testContainers cluster

23164ea

[FLINK-37937] Use nodetool refreshsizeestimates in addition to flush …

5a2ed53

…to update size estimates: flush updates the SSTables and refreshsizeestimates updates the size estimates based on them

[FLINK-37937] change timeouts using cassandra.yaml instead of java op…

725100e

…tions that are not interpreted by Cassandra cluster

[FLINK-37937] upgrade to latest cassandra 4.x. reformat

c3d5b0f

[FLINK-37937] Make estimatedTableSize calculated during split prepara…

8857fa1

…tion accessible to tests and not call estimate_size during tests. In refreshSizeEstimates wait until system.size_estimates has at least a row that has non-null mean_partition_size

[FLINK-37937] Improve start/stop for the 2-container cluster.

80fd38a

echauchot force-pushed the FLINK-37937-multiple-test-containers branch from e706c1b to 2a05068 Compare September 18, 2025 09:04

[FLINK-37937] Rebase on main and reference cassandraContainer1 instea…

0215935

…d of cassandraContainer

echauchot merged commit 84c3fa0 into apache:main Sep 22, 2025
5 checks passed

echauchot deleted the FLINK-37937-multiple-test-containers branch October 3, 2025 14:37

[FLINK-37937] Test Cassandra source in real cluster conditions to better test the split #37

[FLINK-37937] Test Cassandra source in real cluster conditions to better test the split #37

Uh oh!

Conversation

echauchot commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Poorvankbhatia commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot commented Sep 8, 2025

Uh oh!

Poorvankbhatia commented Sep 8, 2025

Uh oh!

echauchot commented Sep 8, 2025

Uh oh!

echauchot commented Sep 8, 2025

Uh oh!

Poorvankbhatia left a comment

Choose a reason for hiding this comment

Uh oh!

echauchot commented Sep 9, 2025

Uh oh!

echauchot commented Sep 9, 2025

Uh oh!

echauchot commented Sep 9, 2025

Uh oh!

echauchot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot commented Sep 12, 2025

Uh oh!

echauchot commented Sep 16, 2025

Uh oh!

Poorvankbhatia commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echauchot commented Sep 17, 2025

Uh oh!

echauchot commented Sep 17, 2025

Uh oh!

Poorvankbhatia commented Sep 17, 2025

Uh oh!

echauchot commented Sep 18, 2025

Uh oh!

echauchot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Poorvankbhatia commented Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

echauchot commented Jul 31, 2025 •

edited

Loading

Poorvankbhatia commented Sep 7, 2025 •

edited

Loading

echauchot commented Sep 9, 2025 •

edited

Loading

echauchot commented Sep 10, 2025 •

edited

Loading

echauchot commented Sep 10, 2025 •

edited

Loading

Poorvankbhatia commented Sep 17, 2025 •

edited

Loading

echauchot commented Sep 19, 2025 •

edited

Loading