Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@echauchot
Copy link
Contributor

@echauchot echauchot commented Jul 31, 2025

A bug in split calculation (ring fraction calculation) was uncovered by this PR. The existing split tests are run on an embedded Cassandra cluster with only one node. This leads to having ringFraction always equal to 1 (the single node hosts 100% of the data) during the tests. This masks the bug.
Test splits on an embedded cluster of 2 nodes.
Additional changes:

  • implement robust refresh size estimates for tests
  • fix timeouts configuration
  • upgrade to latest cassandra 4.x
    PS: tests are longer to setup because of the node cluster: measured on my laptop 2min16 vs 57s but the better split testings is worth it I think.

R: @Poorvankbhatia

@echauchot
Copy link
Contributor Author

@Poorvankbhatia PTAL

@Poorvankbhatia
Copy link
Contributor

Poorvankbhatia commented Sep 7, 2025

Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments.
One issue, when I ran mvn clean install on my local Mac (on PR#37) it failed because of (passes on main branch):

org.apache.flink.connector.cassandra.source.CassandraSourceITCase.testSourceSingleSplit(TestEnvironment, DataStreamSourceExternalContext, CheckpointingMode)[1]  Time elapsed: 0.188 s  <<< ERROR!
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)

Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working).

@echauchot
Copy link
Contributor Author

Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments. One issue, when I ran mvn clean install on my local Mac (on PR#37) it failed because of (passes on main branch):

org.apache.flink.connector.cassandra.source.CassandraSourceITCase.testSourceSingleSplit(TestEnvironment, DataStreamSourceExternalContext, CheckpointingMode)[1]  Time elapsed: 0.188 s  <<< ERROR!
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)

Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working).

Yes I have this from time to time on my laptop I could raise the configured query timeouts and also see with the new startup conditions.

@Poorvankbhatia
Copy link
Contributor

I could raise the configured query timeouts and also see with the new startup conditions.

Yes that would be pretty helpful. 👍

@echauchot
Copy link
Contributor Author

I could raise the configured query timeouts and also see with the new startup conditions.

Yes that would be pretty helpful. 👍

Actually the timeouts were incorrectly applied due to the error in naming the conf file. So I just applied the original timeouts values and ran 5 times on my laptop with no issue such as above.

@echauchot
Copy link
Contributor Author

@Poorvankbhatia thanks for your review ! I addressed all the comments, PTAL.

Copy link
Contributor

@Poorvankbhatia Poorvankbhatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good.
Thank you for resolving the comments.
LGTM.

@echauchot
Copy link
Contributor Author

It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good. Thank you for resolving the comments. LGTM.

Regarding the CI: Only the name test succeeds, it fails in timeout when downloading the flink binaries at https://archive.apache.org. It is incredibly slow (less than 100KB/s) and it has been for some time. I'll see if another url works better.

Regarding timeouts on write requests, I'll raise the write request timeout, could you please test on you local machine. On mine it works well.

@echauchot
Copy link
Contributor Author

Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments. One issue, when I ran mvn clean install on my local Mac (on PR#37) it failed because of (passes on main branch):

org.apache.flink.connector.cassandra.source.CassandraSourceITCase.testSourceSingleSplit(TestEnvironment, DataStreamSourceExternalContext, CheckpointingMode)[1]  Time elapsed: 0.188 s  <<< ERROR!
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)

Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working).

Yes I have this from time to time on my laptop I could raise the configured query timeouts and also see with the new startup conditions.

One thing strikes me though: the write query uses CL=1 whereas it should use lower CL=ANY as org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForWriting is configured with CL=ANY. This means that somehow org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForReading (CL=1) is used for writing as well

@echauchot
Copy link
Contributor Author

Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments. One issue, when I ran mvn clean install on my local Mac (on PR#37) it failed because of (passes on main branch):

org.apache.flink.connector.cassandra.source.CassandraSourceITCase.testSourceSingleSplit(TestEnvironment, DataStreamSourceExternalContext, CheckpointingMode)[1]  Time elapsed: 0.188 s  <<< ERROR!
com.datastax.driver.core.exceptions.WriteFailureException: Cassandra failure during write query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)

Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working).

Yes I have this from time to time on my laptop I could raise the configured query timeouts and also see with the new startup conditions.

One thing strikes me though: the write query uses CL=1 whereas it should use lower CL=ANY as org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForWriting is configured with CL=ANY. This means that somehow org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForReading (CL=1) is used for writing as well

Ah no, the builders are used for the old connectors tests. It is in org.apache.flink.connector.testframe.external.ExternalSystemSplitDataWriter#writeRecords at the mapper side that it needs to be configured

@echauchot echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 176baad to 623d78f Compare September 9, 2025 11:17
@echauchot
Copy link
Contributor Author

echauchot commented Sep 9, 2025

@Poorvankbhatia I have dealt with write timeouts: put consistency level to lower ANY instead of default (CL=1) when writing test data. Raise write request timeout to the same timeout as read request.

Please test on your local machine that the flaky issue is gone.

I'll merge afterwards

@echauchot
Copy link
Contributor Author

echauchot commented Sep 10, 2025

It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good. Thank you for resolving the comments. LGTM.

Regarding the CI: Only the name test succeeds, it fails in timeout when downloading the flink binaries at https://archive.apache.org. It is incredibly slow (less than 100KB/s) and it has been for some time. I'll see if another url works better.

Regarding timeouts on write requests, I'll raise the write request timeout, could you please test on you local machine. On mine it works well.

Downloading Flink binary still timeouts, because we cannot download it within allowed 1h20 because archive.apache.org seem too slow (it seems to be experiencing issues lately: https://status.apache.org/#past-incidents). The other PRs use another version of Flink (migrated to 2.0 and 2.1) so they benefit from a cache it.

@echauchot echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 623d78f to 6069d83 Compare September 10, 2025 13:19
@echauchot
Copy link
Contributor Author

echauchot commented Sep 10, 2025

@Poorvankbhatia I have dealt with write timeouts: put consistency level to lower ANY instead of default (CL=1) when writing test data. Raise write request timeout to the same timeout as read request.

Please test on your local machine that the flaky issue is gone.

I'll merge afterwards

Regarding machine load and timeouts / missing data: I have put consistency level to ONE to be coherent between read and write requests because CL=ANY in write requests could lead to hint writes that would be invisible to subsequent read requests. I guess with the extended timeouts of 15s on writes it will give enough time for a write to wait for 1 replica before ack even on loaded machines

please test on your local machine as I cannot reproduce on mine.

@echauchot
Copy link
Contributor Author

It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good. Thank you for resolving the comments. LGTM.

Regarding the CI: Only the name test succeeds, it fails in timeout when downloading the flink binaries at https://archive.apache.org. It is incredibly slow (less than 100KB/s) and it has been for some time. I'll see if another url works better.
Regarding timeouts on write requests, I'll raise the write request timeout, could you please test on you local machine. On mine it works well.

Downloading Flink binary still timeouts, because we cannot download it within allowed 1h20 because archive.apache.org seem too slow (it seems to be experiencing issues lately: https://status.apache.org/#past-incidents). The other PRs use another version of Flink (migrated to 2.0 and 2.1) so they benefit from a cache it.

FYI: https://lists.apache.org/thread/vb9bd4toq11g0x59x0q47dp5lqdlqgyr

@echauchot echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 6069d83 to 98320b7 Compare September 12, 2025 10:48
@echauchot echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 590cb87 to 98320b7 Compare September 16, 2025 09:12
@echauchot
Copy link
Contributor Author

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

@Poorvankbhatia
Copy link
Contributor

Poorvankbhatia commented Sep 17, 2025

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

@echauchot
Copy link
Contributor Author

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.

I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.

Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

@echauchot
Copy link
Contributor Author

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.

I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.

Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

Well, even 10 min timeout is not enough to start 2 containers on the github CI !
Locally all tests run in 2 min including the startup time.
I might change my plans (not start in parallel or change the wait condition) to make it pass on github CI

@Poorvankbhatia
Copy link
Contributor

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.
I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.
Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

Well, even 10 min timeout is not enough to start 2 containers on the github CI ! Locally all tests run in 2 min including the startup time. I might change my plans (not start in parallel or change the wait condition) to make it pass on github CI

I ran the change on my local (> 10 times) and it succeeded every time. Pretty sure it is a CI issue 😄 @echauchot

@echauchot echauchot force-pushed the FLINK-37937-multiple-test-containers branch from 60e9162 to e706c1b Compare September 18, 2025 08:33
@echauchot
Copy link
Contributor Author

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.
I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.
Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

Well, even 10 min timeout is not enough to start 2 containers on the github CI ! Locally all tests run in 2 min including the startup time. I might change my plans (not start in parallel or change the wait condition) to make it pass on github CI

I ran the change on my local (> 10 times) and it succeeded every time. Pretty sure it is a CI issue 😄 @echauchot

Yes, github actions is a shared environment so pretty loaded.
With 2 containers starting in sequence instead of parallel and a 3 minutes startup timeout, it passes on the Github CI.
For some (strange) reason, this PR automatically links to the build of a very old commit. Here is the link to the build of the last commit of the PR: https://github.com/echauchot/flink-connector-cassandra/runs/50669333696
It is all green, so I'll merge the PR as you already approved it.

…to update size estimates: flush updates the SSTables and refreshsizeestimates updates the size estimates based on them
…tions that are not interpreted by Cassandra cluster
…tion accessible to tests and not call estimate_size during tests. In refreshSizeEstimates wait until system.size_estimates has at least a row that has non-null mean_partition_size
…coherent between read and write request because CL=ANY in write requests could lead to hint writes that would be invisible to subsequent read requests. Raise write request timeout to the same timeout as read request. Put replication factor to 2 to deal with temporary down cassandra container.

But back sequential start of the 2 containers.
@echauchot echauchot force-pushed the FLINK-37937-multiple-test-containers branch from e706c1b to 2a05068 Compare September 18, 2025 09:04
@echauchot
Copy link
Contributor Author

echauchot commented Sep 19, 2025

@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen

yeah it is still finding the cassandraContainer variable. Is there no way to clear these build caches? Do u see any such field in the actions tab?

it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin.

I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later.

Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout.

@Poorvankbhatia I understood the Github build issue: it is not a cache issue as we supposed, it is just due to the fact that, in the meantime, I merged your SQL PR to main and that PR references the old cassandraContainer variable. Github tests the current PR by automatically doing a rebase on main. This rebase does not fail because there are no conflicts, it just adds my changes on top of yours. Hence the cassandraContainer variable that is left in the code and the compilation issue. I'll manually rebase onto main and replace cassandraContainer by cassandraContainer1 in you code.

@Poorvankbhatia
Copy link
Contributor

CI is green 😄 . I think you should merge. 👏

@echauchot echauchot merged commit 84c3fa0 into apache:main Sep 22, 2025
5 checks passed
@echauchot echauchot deleted the FLINK-37937-multiple-test-containers branch October 3, 2025 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants