-
Notifications
You must be signed in to change notification settings - Fork 31
[FLINK-37937] Test Cassandra source in real cluster conditions to better test the split #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-37937] Test Cassandra source in real cluster conditions to better test the split #37
Conversation
|
@Poorvankbhatia PTAL |
...r-cassandra/src/test/java/org/apache/flink/connector/cassandra/CassandraTestEnvironment.java
Show resolved
Hide resolved
...r-cassandra/src/test/java/org/apache/flink/connector/cassandra/CassandraTestEnvironment.java
Outdated
Show resolved
Hide resolved
...r-cassandra/src/test/java/org/apache/flink/connector/cassandra/CassandraTestEnvironment.java
Outdated
Show resolved
Hide resolved
|
Hey @echauchot . Thanks for the diff. It makes sense to me. Added some comments. Typically a timing/cluster-readiness problem in tests, not a logic bug—especially if CI is green. So i am unable to get the test running on my local, but since the CI is green maybe there is an issue with my setup. (I tried giving Docker more CPU/RAM but coudn't get it working). |
Yes I have this from time to time on my laptop I could raise the configured query timeouts and also see with the new startup conditions. |
Yes that would be pretty helpful. 👍 |
Actually the timeouts were incorrectly applied due to the error in naming the conf file. So I just applied the original timeouts values and ran 5 times on my laptop with no issue such as above. |
|
@Poorvankbhatia thanks for your review ! I addressed all the comments, PTAL. |
Poorvankbhatia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still doesn't work on my machine somehow 😅. But I think the CI is green, so that is good.
Thank you for resolving the comments.
LGTM.
Regarding the CI: Only the name test succeeds, it fails in timeout when downloading the flink binaries at https://archive.apache.org. It is incredibly slow (less than 100KB/s) and it has been for some time. I'll see if another url works better. Regarding timeouts on write requests, I'll raise the write request timeout, could you please test on you local machine. On mine it works well. |
One thing strikes me though: the write query uses CL=1 whereas it should use lower CL=ANY as org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForWriting is configured with CL=ANY. This means that somehow org.apache.flink.connector.cassandra.CassandraTestEnvironment#builderForReading (CL=1) is used for writing as well |
Ah no, the builders are used for the old connectors tests. It is in org.apache.flink.connector.testframe.external.ExternalSystemSplitDataWriter#writeRecords at the mapper side that it needs to be configured |
176baad to
623d78f
Compare
|
@Poorvankbhatia I have dealt with write timeouts: put consistency level to lower ANY instead of default (CL=1) when writing test data. Raise write request timeout to the same timeout as read request. Please test on your local machine that the flaky issue is gone. I'll merge afterwards |
Downloading Flink binary still timeouts, because we cannot download it within allowed 1h20 because archive.apache.org seem too slow (it seems to be experiencing issues lately: https://status.apache.org/#past-incidents). The other PRs use another version of Flink (migrated to 2.0 and 2.1) so they benefit from a cache it. |
623d78f to
6069d83
Compare
Regarding machine load and timeouts / missing data: I have put consistency level to ONE to be coherent between read and write requests because CL=ANY in write requests could lead to hint writes that would be invisible to subsequent read requests. I guess with the extended timeouts of 15s on writes it will give enough time for a write to wait for 1 replica before ack even on loaded machines please test on your local machine as I cannot reproduce on mine. |
FYI: https://lists.apache.org/thread/vb9bd4toq11g0x59x0q47dp5lqdlqgyr |
6069d83 to
98320b7
Compare
590cb87 to
98320b7
Compare
|
@Poorvankbhatia I have the impression that there is a build result cache issue on github. The build is not finished but the previous built is printed on screen |
yeah it is still finding the |
it is like it was showing a very old build on a old commit (before adding the second container). At least I can link to the correct github actions builds from my IDE with the github plugin. I have already checked the github actions cache and the only things I see are the flink-binaries (not related as it is flink-core) and setup-java-Linux-x64-maven*. I'll try to remove the later. Anyway I see in the regular build that the startup of both containers to not fit in the 3 min graceful timeout. |
Well, even 10 min timeout is not enough to start 2 containers on the github CI ! |
I ran the change on my local (> 10 times) and it succeeded every time. Pretty sure it is a CI issue 😄 @echauchot |
60e9162 to
e706c1b
Compare
Yes, github actions is a shared environment so pretty loaded. |
…to update size estimates: flush updates the SSTables and refreshsizeestimates updates the size estimates based on them
…tions that are not interpreted by Cassandra cluster
…tion accessible to tests and not call estimate_size during tests. In refreshSizeEstimates wait until system.size_estimates has at least a row that has non-null mean_partition_size
…coherent between read and write request because CL=ANY in write requests could lead to hint writes that would be invisible to subsequent read requests. Raise write request timeout to the same timeout as read request. Put replication factor to 2 to deal with temporary down cassandra container. But back sequential start of the 2 containers.
e706c1b to
2a05068
Compare
@Poorvankbhatia I understood the Github build issue: it is not a cache issue as we supposed, it is just due to the fact that, in the meantime, I merged your SQL PR to main and that PR references the old cassandraContainer variable. Github tests the current PR by automatically doing a rebase on main. This rebase does not fail because there are no conflicts, it just adds my changes on top of yours. Hence the |
…d of cassandraContainer
|
CI is green 😄 . I think you should merge. 👏 |
A bug in split calculation (ring fraction calculation) was uncovered by this PR. The existing split tests are run on an embedded Cassandra cluster with only one node. This leads to having ringFraction always equal to 1 (the single node hosts 100% of the data) during the tests. This masks the bug.
Test splits on an embedded cluster of 2 nodes.
Additional changes:
PS: tests are longer to setup because of the node cluster: measured on my laptop 2min16 vs 57s but the better split testings is worth it I think.
R: @Poorvankbhatia