-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
refactor s3 storage to use context manager to avoid race condition #10387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So I've done a try with 200 test runs (takes more than an hour with 0.5 CPU 😭), and it failed 6 times now, around 3% failure rate, I'm not really happy about that. I'm going to push another commit with a workaround using real file modification nano time to compare with the object. I'm not fond of it and hope we could find another solution, but this would unblock the situation. I could also cherry pick this commit and open another PR for it, maybe that would be cleaner. I'll also run benchmarks to compare the difference between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incredible investigation and great set of changes. 🥇 really love the approach to reproduce the issue by creating certain favorable conditions and then just working with percentages 👍
changes LGTM!
# TODO: remove this with 3.3, this is for persistence reason | ||
if not hasattr(s3_object, "internal_last_modified"): | ||
s3_object.internal_last_modified = s3_stored_object.last_modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for backwards compatibility
Motivation
As reported in #10003 (also in apache/arrow-rs#5283), we would sometimes encounter an error when doing a lot of concurrent access (read and write) to S3 on the same object.
The bug is extremely hard to reproduce, the only way I could do it was by running the LocalStack image with the docker flag
--cpus=0.5
to simulate a constrained environment, and run the rust test suite ofobject_store
(which is really fast) forever until it would break (between 5% and 10% of the time I would say...).When it would fail, we would have a cryptic message from the ASGI bridge. After reproducing the issue a few times, I could add a bunch of debug statement in
rolo
and find the culprit:A GetObject request with Content-Length set to 1, with the key prefix RACE-, followed by a PutObject of length 2.
The first call get the object metadata with a still indicated content-length of 1, but then gets the full new data content
10
so it fails.Basically a very small race condition between the state of the object and its value. I suspect the issue would most probably happen between the Get and Put call, when returning the
S3EphemeralObject
and pass it to the response handler. However, its iterator and__iter__
would not be called until the end of the chain, which would only generate the read lock then. In between, a Put call could still snatch the write lock and modify its value.Changes
Now, after this small refactor, as soon as we create these
S3StoredObject
subclass, we acquire the lock (inread
orwrite
mode). The lock will stay acquired during the life of the object, which means we can control better when we can group action together (modifying the metadata of the object can now be done inside theWriteLock
).Also, it looks much nicer now: we can use the context manager around the
S3StoredObject
, by using it with the.open()
call. The caller ofopen
is always responsible for closing it, and almost every single usage is now done inside a context manager.I've modified the
EphemeralS3StoredMultipart
logic to not store the fullEphemeralS3StoredObject
anymore to release the lock, and now properly create an object when we need a part. This looks cleaner.Added some checks in place to not write on a non-writable object, to avoid creating race condition and not acquiring a write lock.
The only exception is for
GetObject
: in that case, when passing an iterator to the chain, the server is responsible for calling.close()
on the iterator. (see #8926). This is actually the fix of the issue: we now properly generate the read lock in the provider call, and keep that locked acquired until the response is sent.Also, special thanks to @tustvold, which have been very understanding and helpful with the reports and the really nice test suite.
Testing
I've been running the (sometimes) failing test since 40 minutes now and have encountered only one failure, so it is still not fully fixed, but I believe the window where the race condition can happen have been greatly reduced. I still can pinpoint when it happens, but with the current architecture, we have to get the object metadata before the actual object content, so in between these 2 actions the object can get updated. If anyone has an idea on how to fix this.
edit: see the comment but I got a 3% failure rate on 200 runs, so I've updated it with a workaround using the real object modification time to do a check inside the read lock, where we can fetch a possibly updated object. This seems to have done it, I did 200 runs without failure for now, so this is fixed, combining the new lock system with this little trick.
for i in {1..200}; do cargo test aws::tests::s3_test --features aws -- --exact; done