-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
re-introduce S3 virtual host rewriter for v3 #9450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment about the approach: I think this is a good solution for now. I don't think we need to be too dogmatic about principles like immutability when it comes to S3, we're just making our life difficult, and if it works and doesn't break anything in any way, why not.
8fd49d0
to
119c716
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restoring the virutal host rewriting, instead of using the proxy infrastructure, seems fine considering the immense performance impact.
In the future, we might be able to move away from it again (in case we can do an in-memory proxy forwarding without the HTTP overhead), but for now this is a nice solution.
I am just not sure if this wouldn't break the legacy provider (even though it's going to be removed soon), since it's also using the virtual host rewriter, and the regex seems to be simplified quite a lot (to a degree where it catches way less, for example it seems to neglect protocols in host headers)?
I've updated the code so that the functionality for the legacy provider should be untouched, which allows us to release this as non-breaking before v3. I've ran a quick test to see if the legacy provider was still working well: # started LocalStack with `PROVIDER_OVERRIDE_S3=legacy`
$ awslocal s3 mb s3://test
make_bucket: test
$ curl http://test.s3.localhost.localstack.cloud:4566
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>test</Name><MaxKeys>1000</MaxKeys><IsTruncated>false</IsTruncated><Marker></Marker></ListBucketResult>% Also sneaked a little fix to the legacy provider, because I couldn't run a test because it didn't accept the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks for addressing the comments! The changes are now looking good, and the legacy implementation is clearly separated from the new implementation (which will make it super easy to remove with 3.0). 🚀
Motivation
We removed the
VirtualHostRewriter
from the parser with the introduction of S3v2
, because it is not optimal and breaking theRequest
contract: it mutates it during the parsing.However, we've had several issues with the proxying of requests (which should be now solved), and it's a problem performance wise. I've ran several benchmarks with and without this PR to show the impact of the proxying.
I agree that mutating the request is not the right thing to do, but this is temporary before we address the shortcomings we have. There are different solutions to the Virtual Hosted requests problem.
EnvironBuilder
and rolling our own mechanism. But if we have that, we could simply copy the request and mutate the copy before parsing, avoiding a network roundtrip.bucket
in the path), with some kind of mechanism to get the bucket parameter from the hostRight now, the simplest seems to be the request copying then mutating mechanism. In the meantime, I believe re-introducing the VirtualHostRewriter is needed because of the 2x performance drop.
Benchmarks
12b
and100kb
were tested for 1000 requests,10mb
for 100.This is a small benchmark python against LocalStack in host mode.
Note: the
GET
time for the large object seems unusually fast, I believe the previous benchmarks I had done withwarp
were maybe reading the stream back.Additional notes
I also believe that mutating the request this way also was an issue when the request would then be sent to moto: the parsed request would not be the one moto had received. But we don't suffer from this issue anymore, because we never manipulate the request directly, but the
ServiceRequest
in v3, which is a big advantage.Notes on the Virtual Host regex
I believe we can greatly simplify the very greedy virtual host regex, because we've been using the regex from the router since v2 to "capture" virtual hosted requests, and we didn't have an issue since. Also, the current regex only allows domains that are
localhost.localstack.cloud
andamazonaws.com
. I've ran a quick test with several entries (but not multiple times because the entries are cached) to show the difference between the 2 regex matching on 19 entries both virtual host and not:old regex
: 0.0522ms / 19 entriesnew regex
: 0.0228ms / 19 entriesChanges
Re-introduce the
VirtualHostRewriter
in the parser for the v3 provider.Simplified the virtual host regex, and also made use of named group, and we can now extract the region from it, which might become handy at some point in a handler. \cc @viren-nadkarni
Added the new test for the new regex.