-
Notifications
You must be signed in to change notification settings - Fork 4.9k
SslStream Improvements for Unix Performance #22998
Conversation
/cc @stephentoub, @davidsh, @CIPop, @Priya91 PTAL |
#22485 needs to be fixed for the build to work. |
@dotnet-bot Test Outerloop Windows x64 Debug Build |
private unsafe delegate int CreateDelegate(bio_st* bio); | ||
|
||
[UnmanagedFunctionPointer(CallingConvention.Cdecl)] | ||
private unsafe delegate int ReadDelegate(IntPtr bio, void* buf, int size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Struct wrapper for these types to make it strongly typed? e.g.
struct Bio
{
IntPtr bioPtr;
}
The wrapper should happily go through the function pointer; can also return them from externs instead of IntPtr
Can also add methods to the struct if you want act on the IntPtr
and then it behaves and looks a little like a regular class/struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about that. I originally had used the BIO Safehandle, but obviously a class can't be in a callback. But I am happy to make that change
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build |
@bartonjs Quick question, does SslStream support rehandshaking at all when using OpenSSL? From looking at it, I can't see that it does, but I have known to be wrong. It will effect ongoing changes I wanted to make. |
@Drawaes I don't think you can initiate it through the .NET API, but if you connect to |
_writeDelegate = Write; | ||
_readDelegate = Read; | ||
|
||
var name = Marshal.StringToHGlobalAnsi("Managed Bio"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improper use of var
(type name must be on the line via new
, as
, or hard-cast)
bwrite = _writeDelegate, | ||
}; | ||
|
||
var memory = Marshal.AllocHGlobal(Marshal.SizeOf<bio_method_st>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improper var
. (Pervasive)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed many many var violations (suddenly the analyzers are working in my VS :) )
} | ||
|
||
context.InputBio.SetBio(recvBuf, recvOffset, recvCount); | ||
context.OutputBio.SetBio(null, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use named parameter calling so that the true
has a purpose more obvious to a reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, named param for the null as well just to be clear, also modified the one in the Encrypt method to have the named param for the true/false
|
||
public int TakeBytes(out byte[] output) | ||
{ | ||
var bytes = _bytesWritten; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have Write and TakeBytes be synchronized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure what you mean by synchronized here? you mean that take bytes could be called by a different thread or that the two methods could be called at the same time? Are you concerned about an unseen cached value, or a race ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race conditions. Like Write sees _byteArray != null
right after TakeBytes set _bytesWritten
to 0, but before _byteArray = null
. Since neither of these methods works on/with pointers directly it might not be necessary since the class is (IIRC) described as not thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, okay so when you call Encrypt/DoHandshake, your thread doesn't return until after it has done all write/read calls. So you will never get to the GetBytes as the methods are all sync in nature. I think it is fine. The locking to ensure that btw is provided by your init method when you give OpenSsl the mutexes it asks for.
{ | ||
if (_handle.IsAllocated) | ||
{ | ||
_handle.Free(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does GCHandle.Free cause IsAllocated to become false? (I don't see anything either way in the docs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it does; though since its a struct needs to be passed by ref or a class member to reflect it (e.g. single instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and not made readonly :) (that has bitten me before)
|
||
// This allows the write buffer to move during a multi call write, this stops us having to pin it | ||
// across multiple calls where there is an async output to the innerstream inbetween | ||
Ssl.SslCtxSetAcceptMovingWriteBuffer(innerContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know this to be always safe? What guarantees does OpenSSL need on the stability of the memory, and what guarantees can we meet with regard to GC compaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenSsl requires that the memory is the same (as in the content) but that the pointer itself can move. The length must also be the same. So I feel this is fine.
Make it possible to retry SSL_write() with changed buffer location (the buffer contents must stay the same). This is not the default to avoid the misconception that non-blocking SSL_write() behaves like non-blocking write().
Its really only in place to stop silly mistakes, in this case we know we won't change the content (unless the user is silly enough to modify the input array that is in a method and not returned and modify it else where in another thread). Now it is possible for a user to do this, but then you end up with the same situation as the user doing this while you are mid writing or copying the data anyway.
I do have an idea of how I can possibly remove this, but its not an easy change due to the APM model of the SslInternalStream, the large number of callback chaining in there makes state a tough call.
} | ||
|
||
[StructLayout(LayoutKind.Sequential)] | ||
private unsafe struct bio_method_st |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native structs from OpenSSL must not have any implied mapping to their memory layout. Instead you need to make a function which takes the required parameters and does the allocation and assignment from within the shim. (Also applies to bio_st)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native structs are gone, there is a function to set the two callbacks required. All other callbacks and the construction of the structs is now in the shim
[Flags] | ||
private enum BIO_TYPE | ||
{ | ||
BIO_TYPE_SOURCE_SINK = 0x0400, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see native asserts for any of the new enum values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these will go away if I do the below and move the struct generation/allocation to the shim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed all the enums as they are no longer needed with the new shim methods.
/cc @bartonjs changes to the shim/native/interop area have been made to remove the struct generation to the shim. It's possibly faster now that there is 4 callbacks per frame write (the control callback) that are no longer interops but all in native code. |
Its currently failing some tests with the Shim change and will fix it after work tonight. |
@dotnet-bot Test Outerloop Linux x64 Release Build |
It looks to be working locally on my tests. The test failures in the outerloop, I suspect are from a different change to mine. I will double check, but other than that the code should be fine. |
Failures Debian.87.Amd64.OpenSendAsync_RequestVersion20_ResponseVersion20IfHttp2Supported(server: https://http2.akamai.com/) Fedora.26.Amd64.Open | fedora.25.amd64.OpenSystem.Net.Sockets.Tests - Catastrophic failure All LinuxSystem.Net.WebSockets.Client.Tests.CancelTest ReceiveAsync_AfterCancellationDoReceiveAsync_ThrowsWebSocketException(server: ws://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx) https://github.com/dotnet/corefx/issues/23038 ReceiveAsync_AfterCancellationDoReceiveAsync_ThrowsWebSocketException(server: wss://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx) https://github.com/dotnet/corefx/issues/23038 System.Net.WebSockets.Client.Tests.SendReceiveTest SendAsync_MultipleOutstandingSendOperations_Throws(server: wss://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx) |
Yeah I saw that. I will check after the day job. Didn't see those locally, although I see a number of socket changes for span have snuck into my test build so I will rebase and see what is going on. |
Think you have to run outloop locally also
or
Though at least one of them is #23038 |
Yeah I suspected it was not from my change because
I haven't rebased that to my local branch so wouldn't see the failure. I suspect the break isn't from me but I will take a look anyway. |
The socket failure on fedora seemed to be a timeout, but I am not sure. Anyway I will have a crack when I am back in the hotseat |
I believe there are a bunch of known hangs/failures on Fedora, but @geoffkizer can comment.
FWIW, the two wss ones on Linux would be using SslStream. |
You mean I can't blame your Span changes... :) |
@benaadams Yeah, that would be odd. Just to confirm though, you could check the # of SslStreams that are allocated in your trace. I seem to recall some weirdness about allocating a separate buffer for handshake vs encrypt/decrypt, so maybe that's what's happening here. |
Unrelated issue |
Hmm... ok looks like
Since that's > 2^14 (by 32 bytes) ArrayPool rounds it up to the next power of 2 at 2^15 or 32kB |
Yeah, I just found that too. We should do something different there, but it's not urgent. |
Unfortunately the Ssl record size limit seems to be 16K + 5 bytes for header. That's super annoying. |
It's bigger... 5 bytes for header 16k for plain text. 16bytes for the biggest ahead block trailing... And 8 bytes for the sequence part of the nounce (the other 4 bytes are made from the master secret) |
As you say though bigger fish :) there is plenty of fruit left on the tree for sure. |
For my info did you get a failure on big blocks on master again? On 2.0 rtm I couldn't repro. |
@Drawaes Here are the results I get from your ContinueWith branch on my SslStreamPerf test:
|
I don't see the exception I was seeing before from the Bio code, but I don't get any results. Most likely I am eating an exception somewhere. I'll investigate further. |
I mean on corefx/master... I just want to know I am not dealing with a bug from the current branch at the same time. For your test above was that with loopback or in memory. It's interesting that the continue with isn't helping there. I want to understand if it's an interaction with the networking stack. |
Yes, I see the same behavior on corefx/master. So it's not specific to your change. I'll take that issue offline and post an issue when I have a chance to investigate further. The numbers I posted above are in-memory; I can run loopback too. |
Perfect so if we take the tcp stack out of the equation then the continue with has the effect we would expect (slow down) which narrows the problem a bit ! |
{ | ||
asyncRequest.CompleteUser(); | ||
} | ||
asyncRequest?.CompleteUser(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub I have narrowed the issue down. It seems that it is nothing to do with the continuation/free up above. The issue is around calling asyncRequest.CompleteUser() on the current thread. So ~180k rps with the code as it stands and 256 connections (24 core/48 thread server)
Task.Run(() => asyncRequest?.CompleteUser());
It jumps to > 900k rps
This is only effecting Unix, to me it seems like a contention issue because the RPS scales up to ~10 connections and then slowly goes down from there unless you put in a Task.Run().
Looking at it, could it be something in the lazy result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CompleteUser is basically just calling the user callback. There shouldn't be any contention in the CompleteUser call itself (though of course, there could be).
The main difference here is likely something in the user code. To test this, you could add Task.Run to the user code so that the user callback will be enqueued, instead of run immediately, and see if that makes any difference.
Which test are you seeing this with? I'm not sure quite what the user's callback will be doing in this case.
BTW, I ran some numbers with my test app over loopback (instead of the in-memory stream). Here's what I see:
"baseline" is corefx/master (as of a couple days ago) |
Cool, here is my current take. Where there is little waiting /block and in a pure overhead test my changes haven't made things worse, maybe slightly better which is good because we have more object tracking, and continuations etc. So even in that case it sets us up for future changes. In the async /block case such as the full webserver network card scenario it's a win :) apart from this one problem. Now the reason I am puzzled about the user code side is... It's Kestrel and with identical code it doesn't happen on windows. There are a few platform specific parts in the usercallback. OS waithandles/mutexs and libuv. Neither of those am I bold enough to blame. I know there has been work on Kestrel on sockets so I might try that to eliminate (Or not ) Libuv |
What does the call stack look like at this point? Are we inside a callback from OpenSSL? And if so, might it be holding a lock? |
We shouldn't be, not sure how we could be still inside a callback at this point unless there is an exception thrown in the managed callback, and when there has been in development it has spat that out and stopped. I am trying to figure out tracing on Linux, what is interesting is that there is a 10x slowdown on tracing/event log on my linux but the cpu isn't doing much which suggests there is some lock somewhere doing something bad as that is disproportionate to the amount of logs I am seeing trace out. If I turn off eventlogs, and just do a trace I get something more reasonable. What I can glean from the logs, (user and kernel stacks don't correlate which is a pain) is that the CPU's are mostly idle for the non continuewith case. And the most CPU time (apart from idle) is module 2.19.so <<libpthread-2.19.so!__lll_lock_wait closely followed by module 2.19.so <<libpthread-2.19.so!__lll_unlock_wake which kinda says it all :) |
The managed stack looks something like https://gist.github.com/Drawaes/4ff5036448d87557eb735166682332ad The .net callbacks from unmanaged code are pretty tiny so I am not sure it can be that. I will take another look at the openssl locking. |
Ok, thanks. @Drawaes, have you or @benaadams tried just adding that continuation to the existing code base? e.g. changing the line at https://github.com/dotnet/corefx/blob/master/src/System.Net.Security/src/System/Net/Security/SslStreamInternal.cs#L437 from: Task t = _sslState.InnerStream.WriteAsync(outBuffer, 0, encryptedBytes); to: Task t = _sslState.InnerStream.WriteAsync(outBuffer, 0, encryptedBytes).ContinueWith(p => p.GetAwaiter().GetResult()); ? If you have, what were the results? If you haven't, might be interesting to see if/how much of the gains here are due to escaping whatever lock is causing a problem here, separate from the rest of the changes being made. |
Funny... I am building it right now to test that :) |
@stephentoub good or bad news depending on how you look at it. I am seeing 150k -> 650k so I am gaining an extra ~250k with my change but a large chunk is from this issue.... I am contemplating at this stage, close this PR, open an issue for this, and I will start working on that in isolation, then we can comeback to this in the future as I think this is an issue at this stage (FYI 2.0 as well as 2.1 has this issue). What do you think? |
@Drawaes That seems good to me. |
https://github.com/dotnet/corefx/issues/23485 I now hold the dubious honour of having the most commented on PR's in core and corefx that never got merged :P |
@Drawaes FYI, my perf microbenchmark is at https://github.com/geoffkizer/netperf/tree/master/SocketPerfTest. It's pretty simple to use. As you're making additional changes here, I'd love to see the numbers you get from this microbenchmark. |
Cool.. I am feeling upbeat, I got vtune working on my benchmark machines finally...so if I seemed to have gone quiet I haven't stopped looking :) I might have some interesting numbers (or not but hey proving it's not something is useful as well right ? ) soon. |
#21371
Changed the OpenSsl code to use a "custom" bio that uses the buffers from managed code directly rather than using an interim MemoryBio to stop extra copying.
This alone didn't provide much performance however I have modified the encrypt side to write directly to the output buffer, if the content is too large it will then send that to the socket and return to get more of the frame/frames.
This should setup the future for improvements on the read side and to react to Buffer/Span changes.
The performance diff I see on my hardware using the Techempower plaintext on Ubuntu 14.04 with SSL turned on is as below these are unconfirmed numbers. There is a clear convergence of the new and old code as the send sizes get bigger, the numbers don't include the HTTP Header, so the 11 bytes == Hello World. The tests use 256 connections, and pipelining with a depth of 16