SslStream Improvements for Unix Performance #22998

Drawaes · 2017-08-06T21:33:56Z

#21371
Changed the OpenSsl code to use a "custom" bio that uses the buffers from managed code directly rather than using an interim MemoryBio to stop extra copying.

This alone didn't provide much performance however I have modified the encrypt side to write directly to the output buffer, if the content is too large it will then send that to the socket and return to get more of the frame/frames.

This should setup the future for improvements on the read side and to react to Buffer/Span changes.

The performance diff I see on my hardware using the Techempower plaintext on Ubuntu 14.04 with SSL turned on is as below these are unconfirmed numbers. There is a clear convergence of the new and old code as the send sizes get bigger, the numbers don't include the HTTP Header, so the 11 bytes == Hello World. The tests use 256 connections, and pipelining with a depth of 16

benaadams · 2017-08-06T21:39:41Z

/cc @stephentoub, @davidsh, @CIPop, @Priya91 PTAL

Drawaes · 2017-08-06T21:50:40Z

#22485 needs to be fixed for the build to work.

davidsh · 2017-08-06T22:36:00Z

@dotnet-bot Test Outerloop Windows x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop Linux x64 Release Build

benaadams · 2017-08-06T23:57:08Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.BIO.Custom.cs

+        private unsafe delegate int CreateDelegate(bio_st* bio);
+
+        [UnmanagedFunctionPointer(CallingConvention.Cdecl)]
+        private unsafe delegate int ReadDelegate(IntPtr bio, void* buf, int size);


Struct wrapper for these types to make it strongly typed? e.g.

struct Bio { IntPtr bioPtr; }

The wrapper should happily go through the function pointer; can also return them from externs instead of IntPtr

Can also add methods to the struct if you want act on the IntPtr and then it behaves and looks a little like a regular class/struct

I wondered about that. I originally had used the BIO Safehandle, but obviously a class can't be in a callback. But I am happy to make that change

Drawaes · 2017-08-07T01:34:04Z

@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build

Drawaes · 2017-08-07T22:24:00Z

@bartonjs Quick question, does SslStream support rehandshaking at all when using OpenSSL? From looking at it, I can't see that it does, but I have known to be wrong. It will effect ongoing changes I wanted to make.

bartonjs · 2017-08-07T22:44:45Z

@Drawaes I don't think you can initiate it through the .NET API, but if you connect to openssl s_server and send capital-R (renegotiate) I think it works. Or, at least, tries to work. Pretty sure it works when Windows and macOS are the clients, if nothing else. So, if it doesn't work, it probably should.

bartonjs · 2017-08-07T22:46:57Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.BIO.Custom.cs

+            _writeDelegate = Write;
+            _readDelegate = Read;
+
+            var name = Marshal.StringToHGlobalAnsi("Managed Bio");


Improper use of var (type name must be on the line via new, as, or hard-cast)

bartonjs · 2017-08-07T22:47:09Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.BIO.Custom.cs

+                bwrite = _writeDelegate,
+            };
+
+            var memory = Marshal.AllocHGlobal(Marshal.SizeOf<bio_method_st>());


Improper var. (Pervasive)

removed many many var violations (suddenly the analyzers are working in my VS :) )

bartonjs · 2017-08-07T22:50:33Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.OpenSsl.cs

            }

+            context.InputBio.SetBio(recvBuf, recvOffset, recvCount);
+            context.OutputBio.SetBio(null, true);


Use named parameter calling so that the true has a purpose more obvious to a reader.

done, named param for the null as well just to be clear, also modified the one in the Encrypt method to have the named param for the true/false

bartonjs · 2017-08-07T22:55:10Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.Ssl.cs

+
+            public int TakeBytes(out byte[] output)
+            {
+                var bytes = _bytesWritten;


Do we need to have Write and TakeBytes be synchronized?

I am not 100% sure what you mean by synchronized here? you mean that take bytes could be called by a different thread or that the two methods could be called at the same time? Are you concerned about an unseen cached value, or a race ?

Race conditions. Like Write sees _byteArray != null right after TakeBytes set _bytesWritten to 0, but before _byteArray = null. Since neither of these methods works on/with pointers directly it might not be necessary since the class is (IIRC) described as not thread-safe.

Ahh, okay so when you call Encrypt/DoHandshake, your thread doesn't return until after it has done all write/read calls. So you will never get to the GetBytes as the methods are all sync in nature. I think it is fine. The locking to ensure that btw is provided by your init method when you give OpenSsl the mutexes it asks for.

bartonjs · 2017-08-07T22:57:46Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.Ssl.cs

+            {
+                if (_handle.IsAllocated)
+                {
+                    _handle.Free();


Does GCHandle.Free cause IsAllocated to become false? (I don't see anything either way in the docs)

Yes it does; though since its a struct needs to be passed by ref or a class member to reflect it (e.g. single instance)

and not made readonly :) (that has bitten me before)

bartonjs · 2017-08-07T23:00:07Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.OpenSsl.cs


+                // This allows the write buffer to move during a multi call write, this stops us having to pin it
+                // across multiple calls where there is an async output to the innerstream inbetween
+                Ssl.SslCtxSetAcceptMovingWriteBuffer(innerContext);


Do we know this to be always safe? What guarantees does OpenSSL need on the stability of the memory, and what guarantees can we meet with regard to GC compaction?

OpenSsl requires that the memory is the same (as in the content) but that the pointer itself can move. The length must also be the same. So I feel this is fine.

Make it possible to retry SSL_write() with changed buffer location (the buffer contents must stay the same). This is not the default to avoid the misconception that non-blocking SSL_write() behaves like non-blocking write().

Its really only in place to stop silly mistakes, in this case we know we won't change the content (unless the user is silly enough to modify the input array that is in a method and not returned and modify it else where in another thread). Now it is possible for a user to do this, but then you end up with the same situation as the user doing this while you are mid writing or copying the data anyway.

I do have an idea of how I can possibly remove this, but its not an easy change due to the APM model of the SslInternalStream, the large number of callback chaining in there makes state a tough call.

bartonjs · 2017-08-07T23:03:01Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.BIO.Custom.cs

+        }
+
+        [StructLayout(LayoutKind.Sequential)]
+        private unsafe struct bio_method_st


Native structs from OpenSSL must not have any implied mapping to their memory layout. Instead you need to make a function which takes the required parameters and does the allocation and assignment from within the shim. (Also applies to bio_st)

Native structs are gone, there is a function to set the two callbacks required. All other callbacks and the construction of the structs is now in the shim

bartonjs · 2017-08-07T23:06:23Z

src/Common/src/Interop/Unix/System.Security.Cryptography.Native/Interop.BIO.Custom.cs

+        [Flags]
+        private enum BIO_TYPE
+        {
+            BIO_TYPE_SOURCE_SINK = 0x0400,


I don't see native asserts for any of the new enum values.

Most of these will go away if I do the below and move the struct generation/allocation to the shim

I have removed all the enums as they are no longer needed with the new shim methods.

Drawaes · 2017-08-08T02:21:27Z

/cc @bartonjs changes to the shim/native/interop area have been made to remove the struct generation to the shim. It's possibly faster now that there is 4 callbacks per frame write (the control callback) that are no longer interops but all in native code.

Drawaes · 2017-08-08T11:03:37Z

Its currently failing some tests with the Shim change and will fix it after work tonight.

Drawaes · 2017-08-08T15:03:28Z

@dotnet-bot Test Outerloop Linux x64 Release Build

Drawaes · 2017-08-08T16:07:04Z

It looks to be working locally on my tests. The test failures in the outerloop, I suspect are from a different change to mine. I will double check, but other than that the code should be fine.

benaadams · 2017-08-08T16:29:24Z

Failures

Debian.87.Amd64.Open

SendAsync_RequestVersion20_ResponseVersion20IfHttp2Supported(server: https://http2.akamai.com/)

Fedora.26.Amd64.Open | fedora.25.amd64.Open

System.Net.Sockets.Tests - Catastrophic failure

All Linux

System.Net.WebSockets.Client.Tests.CancelTest

ReceiveAsync_AfterCancellationDoReceiveAsync_ThrowsWebSocketException(server: ws://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx) https://github.com/dotnet/corefx/issues/23038

ReceiveAsync_AfterCancellationDoReceiveAsync_ThrowsWebSocketException(server: wss://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx) https://github.com/dotnet/corefx/issues/23038

System.Net.WebSockets.Client.Tests.SendReceiveTest

SendAsync_MultipleOutstandingSendOperations_Throws(server: wss://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx)

Drawaes · 2017-08-08T17:11:40Z

Yeah I saw that. I will check after the day job. Didn't see those locally, although I see a number of socket changes for span have snuck into my test build so I will rebase and see what is going on.

benaadams · 2017-08-08T17:23:40Z

Think you have to run outloop locally also

build-managed -Outerloop

or

 msbuild <csproj_file> /t:BuildAndTest /p:WithCategories=OuterLoop

Though at least one of them is #23038

Drawaes · 2017-08-08T17:26:45Z

Yeah I suspected it was not from my change because

Those tests aren't using TLS they are "normal" requests
This change has gone in, in the tests
"Add HttpListenerWebSocketContext.WebSocket.Receive/Close throw (commit: 5623bfc) (detail / githubweb)"

I haven't rebased that to my local branch so wouldn't see the failure. I suspect the break isn't from me but I will take a look anyway.

Drawaes · 2017-08-08T17:34:15Z

The socket failure on fedora seemed to be a timeout, but I am not sure. Anyway I will have a crack when I am back in the hotseat

stephentoub · 2017-08-08T17:46:45Z

The socket failure on fedora seemed to be a timeout, but I am not sure

I believe there are a bunch of known hangs/failures on Fedora, but @geoffkizer can comment.

Yeah I suspected it was not from my change because Those tests aren't using TLS they are "normal" requests

FWIW, the two wss ones on Linux would be using SslStream.

Drawaes · 2017-08-08T17:49:56Z

You mean I can't blame your Span changes... :)

geoffkizer · 2017-08-21T18:27:03Z

@benaadams Yeah, that would be odd. Just to confirm though, you could check the # of SslStreams that are allocated in your trace.

I seem to recall some weirdness about allocating a separate buffer for handshake vs encrypt/decrypt, so maybe that's what's happening here.

benaadams · 2017-08-21T18:27:16Z

34 broken is a bit of a worry? 3%

Unrelated issue

benaadams · 2017-08-21T18:43:44Z

Hmm... ok looks like

const int ReadBufferSize = 4096 * 4 + 32; = 16416

Since that's > 2^14 (by 32 bytes) ArrayPool rounds it up to the next power of 2 at 2^15 or 32kB

geoffkizer · 2017-08-21T18:44:35Z

Yeah, I just found that too. We should do something different there, but it's not urgent.

geoffkizer · 2017-08-21T18:55:42Z

Unfortunately the Ssl record size limit seems to be 16K + 5 bytes for header. That's super annoying.

Drawaes · 2017-08-21T18:57:48Z

It's bigger... 5 bytes for header 16k for plain text. 16bytes for the biggest ahead block trailing... And 8 bytes for the sequence part of the nounce (the other 4 bytes are made from the master secret)

Drawaes · 2017-08-21T18:58:58Z

As you say though bigger fish :) there is plenty of fruit left on the tree for sure.

Drawaes · 2017-08-21T19:00:57Z

For my info did you get a failure on big blocks on master again? On 2.0 rtm I couldn't repro.

geoffkizer · 2017-08-21T19:14:04Z

@Drawaes Here are the results I get from your ContinueWith branch on my SslStreamPerf test:

MessageSize	1	16	256	4096
baseline	292919	282623	261778	196408
continuewith	253619	238937	227682	171037
Improvement	-13.4%	-15.5%	-13.0%	-12.9%

geoffkizer · 2017-08-21T19:26:27Z

@Drawaes

For my info did you get a failure on big blocks on master again?

I don't see the exception I was seeing before from the Bio code, but I don't get any results. Most likely I am eating an exception somewhere. I'll investigate further.

Drawaes · 2017-08-21T19:39:26Z

I mean on corefx/master... I just want to know I am not dealing with a bug from the current branch at the same time. For your test above was that with loopback or in memory. It's interesting that the continue with isn't helping there. I want to understand if it's an interaction with the networking stack.

geoffkizer · 2017-08-21T19:42:05Z

Yes, I see the same behavior on corefx/master. So it's not specific to your change. I'll take that issue offline and post an issue when I have a chance to investigate further.

The numbers I posted above are in-memory; I can run loopback too.

Drawaes · 2017-08-21T19:45:42Z

Perfect so if we take the tcp stack out of the equation then the continue with has the effect we would expect (slow down) which narrows the problem a bit !

Drawaes · 2017-08-22T00:41:06Z

src/System.Net.Security/src/System/Net/Security/SslStreamInternal.cs

-            {
-                asyncRequest.CompleteUser();
            }
+            asyncRequest?.CompleteUser();


@stephentoub I have narrowed the issue down. It seems that it is nothing to do with the continuation/free up above. The issue is around calling asyncRequest.CompleteUser() on the current thread. So ~180k rps with the code as it stands and 256 connections (24 core/48 thread server)

Task.Run(() => asyncRequest?.CompleteUser());
It jumps to > 900k rps

This is only effecting Unix, to me it seems like a contention issue because the RPS scales up to ~10 connections and then slowly goes down from there unless you put in a Task.Run().

Looking at it, could it be something in the lazy result?

CompleteUser is basically just calling the user callback. There shouldn't be any contention in the CompleteUser call itself (though of course, there could be).

The main difference here is likely something in the user code. To test this, you could add Task.Run to the user code so that the user callback will be enqueued, instead of run immediately, and see if that makes any difference.

Which test are you seeing this with? I'm not sure quite what the user's callback will be doing in this case.

geoffkizer · 2017-08-22T06:52:53Z

BTW, I ran some numbers with my test app over loopback (instead of the in-memory stream). Here's what I see:

Over loopback
MessageSize	1	16	256	4096
baseline	122440	119689	115707	92481
drawaes	123560	121826	114060	92321
Improvement	0.9%	1.8%	-1.4%	-0.2%
continuewith	24122	24699	22522	17899
Improvement	-80.3%	-79.4%	-80.5%	-80.6%

"baseline" is corefx/master (as of a couple days ago)
"drawaes" is this PR -- looks like basically a wash
"continuewith" is adding in the continuewith -- this kills performance on my test. Notably, CPU was only around 60%, so there's probably some sort of contention issue here.

Drawaes · 2017-08-22T08:35:34Z

Cool, here is my current take. Where there is little waiting /block and in a pure overhead test my changes haven't made things worse, maybe slightly better which is good because we have more object tracking, and continuations etc.

So even in that case it sets us up for future changes. In the async /block case such as the full webserver network card scenario it's a win :) apart from this one problem.

Now the reason I am puzzled about the user code side is... It's Kestrel and with identical code it doesn't happen on windows.

There are a few platform specific parts in the usercallback. OS waithandles/mutexs and libuv. Neither of those am I bold enough to blame. I know there has been work on Kestrel on sockets so I might try that to eliminate (Or not ) Libuv

stephentoub · 2017-08-22T10:34:44Z

This is only effecting Unix, to me it seems like a contention issue because the RPS scales up to ~10 connections and then slowly goes down from there unless you put in a Task.Run().

What does the call stack look like at this point? Are we inside a callback from OpenSSL? And if so, might it be holding a lock?

Drawaes · 2017-08-22T10:42:51Z

We shouldn't be, not sure how we could be still inside a callback at this point unless there is an exception thrown in the managed callback, and when there has been in development it has spat that out and stopped.

I am trying to figure out tracing on Linux, what is interesting is that there is a 10x slowdown on tracing/event log on my linux but the cpu isn't doing much which suggests there is some lock somewhere doing something bad as that is disproportionate to the amount of logs I am seeing trace out. If I turn off eventlogs, and just do a trace I get something more reasonable.

What I can glean from the logs, (user and kernel stacks don't correlate which is a pain) is that the CPU's are mostly idle for the non continuewith case.

And the most CPU time (apart from idle) is

module 2.19.so <<libpthread-2.19.so!__lll_lock_wait

closely followed by

module 2.19.so <<libpthread-2.19.so!__lll_unlock_wake

which kinda says it all :)

Drawaes · 2017-08-22T17:21:36Z

The managed stack looks something like

https://gist.github.com/Drawaes/4ff5036448d87557eb735166682332ad

The .net callbacks from unmanaged code are pretty tiny so I am not sure it can be that. I will take another look at the openssl locking.

stephentoub · 2017-08-22T18:54:08Z

The managed stack looks something like

Ok, thanks.

@Drawaes, have you or @benaadams tried just adding that continuation to the existing code base? e.g. changing the line at https://github.com/dotnet/corefx/blob/master/src/System.Net.Security/src/System/Net/Security/SslStreamInternal.cs#L437 from:

Task t = _sslState.InnerStream.WriteAsync(outBuffer, 0, encryptedBytes);

to:

Task t = _sslState.InnerStream.WriteAsync(outBuffer, 0, encryptedBytes).ContinueWith(p => p.GetAwaiter().GetResult());

?

If you have, what were the results? If you haven't, might be interesting to see if/how much of the gains here are due to escaping whatever lock is causing a problem here, separate from the rest of the changes being made.

Drawaes · 2017-08-22T18:57:45Z

Funny... I am building it right now to test that :)

Drawaes · 2017-08-22T19:50:23Z

@stephentoub good or bad news depending on how you look at it. I am seeing 150k -> 650k so I am gaining an extra ~250k with my change but a large chunk is from this issue.... I am contemplating at this stage, close this PR, open an issue for this, and I will start working on that in isolation, then we can comeback to this in the future as I think this is an issue at this stage (FYI 2.0 as well as 2.1 has this issue).

What do you think?

geoffkizer · 2017-08-22T19:59:48Z

@Drawaes That seems good to me.

Drawaes · 2017-08-22T20:05:33Z

https://github.com/dotnet/corefx/issues/23485

I now hold the dubious honour of having the most commented on PR's in core and corefx that never got merged :P

geoffkizer · 2017-08-23T20:41:46Z

@Drawaes FYI, my perf microbenchmark is at https://github.com/geoffkizer/netperf/tree/master/SocketPerfTest. It's pretty simple to use. As you're making additional changes here, I'd love to see the numbers you get from this microbenchmark.

Drawaes · 2017-08-23T21:42:55Z

Cool.. I am feeling upbeat, I got vtune working on my benchmark machines finally...so if I seemed to have gone quiet I haven't stopped looking :) I might have some interesting numbers (or not but hey proving it's not something is useful as well right ? ) soon.

dnfclas added the cla-already-signed label Aug 6, 2017

davidsh added area-System.Net.Security os-linux Linux OS (any supported distro) labels Aug 6, 2017

benaadams reviewed Aug 6, 2017

View reviewed changes

karelz assigned Drawaes, Priya91, geoffkizer, davidsh and CIPop Aug 7, 2017

davidsh removed their assignment Aug 7, 2017

CIPop removed their assignment Aug 7, 2017

bartonjs suggested changes Aug 7, 2017

View reviewed changes

Added reference to AppSet/AppGet from openssl

4b4b72c

Drawaes commented Aug 22, 2017

View reviewed changes

Drawaes closed this Aug 22, 2017

karelz modified the milestone: 2.1.0 Aug 28, 2017

Drawaes mentioned this pull request Oct 1, 2017

SSLStream : Removed sync lock method #24352

Closed

SslStream Improvements for Unix Performance #22998

SslStream Improvements for Unix Performance #22998

Uh oh!

Conversation

Drawaes commented Aug 6, 2017

Uh oh!

benaadams commented Aug 6, 2017

Uh oh!

Drawaes commented Aug 6, 2017

Uh oh!

davidsh commented Aug 6, 2017

Uh oh!

benaadams Aug 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Drawaes commented Aug 7, 2017

Uh oh!

Drawaes commented Aug 7, 2017

Uh oh!

bartonjs commented Aug 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Drawaes commented Aug 8, 2017

Uh oh!

Drawaes commented Aug 8, 2017

Uh oh!

Drawaes commented Aug 8, 2017

Uh oh!

Drawaes commented Aug 8, 2017

Uh oh!

benaadams commented Aug 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Debian.87.Amd64.Open

Fedora.26.Amd64.Open | fedora.25.amd64.Open

All Linux

Uh oh!

Drawaes commented Aug 8, 2017

Uh oh!

benaadams commented Aug 8, 2017

Uh oh!

benaadams Aug 6, 2017 •

edited

Loading

benaadams commented Aug 8, 2017 •

edited

Loading