Thanks to visit codestin.com
Credit goes to github.com

Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

Drawaes
Copy link

@Drawaes Drawaes commented Aug 6, 2017

#21371
Changed the OpenSsl code to use a "custom" bio that uses the buffers from managed code directly rather than using an interim MemoryBio to stop extra copying.

This alone didn't provide much performance however I have modified the encrypt side to write directly to the output buffer, if the content is too large it will then send that to the socket and return to get more of the frame/frames.

This should setup the future for improvements on the read side and to react to Buffer/Span changes.

The performance diff I see on my hardware using the Techempower plaintext on Ubuntu 14.04 with SSL turned on is as below these are unconfirmed numbers. There is a clear convergence of the new and old code as the send sizes get bigger, the numbers don't include the HTTP Header, so the 11 bytes == Hello World. The tests use 256 connections, and pipelining with a depth of 16

2017-08-06 22_28_44-new test results

@benaadams
Copy link
Member

/cc @stephentoub, @davidsh, @CIPop, @Priya91 PTAL

@Drawaes
Copy link
Author

Drawaes commented Aug 6, 2017

#22485 needs to be fixed for the build to work.

@davidsh davidsh added area-System.Net.Security os-linux Linux OS (any supported distro) labels Aug 6, 2017
@davidsh
Copy link
Contributor

davidsh commented Aug 6, 2017

@dotnet-bot Test Outerloop Windows x64 Debug Build
@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build
@dotnet-bot Test Outerloop Linux x64 Release Build

private unsafe delegate int CreateDelegate(bio_st* bio);

[UnmanagedFunctionPointer(CallingConvention.Cdecl)]
private unsafe delegate int ReadDelegate(IntPtr bio, void* buf, int size);
Copy link
Member

@benaadams benaadams Aug 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struct wrapper for these types to make it strongly typed? e.g.

struct Bio
{
    IntPtr bioPtr;
}

The wrapper should happily go through the function pointer; can also return them from externs instead of IntPtr

Can also add methods to the struct if you want act on the IntPtr and then it behaves and looks a little like a regular class/struct

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about that. I originally had used the BIO Safehandle, but obviously a class can't be in a callback. But I am happy to make that change

@Drawaes
Copy link
Author

Drawaes commented Aug 7, 2017

@dotnet-bot Test Outerloop UWP CoreCLR x64 Debug Build

@Drawaes
Copy link
Author

Drawaes commented Aug 7, 2017

@bartonjs Quick question, does SslStream support rehandshaking at all when using OpenSSL? From looking at it, I can't see that it does, but I have known to be wrong. It will effect ongoing changes I wanted to make.

@davidsh davidsh removed their assignment Aug 7, 2017
@bartonjs
Copy link
Member

bartonjs commented Aug 7, 2017

@Drawaes I don't think you can initiate it through the .NET API, but if you connect to openssl s_server and send capital-R (renegotiate) I think it works. Or, at least, tries to work. Pretty sure it works when Windows and macOS are the clients, if nothing else. So, if it doesn't work, it probably should.

@CIPop CIPop removed their assignment Aug 7, 2017
_writeDelegate = Write;
_readDelegate = Read;

var name = Marshal.StringToHGlobalAnsi("Managed Bio");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improper use of var (type name must be on the line via new, as, or hard-cast)

bwrite = _writeDelegate,
};

var memory = Marshal.AllocHGlobal(Marshal.SizeOf<bio_method_st>());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improper var. (Pervasive)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed many many var violations (suddenly the analyzers are working in my VS :) )

}

context.InputBio.SetBio(recvBuf, recvOffset, recvCount);
context.OutputBio.SetBio(null, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use named parameter calling so that the true has a purpose more obvious to a reader.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, named param for the null as well just to be clear, also modified the one in the Encrypt method to have the named param for the true/false


public int TakeBytes(out byte[] output)
{
var bytes = _bytesWritten;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have Write and TakeBytes be synchronized?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure what you mean by synchronized here? you mean that take bytes could be called by a different thread or that the two methods could be called at the same time? Are you concerned about an unseen cached value, or a race ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race conditions. Like Write sees _byteArray != null right after TakeBytes set _bytesWritten to 0, but before _byteArray = null. Since neither of these methods works on/with pointers directly it might not be necessary since the class is (IIRC) described as not thread-safe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, okay so when you call Encrypt/DoHandshake, your thread doesn't return until after it has done all write/read calls. So you will never get to the GetBytes as the methods are all sync in nature. I think it is fine. The locking to ensure that btw is provided by your init method when you give OpenSsl the mutexes it asks for.

{
if (_handle.IsAllocated)
{
_handle.Free();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GCHandle.Free cause IsAllocated to become false? (I don't see anything either way in the docs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it does; though since its a struct needs to be passed by ref or a class member to reflect it (e.g. single instance)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not made readonly :) (that has bitten me before)


// This allows the write buffer to move during a multi call write, this stops us having to pin it
// across multiple calls where there is an async output to the innerstream inbetween
Ssl.SslCtxSetAcceptMovingWriteBuffer(innerContext);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know this to be always safe? What guarantees does OpenSSL need on the stability of the memory, and what guarantees can we meet with regard to GC compaction?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenSsl requires that the memory is the same (as in the content) but that the pointer itself can move. The length must also be the same. So I feel this is fine.

Make it possible to retry SSL_write() with changed buffer location (the buffer contents must stay the same). This is not the default to avoid the misconception that non-blocking SSL_write() behaves like non-blocking write().

Its really only in place to stop silly mistakes, in this case we know we won't change the content (unless the user is silly enough to modify the input array that is in a method and not returned and modify it else where in another thread). Now it is possible for a user to do this, but then you end up with the same situation as the user doing this while you are mid writing or copying the data anyway.

I do have an idea of how I can possibly remove this, but its not an easy change due to the APM model of the SslInternalStream, the large number of callback chaining in there makes state a tough call.

}

[StructLayout(LayoutKind.Sequential)]
private unsafe struct bio_method_st
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native structs from OpenSSL must not have any implied mapping to their memory layout. Instead you need to make a function which takes the required parameters and does the allocation and assignment from within the shim. (Also applies to bio_st)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native structs are gone, there is a function to set the two callbacks required. All other callbacks and the construction of the structs is now in the shim

[Flags]
private enum BIO_TYPE
{
BIO_TYPE_SOURCE_SINK = 0x0400,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see native asserts for any of the new enum values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these will go away if I do the below and move the struct generation/allocation to the shim

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed all the enums as they are no longer needed with the new shim methods.

@Drawaes
Copy link
Author

Drawaes commented Aug 8, 2017

/cc @bartonjs changes to the shim/native/interop area have been made to remove the struct generation to the shim. It's possibly faster now that there is 4 callbacks per frame write (the control callback) that are no longer interops but all in native code.

@Drawaes
Copy link
Author

Drawaes commented Aug 8, 2017

Its currently failing some tests with the Shim change and will fix it after work tonight.

@Drawaes
Copy link
Author

Drawaes commented Aug 8, 2017

@dotnet-bot Test Outerloop Linux x64 Release Build

@Drawaes
Copy link
Author

Drawaes commented Aug 8, 2017

It looks to be working locally on my tests. The test failures in the outerloop, I suspect are from a different change to mine. I will double check, but other than that the code should be fine.

@benaadams
Copy link
Member

benaadams commented Aug 8, 2017

Failures

Debian.87.Amd64.Open

SendAsync_RequestVersion20_ResponseVersion20IfHttp2Supported(server: https://http2.akamai.com/)

Fedora.26.Amd64.Open | fedora.25.amd64.Open

System.Net.Sockets.Tests - Catastrophic failure

All Linux

System.Net.WebSockets.Client.Tests.CancelTest

ReceiveAsync_AfterCancellationDoReceiveAsync_ThrowsWebSocketException(server: ws://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx) https://github.com/dotnet/corefx/issues/23038

ReceiveAsync_AfterCancellationDoReceiveAsync_ThrowsWebSocketException(server: wss://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx) https://github.com/dotnet/corefx/issues/23038

System.Net.WebSockets.Client.Tests.SendReceiveTest

SendAsync_MultipleOutstandingSendOperations_Throws(server: wss://corefx-net.cloudapp.net/WebSocket/EchoWebSocket.ashx)

@Drawaes
Copy link
Author

Drawaes commented Aug 8, 2017

Yeah I saw that. I will check after the day job. Didn't see those locally, although I see a number of socket changes for span have snuck into my test build so I will rebase and see what is going on.

@benaadams
Copy link
Member

Think you have to run outloop locally also

build-managed -Outerloop

or

 msbuild <csproj_file> /t:BuildAndTest /p:WithCategories=OuterLoop

Though at least one of them is #23038

@Drawaes
Copy link
Author

Drawaes commented Aug 8, 2017

Yeah I suspected it was not from my change because

  1. Those tests aren't using TLS they are "normal" requests
  2. This change has gone in, in the tests
    "Add HttpListenerWebSocketContext.WebSocket.Receive/Close throw (commit: 5623bfc) (detail / githubweb)"

I haven't rebased that to my local branch so wouldn't see the failure. I suspect the break isn't from me but I will take a look anyway.

@Drawaes
Copy link
Author

Drawaes commented Aug 8, 2017

The socket failure on fedora seemed to be a timeout, but I am not sure. Anyway I will have a crack when I am back in the hotseat

@stephentoub
Copy link
Member

The socket failure on fedora seemed to be a timeout, but I am not sure

I believe there are a bunch of known hangs/failures on Fedora, but @geoffkizer can comment.

Yeah I suspected it was not from my change because Those tests aren't using TLS they are "normal" requests

FWIW, the two wss ones on Linux would be using SslStream.

@Drawaes
Copy link
Author

Drawaes commented Aug 8, 2017

You mean I can't blame your Span changes... :)

@geoffkizer
Copy link

@benaadams Yeah, that would be odd. Just to confirm though, you could check the # of SslStreams that are allocated in your trace.

I seem to recall some weirdness about allocating a separate buffer for handshake vs encrypt/decrypt, so maybe that's what's happening here.

@benaadams
Copy link
Member

34 broken is a bit of a worry? 3%

Unrelated issue

@benaadams
Copy link
Member

Hmm... ok looks like

const int ReadBufferSize = 4096 * 4 + 32; = 16416

Since that's > 2^14 (by 32 bytes) ArrayPool rounds it up to the next power of 2 at 2^15 or 32kB

@geoffkizer
Copy link

Yeah, I just found that too. We should do something different there, but it's not urgent.

@geoffkizer
Copy link

Unfortunately the Ssl record size limit seems to be 16K + 5 bytes for header. That's super annoying.

@Drawaes
Copy link
Author

Drawaes commented Aug 21, 2017

It's bigger... 5 bytes for header 16k for plain text. 16bytes for the biggest ahead block trailing... And 8 bytes for the sequence part of the nounce (the other 4 bytes are made from the master secret)

@Drawaes
Copy link
Author

Drawaes commented Aug 21, 2017

As you say though bigger fish :) there is plenty of fruit left on the tree for sure.

@Drawaes
Copy link
Author

Drawaes commented Aug 21, 2017

For my info did you get a failure on big blocks on master again? On 2.0 rtm I couldn't repro.

@geoffkizer
Copy link

@Drawaes Here are the results I get from your ContinueWith branch on my SslStreamPerf test:

MessageSize 1 16 256 4096
baseline 292919 282623 261778 196408
continuewith 253619 238937 227682 171037
Improvement -13.4% -15.5% -13.0% -12.9%

@geoffkizer
Copy link

@Drawaes

For my info did you get a failure on big blocks on master again?

I don't see the exception I was seeing before from the Bio code, but I don't get any results. Most likely I am eating an exception somewhere. I'll investigate further.

@Drawaes
Copy link
Author

Drawaes commented Aug 21, 2017

I mean on corefx/master... I just want to know I am not dealing with a bug from the current branch at the same time. For your test above was that with loopback or in memory. It's interesting that the continue with isn't helping there. I want to understand if it's an interaction with the networking stack.

@geoffkizer
Copy link

Yes, I see the same behavior on corefx/master. So it's not specific to your change. I'll take that issue offline and post an issue when I have a chance to investigate further.

The numbers I posted above are in-memory; I can run loopback too.

@Drawaes
Copy link
Author

Drawaes commented Aug 21, 2017

Perfect so if we take the tcp stack out of the equation then the continue with has the effect we would expect (slow down) which narrows the problem a bit !

{
asyncRequest.CompleteUser();
}
asyncRequest?.CompleteUser();
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub I have narrowed the issue down. It seems that it is nothing to do with the continuation/free up above. The issue is around calling asyncRequest.CompleteUser() on the current thread. So ~180k rps with the code as it stands and 256 connections (24 core/48 thread server)

Task.Run(() => asyncRequest?.CompleteUser());
It jumps to > 900k rps

This is only effecting Unix, to me it seems like a contention issue because the RPS scales up to ~10 connections and then slowly goes down from there unless you put in a Task.Run().

Looking at it, could it be something in the lazy result?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompleteUser is basically just calling the user callback. There shouldn't be any contention in the CompleteUser call itself (though of course, there could be).

The main difference here is likely something in the user code. To test this, you could add Task.Run to the user code so that the user callback will be enqueued, instead of run immediately, and see if that makes any difference.

Which test are you seeing this with? I'm not sure quite what the user's callback will be doing in this case.

@geoffkizer
Copy link

BTW, I ran some numbers with my test app over loopback (instead of the in-memory stream). Here's what I see:

Over loopback        
MessageSize 1 16 256 4096
baseline 122440 119689 115707 92481
drawaes 123560 121826 114060 92321
Improvement 0.9% 1.8% -1.4% -0.2%
continuewith 24122 24699 22522 17899
Improvement -80.3% -79.4% -80.5% -80.6%

"baseline" is corefx/master (as of a couple days ago)
"drawaes" is this PR -- looks like basically a wash
"continuewith" is adding in the continuewith -- this kills performance on my test. Notably, CPU was only around 60%, so there's probably some sort of contention issue here.

@Drawaes
Copy link
Author

Drawaes commented Aug 22, 2017

Cool, here is my current take. Where there is little waiting /block and in a pure overhead test my changes haven't made things worse, maybe slightly better which is good because we have more object tracking, and continuations etc.

So even in that case it sets us up for future changes. In the async /block case such as the full webserver network card scenario it's a win :) apart from this one problem.

Now the reason I am puzzled about the user code side is... It's Kestrel and with identical code it doesn't happen on windows.

There are a few platform specific parts in the usercallback. OS waithandles/mutexs and libuv. Neither of those am I bold enough to blame. I know there has been work on Kestrel on sockets so I might try that to eliminate (Or not ) Libuv

@stephentoub
Copy link
Member

This is only effecting Unix, to me it seems like a contention issue because the RPS scales up to ~10 connections and then slowly goes down from there unless you put in a Task.Run().

What does the call stack look like at this point? Are we inside a callback from OpenSSL? And if so, might it be holding a lock?

@Drawaes
Copy link
Author

Drawaes commented Aug 22, 2017

We shouldn't be, not sure how we could be still inside a callback at this point unless there is an exception thrown in the managed callback, and when there has been in development it has spat that out and stopped.

I am trying to figure out tracing on Linux, what is interesting is that there is a 10x slowdown on tracing/event log on my linux but the cpu isn't doing much which suggests there is some lock somewhere doing something bad as that is disproportionate to the amount of logs I am seeing trace out. If I turn off eventlogs, and just do a trace I get something more reasonable.

What I can glean from the logs, (user and kernel stacks don't correlate which is a pain) is that the CPU's are mostly idle for the non continuewith case.

And the most CPU time (apart from idle) is

module 2.19.so <<libpthread-2.19.so!__lll_lock_wait

closely followed by

module 2.19.so <<libpthread-2.19.so!__lll_unlock_wake

which kinda says it all :)

@Drawaes
Copy link
Author

Drawaes commented Aug 22, 2017

The managed stack looks something like

https://gist.github.com/Drawaes/4ff5036448d87557eb735166682332ad

The .net callbacks from unmanaged code are pretty tiny so I am not sure it can be that. I will take another look at the openssl locking.

@stephentoub
Copy link
Member

stephentoub commented Aug 22, 2017

The managed stack looks something like

Ok, thanks.

@Drawaes, have you or @benaadams tried just adding that continuation to the existing code base? e.g. changing the line at https://github.com/dotnet/corefx/blob/master/src/System.Net.Security/src/System/Net/Security/SslStreamInternal.cs#L437 from:

Task t = _sslState.InnerStream.WriteAsync(outBuffer, 0, encryptedBytes);

to:

Task t = _sslState.InnerStream.WriteAsync(outBuffer, 0, encryptedBytes).ContinueWith(p => p.GetAwaiter().GetResult());

?

If you have, what were the results? If you haven't, might be interesting to see if/how much of the gains here are due to escaping whatever lock is causing a problem here, separate from the rest of the changes being made.

@Drawaes
Copy link
Author

Drawaes commented Aug 22, 2017

Funny... I am building it right now to test that :)

@Drawaes
Copy link
Author

Drawaes commented Aug 22, 2017

@stephentoub good or bad news depending on how you look at it. I am seeing 150k -> 650k so I am gaining an extra ~250k with my change but a large chunk is from this issue.... I am contemplating at this stage, close this PR, open an issue for this, and I will start working on that in isolation, then we can comeback to this in the future as I think this is an issue at this stage (FYI 2.0 as well as 2.1 has this issue).

What do you think?

@geoffkizer
Copy link

@Drawaes That seems good to me.

@Drawaes
Copy link
Author

Drawaes commented Aug 22, 2017

https://github.com/dotnet/corefx/issues/23485

I now hold the dubious honour of having the most commented on PR's in core and corefx that never got merged :P

@Drawaes Drawaes closed this Aug 22, 2017
@geoffkizer
Copy link

@Drawaes FYI, my perf microbenchmark is at https://github.com/geoffkizer/netperf/tree/master/SocketPerfTest. It's pretty simple to use. As you're making additional changes here, I'd love to see the numbers you get from this microbenchmark.

@Drawaes
Copy link
Author

Drawaes commented Aug 23, 2017

Cool.. I am feeling upbeat, I got vtune working on my benchmark machines finally...so if I seemed to have gone quiet I haven't stopped looking :) I might have some interesting numbers (or not but hey proving it's not something is useful as well right ? ) soon.

@karelz karelz modified the milestone: 2.1.0 Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Security os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants