Socket.Unix: reduce locking by using Interlocked operations #36008

tmds · 2020-05-07T20:01:10Z

This optimize the common case where there is at most one
on-going receive, and one on-going send operation.

cc @stephentoub @adamsitnik @antonfirsov @karelz

This optimize the common case where there is at most one on-going receive, and one on-going send operation.

ghost · 2020-05-07T20:01:16Z

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

tmds · 2020-05-07T20:02:06Z

This is meant for benchmarking. No need to review until benchmarks show it is worth it.

adamsitnik · 2020-05-07T20:23:47Z

This is meant for benchmarking.

@tmds is there any chance you could share your copy of System.Net.Sockets.dll with me? I am just lazy and would like to go the easy way without compiling your fork ;)

tmds · 2020-05-07T20:51:21Z

Adam, here it is: System.Net.Sockets.dll.tar.gz

adamsitnik · 2020-05-08T13:08:24Z

BTW I was looking at the JSON profile today and we spent sth around 1.5% of total time in this particular lock

tmds · 2020-05-08T14:00:16Z

It looks like slight gains on x64, and mixed gain/loss on arm64.
The numbers aren't very consistent, and they deviate a lot from 1.5% spent in the lock for JSON.
If you run these benchmarks again, will the results be similar?

adamsitnik · 2020-05-08T14:26:11Z

I've tried to run a few of them and the results seems similar (and still not super stable)

adamsitnik · 2020-05-08T14:56:34Z

I've run the benchmarks one more time, the results look very similar:

(I've interrupted the ARM run as I am finishing work for today)

tmds · 2020-05-11T11:45:50Z

There's no reason these changes should regress performance.
I don't know what is going on on arm64. Do we care about figuring it out?
Adam, can you collect a perftrace on arm64 for jsonplatform with 128 connections before and after? Maybe it tells us something.

adamsitnik · 2020-05-11T20:22:38Z

I don't know what is going on on arm64.

From what I've seen in the profiles so far is that all the Interlocked and Volatile operations which seem to be almost immediate on x64 are not so cheap on ARM64.

A good example is this simple micro-benchmark:

public class Perf_Volatile
{
    private double _location = 0;
    private double _newValue = 1;
    
    [Benchmark]
    public double Read_double() => Volatile.Read(ref _location);

    [Benchmark]
    public void Write_double() => Volatile.Write(ref _location, _newValue);
}

 BenchmarkDotNet=v0.12.1, OS=ubuntu 18.04
 ARMv8 Processor rev 1 (v8l), 4 logical cores
 .NET Core SDK=5.0.100-preview.4.20217.5
   [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.21702, CoreFX 5.0.20.21702), Arm64 RyuJIT
   Job-ULPSAR : .NET Core 5.0.0 (CoreCLR 5.0.20.21702, CoreFX 5.0.20.21702), Arm64 RyuJIT

Method	Mean
Read_double	10.34 ns
Write_double	17.53 ns

 BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.778 (1909/November2018Update/19H2)
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.4.20217.2
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.21611, CoreFX 5.0.20.21611), X64 RyuJIT
  Job-OXOUZI : .NET Core 5.0.0 (CoreCLR 5.0.20.21611, CoreFX 5.0.20.21611), X64 RyuJIT

Method	Mean
Read_double	0.0006 ns
Write_double	0.2595 ns

@kunalspathak is this expected? Can we do anything about it?

adamsitnik · 2020-05-11T20:25:55Z

Adam, can you collect a perftrace on arm64 for jsonplatform with 128 connections before and after?

Sure, I can get them for you tomorrow. BTW when my VPN is not working I just close everything on my PC and run the benchmark server and wrk myself. I use the wrk arguments from this file: https://github.com/aspnet/Benchmarks/blob/master/src/WrkClient/wrk.yml and it typically can give me an answer whether my change is going to improve the perf or not. I know that it is far from perfect, but it should shorten your perf feedback loop.

stephentoub · 2020-05-11T20:28:20Z

From what I've seen in the profiles so far is that all the Interlocked and Volatile operations which seem to be almost immediate on x64 are not so cheap on ARM64

On x86/64 the architecture's memory model is strong enough that volatile operations end up serving just as a compiler barrier; the actual instruction output doesn't differ based on whether the read or write is volatile. You can see that here:
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABABgAJiBGAOgBUALWbAEwEsA7AcwG4BYAFDEAzJQBM5AMLkA3oPILKozhnIBZKgApYAM3Ir9ASnIBeAHzkAahAA22DGxswaAJRittMPW0P8BipX0OVTUxT29go1MLNj8AXyA=

On ARM, there's a weaker memory model, so volatile operations end up entailing actual barriers in the JIT'd instructions, e.g. dmb.

kunalspathak · 2020-05-11T20:30:30Z

Echoing to what @stephentoub has said, one recommendation would be to perhaps see possibilities of reducing volatile variable access inside a loop. See #34225 for example.

tmds · 2020-05-12T08:22:13Z

Sure, I can get them for you tomorrow.

@adamsitnik let me make a few changes before you collect traces.

tmds · 2020-05-12T12:44:33Z

@adamsitnik I pushed the change. If you collect perftraces for arm64 for jsonplatform with 128 connections before and after, maybe we'll learn why this is regressing.

This is compiled System.Net.Sockets: System.Net.Sockets.dll.tar.gz

tmds · 2020-05-12T15:37:12Z

I got traces from Adam and took a look.
The variance between benchmarks makes it not possible to derive something meaningful.

This change is trying to replace something with something else, assuming it is cheaper. But I'm not really sure it is cheaper, and I don't have a good way to measure.
I'm giving up on this.

stephentoub · 2020-05-12T16:31:29Z

Thanks for trying, @tmds.

Socket.Unix: reduce locking by using Interlocked operations

3e4c6fb

This optimize the common case where there is at most one on-going receive, and one on-going send operation.

Dotnet-GitSync-Bot added the area-System.Net.Sockets label May 7, 2020

Cleanup

29ed35d

jaredpar mentioned this pull request May 7, 2020

OSX machines are de-provisioned during CI / PR runs leading to failures #34472

Closed

tmds mentioned this pull request May 11, 2020

perform a lock-free speculative IsReady check for small socket send operations #36214

Closed

Get rid of volatile queue reads when processing stops

8231f0e

tmds closed this May 12, 2020

karelz added this to the 5.0.0 milestone Aug 18, 2020

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Socket.Unix: reduce locking by using Interlocked operations #36008

Socket.Unix: reduce locking by using Interlocked operations #36008

Uh oh!

Conversation

tmds commented May 7, 2020

Uh oh!

ghost commented May 7, 2020

Uh oh!

tmds commented May 7, 2020

Uh oh!

adamsitnik commented May 7, 2020

Uh oh!

tmds commented May 7, 2020

Uh oh!

adamsitnik commented May 8, 2020

Uh oh!

tmds commented May 8, 2020

Uh oh!

adamsitnik commented May 8, 2020

Uh oh!

adamsitnik commented May 8, 2020

Uh oh!

tmds commented May 11, 2020

Uh oh!

adamsitnik commented May 11, 2020

Uh oh!

adamsitnik commented May 11, 2020

Uh oh!

stephentoub commented May 11, 2020

Uh oh!

kunalspathak commented May 11, 2020

Uh oh!

tmds commented May 12, 2020

Uh oh!

tmds commented May 12, 2020

Uh oh!

tmds commented May 12, 2020

Uh oh!

stephentoub commented May 12, 2020

Uh oh!

Uh oh!