Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Significant performance difference between x += y and x = x + y on properties, differing between hardware and runtime version (7 / 8) #108227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
BlackxSnow opened this issue Sep 25, 2024 · 11 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@BlackxSnow
Copy link

(I originally detailed this issue on StackOverflow, here.)

Description

The following two snippets produce wildly different benchmark results to eachother as well as between different machines and major runtime versions (where SomeProperty is an int auto-property):

SomeProperty += i;
var propertyValue = SomeProperty;
SomeProperty = propertyValue + i;

The benchmark (below), when run on my machine, showed poor performance of the former case on .NET 7 but otherwise expected results. 2 others ran the benchmarks, resulting in poor performance for both cases on .NET 8 but not .NET 7. The host version did not appear to make a difference in these cases. I've included the benchmark results and system configurations below the benchmark code.

Potentially relevantly (but not directly related), I've noticed (but not been able to isolate) significant performance issues with setting data through a native memory pointer provided by mapping a Direct3D sub-resource which wasn't present on .NET 8 or any of my colleague's machines on .NET 7. That issue appears to be more strongly linked to number of assignments to the pointer than to the amount of data assigned.

Benchmark

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace Benchmarks;

[DisassemblyDiagnoser(maxDepth: 1, printSource:true)]
[Config(typeof(Config))]
public class FieldVsProperty
{
    public int Prop_ReadWrite { get; set; } = Random.Shared.Next();

    public static int N = 1000;
    
    [Benchmark]
    public int Property_ReadWrite_Write_Add()
    {
        for (int i = 0; i < N; i++)
        {
            Prop_ReadWrite += i;
        }
        return Prop_ReadWrite;
    }
    [Benchmark]
    public int Property_ReadWrite_Write_Add_Separate()
    {
        for (int i = 0; i < N; i++)
        {
            var val = Prop_ReadWrite;
            Prop_ReadWrite = val + i;
        }
        return Prop_ReadWrite;
    }
    
    private class Config : ManualConfig
    {
        public Config()
        {
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core70));
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core80));
        }
    }
}

Data

My machine (also ran this on my Arch Linux install, with no notable difference):

BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4412/22H2/2022Update)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.302
  [Host]     : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XRUBGY : .NET 7.0.15 (7.0.1523.57226), X64 RyuJIT AVX2
  Job-TPIWHS : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method                                | Runtime  | Mean       | Error   | StdDev  | Code Size |
|-------------------------------------- |--------- |-----------:|--------:|--------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 7.0 | 1,352.2 ns | 5.95 ns | 5.57 ns |      33 B |
| Property_ReadWrite_Write_Add_Separate | .NET 7.0 |   186.9 ns | 0.53 ns | 0.50 ns |      33 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 |   186.8 ns | 0.45 ns | 0.40 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 |   186.9 ns | 0.40 ns | 0.35 ns |      25 B |

The two other machines:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12900HK, 1 CPU, 20 logical and 14 physical cores
.NET SDK 8.0.302
  [Host]     : .NET 7.0.16 (7.0.1624.6629), X64 RyuJIT AVX2
  Job-TVXXNG : .NET 7.0.16 (7.0.1624.6629), X64 RyuJIT AVX2
  Job-YOOWAN : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2


| Method                                | Runtime  | Mean       | Error   | StdDev  | Code Size |
|-------------------------------------- |--------- |-----------:|--------:|--------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 7.0 |   284.5 ns | 5.59 ns | 8.19 ns |      33 B |
| Property_ReadWrite_Write_Add_Separate | .NET 7.0 |   256.1 ns | 3.04 ns | 2.54 ns |      33 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1,488.2 ns | 6.57 ns | 5.49 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1,496.4 ns | 7.92 ns | 7.02 ns |      25 B |
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
12th Gen Intel Core i7-12650H, 1 CPU, 16 logical and 10 physical cores
.NET SDK 8.0.303
  [Host]     : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
  Job-WMSFHF : .NET 7.0.20 (7.0.2024.26716), X64 RyuJIT AVX2
  Job-HBRVHQ : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2

| Method                                | Runtime  | Mean       | Error    | StdDev   | Code Size |
|-------------------------------------- |--------- |-----------:|---------:|---------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 7.0 |   407.7 ns |  8.21 ns | 16.40 ns |      33 B |
| Property_ReadWrite_Write_Add_Separate | .NET 7.0 |   348.1 ns |  6.68 ns |  7.70 ns |      33 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1,878.6 ns | 36.54 ns | 51.22 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1,870.5 ns | 37.06 ns | 42.68 ns |      25 B |

Analysis

The most notable difference is between the CPU vendors, but the data is pretty limited.

@BlackxSnow BlackxSnow added the tenet-performance Performance related issue label Sep 25, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 25, 2024
@karakasa
Copy link
Contributor

karakasa commented Sep 25, 2024

On .NET 8, both the add operators are compiled into one assembly.

       add dword ptr [rbx+0x08], r15d

While on .NET 7, it's

       mov      edi, r14d
       add      edi, dword ptr [rbx+08H]
       mov      dword ptr [rbx+08H], edi

and

       mov      edi, dword ptr [rbx+08H]
       add      edi, r14d
       mov      dword ptr [rbx+08H], edi

, respectively.

https://godbolt.org/z/51ooerfGr

I'm unsure why the first one would be slower.

@huoyaoyuan huoyaoyuan added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 25, 2024
@huoyaoyuan
Copy link
Member

It can be micro-architecture specific behavior of handling mem operands. Intensive loop may also increase the chance to mess things up by branch predictor and out-of-order execution.

On my Ice Lake-SP there's merely no difference:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  Job-TIZHRT : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
  Job-RVIHQI : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  Job-UKHFVG : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL

| Method                                | Runtime  | Mean     | Error     | StdDev    |
|-------------------------------------- |--------- |---------:|----------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 6.0 | 1.193 us | 0.0063 us | 0.0053 us |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 1.201 us | 0.0034 us | 0.0030 us |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1.185 us | 0.0102 us | 0.0090 us |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1.197 us | 0.0161 us | 0.0134 us |
| Property_ReadWrite_Write_Add          | .NET 9.0 | 1.213 us | 0.0114 us | 0.0101 us |
| Property_ReadWrite_Write_Add_Separate | .NET 9.0 | 1.210 us | 0.0163 us | 0.0144 us |

Manually unroll the loop by manipulating 8 properties in a row may also make the performance closer.

@EgorBo
Copy link
Member

EgorBo commented Sep 25, 2024

.net 7.0:

| Method                                | Mean       | Error   | StdDev  |
|-------------------------------------- |-----------:|--------:|--------:|
| Property_ReadWrite_Write_Add          | 1,379.6 ns | 6.49 ns | 5.42 ns |
| Property_ReadWrite_Write_Add_Separate |   196.1 ns | 2.78 ns | 2.47 ns |

.net 8.0 (same on net9.0):

| Method                                | Mean     | Error   | StdDev  |
|-------------------------------------- |---------:|--------:|--------:|
| Property_ReadWrite_Write_Add          | 194.6 ns | 2.45 ns | 2.17 ns |
| Property_ReadWrite_Write_Add_Separate | 193.7 ns | 1.31 ns | 1.02 ns |

so looks like everything is okay?

@EgorBo
Copy link
Member

EgorBo commented Sep 25, 2024

@EgorBot -intel -arm64 --runtimes net7.0 net8.0 net9.0

using BenchmarkDotNet.Attributes;

public class FieldVsProperty
{
    public int Prop_ReadWrite { get; set; } = Random.Shared.Next();

    public static int N = 1000;

    [Benchmark]
    public int Property_ReadWrite_Write_Add()
    {
        for (int i = 0; i < N; i++)
        {
            Prop_ReadWrite += i;
        }
        return Prop_ReadWrite;
    }
    [Benchmark]
    public int Property_ReadWrite_Write_Add_Separate()
    {
        for (int i = 0; i < N; i++)
        {
            var val = Prop_ReadWrite;
            Prop_ReadWrite = val + i;
        }
        return Prop_ReadWrite;
    }
}

@huoyaoyuan
Copy link
Member

I can reproduce the same regression on Raptor Lake:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.1742)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-ZJTSND : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
  Job-EGSVEF : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2


| Method                                | Runtime  | Mean       | Error    | StdDev   | Code Size |
|-------------------------------------- |--------- |-----------:|---------:|---------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 6.0 |   237.5 ns |  1.08 ns |  0.90 ns |      65 B |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 |   215.0 ns |  0.66 ns |  0.61 ns |      65 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1,254.5 ns |  3.28 ns |  2.91 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1,250.5 ns | 24.55 ns | 25.21 ns |      25 B |

when affinitized to E-Cores:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.1742)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  Job-FAOOUD : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
  Job-PUVIKJ : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2

Affinity=00000000000000010000000000000000

| Method                                | Runtime  | Mean     | Error   | StdDev  | Code Size |
|-------------------------------------- |--------- |---------:|--------:|--------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 6.0 | 440.0 ns | 7.06 ns | 5.52 ns |      65 B |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 428.0 ns | 2.43 ns | 2.03 ns |      65 B |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 341.3 ns | 4.36 ns | 3.86 ns |      25 B |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 346.7 ns | 6.78 ns | 9.28 ns |      25 B |

Apparently there's something unhappy with the Golden Cove cores. E-Cores performs much better than P-Cores!

@colejohnson66
Copy link

colejohnson66 commented Sep 25, 2024

I remember reading something in Intel's optimization guide that newer CPU models will fuse mov {reg1}, [mem]; {op} {reg1}, {reg2} into a three-operand non-destructive form of {op} {reg1}, [mem], {reg2}, which bypasses a register rename holding up retirement of {reg1}. Perhaps swapping the memory operand's position inhibits this optimization?

@BlackxSnow
Copy link
Author

It can be micro-architecture specific behavior of handling mem operands. Intensive loop may also increase the chance to mess things up by branch predictor and out-of-order execution.

On my Ice Lake-SP there's merely no difference:

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4169/23H2/2023Update/SunValley3)
Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  Job-TIZHRT : .NET 6.0.33 (6.0.3324.36610), X64 RyuJIT AVX2
  Job-RVIHQI : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  Job-UKHFVG : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL

| Method                                | Runtime  | Mean     | Error     | StdDev    |
|-------------------------------------- |--------- |---------:|----------:|----------:|
| Property_ReadWrite_Write_Add          | .NET 6.0 | 1.193 us | 0.0063 us | 0.0053 us |
| Property_ReadWrite_Write_Add_Separate | .NET 6.0 | 1.201 us | 0.0034 us | 0.0030 us |
| Property_ReadWrite_Write_Add          | .NET 8.0 | 1.185 us | 0.0102 us | 0.0090 us |
| Property_ReadWrite_Write_Add_Separate | .NET 8.0 | 1.197 us | 0.0161 us | 0.0134 us |
| Property_ReadWrite_Write_Add          | .NET 9.0 | 1.213 us | 0.0114 us | 0.0101 us |
| Property_ReadWrite_Write_Add_Separate | .NET 9.0 | 1.210 us | 0.0163 us | 0.0144 us |

Manually unroll the loop by manipulating 8 properties in a row may also make the performance closer.

Your results exclude .NET 7. I'm wondering whether you see the same bump in execution time for Add (compared to Add_Separate that I do.

@huoyaoyuan
Copy link
Member

Your results exclude .NET 7.

I just executed on what I have on my machine. 6.0 represents for pre-8.0 which doesn't include the codegen change.

I'm wondering whether you see the same bump in execution time for Add (compared to Add_Separate that I do.

The behavior seems consistent for each micro-architecture.

Ice Lake-SP, Gracemont: everything looks fine.
Zen 4: Significantly slow for the pre-8.0 Add codegen, fine for others.
Golden Cove: Significantly slow for the post-8.0 codegen.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 25, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Sep 25, 2024
@JulieLeeMSFT
Copy link
Member

@BruceForstall, PTAL when we get Meteor lake laptops this year. cc @dotnet/jit-contrib.

@BlackxSnow
Copy link
Author

.net 7.0:

| Method                                | Mean       | Error   | StdDev  |
|-------------------------------------- |-----------:|--------:|--------:|
| Property_ReadWrite_Write_Add          | 1,379.6 ns | 6.49 ns | 5.42 ns |
| Property_ReadWrite_Write_Add_Separate |   196.1 ns | 2.78 ns | 2.47 ns |

.net 8.0 (same on net9.0):

| Method                                | Mean     | Error   | StdDev  |
|-------------------------------------- |---------:|--------:|--------:|
| Property_ReadWrite_Write_Add          | 194.6 ns | 2.45 ns | 2.17 ns |
| Property_ReadWrite_Write_Add_Separate | 193.7 ns | 1.31 ns | 1.02 ns |

so looks like everything is okay?

What architecture were these run on? Yours are the only results I've seen that mimic my system.

@huoyaoyuan
Copy link
Member

What architecture were these run on? Yours are the only results I've seen that mimic my system.

I remember Egor uses R9-7950X. It's also Zen 4.

@BruceForstall BruceForstall modified the milestones: 10.0.0, Future Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants