Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Improve BlockCopy performance #117

@neuecc

Description

@neuecc

In serialization process, MessagePack for C# uses Buffer.BlockCopy heavily.
for example when serializing contractless(stirng key), property name was cached and uses block copy.
Here is benchmark of property name(Encoding.UTF8.GetBytes("MyProperty1"), length = 11) copy speed.

Method Jit Platform Mean Error Scaled Allocated
ArrayCopy LegacyJit X86 29.714 ns NA 1.31 0 B
BlockCopy LegacyJit X86 22.731 ns NA 1.00 0 B
MemoryCopy LegacyJit X86 6.256 ns NA 0.28 0 B
ByteCopy LegacyJit X86 10.119 ns NA 0.45 0 B
LongCopy LegacyJit X86 5.429 ns NA 0.24 0 B
Copy11 LegacyJit X86 2.992 ns NA 0.13 0 B
CopyBlock LegacyJit X86 15.780 ns NA 0.69 0 B
CopyBlockUnaligned LegacyJit X86 13.963 ns NA 0.61 0 B
ArrayCopy RyuJit X64 9.942 ns NA 0.93 0 B
BlockCopy RyuJit X64 10.666 ns NA 1.00 0 B
MemoryCopy RyuJit X64 3.173 ns NA 0.30 0 B
ByteCopy RyuJit X64 7.308 ns NA 0.69 0 B
LongCopy RyuJit X64 2.586 ns NA 0.24 0 B
Copy11 RyuJit X64 2.586 ns NA 0.24 0 B
CopyBlock RyuJit X64 3.406 ns NA 0.32 0 B
CopyBlockUnaligned RyuJit X64 3.304 ns NA 0.31 0 B

MemoryCopy is fast but only support from .NET 4.6.
CopyBlock, CopyBlockUnaligned is from System.Runtime.CompilerServices.Unsafe, it is using cpblk opcode directly.
https://github.com/dotnet/corefx/blob/master/src/System.Runtime.CompilerServices.Unsafe/src/System.Runtime.CompilerServices.Unsafe.il#L162
It is nice but in x86, too slow.

ByteCopy , LongCopy, Copy11 are self defined unsafe copy code.
Surprisingly, it is very fast in small size(like property name binary).

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Exporters;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System;
using System.Runtime.CompilerServices;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        var switcher = new BenchmarkSwitcher(new[]
        {
            typeof(StandardBenchmark)
        });

        args = new string[] { "0" };

#if DEBUG
        var b = new StandardBenchmark();
#else
        switcher.Run(args);
#endif
    }
}

public class BenchmarkConfig : ManualConfig
{
    public BenchmarkConfig()
    {
        Add(MarkdownExporter.GitHub);
        Add(MemoryDiagnoser.Default);

        Add(Job.RyuJitX64.WithWarmupCount(1).WithLaunchCount(1).WithTargetCount(1),
            Job.LegacyJitX86.WithWarmupCount(1).WithLaunchCount(1).WithTargetCount(1));
    }
}

[Config(typeof(BenchmarkConfig))]
public class StandardBenchmark
{
    byte[] bytes = Encoding.UTF8.GetBytes("MyProperty1");
    byte[] to;

    public StandardBenchmark()
    {
        to = new byte[bytes.Length];
    }

    [Benchmark]
    public void ArrayCopy()
    {
        Array.Copy(bytes, 0, to, 0, to.Length);
    }

    [Benchmark(Baseline = true)]
    public void BlockCopy()
    {
        Buffer.BlockCopy(bytes, 0, to, 0, to.Length);
    }

    [Benchmark]
    public unsafe void MemoryCopy()
    {
        fixed (void* s = &bytes[0])
        fixed (void* p = &to[0])
        {
            Buffer.MemoryCopy(s, p, bytes.Length, to.Length);
        }
    }

    [Benchmark]
    public unsafe void UnsafeSimple()
    {
        fixed (byte* s = &bytes[0])
        fixed (byte* p = &to[0])
        {
            UnsafeUtility.ByteCopy(s, p, to.Length);
        }
    }

    [Benchmark]
    public unsafe void UnsafeBlock()
    {
        fixed (byte* s = &bytes[0])
        fixed (byte* p = &to[0])
        {
            UnsafeUtility.LongCopy(s, p, to.Length);
        }
    }

    [Benchmark]
    public unsafe void UnsafeOptimized()
    {
        fixed (byte* s = &bytes[0])
        fixed (byte* p = &to[0])
        {
            UnsafeUtility.Copy11(s, p);
        }
    }

    [Benchmark]
    public unsafe void CopyBlock()
    {
        fixed (void* s = &bytes[0])
        fixed (void* p = &to[0])
        {
            Unsafe.CopyBlock(p, s, (uint)bytes.Length);
        }
    }

    [Benchmark]
    public unsafe void CopyBlockUnaligned()
    {
        fixed (void* s = &bytes[0])
        fixed (void* p = &to[0])
        {
            Unsafe.CopyBlockUnaligned(p, s, (uint)bytes.Length);
        }
    }
}

public unsafe static class UnsafeUtility
{
    public static unsafe void ByteCopy(byte* src, byte* dst, int count)
    {
        for (int i = 0; i < count; i++)
        {
            *dst = *src;
            src++;
            dst++;
        }
    }

    public static unsafe void LongCopy(byte* src, byte* dst, int count)
    {
        while (count >= 8)
        {
            *(ulong*)dst = *(ulong*)src;
            dst += 8;
            src += 8;
            count -= 8;
        }
        if (count >= 4)
        {
            *(uint*)dst = *(uint*)src;
            dst += 4;
            src += 4;
            count -= 4;
        }
        if (count >= 2)
        {
            *(ushort*)dst = *(ushort*)src;
            dst += 2;
            src += 2;
            count -= 2;
        }
        if (count >= 1)
        {
            *dst = *src;
        }
    }

    // Optimized for count

    public static unsafe void Copy11(byte* src, byte* dst)
    {
        // 11 - 8 = 3
        *(ulong*)dst = *(ulong*)src;
        dst += 8;
        src += 8;

        // 3 - 2 = 1
        *(ushort*)dst = *(ushort*)src;
        dst += 2;
        src += 2;

        // 1 - 1 = 0
        *dst = *src;
    }
}

I'll write Copy1 ~ Copy32 helper and embed in IL.
I do not know which of Buffer.BlockCopy and SelfCode is faster for large size.
I'll measure and think about which to use.

Note, here is Buffer.BlockCopy code
https://github.com/dotnet/coreclr/blob/5c07c5aa98f8a088bf25099f1ab2d38b59ea5478/src/vm/comutilnative.cpp#L1366

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions