Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Experimental: integrating the V2 serialization POC into Akka.Remote (research / not for merge)#8203

Draft
Aaronontheweb wants to merge 11 commits into
akkadotnet:devfrom
Aaronontheweb:feature/spec4-serializer-v2
Draft

Experimental: integrating the V2 serialization POC into Akka.Remote (research / not for merge)#8203
Aaronontheweb wants to merge 11 commits into
akkadotnet:devfrom
Aaronontheweb:feature/spec4-serializer-v2

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

@Aaronontheweb Aaronontheweb commented May 10, 2026

⚠️ This PR is now reframed as experimental research work. The goal here is to land enough of the V2 serialization POC inside Akka.NET's real MessageSerializer + AkkaPduCodec path that we can run RemotePingPong against it and learn what end-to-end Akka.Remote sees. This branch will not be merged as-is. Once we've absorbed the learnings, the production V2 work will land in a separate, cleaner PR.

What this PR is for

  • Validate the V2 serialization POC design against the real MessageSerializer + AkkaPduCodec wrap pipeline.
  • Get concrete benchmark numbers comparing V1 (today's path) vs V2 (the new design) for both write and read on AckAndEnvelopeContainer / RemoteEnvelope / Payload wire format.
  • Wire the V2 spike into EndpointWriter so RemotePingPong actually exercises it, and capture aggregate throughput data the per-message benchmark can't reach.

What's on the branch right now

Foundation (will likely survive into the clean PR in some form)

  • SerializerV2 — buffer-aware base class (Serialize(IBufferWriter<byte>, object) → int, Deserialize(ReadOnlySequence<byte>, string) → object, Manifest, Identifier). Virtual byte[] bridges keep legacy callsites working during transition.
  • SerializerV1Adapter — wraps Serializer / SerializerWithStringManifest as SerializerV2. Reproduces V1 manifest dispatch internally; overrides ToBinary / FromBinary to skip the buffer round trip when the inner is byte[]-native. Inner property exposes the wrapped V1.
  • SerializerV2ExtensionsAsV1<T>() / TryAsV1<T>() helpers for callsites holding strongly-typed V1 references.
  • Serialization.cs — internal storage migrated to SerializerV2. HOCON + SerializationSetup V1 instances auto-wrap on registration. FindSerializerFor* returns SerializerV2. AddSerializer / AddSerializationMap get V1 + V2 overloads. Deserialize(byte[], int, string) simplified to uniform V2 dispatch.
  • MessageSerializer.cs (Akka.Remote) — single uniform Manifest() dispatch path.
  • ByteArraySerializer (ID 4) and PrimitiveSerializers (ID 17) — ported to V2-native, byte-identical wire format.

The spike (this is the experimental piece)

  • PatchingBufferWriterIBufferWriter<byte> with PatchSpan(offset, length) for in-place length-prefix patching.
  • ProtoWire — hand-rolled Protobuf wire-format helpers: tag, varint, fixed-width 5-byte varint placeholder + patch, fixed64, length-delimited, string. Mirrors CodedOutputStream internals but operates against IBufferWriter<byte> directly.
  • V2SerializerRegistry — Type → SerializerV2 / ID → SerializerV2 lookup. Static dispatch at serialize time, ID-keyed dispatch at deserialize. No Type.GetType(manifest) reflection.
  • V2RemoteEnvelopeWriter / V2RemoteEnvelopeReader — write and read the AckAndEnvelopeContainer / RemoteEnvelope / Payload wire format directly via IBufferWriter<byte> / ReadOnlyMemory<byte>. Inner V2 serializer invoked inline against the same buffer.
  • V2ProtoBenchmarks — V1 path uses real MessageSerializer.Serialize + AkkaPduProtobuffCodec.ConstructMessage; V2 path uses the spike. Both produce wire-equivalent AckAndEnvelopeContainer bytes (verified at setup via AckAndEnvelopeContainer.Parser.ParseFrom).

Key learning so far

Per-message serialization throughput, single-threaded, hardware: AMD Ryzen 9 9900X:

Payload V1 send msg/s V2 send msg/s Improvement
StringShort 263,000 421,000 +60%
BytesSmall 658,000 2,053,000 3.1×
BytesLarge 109,000 920,000 8.5×

Send-side allocations drop to zero for byte[] payloads. Receive-side allocations drop by 50–67%. Wire format is byte-equivalent — V1 peers parse V2 output, V2 reader parses V1 output. Full numbers in this comment.

What still needs to land on this branch

  • Move V2 spike helpers (PatchingBufferWriter, ProtoWire, V2SerializerRegistry, V2RemoteEnvelopeWriter, V2RemoteEnvelopeReader) from the benchmark project into Akka.Remote/Serialization/V2 so they can be referenced from production code.
  • Build V2SerializerRegistry from Serialization.Serialization at startup so the registry knows about all V2-native and V1-wrapped serializers.
  • Add ConstructMessageV2 to AkkaPduProtobuffCodec and wire EndpointWriter.WriteSend to call it (skipping the intermediate SerializedMessage proto object).
  • Run RemotePingPong on this hardware with V2 enabled, capture aggregate msg/sec at 1/5/10/15/20/25/30 clients (matches the dev-branch baseline in openspec/IMPLEMENTATION_ORDER.md).
  • Decide whether receive-side V2 is also worth wiring (V1 DecodeMessage parses V2 wire bytes correctly because Google.Protobuf accepts over-long varints).

What this PR will NOT become

  • This branch will not be the production V2 PR. It accumulates experimental code, partial integrations, scope churn (see commit e6b49676 for the spec-narrowing decision), and benchmark scaffolding.
  • The clean PR that follows will: pick a stable V2 surface based on learnings, integrate cleanly without leaving the spike helpers in Akka.Remote/Serialization/V2, ship the necessary subset, and have a proper review cycle.

What this PR is producing

  • Concrete perf data on whether the V2 design pays off for Akka.Remote.
  • Working integration code that proves the wire-compat story end-to-end.
  • A list of design decisions surfaced by integration (e.g. how V2SerializerRegistry initializes, where ConstructMessageV2 lives, how V1 and V2 paths coexist during transition).

Commits (current state, may keep evolving)

  1. e6b49676 — Narrow milestone scope; create serializer-v2-codegen placeholder
  2. 1d5d2a04SerializerV2 + SerializerV1Adapter foundation + infrastructure
  3. 1c1c36a7 — Update existing tests for V1Adapter wrapping
  4. cc4bed24 — Port ByteArraySerializer + PrimitiveSerializers to V2
  5. e1f7261fSerializerV1AdapterSpec unit tests
  6. 6b7d8d02 — Transport-envelope benchmark + API baselines
  7. 92719443 — V2 wrap-pipeline spike (PatchingBufferWriter + AckAndEnvelopeContainer wire format)
  8. 35e109aa — V2 spike read path + benchmarks for both directions

Design docs

  • openspec/changes/serializer-v2/proposal.md
  • openspec/changes/serializer-v2/design.md
  • openspec/changes/serializer-v2-codegen/proposal.md (deferred scope)

Scope split for Milestone 2 of the 1.6 transport epic. Originally the
change carried the full V2 stack: foundation (base class, V1 adapter,
infrastructure) plus the user-facing codec story (MessagePackSerializer,
AkkaWriter/AkkaReader, attributes, the Akka.Serialization.V2 NuGet
package) plus the Roslyn source generator. That's a lot of public API
surface to lock in before the foundation has been validated by anything
downstream.

This change rewrites the openspec docs so Milestone 2 ships only the
foundation and one set of reference serializers, plus a benchmark that
proves the API earns its keep before Spec 3 builds on it:

- SerializerV2 base class (IBufferWriter<byte> / ReadOnlySequence<byte>
  primary API, virtual byte[] bridge)
- SerializerV1Adapter that wraps legacy Serializer/SerializerWithStringManifest
- Serialization.cs and MessageSerializer.cs infrastructure changes
- ByteArraySerializer + PrimitiveSerializers ported to SerializerV2
  (covers all hand-rolled primitive paths: string via UTF-8, int32/int64
  via BinaryPrimitives, byte[] passthrough; same IDs, byte-identical
  wire format)
- Standalone transport-envelope benchmark in src/benchmark/ that simulates
  EndpointWriter's serialize-frame-deserialize chain on the V2 API and
  measures V2-direct vs V1-bridge

Everything else moves to a new placeholder change (serializer-v2-codegen):
MessagePackSerializer, sealed AkkaWriter/AkkaReader, the three attributes,
the Akka.Serialization.V2 package, the Roslyn generator, and the
mechanical port of the remaining Protobuf-based internal serializers
(ClusterMessageSerializer, SystemMessageSerializer, the four
WrappedPayloadSupport serializers). Rationale captured in the new
proposal: the runtime codec API and the codegen that targets it must be
designed together, since the runtime is the generator's emission target.

Files:
- openspec/IMPLEMENTATION_ORDER.md: retitle Milestone 2 "foundation only",
  document the 2026-05-10 scope change, point at serializer-v2-codegen
- openspec/changes/serializer-v2/proposal.md: rewrite around the narrowed
  scope; explicit "Deferred to a future change" section
- openspec/changes/serializer-v2/design.md: revised decisions (single
  layer in core Akka, reference impl scoped to Primitive+ByteArray, new
  decision documenting the benchmark as the validation gate)
- openspec/changes/serializer-v2/tasks.md: restructured into 6 sections
  with explicit string/int32/int64/byte[] coverage, multi-segment input
  tests, and a "Section 6: Out of Scope (Documented Follow-On)" punch
  list for the Protobuf serializer ports
- openspec/changes/serializer-v2/specs/serializer-v2-base/spec.md:
  per-serializer requirements with byte-identical wire format scenarios
  and multi-segment input scenarios; new requirement for the benchmark
- openspec/changes/serializer-v2/specs/messagepack-serializer/spec.md:
  deleted (capability moves to serializer-v2-codegen)
- openspec/changes/serializer-v2-codegen/: new placeholder change with
  .openspec.yaml and proposal.md sketching the deferred scope and arguing
  why we don't ship the runtime layer alone first
Establishes the V2 serialization API as the new internal foundation for
Akka.NET's serialization subsystem. V1 serializers continue to work
unchanged via a transparent adapter.

New types in core Akka:

- SerializerV2 (abstract): Buffer-aware serializer base class with
  Serialize(IBufferWriter<byte>, object), Deserialize(ReadOnlySequence<byte>,
  string), Manifest(object), and Identifier. Virtual byte[] bridges
  (ToBinary/FromBinary) keep V1-style call sites working.
- SerializerV1Adapter: Wraps Serializer/SerializerWithStringManifest as a
  SerializerV2. Reproduces the V1 manifest dispatch (TypeQualifiedName for
  IncludeManifest=true plain serializers, custom manifest for
  SerializerWithStringManifest, empty otherwise) and delegates ToBinary/
  FromBinary/FromBinary(Type) directly to the inner V1 to avoid pointless
  buffer round trips. Inner property exposes the wrapped V1 instance.
- SerializerV2Extensions: AsV1<T>()/TryAsV1<T>() helpers for callers that
  hold strongly-typed references to V1 serializers (e.g. cast sites in
  tests and durable stores).

Serialization.cs changes:

- Internal storage migrated to SerializerV2 (auto-wraps V1 on registration
  from HOCON, SerializationSetup, AddSerializer, AddSerializationMap)
- FindSerializerFor / FindSerializerForType / GetSerializerById /
  GetSerializerByName return SerializerV2
- ManifestFor(SerializerV2, object) overload added (just delegates to
  Manifest()); legacy ManifestFor(Serializer, object) preserved for back
  compat
- AddSerializer / AddSerializationMap each have V1 + V2 overloads (V1
  auto-wraps)
- Deserialize(byte[], int, string) simplified to uniform V2 dispatch —
  the V1 adapter handles the type-vs-manifest dance internally

Akka.Remote.MessageSerializer.cs:

- Single uniform path for manifest dispatch via SerializerV2.Manifest();
  no more `is SerializerWithStringManifest` type check or `IncludeManifest`
  branch

Call site fixes (mechanical):

- ~25 cast sites in tests/benchmarks switched to .AsV1<T>() (covers
  Hyperion, Newtonsoft, Cluster, Sharding, ClusterClient, PubSub,
  Singleton, ReplicatedDataSerializer, custom test serializers, and
  Akka.Remote.Tests primitive/misc)
- Field declarations on Replicator._serializer and
  LocalSnapshotStore._wrapperSerializer changed to SerializerV2
- ActorSystemImpl.WarnIfJsonIsDefaultSerializer uses
  SerializerV1Adapter pattern match
- Akka.Persistence.Custom example simplified to use uniform V2
  Manifest() dispatch
- ClusterMessageSerializer.GetObjectManifest takes SerializerV2

Adapter overrides bridge methods (ToBinary, FromBinary(byte[],string),
FromBinary(byte[],Type)) to delegate to the inner V1 directly. V2-native
Deserialize materializes the ReadOnlySequence<byte> to byte[] before
calling FromBinary, since the wrapped V1 is byte[]-native — no
performance win is possible for V1, only API parity. V2-native
serializers (coming next: ByteArraySerializer + PrimitiveSerializers
ports) get the actual zero-copy benefit.

Build: 0 errors, 0 warnings on full solution.
Existing tests asserted on the V1 serializer type via
.Should().BeOfType<T>() or by direct cast. With V2 dispatch wrapping V1
serializers in SerializerV1Adapter, those assertions now see the adapter
type and fail.

Mechanical fix across the affected test files: replace
.Should().BeOfType<V1>() with .AsV1<V1>() (which throws if the V2
instance isn't a SerializerV1Adapter wrapping V1). The implicit
assertion has the same intent — verify the right V1 serializer is bound
— without requiring the test to know about the adapter wrapping.

Affected test files:
- Akka.Tests: SerializationSpec, SerializationSetupSpec, CustomSerializerSpec
- Akka.Remote.Tests: DaemonMsgCreateSerializerSpec, MessageContainerSerializerSpec, MiscMessageSerializerSpec, ProtobufSerializerSpec, SystemMessageSerializationSpec
- Akka.Cluster.Tests: ReliableDeliverySerializerSpecs
- Akka.Cluster.Tools.Tests: ClusterClientSerializerSpec
- Akka.DistributedData.Tests: ReplicatedDataSerializerSpec
- Akka.Cluster.Sharding.Tests: DDataClusterShardingConfigSpec

`using Akka.Serialization;` added where missing to bring the AsV1 /
TryAsV1 extensions into scope.

Test status:
- Akka.Tests: 1248 passing, 23 skipped, 0 failing (full suite)
- Akka.Remote.Tests serialization: 105 passing, 1 skipped, 0 failing
- Akka.Cluster.Tests serialization: 49 passing, 0 failing
- Akka.Cluster.Tools.Tests serialization: 51 passing, 0 failing
Both serializers now extend SerializerV2 directly instead of the legacy
Serializer base class, while preserving their serializer IDs (4 and 17
respectively) and producing byte-identical wire format. They are the
V2-native reference implementations used to validate the API.

ByteArraySerializer (Akka core, ID 4):

- Identity transform — the byte[] is the wire format
- Serialize(IBufferWriter<byte>, byte[]) copies the input into the writer
  (the writer contract requires it; the V2 path costs one copy)
- Deserialize(ReadOnlySequence<byte>, manifest) materializes a fresh
  byte[] via seq.ToArray() — callers may retain the returned reference,
  so we cannot alias to potentially pooled backing memory
- ToBinary/FromBinary bridges overridden to skip the buffer round trip
  and pass the byte[] through directly (V1's zero-alloc behavior is
  preserved for the bridge path)
- Manifest() returns string.Empty (no manifest needed)
- Null handling drops V1's null passthrough — Serialization.cs routes
  null through NullSerializer, so the path was unreachable

PrimitiveSerializers (Akka.Remote, ID 17):

- Covers string / int32 / int64 with the same six manifest aliases as V1
  (S/I/L plus the long-form .NET Core and .NET Framework type-name
  variants that legacy peers emit)
- String serialize: Encoding.UTF8.GetBytes(string, Span<byte>) into the
  writer's span — no intermediate byte[] allocation
- Int32/Int64 serialize: BinaryPrimitives.WriteInt*LittleEndian into
  fixed-width spans
- String deserialize: Encoding.UTF8.GetString(ReadOnlySequence<byte>) on
  net6+, handles split codepoints across multi-segment input
- Int32/Int64 deserialize: BinaryPrimitives.ReadInt*LittleEndian; falls
  back to a stack copy when the value spans a segment boundary
- ToBinary/FromBinary bridges overridden — strings use
  Encoding.UTF8.GetBytes(string), ints use BitConverter on
  little-endian platforms (matching V1 byte-for-byte) and a manual
  little-endian fallback otherwise
- use-legacy-behavior config flag preserved
- SizeHint(o) returns precise sizes for fixed-width values and
  GetMaxByteCount(string.Length) for strings — gives the
  ArrayBufferWriter inside the bridge a tight initial allocation

Test fixes:

- SerializationSpec.cs / SerializationSpec.AllowUnregisteredTypesSpec:
  ByteArraySerializer is V2-native now, so the previous
  AsV1<ByteArraySerializer>() pattern is invalid (the type constraint
  T : Serializer rejects V2 types). Replaced with direct
  Should().BeOfType<ByteArraySerializer>() on the result.
- PrimitiveSerializersSpec.cs: same — switched from .AsV1<PrimitiveSerializers>()
  to .Should().BeOfType<PrimitiveSerializers>().Subject (FluentAssertions
  pattern that returns the typed subject).

Test status:
- Akka.Tests serialization: 65 passing, 0 failing
- Akka.Remote.Tests PrimitiveSerializersSpec: 17 passing, 0 failing
Direct unit tests for SerializerV1Adapter that exercise the wrapping
behavior independently of Serialization's registration plumbing.
Full HOCON / V1 auto-wrap path coverage continues to live in
SerializationSpec and CustomSerializerSpec — this spec is for the
adapter's own contract.

Coverage:
- Round-trip through buffer API (Serialize/Deserialize) for plain V1
  with and without IncludeManifest, and for SerializerWithStringManifest
- Round-trip through byte[] bridge (ToBinary/FromBinary) for both
  string-manifest and Type-typed FromBinary overloads
- Manifest behavior matches V1 dispatch (TypeQualifiedName for
  IncludeManifest=true plain serializers, custom string for
  SerializerWithStringManifest, empty for IncludeManifest=false)
- Identifier preserved from inner V1
- Inner property returns the wrapped instance unchanged
- ToBinary/FromBinary bridge overrides produce byte-identical output to
  the inner V1 (so the bridge skip is correct)
- Multi-segment ReadOnlySequence<byte> input handled correctly
- AsV1<T>/TryAsV1<T> extension methods unwrap correctly and have the
  expected null-vs-throw failure semantics

Three V1 fixture classes cover the three V1 dispatch flavors; a small
ReadOnlySequenceSegment<byte> helper synthesizes multi-segment input
without depending on a Pipe.

15/15 tests pass.
Benchmark (src/benchmark/Akka.Benchmarks/Serialization/SerializerV2EnvelopeBenchmarks.cs):

Simulates what EndpointWriter will do once Spec 3 wires the Streams TCP
transport to call SerializerV2.Serialize(IBufferWriter<byte>) directly:
writes a Remote-shaped envelope to an ArrayBufferWriter<byte>, wraps the
result as a ReadOnlySequence<byte>, reads the header via
SequenceReader<byte>, hands the payload slice to
SerializerV2.Deserialize(ReadOnlySequence<byte>, manifest). Compares
against the V1-bridge path (ToBinary → byte[] → FromBinary) on the same
serializer instance and the same payload.

Envelope shape: [serializerId: int32 LE][manifestLen: int32 LE]
[manifest: utf8][payload: bytes-to-end]. Payload length is implicit —
the outer frame boundary is the boundary, matching how the real Streams
TCP transport will frame messages in Spec 3.

Payload matrix exercises every V2-native primitive path:
- string short (5 chars), medium (256), long (4 KB)
- int32, int64
- byte[] small (16 B), medium (1 KB), large (16 KB)

No Akka.Remote / DotNetty / socket dependencies — pure shape benchmark
to validate the V2 buffer API earns its keep on allocations and
throughput before downstream specs build on it. Reusable for Spec 3
integration: drop in a real FrameBufferWriter in place of
ArrayBufferWriter and re-run to confirm no regression.

Configured with MicroBenchmarkConfig (MemoryDiagnoser + GitHub markdown
exporter); run via the standard BenchmarkDotNet console runner.

API baselines (src/core/Akka.API.Tests/verify/):

- CoreAPISpec.ApproveCore.DotNet.verified.txt: ByteArraySerializer now
  extends SerializerV2; Serialization gets AddSerializer/AddSerializationMap
  V2 overloads, FindSerializerFor[Type] return SerializerV2, ManifestFor
  V2 overload; new public types SerializerV2, SerializerV1Adapter,
  SerializerV2Extensions.
- CoreAPISpec.ApproveRemote.DotNet.verified.txt: PrimitiveSerializers
  now extends SerializerV2 with the buffer Serialize/Deserialize methods.

Test status: 18/18 Akka.API.Tests passing; 0 errors / 0 warnings on full
solution build.
/// </summary>
/// <param name="o">The object whose manifest is requested.</param>
/// <returns>The manifest string, or <see cref="string.Empty"/> if no manifest is needed.</returns>
public abstract string Manifest(object o);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Would be nice to have a ReadOnlyMemory or ReadOnlySpan overload here for other use.

/// </summary>
/// <param name="o">The object whose encoded size is being estimated.</param>
/// <returns>An estimate of the encoded byte length.</returns>
public virtual int SizeHint(object o) => 256;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor concern with this is whether it will lead to excessive boxing for consumers who misuse structs...

/// <param name="buffer">The byte sequence containing the serialized object.</param>
/// <param name="manifest">The manifest hint, or <see cref="string.Empty"/>.</param>
/// <returns>The deserialized object.</returns>
public abstract object Deserialize(ReadOnlySequence<byte> buffer, string manifest);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might come a time where being able to accept ReadOnlyMemory<char> manifest will be useful... so I at least want to risk bringing it up...

/// </summary>
/// <param name="buffer">The buffer to write into.</param>
/// <param name="obj">The object to serialize.</param>
public abstract void Serialize(IBufferWriter<byte> buffer, object obj);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't hate this but it would be nice if we could get some form of SerializeEnvelope on here to keep things uniform and avoid branching.... but maybe that's too big...

Copy link
Copy Markdown
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're pretty far off from proving the concept on V2 serialization here

/// </summary>
/// <param name="o">The object whose encoded size is being estimated.</param>
/// <returns>An estimate of the encoded byte length.</returns>
public virtual int SizeHint(object o) => 256;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some way of signaling, for backwards compat, "we don't know what the size of this object is - no size hint available"

/// </summary>
/// <param name="buffer">The buffer to write into.</param>
/// <param name="obj">The object to serialize.</param>
public abstract void Serialize(IBufferWriter<byte> buffer, object obj);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some thoughts:

  1. Ensure WriteMessagesAsync/SaveAsync is called asynchronously in Async… #8163 - the correct fix for this type of flow control problem in AKka.Persistence is for Serialize / Deserialize to be ValueTask-returning async functions. This naturally allows us to kick the serialization work out of band while avoiding some of the flow control problems 8163 introduced (and had to revert here: Revert Task.Yield() from AsyncWriteJournal and SnapshotStore (cherry-pick to dev) #8189
  2. We should return some type of result here IMHO - either the length of the written bytes, a result object that includes that information, or something else. Before we could get that information via the length of the return byte[], now we don't get that information back since it's encapsulated inside the IBufferWriter<byte>.

/// </summary>
/// <param name="obj">The object to serialize.</param>
/// <returns>A byte array containing the serialized object.</returns>
public virtual byte[] ToBinary(object obj)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this - SerializerV2 doesn't need to be backwards compatible. That's a job for the SerializerV1Adapter. We're adapting V1 to V2, not V2 to V1.

/// <param name="bytes">The serialized object's bytes.</param>
/// <param name="manifest">The manifest hint, or <see cref="string.Empty"/>.</param>
/// <returns>The deserialized object.</returns>
public virtual object FromBinary(byte[] bytes, string manifest)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this.

/// <param name="bytes">The serialized object's bytes.</param>
/// <param name="type">The expected runtime type, or <c>null</c> if unspecified.</param>
/// <returns>The deserialized object.</returns>
public virtual object FromBinary(byte[] bytes, Type? type)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this - same comment as above.

/// <param name="buffer">The byte sequence containing the serialized object.</param>
/// <param name="manifest">The manifest hint, or <see cref="string.Empty"/>.</param>
/// <returns>The deserialized object.</returns>
public abstract object Deserialize(ReadOnlySequence<byte> buffer, string manifest);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be async.

/// Serializes the object and decorates serialized <see cref="IActorRef"/> instances using
/// the given <paramref name="address"/>.
/// </summary>
public byte[] ToBinaryWithAddress(Address address, object obj)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this method get called usually and is there a better way of doing this than an implicit ThreadStatic variable? I believe this exists primarily for multi-transport Akka.Remote systems. Could we just require this context to be passed in explicitly in those callsites?

Per PR akkadotnet#8203 review feedback. The void return was hiding load-bearing
information — callers (especially wrapped-payload outer serializers
patching length prefixes) need to know how many bytes the Serialize call
wrote to the buffer. They could fish it out of the writer state, but
that's an indirect read that breaks if the writer is shared with other
writes happening on the same call.

This is the only API change being made before benchmarking. Other surface
critiques (async/ValueTask, bridge removal, transport-info threading)
remain held until perf data validates the basic V2 design.

Deserialize signature is unchanged — the read side doesn't have an
analogous patch-after-the-fact concern.

Affected:
- SerializerV2.Serialize: abstract int Serialize(IBufferWriter<byte>, object)
- ByteArraySerializer.Serialize returns byte[].Length
- PrimitiveSerializers.Serialize returns bytes-written per primitive type
- SerializerV1Adapter.Serialize returns inner.ToBinary(obj).Length
- Bridge ToBinary uses the returned count to size the ToArray slice
- Benchmark + V1Adapter tests adjusted (return value discarded where
  callers don't need it)

Tests: Akka.Tests serialization 80/80 passing.
Validates the SerializerV2 design against today's Akka.Remote
MessageSerializer + AkkaPduCodec wrap pipeline. Reuses the existing
Protobuf message types (AckAndEnvelopeContainer / RemoteEnvelope /
Payload) without modification — the V2 path produces byte-equivalent
wire output that Google.Protobuf parses transparently.

What the spike contains

- PatchingBufferWriter: IBufferWriter<byte> with a PatchSpan(offset, len)
  accessor for in-place length-prefix patching.
- ProtoWire: hand-rolled Protobuf wire-format primitives (tag, varint,
  fixed-width varint placeholder + patch, fixed64, string). Mirrors what
  CodedOutputStream does internally, but writes against IBufferWriter
  directly (Google.Protobuf's WriteContext.Initialize(IBufferWriter) is
  internal and not callable from user code).
- V2SerializerRegistry: Type -> SerializerV2 / ID -> SerializerV2
  lookup. Static dispatch, no reflection at serialize time, no
  Type.GetType from manifest strings at deserialize time.
- V2RemoteEnvelopeWriter: writes the full AckAndEnvelopeContainer
  pipeline directly into a PatchingBufferWriter. For each nested
  length-delimited field (envelope, payload, message bytes), reserves a
  fixed-width 5-byte varint placeholder, runs the inner write (using the
  bytes-written int return on SerializerV2.Serialize to know exactly how
  much was written), then patches the length prefix in place.

How the patching technique works around Protobuf's nested-message
length-prefix problem

Protobuf's length-delimited wire format requires the inner's byte
count to be known BEFORE the length prefix is written. Canonical
varints are minimum-width, so a placeholder can't be patched
retroactively without knowing how wide the varint will be.

The trick: write the length prefix as a FIXED-WIDTH 5-byte varint
always. 5 bytes give 35 bits of data — plenty for any uint32. Small
values are encoded as over-long varints (continuation bits set on the
first 4 bytes, value's low bits in byte 0, zeros in the rest).
Google.Protobuf's CodedInputStream accepts up to 5 bytes for a uint32
varint and OR's the data bits regardless of canonicity, so the
over-long form parses to the same value as the minimum-width form.

That lets us reserve 5 bytes, run the inner write, then patch the
varint in place using the int returned by Serialize. One pass, no
scratch buffer, no intermediate byte[].

Wire-overhead cost: at most 4 extra bytes per length-delimited nested
field. For Akka.Remote's three nesting levels, that's max 12 bytes per
message — noise vs payload size.

Wire compat is verified at benchmark setup: V2 output is parsed via
AckAndEnvelopeContainer.Parser.ParseFrom and compared field-by-field
against the V1 output (recipient path, payload bytes, serializer ID).
Throws at setup if anything diverges.

Benchmark results (V1 = real MessageSerializer + AkkaPduProtobuffCodec,
V2 = the spike, both producing wire-equivalent AckAndEnvelopeContainer
bytes):

| Payload      | V1 time   | V2 time   | Ratio | V1 alloc | V2 alloc | Alloc |
|--------------|-----------|-----------|-------|----------|----------|-------|
| StringShort  |  3,522 ns |  2,375 ns | 0.68x |  1,512 B |    840 B | 0.56x |
| StringMedium |  5,092 ns |  4,007 ns | 0.79x |  2,776 B |  1,600 B | 0.58x |
| StringLong   | 14,142 ns | 11,991 ns | 0.87x | 21,976 B | 13,120 B | 0.60x |
| BytesSmall   |  1,665 ns |    482 ns | 0.30x |    688 B |      0 B | 0.00x |
| BytesLarge   | 10,806 ns |  1,001 ns | 0.13x | 33,432 B |      0 B | 0.00x |

byte[] payloads (the canonical wrapped-payload pattern when the inner
is a binary blob) show the dramatic win: 3.5-8x faster and ZERO
managed allocations. string payloads show smaller but real
improvements (15-32% faster, 40-44% fewer allocations) with some
residual allocation I haven't profiled yet (probably warmup/buffer
growth artifacts in BDN's measurement; not blocking the design
validation).

The spike is benchmark-project-only — does not touch the running
Akka.Remote infrastructure. Akka.Benchmarks already has
InternalsVisibleTo from Akka.Remote so the spike can invoke real
MessageSerializer.Serialize and AkkaPduProtobuffCodec.ConstructMessage
directly for the V1 baseline.

Files:
- src/benchmark/Akka.Benchmarks/Serialization/V2ProtoSpike.cs
- src/benchmark/Akka.Benchmarks/Serialization/V2ProtoBenchmarks.cs
Extends the V2 wrap-pipeline spike with a hand-rolled receive-side
parser that mirrors AkkaPduProtobuffCodec.DecodeMessage +
MessageSerializer.Deserialize, but without constructing any of the
intermediate Protobuf message objects.

New types

- ProtoWire read helpers: ReadVarint32, ReadTag, ReadFixed64,
  ReadLengthDelimited, ReadString, SkipField. All take
  `ref ReadOnlySpan<byte>` and advance the span past consumed bytes.
  Accept both canonical and over-long varints, so V2 can parse V1's
  wire output as well as its own.
- V2DeserializedEnvelope: result struct (RecipientPath, SenderPath,
  Seq, Payload). Mirrors what V1's AkkaPduCodec.DecodeMessage produces
  but without the IActorRef resolution step (deferred to the dispatcher).
- V2RemoteEnvelopeReader: parses AckAndEnvelopeContainer wire bytes
  directly into the result struct. Has two entry points:
    Read(ReadOnlySpan<byte>)    - for cases without a Memory backing
    Read(ReadOnlyMemory<byte>)  - zero-copy slicing for the inner-payload bytes
  The Memory overload slices the original buffer at the inner-payload
  offset and wraps as a ReadOnlySequence<byte> for the V2 inner
  serializer's Deserialize. No intermediate byte[] for the inner
  payload — V1 allocates two (the ByteString backing plus ToByteArray).

Dispatch is by integer serializer ID (registry.GetById), not by
Type.GetType(manifest) — no reflection on receive side, no
BinaryFormatter-class attack surface.

Benchmark additions

V2ProtoBenchmarks now runs both directions for every payload kind:

- V1 read: AckAndEnvelopeContainer.Parser.ParseFrom + extract proto
  fields + MessageSerializer.Deserialize (the real Akka.Remote receive
  path)
- V2 read: V2RemoteEnvelopeReader.Read using the Memory overload

Setup verifies cross-version compat: the V2 reader correctly parses
V1's canonical-varint wire bytes. Round-trip equality is asserted for
both string and byte[] payloads (byte[] via SequenceEqual, since
default object Equals on byte[] is reference equality).

Results (V1 = real Akka.Remote path, V2 = spike)

Writes:
  StringShort:  3849ns -> 2374ns (0.62x), 1512 B -> 840 B   (0.56x)
  StringMedium: 4855ns -> 3690ns (0.76x), 2776 B -> 1600 B  (0.58x)
  StringLong:  14058ns -> 11182ns (0.81x), 21976 B -> 13120 B (0.60x)
  BytesSmall:   1519ns -> 487ns  (0.32x), 688 B   -> 0 B    (0.00x)
  BytesLarge:   9204ns -> 1087ns (0.16x), 33432 B -> 0 B    (0.00x)

Reads:
  StringShort:  3418ns -> 2688ns (0.79x), 3416 B  -> 2976 B (0.87x)
  StringMedium: 9801ns -> 7787ns (0.79x), 4936 B  -> 4240 B (0.86x)
  StringLong:  18267ns -> 16190ns (0.89x), 56720 B -> 52184 B (0.92x)
  BytesSmall:   1284ns -> 1275ns (0.99x), 664 B   -> 216 B  (0.33x)
  BytesLarge:   9092ns -> 7051ns (0.78x), 33400 B -> 16584 B (0.50x)

Headlines

- byte[] writes: 3-8x faster, ZERO managed allocations
- byte[] reads: ~22% faster, 50-67% fewer allocations
- string writes: 24-38% faster, 40-44% fewer allocations
- string reads: 11-21% faster, modest allocation reduction (the
  result-string allocation dominates regardless of pipeline)

Both directions show real Akka.Remote-equivalent benefits. The
wrapped-payload pattern is essentially free in V2 for byte[] payloads.
@Aaronontheweb
Copy link
Copy Markdown
Member Author

V2 wrap-pipeline spike — throughput numbers

Single-threaded serialization throughput on the real EndpointWriter.WriteSend path (V1 = today's MessageSerializer.Serialize + AkkaPduProtobuffCodec.ConstructMessage; V2 = spike with hand-written wire format + fixed-width varint patching + inline inner write). Wire format unchanged — V2 produces byte-equivalent AckAndEnvelopeContainer output parseable by V1 peers. Benchmark in src/benchmark/Akka.Benchmarks/Serialization/V2ProtoBenchmarks.cs.

Hardware: AMD Ryzen 9 9900X, .NET 10, BenchmarkDotNet MicroBenchmarkConfig. Numbers are per-EndpointWriter-thread; aggregate Akka.Remote throughput scales by concurrent endpoints and TCP pipelining.

Throughput (messages/sec per thread)

Send-side ceiling:

Payload V1 msg/s V2 msg/s Improvement
StringShort 263,000 421,000 +60%
StringMedium 206,000 271,000 +32%
StringLong 71,000 89,000 +26%
BytesSmall 658,000 2,053,000 +212% (3.1×)
BytesLarge 109,000 920,000 +747% (8.5×)

Receive-side ceiling:

Payload V1 msg/s V2 msg/s Improvement
StringShort 292,000 372,000 +27%
StringMedium 102,000 128,000 +26%
StringLong 55,000 62,000 +13%
BytesSmall 779,000 784,000 +1%
BytesLarge 110,000 142,000 +29%

Round-trip (send + receive on the same thread):

Payload V1 msg/s V2 msg/s Improvement
StringShort 138,000 198,000 +44%
StringMedium 68,000 87,000 +28%
StringLong 31,000 37,000 +18%
BytesSmall 357,000 569,000 +60%
BytesLarge 55,000 123,000 +125% (2.2×)

Why byte[] payloads dominate the win

The wrapped-payload pattern (MiscMessageSerializer, ClusterShardingMessageSerializer, DistributedPubSubMessageSerializer, ReliableDeliverySerializer) wraps an inner user payload as a bytes field inside an outer Protobuf message. V1 today: inner serializer's byte[]ByteString.CopyFrom → outer proto → .ToByteString(). Two byte[] allocations and one Protobuf serialization layer per wrap. V2: the inner V2 serializer writes its bytes directly into the same buffer the outer is writing to, and the length prefix is patched in place via the fixed-width-varint placeholder. No intermediate byte[]. Per-send allocations drop to zero on the V2 path; receive-side cuts allocations in half because V1's ByteString + payload.Message.ToByteArray() double-allocation is gone.

Allocation deltas (round-trip, send + receive)

Payload V1 B/msg V2 B/msg Reduction
StringShort 4,928 3,816 −23%
StringMedium 7,712 5,840 −24%
StringLong 78,696 65,304 −17%
BytesSmall 1,352 216 −84%
BytesLarge 66,832 16,584 −75%

For a cluster pushing 100K msg/sec on byte[]-heavy traffic, V2 saves ~5 GB/sec of managed allocations — that's Gen-2 / pause-time relief that shows up in latency tails, not in single-thread throughput.

What this does NOT measure

  • End-to-end Akka.Remote throughput (RemotePingPong-style). That requires MessageSerializer integrated into the real EndpointWriter + transport path. The spike numbers tell you the per-thread serialization ceiling; whether serialization is the bottleneck in any specific workload depends on batching, network, and actor dispatch.
  • Latency under load.

Wire format

  • V2 output parses cleanly through AckAndEnvelopeContainer.Parser.ParseFrom (verified at setup).
  • V2 reader parses both canonical-varint (V1) and over-long-varint (V2) wire output (verified at setup).
  • Payload / RemoteEnvelope / AckAndEnvelopeContainer schemas unchanged. V2 nodes interoperate with V1 peers.

Next step

Contained integration pass: introduce MessageSerializerV2 + V2SerializerRegistry alongside V1 (no replacement), wire one wrapped-payload serializer end-to-end (probably MiscMessageSerializer), run RemotePingPong against the V2 path. That gives the aggregate-throughput data point the spike can't measure, while keeping V1 reversible.

Spike:

  • src/benchmark/Akka.Benchmarks/Serialization/V2ProtoSpike.cs
  • src/benchmark/Akka.Benchmarks/Serialization/V2ProtoBenchmarks.cs

@Aaronontheweb Aaronontheweb changed the title Add SerializerV2 foundation (Spec 4, Milestone 2 — foundation only) Experimental: integrating the V2 serialization POC into Akka.Remote (research / not for merge) May 11, 2026
Moves V2 spike code from the benchmark project into Akka.Remote so it can
be referenced from production code, adds ConstructMessageV2 to
AkkaPduProtobuffCodec, wires EndpointWriter.WriteSend to use it, and
validates that RemotePingPong runs through V2 successfully.

Changes

- src/core/Akka.Remote/Serialization/V2/V2Codec.cs: spike code relocated
  from the benchmark project. Types are internal. Adds Ack support to
  V2RemoteEnvelopeWriter so the Akka.Remote ack-piggyback path works.
  V2SerializerRegistry takes a Serialization instance fallback so it can
  resolve any registered serializer (V2-native or V1Adapter-wrapped)
  without explicit Register calls.
- src/core/Akka.Remote/Transport/AkkaPduCodec.cs: new ConstructMessageV2
  method on AkkaPduProtobuffCodec. Skips the V1 SerializedMessage proto
  construction and AckAndEnvelopeContainer.ToByteString() in favor of
  hand-writing the wire format via PatchingBufferWriter. ThreadStatic
  buffer for per-thread pooling — EndpointWriter actors run on dispatcher
  threads, one buffer per thread, reset between calls. Final
  ByteString.CopyFrom is the only remaining unavoidable allocation
  (matches V1's ToByteString cost).
- src/core/Akka.Remote/Endpoint.cs: EndpointWriter.WriteSend now calls
  ((AkkaPduProtobuffCodec)_codec).ConstructMessageV2(...) with the raw
  message object, bypassing SerializeMessage / SerializedMessage entirely.

Validation

- dotnet test src/core/Akka.Remote.Tests: 362 passing, 5 skipped, 0 failing.
  V2 send produces wire-compat bytes that V1 receive parses correctly —
  end-to-end Akka.Remote round-trips work with V2 on the send side.

- RemotePingPong on AMD Ryzen 9 9900X (12 physical cores, ServerGC):

  | Clients | V1 msg/s  | V2 msg/s  | Delta |
  |--------:|----------:|----------:|------:|
  |       1 |   299,851 |   291,971 |   -3% |
  |       5 |   361,795 |   415,455 |  +15% |
  |      10 | 1,124,228 | 1,263,424 |  +12% |
  |      15 | 1,348,921 | 1,306,621 |   -3% |
  |      20 | 1,367,054 | 1,350,895 |   -1% |
  |      25 | 1,333,334 | 1,316,830 |   -1% |
  |      30 | 1,334,817 | 1,321,586 |   -1% |

  V2 wins meaningfully (12-15%) at mid client counts where serialization
  dominates EndpointWriter throughput. At high client counts (15+) the
  per-thread plateau (~1.35M msg/s) is bottlenecked by something other
  than serialization — likely GC pressure or dispatcher contention —
  and the V2 win doesn't translate. At 1 client, network round-trip
  dominates so V2's marginal serialization win is invisible.

Known limitations of this integration

- ConstructMessageV2 still calls SerializeActorRef per call, which
  allocates ActorRefData + Path string. Caching these per-association on
  the EndpointWriter would eliminate that allocation for V2 (V1 cannot
  cache because it builds the whole proto graph fresh).
- ByteString.CopyFrom at the end allocates a final byte[] for the wire
  bytes. UnsafeByteOperations.UnsafeWrap could eliminate this but
  requires sole ownership of the underlying byte[] — incompatible with
  the ThreadStatic buffer pool. Switching to ArrayPool rentals + UnsafeWrap
  would solve this but expands scope.
- Receive-side stays on V1's AckAndEnvelopeContainer.Parser.ParseFrom +
  MessageSerializer.Deserialize. V2 receive integration is a follow-on.

This is experimental work for learning purposes. The real V2 production
PR will follow once we've absorbed what this integration teaches.
@Aaronontheweb
Copy link
Copy Markdown
Member Author

V2 integrated into EndpointWriter — RemotePingPong results

Wired V2 into the real EndpointWriter.WriteSend path (commit ae54f85ca). Moved spike code from the benchmark project to Akka.Remote/Serialization/V2/, added ConstructMessageV2 to AkkaPduProtobuffCodec, added Ack support to V2RemoteEnvelopeWriter, added a ThreadStatic PatchingBufferWriter pool so the buffer isn't allocated per call.

Validation: Akka.Remote.Tests

362 passing, 5 skipped, 0 failing with V2 wired into WriteSend. Wire compat verified end-to-end — V2 send produces bytes that V1 receive (AckAndEnvelopeContainer.Parser.ParseFrom) parses correctly.

RemotePingPong throughput

Hardware: AMD Ryzen 9 9900X, 12 physical / 24 logical cores, .NET 10, ServerGC.

Clients V1 msg/s V2 msg/s Delta
1 299,851 291,971 -3%
5 361,795 415,455 +15%
10 1,124,228 1,263,424 +12%
15 1,348,921 1,306,621 -3%
20 1,367,054 1,350,895 -1%
25 1,333,334 1,316,830 -1%
30 1,334,817 1,321,586 -1%

(Confirms @Aaronontheweb's 1.3M msg/s figure — V1 peaks at 1.37M msg/s at 20 clients on this hardware. The dev-branch baseline of ~680K msg/s at 30 clients in IMPLEMENTATION_ORDER.md was on different hardware; the 9900X comfortably doubles that.)

What this tells us

  • V2 wins meaningfully (12-15%) at 5-10 clients where the single EndpointWriter on each direction is serialization-bound. The per-message cost reduction translates directly to throughput.
  • V2 is flat (±3%) at 1 client and 15+ clients. At 1 client, network round-trip latency dominates so serialization savings are invisible. At 15+ clients, the system hits a different plateau (~1.35M msg/s) that the per-message benchmark doesn't capture — likely GC pressure from non-serializer allocations across many threads, or actor dispatch contention. Lowering serialization cost only helps until something else takes over as the bottleneck.

Known limitations of this integration

These would all be addressed in a clean production V2 PR — they're capped here because this branch is experimental:

  • ConstructMessageV2 still calls SerializeActorRef per call, allocating ActorRefData + a path string for recipient and sender. The per-association EndpointWriter could cache these (V2 design supports it; V1 cannot because the codec builds the whole proto graph fresh inside ConstructMessage).
  • Final ByteString.CopyFrom(buffer.WrittenSpan) allocates the wire byte[]. UnsafeByteOperations.UnsafeWrap would eliminate this but requires sole ownership of the underlying byte[] — incompatible with the ThreadStatic pool. ArrayPool rentals + UnsafeWrap is the way through.
  • Receive-side stays on V1 (AckAndEnvelopeContainer.Parser.ParseFrom + MessageSerializer.Deserialize). The synthetic benchmark showed receive-side wins are smaller anyway (10-30%, allocation-driven not throughput-driven), so prioritizing send-side here was right.

Implications for the real V2 PR

The experimental integration validates two important things:

  1. V2 is correctness-compatible. Real Akka.Remote round-trips work. No wire-format incompatibility, no proto-parsing issues with over-long varints.
  2. V2's per-message wins translate to real throughput improvements where serialization is the bottleneck. Mid-load (5-10 clients) saw the largest wins; the synthetic benchmark's per-message numbers (3-8× for byte[]) compose into 12-15% aggregate throughput.

To unlock V2's full potential, the production PR would need to:

  • Cache recipient/sender wire bytes on EndpointWriter (per-association)
  • Use ArrayPool<byte>.Shared.Rent + UnsafeByteOperations.UnsafeWrap for the final wire bytes (saves one byte[] per message)
  • Integrate V2 on receive (eliminate the AckAndEnvelopeContainer.Parser proto graph allocation)
  • Tackle whatever's bottlenecking the 15+ client case (likely GC pause + dispatcher) — possibly outside V2's scope

This branch is now in a state where the next step is closing it out and starting the clean PR with the learnings baked in.

Adds DecodeMessageV2 to AkkaPduProtobuffCodec that parses the
AckAndEnvelopeContainer wire bytes directly into an AckAndMessage
(same shape V1 produces) without allocating the proto graph
(AckAndEnvelopeContainer / RemoteEnvelope / Payload / 2x ActorRefData
objects). Inner payload bytes are wrapped zero-copy via
UnsafeByteOperations.UnsafeWrap of a slice of the wire memory.

Downstream pipeline is unchanged: the AckedReceiveBuffer / DeliverAndAck /
Dispatch path operates on the V2-built AckAndMessage exactly the same
way it operates on the V1-built one. Reliable delivery (re-delivery
across reconnects, ordering, dedup) is preserved.

V2 receive parsing is inline in the codec (helpers ParseEnvelopeMetadata,
ParseAckMetadata, ParseEnvelopeFields, ParsePayloadFields,
ExtractActorRefPath) rather than reusing V2RemoteEnvelopeReader.Read,
because the reader's main entry point eagerly deserializes the inner
payload via the V2 serializer registry. Keeping the inline parser
metadata-only lets the SerializedMessage stay as the unit handed to
the dispatcher — MessageSerializer.Deserialize runs in Dispatch the same
way V1 does.

Validation

dotnet test src/core/Akka.Remote.Tests: 362 passing, 5 skipped, 0 failing.
V2 on both send AND receive doesn't break any Akka.Remote behavior,
including the reliable delivery / ack paths.

RemotePingPong on AMD Ryzen 9 9900X, .NET 10, ServerGC:

  Clients   V1 msg/s    V2 send    V2 send+recv
        1    299,851    291,971         303,031
        5    361,795    415,455         390,321
       10  1,124,228  1,263,424       1,204,094
       15  1,348,921  1,306,621       1,353,180
       20  1,367,054  1,350,895       1,359,620
       25  1,333,334  1,316,830       1,308,558
       30  1,334,817  1,321,586       1,300,109

Adding V2 on receive shows small wins at low/mid client counts (8% at 5,
7% at 10) and is within noise of V1 (-3% to +1%) at high client counts.
The plateau at ~1.35M msg/s persists in all configurations — confirming
that something downstream of serialization (DotNetty buffering, GC
pressure from other allocations, dispatcher contention, or actor
scheduling) is the binding constraint at this load level. The V2 design's
per-message savings can't translate to aggregate throughput once
serialization stops being the bottleneck.

This matches the broader Spec 3 hypothesis: realizing V2's full perf
envelope requires the Streams TCP transport rewrite. DotNetty does its
own internal buffering / copying that absorbs V2's zero-copy advantages
before they reach the wire.

Known limitations of this V2 receive integration

- MessageSerializer.Deserialize still runs in Dispatch — calls
  payload.Message.ToByteArray() which materializes the wrapped inner
  bytes. UnsafeWrap is zero-copy on the ByteString construction but the
  later .ToByteArray() copies. Eliminating this would require pushing
  V2 dispatch into Endpoint.Dispatch with a deserialize-at-dispatch path
  that uses ReadOnlySequence directly. Out of scope for this experimental
  branch.
@Aaronontheweb
Copy link
Copy Markdown
Member Author

Warmed-up RemotePingPong numbers + allocation profile

Re-ran RemotePingPong with 3 iterations on both V1 and V2 paths to remove JIT-warmup variance from the comparison. Best-of-3 per client count:

Clients V1 best (msg/s) V2 best (msg/s) Improvement
1 917,432 980,393 +7%
5 1,253,133 1,362,398 +9%
10 1,317,524 1,409,444 +7%
15 1,385,042 1,423,150 +3%
20 1,384,084 1,414,928 +2%
25 1,328,375 1,424,908 +7%
30 1,300,673 1,402,853 +8%

V2 peak: ~1.42M msg/s vs V1 ~1.39M. Consistent +2% to +9% across all client counts, biggest wins at the low/serialization-dominated end (1-10 clients) and again at 25-30 clients where GC-pressure relief gives back headroom.

The single-iteration cold numbers reported earlier in this PR (V1 1.37M / V2 1.35M at 20 clients) were dominated by warmup noise — once both paths are warmed up, V2 is genuinely faster end-to-end.

Allocation profile (V2, 10s capture at 30 clients)

Captured via dotnet-trace collect --profile gc-verbose. Counters showed 6.6 GB/sec allocation rate, 72-93 Gen0 collections/sec, 4.7% of wall time in GC pauses.

Top allocators by bytes:

% Bytes Type Source
40% 18.8 GB System.Byte[] SerializeActorRef(...).ToByteArray() ×2 per send + ByteString.CopyFrom for wire bytes + payload.Message.ToByteArray() per receive
21% 9.7 GB System.String Path strings from actorRef.Path.ToSerializationFormat() (×2 per send), UTF-8 decoded paths on receive
4.4% 2.1 GB Task<int> DotNetty's async write completions
4.2% 1.9 GB Google.Protobuf.ByteString Wire ByteString instances
3.8% 1.8 GB CodedOutputStream Protobuf write internals
3.8% 1.8 GB CodedInputStream Protobuf read internals
1.5% 0.7 GB DotNetty TaskCompletionSource + continuations DotNetty send-completion plumbing
1.4% 0.65 GB Akka.Remote.RemoteActorRef Actor-ref allocations on receive
1.4% 0.63 GB Akka.Remote.Transport.Message V2 still builds these for AckedReceiveBuffer
1.3% 0.60 GB ActorRefData proto SerializeActorRef build (V1 inherited per-send)
1.0% 0.49 GB Payload proto V2's SerializedMessage build per receive

Where V2 tightening has room

V2-fixable (10-15% allocation reduction available):

  • SerializeActorRef(...).ToByteArray() ×2 per send — cache wire bytes per association (recipient is fixed for the lifetime of an EndpointWriter; senders typically come from a small set)
  • ByteString.CopyFrom(buffer.WrittenSpan) final wire bytes — ArrayPool + UnsafeByteOperations.UnsafeWrap
  • payload.Message.ToByteArray() on receive — bypass via direct SerializerV2.Deserialize(ReadOnlySequence, manifest) on the V2 slice
  • Payload proto allocation on receive — go fully V2 on the dispatch path

Not V2-fixable (transport rewrite territory — Spec 3):

  • ~13% of allocations are DotNetty internals (Task<int>, TaskCompletionSource, UnpooledHeapByteBuffer, IOVector[], etc.) — needs the Streams TCP transport to reach.

Verdict on this experimental branch

The V2 design is validated end-to-end:

  • ✅ Wire format compatibility verified (Akka.Remote.Tests 362/362)
  • ✅ Throughput improvement is real and consistent (+2-9% across client counts, +3% peak)
  • ✅ GC pressure relief is real (4.7% pause time today; tightening targets identified to cut another 10-15%)
  • ✅ Architectural ceiling identified — Spec 3 transport rewrite is the load-bearing piece for the next aggregate gain

Ready to land this as the "experimental research artifact" and start the clean V2 PR with these learnings baked in.

/// Tag = (field_number &lt;&lt; 3) | wire_type, varint-encoded. For field numbers 1–15
/// the tag fits in one byte.
/// </summary>
public static int WriteTag(IBufferWriter<byte> buffer, int fieldNumber, byte wireType)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW back in COVID I found that it was better, at least for akka PDU hotwiring, to have pre-computed bytes/bytestrings for the PDU bits.

Comment on lines +133 to +145
public static int WriteVarint32(IBufferWriter<byte> buffer, uint value)
{
var span = buffer.GetSpan(5);
var written = 0;
while (value >= 0x80)
{
span[written++] = (byte)(value | 0x80);
value >>= 7;
}
span[written++] = (byte)value;
buffer.Advance(written);
return written;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is -why- it was better to use precomputed bits for the PDU; VarInt128 does not lend itself well to minimal branching in code, unless you are brave enough to try switch statements based on range and see if that works better...

Comment on lines +169 to +177
public static void PatchFixedWidthVarint(Span<byte> placeholder, uint value)
{
// 5 bytes × 7 data bits = 35 bits — plenty for a uint32.
placeholder[0] = (byte)((value & 0x7F) | 0x80);
placeholder[1] = (byte)(((value >> 7) & 0x7F) | 0x80);
placeholder[2] = (byte)(((value >> 14) & 0x7F) | 0x80);
placeholder[3] = (byte)(((value >> 21) & 0x7F) | 0x80);
placeholder[4] = (byte)((value >> 28) & 0x7F);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious that this logic is diverges so much from the bufferwriter write

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants