-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Hi guys, Our team have been using SE.Redis over three years on the production. Before I start, I just wanna say thanks to all contributors of this project. π
After we switched to SE.Redis version 2, we found some microservice servers were in trouble with TimeoutBeforeWrite errors.
We didn't have enough time to dig in at that time, so we just worked around by making a reconnection logic.
Recently, we build a new microservice server which uses transactions heavily and produces a lot of TimeoutBeforeWrite errors.
After the debugging, I found the edge case from using transaction causes TimeoutBeforeWrite.
The timeouts are caused by the combination of several natural behaviors.
- Multiple ProcessBackLog() can be scheduled at the same time. (of course, the only one acquires lock and does processing)
- GetMessages() of TransactionMessage includes Monitor.Wait to wait conditions completed.
- Pipe sender, reader and ProcessBacklog() run on the same thread-pool.
Suppose, we have N workers(threads) and wait transaction conditions to be finished in GetMessages().
Normally, pipe sends and reads are scheduled on N-1 workers and GetMessages() finishes immediately.
But with heavy traffic, the situation that all N-1 workers run ProcessBackLog() and wait for lock acquire can be happened.
(a consumer makes the number of backlog 1->0 infinitely and many producers schedule new ProcessBacklog() after they make the number of backlog 0->1.)
As GetMessages() is running with lock, all ProcessBackLog() doesn't release thread.
As there's no thread to send and read through Pipe, Monitor.Wait in GetMessages() eventually reaches to timeout.
Furthermore, when it happens, most messages in backlog just set as TimeoutBeforeWrite because we spend 5 seconds(default timeout) in GetMessages().
The timeout was reached before the message could be written to the output buffer, and it was not sent, command=EXEC, timeout: 5000, outbound: 0KiB, inbound: 0KiB
To reproduce this situation easier, change a worker count less than default value(10). However, I could reproduce the case even with 16 workers.
My temporal workaround for this issue is to use dotnet thread pool. After we change the option to use dotnet thread pool, our production servers become stable. It seems that hill-climbing of dotnet thread pool is helpful in somehow.
But you know, it's not an ideal solution and I believe you guys could solve this issue fundamentally.
(+ I also submitted simple PR to fix it.)
Here's testing code that I used for debugging.
namespace TestConsole
{
internal static class Program
{
public static async Task Main()
{
var option = ConfigurationOptions.Parse("localhost");
option.SocketManager = new SocketManager("hello", 4);
var client = ConnectionMultiplexer.Connect(option);
client.GetDatabase().Ping();
var stop = false;
var db = client.GetDatabase();
var start = DateTime.Now;
while (!stop)
{
_ = Task.Run(async () =>
{
for (int t = 0; t < 50; t++)
{
if(stop)
{
return;
}
try
{
var newId = Guid.NewGuid().ToString();//.Substring(0, 1);
var tran = db.CreateTransaction();
tran.AddCondition(Condition.KeyNotExists(newId));
var tsk = tran.StringSetAsync(newId, "UniqueID");
await tran.ExecuteAsync();
await tsk;
}
catch (Exception e)
{
stop = true;
Console.WriteLine(e);
return;
}
}
});
await Task.Delay(10);
}
Console.ReadLine();
}
}
}
I hope this report would be helpful. If you need more information, feel free to ask me.
Thank you.