-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Bugfix for stuck in write method of WiFiClient and WiFiClientSecure until the remote peer closed connection #6104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To @d-a-v to give this a once-over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you found a sneaky bug that has been there for a long time, thanks!
However, I would request you change it to a bool and adjust the if (send_waiting==1)
statement (which caused the infinite hang once send_waiting got to 2) accordingly. We really want a flag here, not a count, so a bool would reduce technical debt.
if (_send_waiting == 1) { |
I've updated _send_waiting to be clear bool flag. |
Thanks! I'll leave it to @d-a-v to double-check that this only needs to be a flag and not a count (in which case the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sislakd this is a great finding and fixing!
I have always been suspecting this send_waiting operations but wasn't sure if it had to be fixed.
no fail, no fix .. until it fails, thanks
Well, this looks like a very sneaky bug indeed and just by looking at the code, I already imagine a lot of related reported issues. @d-a-v if you're aware of more of these 'suspecting' parts of the code, please add some issues about them so they can be looked into. |
Changes since 2.5.1 (to 2.5.2) Core ---- * Add explicit Print::write(char) (esp8266#6101) Build system ---- * Fix typo in elf2bin for QOUT binary generation (esp8266#6116) * Support PIO Wl-T and Arduino -T linking properly (esp8266#6095) * Allow *.cc files to be linked into flash by default (esp8266#6100) * Use custom "ElfToBin" builder for PIO (esp8266#6091) * Fail if generated JSON file cannot be read (esp8266#6076) * Moved 'Dropping' print from stdout to stderr in drop_versions.py (esp8266#6071) * Fix PIO issue when build environment contains spaces (esp8266#6119) Libraries ---- * Remove deadlock when server is not acking our data (esp8266#6107) * Bugfix for stuck in write method of WiFiClient and WiFiClientSecure until the remote peer closed connection (esp8266#6104) * Re-add original SD FAT info access methods (esp8266#6092) * Make FILE_WRITE append in SD.h wrapper (esp8266#6106) * Drop X509 after connection, avoid hang on TLS broken (esp8266#6065)
Changes since 2.5.1 (to 2.5.2) Core ---- * Add explicit Print::write(char) (#6101) Build system ---- * Fix typo in elf2bin for QOUT binary generation (#6116) * Support PIO Wl-T and Arduino -T linking properly (#6095) * Allow *.cc files to be linked into flash by default (#6100) * Use custom "ElfToBin" builder for PIO (#6091) * Fail if generated JSON file cannot be read (#6076) * Moved 'Dropping' print from stdout to stderr in drop_versions.py (#6071) * Fix PIO issue when build environment contains spaces (#6119) Libraries ---- * Remove deadlock when server is not acking our data (#6107) * Bugfix for stuck in write method of WiFiClient and WiFiClientSecure until the remote peer closed connection (#6104) * Re-add original SD FAT info access methods (#6092) * Make FILE_WRITE append in SD.h wrapper (#6106) * Drop X509 after connection, avoid hang on TLS broken (#6065)
Couple of days I was troubleshooting strange behavior with stability of components built on top of WiFiClient and WiFiClientSecure. Finally, I found the root cause of these issues. From time to time it happened that call of write method get stuck until the remote peer closed connection. It seems that root cause bug is present for quite long time in the code.
When tcp send buffer is full, ClientContext::_write_from_source increments _send_waiting and switch context to NONOS using esp_yield. If something else call esp_schedule (not _write_some_from_cb method in the same instance of ClientContext), the cycle in _write_from_source is repeated, send buffer is still full and value of _send_waiting is incremented again (thus from this moment _send_waiting>1). Any successful ack on the relevant connection never call esp_schedule because of condition in _write_some_from_cb where _send_waiting is decremented only if it is equal to 1.
One example when something else can call esp_schedule method is when there are two or more ClientContext instances (e.g. two client connections). Ack on other client context cause esp_schedule and thus resume of write this client context while there is still no space in tcp send buffer.
The simplest solution is set _send_waiting to 1 instead of its increment. As _send_waiting is one Byte it has no sense to change it to bool.