Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Sep 28, 2025

Fixes a critical panic that occurs during high-concurrency RADIUS authentication failures: panic: sync: RUnlock of unlocked RWMutex.

Problem

Under heavy load with frequent authentication failures, the RADIUS service crashes with:

fatal error: sync: RUnlock of unlocked RWMutex

goroutine 77 [running]:
sync.(*RWMutex).RUnlock(...)
github.com/talkincode/toughradius/v8/toughradius.(*RejectItem).IsOver
    toughradius/radius_reject_delay.go:32

The issue occurs in radius_reject_delay.go where read-to-write lock upgrades cause double unlocking:

func (ri *RejectItem) IsOver(max int64) bool {
    ri.Lock.RLock()
    defer ri.Lock.RUnlock()          // Deferred unlock #1
    if time.Since(ri.LastReject).Seconds() > 10 {
        ri.Lock.RUnlock()            // Manual unlock #2 - PANIC!
        ri.Lock.Lock()
        // ...
    }
}

The same pattern existed in RejectCache.GetItem().

Solution

Removed problematic defer statements and implemented proper manual lock management:

Before:

ri.Lock.RLock()
defer ri.Lock.RUnlock()    // Causes double unlock
if condition {
    ri.Lock.RUnlock()      // Manual unlock before upgrade
    ri.Lock.Lock()
    defer ri.Lock.Unlock()
}
return value

After:

ri.Lock.RLock()
if condition {
    ri.Lock.RUnlock()      // Explicit unlock before upgrade
    ri.Lock.Lock()
    // ... operations ...
    ri.Lock.Unlock()       // Explicit unlock after operations
    return value
}
over := value            // Store result before unlock
ri.Lock.RUnlock()        // Single unlock for normal path
return over

Testing

Added comprehensive test suite (radius_reject_delay_test.go) with:

  • Concurrent access testing (100+ goroutines)
  • Race condition detection (go test -race)
  • Normal operation verification
  • Reset behavior validation

All tests pass with no panics or race conditions detected.

Impact

This fix prevents process crashes and systemd restart loops in production deployments experiencing high authentication failure rates. The change is minimal and preserves all existing functionality while ensuring thread safety under concurrent access.

Resolves the panic described in issue where production systems crash during high-concurrency authentication scenarios.

Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG]panic: sync: RUnlock of unlocked RWMutex** —— Radius 认证高并发时进程崩溃</issue_title>
<issue_description>使用过程中发现一个高并发的bug,用AI修复了一下。以下文字来自AI:

🐞 Bug Report

panic: sync: RUnlock of unlocked RWMutex —— Radius 认证高并发时进程崩溃


1. 环境信息

项目 版本 / 描述
ToughRADIUS v8 (9a9edd1 及之后 commit)
Go 1.20.x
OS CentOS 7 / Rocky 9(多台同配置,仅高并发节点复现)
数据库 PostgreSQL 15

2. 复现步骤

  1. 使用默认配置启动 ToughRADIUS(radiusd.enabled=true,其余保持缺省)。

  2. 向 1812/UDP 持续发送 错误密码 的 Access-Request,触发 Reject 逻辑(脚本或真实 NAS 均可)。

  3. 数分钟后进程崩溃并被 systemd 重启,journalctl 输出首行:

    fatal error: sync: RUnlock of unlocked RWMutex
    
  4. 栈顶定位在 toughradius/radius_reject_delay.go:32, RejectItem.IsOver()


3. 实际日志(截取)

fatal error: sync: RUnlock of unlocked RWMutex

goroutine 77 [running]:
sync.fatal(...)
sync.(*RWMutex).rUnlockSlow(...)
sync.(*RWMutex).RUnlock(...)
github.com/talkincode/toughradius/v8/toughradius.(*RejectItem).IsOver
    toughradius/radius_reject_delay.go:32
github.com/talkincode/toughradius/v8/toughradius.(*RadiusService).CheckRadAuthError
    toughradius/errors.go:28
...

4. 预期行为

高并发认证失败场景下,进程应稳定运行,Reject 限速逻辑正常生效,而不是触发 panic。


5. 初步根因分析

  • 触发 commit:9a9edd1 Refactor RejectCache to use a read-write mutex for concurrent access
  • 代码将 Mutex 升级为 RWMutex,但保留了旧的 defer RUnlock(),又在逻辑分支里 手动 RUnlock() 后再次升级写锁,导致同一读锁被释放两次:
ri.Lock.RLock()
defer ri.Lock.RUnlock()          // 第一次解锁
...
ri.Lock.RUnlock()                // 第二次解锁 → panic
ri.Lock.Lock()                   // 升级写锁

RejectCache.GetItem() 中同样存在「读锁 + defer + 手动 RUnlock」的重复解锁。


6. 修复建议(已验证)

func (ri *RejectItem) IsOver(max int64) bool {
-   ri.Lock.RLock()
-   defer ri.Lock.RUnlock()
+   ri.Lock.RLock()

    if time.Since(ri.LastReject).Seconds() > 10 {
-       ri.Lock.RUnlock()
+       ri.Lock.RUnlock()              // 升级前显式释放
        ri.Lock.Lock()
-       defer ri.Lock.Unlock()
        if time.Since(ri.LastReject).Seconds() > 10 {
            atomic.StoreInt64(&ri.Rejects, 0)
        }
+       ri.Lock.Unlock()
        return false
    }
-   return atomic.LoadInt64(&ri.Rejects) > max
+   over := atomic.LoadInt64(&ri.Rejects) > max
+   ri.Lock.RUnlock()
+   return over
}

RejectCache.GetItem() 同理,去掉 defer RUnlock(),按需手动释放并在升级写锁后再 Unlock()


7. 影响范围

任何并发量较高、认证失败频繁的部署都会触发,导致进程循环崩溃并被 systemd/NSSM 重启。


8. 附件

  • 完整 panic stack trace
  • 最小复现脚本(可选提供)

完整的修改(仅供参考):falseen@def3c8d

感谢作者: 如需更多信息或验证补丁,请 @ 我 😊</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #184

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@Copilot Copilot AI changed the title [WIP] [BUG]panic: sync: RUnlock of unlocked RWMutex** —— Radius 认证高并发时进程崩溃 Fix RWMutex double unlock panic in RADIUS reject delay under high concurrency Sep 28, 2025
@Copilot Copilot AI requested a review from jamiesun September 28, 2025 12:46
Copilot finished work on behalf of jamiesun September 28, 2025 12:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]panic: sync: RUnlock of unlocked RWMutex** —— Radius 认证高并发时进程崩溃

2 participants