Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@qawl987
Copy link

@qawl987 qawl987 commented Jan 12, 2026

1. What changes were made

  • Implemented UEScheduler (scheduler.go): Introduced a worker pool architecture with hash-based task distribution to manage concurrent NGAP message processing.
  • Added ExtractUEID (ue_id_extractor.go): Created a lightweight decoder to extract UE identifiers (AMF-UE-NGAP-ID or RAN-UE-NGAP-ID) from incoming packets without full ASN.1 unmarshalling.
  • Refactored I/O Layer (service.go): Modified handleConnection to decouple SCTP reading from message processing. It now dispatches tasks to the worker pool instead of blocking on handler.HandleMessage.
  • Updated Lifecycle Management (init.go): Added initialization and graceful shutdown routines for the scheduler within the AMF service startup/teardown sequence.
  • Enhanced Configuration: Added NgapWorkerPoolSize and NgapTaskBufferSize options to amfcfg.yaml to allow performance tuning.
  • Added Comprehensive Test Suite:
  • ue_id_extractor_test.go: Validates ID extraction across 9 scenarios, including InitialUEMessage, UplinkNASTransport, HandoverRequired, PDUSessionResourceSetupResponse, and invalid message handling.
  • scheduler_test.go (Distribution): Verifies hash consistency and uniform distribution across workers (tested with 10,000 UEs).
  • scheduler_test.go (Concurrency): Validates system stability under high load (50 concurrent goroutines submitting 5,000 tasks), ensures strict per-UE message sequentiality, and verifies graceful shutdown procedures.

2. How it works

  • Incoming NGAP messages are immediately dispatched to a buffered channel, allowing the SCTP reader to resume listening without blocking.
  • The scheduler uses consistent hashing (hash(UE_ID) % N) to route all messages belonging to the same UE to the specific worker goroutine.
  • This guarantees Per-UE Sequentiality, ensuring that messages for a specific user are processed in order, preventing race conditions while allowing different UEs to be processed in parallel.
  • Non-UE associated messages (like NGSetup) are routed to a default worker to maintain global order where necessary.

3. Why this change is necessary

  • Eliminate Performance Bottlenecks: The previous "Per-Connection Sequential Processing Model" forced all UEs on a single gNB connection to be processed serially by a single goroutine. This caused Head-of-Line blocking and saturated a single CPU core even on powerful servers.
  • Improve Scalability: This refactor enables the AMF to utilize all available CPU cores for message processing ("Per-UE Parallelism Model"), significantly increasing throughput and responsiveness under high load without altering the existing business logic handlers.

Notes

  1. Architecture & Safety Considerations:
    To minimize changes to the core architecture, this implementation continues to utilize the global AMFContext (via sync.Map) for storing UE contexts. Consequently, a UE's InitialUEMessage (keyed by RAN ID) may be processed by Worker A, while subsequent NAS messages (keyed by AMF ID) may be processed by Worker B.
  • Safety: This approach is safe because the global context uses thread-safe sync.Map for access. Furthermore, the 5G Request-Response architecture prevents race conditions where a UE sends messages with a RAN ID and an AMF ID simultaneously.
  • Future Work: Achieving a strictly local "Per-UE Connection" model (removing global lock contention entirely) would require removing the global UePool and modifying the AMF-UE-ID allocation mechanism to bind specific IDs to specific workers.
  1. Performance Testing Tools:

Architecture & Simple testing result

  1. Original AMF
image
  1. Parallelize NGAP AMF
image
  1. A simple performance comparison with Open5GS across different workers and simulation environments(UERANSIM, free-ran-ue)
image

Copilot AI review requested due to automatic review settings January 12, 2026 09:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements parallel processing of NGAP messages in the AMF component by introducing a worker pool architecture with hash-based task distribution. The goal is to eliminate performance bottlenecks from the previous sequential processing model while maintaining per-UE message ordering guarantees.

Changes:

  • Introduced a UE scheduler with configurable worker pool for concurrent NGAP message processing
  • Added lightweight UE ID extraction logic to route messages to appropriate workers without full ASN.1 unmarshalling
  • Refactored connection handling to dispatch messages asynchronously through the worker pool
  • Added configuration options for worker pool size and task buffer size with graceful defaults

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
pkg/factory/config.go Added configuration fields and getters for NGAP worker pool size and task buffer size
pkg/service/init.go Added scheduler initialization on startup and graceful shutdown on termination
internal/ngap/scheduler.go Implemented worker pool architecture with hash-based UE-to-worker routing
internal/ngap/ue_id_extractor.go Implemented lightweight UE ID extraction from NGAP messages covering all major message types
internal/ngap/service/service.go Modified connection handler to dispatch messages through worker pool with fallback
internal/ngap/scheduler_test.go Added comprehensive tests for hash distribution, concurrency, sequentiality, and shutdown
internal/ngap/ue_id_extractor_test.go Added tests for UE ID extraction across 9 different NGAP message types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 155 to 376
time.Sleep(3 * time.Second)

expectedTotal := numGoroutines * tasksPerGoroutine
actualProcessed := atomic.LoadInt32(&processedCount)

t.Logf("Expected %d tasks, processed %d tasks", expectedTotal, actualProcessed)
assert.Equal(t, int32(expectedTotal), actualProcessed,
"All tasks should be processed")

// Verify distribution
t.Log("Tasks processed per worker:")
for i := 0; i < numWorkers; i++ {
count := processedByWorker[i]
t.Logf(" Worker %d: %d tasks", i, count)
}
}

func TestScheduler_PerUESequentiality(t *testing.T) {
// Test that messages for the same UE are processed in order
numWorkers := 4
ueID := uint64(12345)
numMessages := 100

var processedOrder []int
var mu sync.Mutex

handler := func(conn net.Conn, msg []byte) {
// Extract message sequence number from message
seqNum := int(msg[0])
mu.Lock()
processedOrder = append(processedOrder, seqNum)
mu.Unlock()
// Small delay to test ordering
time.Sleep(1 * time.Millisecond)
}

scheduler := NewUEScheduler(numWorkers, 1000, handler)
defer scheduler.Shutdown()

// Submit messages for the same UE in order
for i := 0; i < numMessages; i++ {
task := Task{
UEID: ueID,
Conn: &mockConn{},
Message: []byte{byte(i)},
}
scheduler.DispatchTask(task)
}

// Wait for all messages to be processed
time.Sleep(2 * time.Second)

// Verify messages were processed in order
require.Equal(t, numMessages, len(processedOrder),
"All messages should be processed")

for i := 0; i < numMessages; i++ {
assert.Equal(t, i, processedOrder[i],
"Message %d should be processed in order", i)
}
}

func TestScheduler_MultipleUEsConcurrent(t *testing.T) {
// Test multiple UEs being processed concurrently
numWorkers := 8
numUEs := 20
messagesPerUE := 50

processedByUE := make(map[uint64][]int)
var mu sync.Mutex

handler := func(conn net.Conn, msg []byte) {
ueID := uint64(msg[0])
seqNum := int(msg[1])

mu.Lock()
processedByUE[ueID] = append(processedByUE[ueID], seqNum)
mu.Unlock()

time.Sleep(1 * time.Millisecond)
}

scheduler := NewUEScheduler(numWorkers, 1000, handler)
defer scheduler.Shutdown()

var wg sync.WaitGroup
wg.Add(numUEs)

// Each UE submits messages in its own goroutine
for ueIdx := 0; ueIdx < numUEs; ueIdx++ {
go func(ueID uint64) {
defer wg.Done()

for msgIdx := 0; msgIdx < messagesPerUE; msgIdx++ {
task := Task{
UEID: ueID,
Conn: &mockConn{},
Message: []byte{byte(ueID), byte(msgIdx)},
}
scheduler.DispatchTask(task)
// Small random delay between messages
time.Sleep(100 * time.Microsecond)
}
}(uint64(ueIdx))
}

wg.Wait()

// Give workers time to process
time.Sleep(3 * time.Second)

// Verify each UE's messages were processed in order
for ueID := uint64(0); ueID < uint64(numUEs); ueID++ {
messages := processedByUE[ueID]
require.Equal(t, messagesPerUE, len(messages),
"UE %d should have all messages processed", ueID)

for i := 0; i < messagesPerUE; i++ {
assert.Equal(t, i, messages[i],
"UE %d message %d should be in order", ueID, i)
}
}
}

func TestScheduler_GracefulShutdown(t *testing.T) {
// Test graceful shutdown of scheduler
numWorkers := 4

var processedCount int32
handler := func(conn net.Conn, msg []byte) {
atomic.AddInt32(&processedCount, 1)
time.Sleep(10 * time.Millisecond)
}

scheduler := NewUEScheduler(numWorkers, 100, handler)

// Submit some tasks
for i := 0; i < 50; i++ {
task := Task{
UEID: uint64(i),
Conn: &mockConn{},
Message: []byte{0x00},
}
scheduler.DispatchTask(task)
}

// Give some time for processing to start
time.Sleep(100 * time.Millisecond)

// Shutdown
scheduler.Shutdown()

// Verify some tasks were processed (not all, due to shutdown)
processed := atomic.LoadInt32(&processedCount)
t.Logf("Processed %d tasks before shutdown", processed)
assert.Greater(t, processed, int32(0),
"Some tasks should be processed before shutdown")
}

func TestScheduler_WorkerCount(t *testing.T) {
testCases := []struct {
name string
numWorkers int
expectedCount int
}{
{"Single worker", 1, 1},
{"Four workers", 4, 4},
{"Eight workers", 8, 8},
{"Auto-detect (0)", 0, -1}, // -1 means check > 0
}

for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
scheduler := NewUEScheduler(tc.numWorkers, 100,
func(conn net.Conn, msg []byte) {})
defer scheduler.Shutdown()

actualCount := len(scheduler.workers)
if tc.expectedCount == -1 {
assert.Greater(t, actualCount, 0,
"Auto-detected worker count should be > 0")
} else {
assert.Equal(t, tc.expectedCount, actualCount,
"Worker count should match expected")
}
})
}
}

func TestScheduler_NonUEMessage(t *testing.T) {
// Test handling of non-UE messages (UE ID = 0)
numWorkers := 4

var processedCount int32

handler := func(conn net.Conn, msg []byte) {
atomic.AddInt32(&processedCount, 1)
}

scheduler := NewUEScheduler(numWorkers, 100, handler)
defer scheduler.Shutdown()

// Submit non-UE messages (UE ID = 0)
// All should go to the same worker (determined by hash)
expectedWorkerIndex := scheduler.hashUEID(0)

for i := 0; i < 20; i++ {
task := Task{
UEID: 0, // Non-UE message
Conn: &mockConn{},
Message: []byte{0x00},
}

// Verify they all go to the same worker
workerIndex := scheduler.hashUEID(0)
assert.Equal(t, expectedWorkerIndex, workerIndex,
"All non-UE messages should route to the same worker")

scheduler.DispatchTask(task)
}

time.Sleep(500 * time.Millisecond)
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test relies on time.Sleep() calls to wait for message processing to complete (lines 155, 205, 264, 376). This makes tests non-deterministic and unnecessarily slow. Consider using synchronization primitives like channels or WaitGroups in the mock handler to signal when processing is complete, rather than arbitrary sleep durations.

Copilot uses AI. Check for mistakes.
Copy link
Author

@qawl987 qawl987 Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: Refactor scheduler_test.go to use synchronization primitives instead of time.Sleep

  1. Eliminated time.Sleep: Removed all arbitrary sleep calls (time.Sleep) used to wait for async processing.
  2. Implemented sync.WaitGroup: Introduced sync.WaitGroup in the test functions and mock handlers to track the exact completion of task processing.

Comment on lines 130 to 149
// Shutdown gracefully shuts down all workers.
func (s *UEScheduler) Shutdown() {
s.workerMutex.Lock()
defer s.workerMutex.Unlock()

logger.NgapLog.Info("Shutting down UE Scheduler and all workers...")

var wg sync.WaitGroup
for i, worker := range s.workers {
wg.Add(1)
go func(idx int, w *Worker) {
defer wg.Done()
logger.NgapLog.Infof("Closing task channel for Worker %d", idx)
close(w.taskChan)
}(i, worker)
}
wg.Wait()

logger.NgapLog.Info("All workers shut down successfully")
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Worker's wg WaitGroup is added to when workers start but never waited upon during shutdown. Each worker calls wg.Done() when it exits, but there's no corresponding wg.Wait() in the shutdown sequence. This means the Shutdown() method can return before all workers have finished processing their remaining tasks and exited cleanly. The UEScheduler should store a reference to this WaitGroup and wait for all workers to complete during shutdown.

Copilot uses AI. Check for mistakes.
Copy link
Author

@qawl987 qawl987 Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: Integrated sync.WaitGroup into the UEScheduler struct and ensured Shutdown() waits for all workers. Removed the unused stopChan and Stop() method.
Rationale: This ensures the AMF process does not exit until all workers have finished processing their remaining tasks.

// internal/ngap/scheduler.go

type UEScheduler struct {
    // ...
    wg sync.WaitGroup // <--- Stored in struct
}

func NewUEScheduler(...) *UEScheduler {
    // ...
    for i := 0; i < numWorkers; i++ {
        // Pass the scheduler's WG reference to workers
        scheduler.workers[i] = NewWorker(i, taskBufferSize, handler, &scheduler.wg)
    }
    return scheduler
}

func (s *UEScheduler) Shutdown() {
    s.workerMutex.Lock()
    // ... close channels ...
    s.workerMutex.Unlock()

    s.wg.Wait() // <--- Explicit wait prevents data loss
    logger.NgapLog.Info("All workers shut down successfully")
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to double-check whether the current implementation can run into a send on closed channel issue.

Shutdown() closes w.taskChan, while UEScheduler.DispatchTask() may still be running in a different goroutine and calling worker.Submit(task). If these overlap, it seems possible for a send to occur after the channel has been closed.

Could you help confirm whether this race is already prevented by the current design, or if additional coordination (e.g. a scheduler-level stop signal or submit-side guarding) is needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After consideration, I have decided to revert to the stopChan + select design.

Implementation Details:

  • Mechanism: taskChan is never explicitly closed (left to GC). Submit uses select to manage both shutdown behavior and normal traffic flow, ensuring backpressure is maintained.
  • Draining: A drainAndExit function is added to process residual packets during shutdown.
  • Safety: This approach resolves both the "send on closed channel" panic and "shutdown deadlock" issues.
  • Lock-Free: All mutexes have been removed. The worker structure is effectively read-only after initialization, and the shutdown process no longer depends on the specific closing order of taskChan.
  • Error Handling: defer recover is retained in the run loop strictly to handle unexpected runtime errors (e.g., nil pointer dereferences) and prevent the worker from crashing.

Performance Impact (Hot Path Analysis):
I acknowledge that using select in the Submit hot path introduces overhead. I conducted a simple test using UERANSIM with 100 concurrent UEs (-n 100) to measure the impact on registration time (calculated via log timestamps):

  • The initial PR version (which had no extra overhead in Submit): 10.292 seconds
  • Current Version (Select-based): 11.829 seconds

Result: The current safety-focused design incurs an approximate 14% performance cost.


// Submit submits a task to this worker's queue.
func (w *Worker) Submit(task Task) {
w.taskChan <- task
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Submit method performs a blocking send on the task channel. If the worker's buffer is full and no workers are consuming (e.g., during shutdown or if a worker panics), this will block indefinitely, potentially deadlocking the system. Consider using a select statement with a timeout or a default case to handle the scenario where the buffer is full, allowing the caller to handle backpressure appropriately.

Suggested change
w.taskChan <- task
select {
case w.taskChan <- task:
// Task successfully enqueued.
case <-w.stopChan:
// Worker is stopping; avoid blocking and drop the task.
logger.NgapLog.Warnf("Worker %d is stopping; dropping task for UE ID %d", w.ID, task.UEID)
default:
// Channel buffer is full; avoid blocking and drop the task to prevent deadlock.
logger.NgapLog.Warnf("Worker %d task channel full; dropping task for UE ID %d", w.ID, task.UEID)
}

Copilot uses AI. Check for mistakes.
Copy link
Author

@qawl987 qawl987 Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: Moved the blocking Submit call outside the workerMutex lock scope.
Rationale: This prevents a deadlock scenario where a full channel blocks Submit (holding the RLock), preventing Shutdown from acquiring the Lock. Note that I retained the blocking behavior (instead of dropping tasks) to maintain backpressure for flow control in the 5G control plane.

// internal/ngap/scheduler.go

func (s *UEScheduler) DispatchTask(task Task) {
    s.workerMutex.RLock()
    // Hash the UE ID to determine which worker should handle it
    workerIndex := s.hashUEID(task.UEID)
    worker := s.workers[workerIndex]
    s.workerMutex.RUnlock() // <--- Unlock BEFORE submitting

    logger.NgapLog.Debugf("Dispatching UE ID %d to Worker %d", task.UEID, workerIndex)
    worker.Submit(task)     // <--- Safe blocking submission
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that changing only DispatchTask does not fully address the issue. If w.taskChan is full, UEScheduler can still be blocked on worker.Submit(task).

I would suggest that the select-based handling inside Submit is still necessary, so that cases like a full buffer or stopChan being closed are explicitly handled. This also ensures proper behavior during worker shutdown.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have re-implemented the select-based logic within Submit. Currently, when the worker is full, the operation blocks to ensure backpressure, which effectively reduces retransmissions and prevents signaling storms. If a shutdown signal is received while the worker is full, Submit will exit immediately, leaving drainAndExit to handle the remaining packets.

scheduler, err := ngap_internal.GetScheduler()
if err != nil {
// Fallback to direct handling if scheduler is not initialized
logger.NgapLog.Warnf("Scheduler not initialized, falling back to sequential processing: %v", err)
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the scheduler is not initialized, the code falls back to sequential processing with a warning log. However, there's no mechanism to prevent continued operation in this degraded state. If scheduler initialization fails during startup, the system silently continues without parallelization. Consider making scheduler initialization a hard requirement, or at least log this as an error rather than a warning to make the degraded state more visible.

Suggested change
logger.NgapLog.Warnf("Scheduler not initialized, falling back to sequential processing: %v", err)
logger.NgapLog.Errorf("Scheduler not initialized, falling back to sequential processing: %v", err)

Copilot uses AI. Check for mistakes.
Comment on lines 121 to 126
// Convert uint64 to bytes for hashing
b := make([]byte, 8)
for i := 0; i < 8; i++ {
b[i] = byte(ueID >> (i * 8))
}
h.Write(b)
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte conversion for hashing uses little-endian order (shifting right and taking lower bytes first). While this is not incorrect, using a standard library function like binary.LittleEndian.PutUint64() would be clearer and more maintainable. This makes the byte order explicit and follows Go conventions.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: Replaced fnv hashing with simple modulo arithmetic.
Rationale: fnv required memory allocation (make([]byte)) on every packet. Using modulo is zero-allocation and sufficient for load distribution.

// internal/ngap/scheduler.go

func (s *UEScheduler) hashUEID(ueID uint64) int {
    return int(ueID % uint64(s.numWorkers))
}

Comment on lines 66 to 69
// Stop gracefully stops the worker.
func (w *Worker) Stop() {
close(w.stopChan)
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stopChan is closed in the Stop() method but never actually used in the Shutdown() function. The Shutdown method only closes the taskChan, but doesn't call Stop() on individual workers. This means the stopChan case in the worker's run() loop is dead code. Either remove the stopChan or properly utilize it in the shutdown sequence.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#194 (comment)
I removed the unused stopChan and Stop() method. Worker termination is now handled purely by closing the taskChan, which is a cleaner and more idiomatic Go pattern.

if c.Configuration != nil && c.Configuration.NgapWorkerPoolSize > 0 {
return c.Configuration.NgapWorkerPoolSize
}
return 0 // 0 means auto-detect based on CPU cores
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration getters return 0 for auto-detect when NgapWorkerPoolSize is not set or is 0. However, the InitScheduler function is called with this 0 value, and within InitScheduler, it checks if numWorkers <= 0 and defaults to runtime.NumCPU(). This logic is duplicated - InitScheduler already handles the auto-detect case, so the comment on line 1043 is misleading as the function doesn't directly perform auto-detection based on the returned 0.

Suggested change
return 0 // 0 means auto-detect based on CPU cores
return 0 // 0 indicates that the caller (e.g., InitScheduler) should auto-detect based on CPU cores

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment in config.go to clearly explain that a return value of 0 signals the caller to perform auto-detection.

Comment on lines 199 to 201
if err := ngap.InitScheduler(workerPoolSize, taskBufferSize, ngap.Dispatch); err != nil {
logger.InitLog.Fatalf("Failed to initialize NGAP scheduler: %v", err)
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If InitScheduler returns an error (from the initErr variable), the code calls logger.InitLog.Fatalf which terminates the program. However, InitScheduler never actually sets initErr to a non-nil value - the function always returns nil. Either remove the error return type from InitScheduler or properly implement error handling for initialization failures (e.g., if worker creation fails).

Suggested change
if err := ngap.InitScheduler(workerPoolSize, taskBufferSize, ngap.Dispatch); err != nil {
logger.InitLog.Fatalf("Failed to initialize NGAP scheduler: %v", err)
}
ngap.InitScheduler(workerPoolSize, taskBufferSize, ngap.Dispatch)

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: Removed the error return from InitScheduler.
Rationale: Since invalid inputs (<=0) are now handled by applying safe defaults (auto-detect or 4096 buffer), the function cannot fail. Removing the return value clarifies the API and satisfies linter checks.

// internal/ngap/scheduler.go

func InitScheduler(numWorkers int, taskBufferSize int, handler func(conn net.Conn, msg []byte)) {
    globalSchedulerOnce.Do(func() {
        if numWorkers <= 0 { numWorkers = runtime.NumCPU() }
        if taskBufferSize <= 0 { taskBufferSize = 4096 }
        // ... init ...
    })
    // No return error needed
}

// dispatchToWorkerPool extracts the UE ID and dispatches the task to the appropriate worker.
// For non-UE messages (e.g., NGSetupRequest), it dispatches to a default worker (worker 0).
func dispatchToWorkerPool(conn net.Conn, msg []byte, handler NGAPHandler) {
// Import the ngap package to access the scheduler
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Import the ngap package to access the scheduler" is misleading. The import statement is at the top of the file (line 12), not at this location. This comment should be removed or clarified to describe what this function does, not how imports work.

Suggested change
// Import the ngap package to access the scheduler
// Get the NGAP scheduler used for dispatching tasks based on UE ID

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the confusing comment in service.go

Comment on lines 119 to 128
func (s *UEScheduler) hashUEID(ueID uint64) int {
h := fnv.New64a()
// Convert uint64 to bytes for hashing
b := make([]byte, 8)
for i := 0; i < 8; i++ {
b[i] = byte(ueID >> (i * 8))
}
h.Write(b)
return int(h.Sum64() % uint64(s.numWorkers))
}
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hash function creates a new FNV hash instance and allocates a new byte slice on every call. Since this is called for every message dispatch (potentially thousands of times per second), this creates unnecessary allocations and garbage collection pressure. Consider creating a pool of hash instances or using a simpler modulo operation directly on the uint64 value if perfect distribution isn't critical: return int(ueID % uint64(s.numWorkers)).

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#194 (comment)
Switched to modulo arithmetic to avoid memory allocation on every packet dispatch.

@qawl987
Copy link
Author

qawl987 commented Jan 12, 2026

Subject: Fix deadlock, improve shutdown safety, and optimize performance.


  • internal/ngap/scheduler.go:

  • Moved worker.Submit() out of the read lock to prevent deadlocks.

  • Integrated sync.WaitGroup into UEScheduler for reliable shutdown.

  • Added recover() to worker goroutines.

  • Optimized hashUEID to avoid allocation.

  • Removed dead code (stopChan).

  • internal/ngap/ue_id_extractor.go:

  • Updated comments to accurately reflect ID extraction priority.

  • pkg/factory/config.go:

  • Clarified auto-detect comments.

  • pkg/service/init.go:

  • Aligned InitScheduler call with the simplified error handling logic.

case ngapType.ProcedureCodePathSwitchRequest:
if msg.Value.PathSwitchRequest != nil {
for _, ie := range msg.Value.PathSwitchRequest.ProtocolIEs.List {
if ie.Id.Value == ngapType.ProtocolIEIDRANUENGAPID && ie.Value.RANUENGAPID != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be SourceAMFUENGAPID? Otherwise, it will use a worker different from that of existing RanUe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this possible issue I have written in above Notes. Under the current global AMF Context architecture, a UE message triggers a worker switch only once. While this results in a cache miss, it does not lead to any functional errors. However, if the global Context is to be removed in the future, the allocation mechanisms for both amf-ue-id and ran-ue-id must be modified accordingly.

Architecture & Safety Considerations:
To minimize changes to the core architecture, this implementation continues to utilize the global AMFContext (via sync.Map) for storing UE contexts. Consequently, a UE's InitialUEMessage (keyed by RAN ID) may be processed by Worker A, while subsequent NAS messages (keyed by AMF ID) may be processed by Worker B.

  • Safety: This approach is safe because the global context uses thread-safe sync.Map for access. Furthermore, the 5G Request-Response architecture prevents race conditions where a UE sends messages with a RAN ID and an AMF ID simultaneously.
  • Future Work: Achieving a strictly local "Per-UE Connection" model (removing global lock contention entirely) would require removing the global UePool and modifying the AMF-UE-ID allocation mechanism to bind specific IDs to specific workers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding (please correct me if I’m mistaken) is that while InitialUEMessage only triggers a single worker switch and is functionally safe under the current global AMFContext + sync.Map design, PathSwitchRequest is different.

At this stage, the UE may already have ongoing traffic (e.g. previous uplink NGAP messages) keyed by AMF UE NGAP ID. If PathSwitchRequest is dispatched using the new RAN UE NGAP ID instead of SourceAMFUENGAPID, it could introduce an additional worker switch during an active UE lifecycle.

A possible scenario is:

Initial UE Message   → Worker A (RAN ID)
Uplink NAS Transport → Worker B (AMF UE ID)
Path Switch Request → Worker C (new RAN UE ID)

This could lead to multiple workers accessing the same UE context and extra worker hopping.
If my understanding differs from the actual implementation, I’m happy to discuss and clarify.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't fully grasp your question at first. You are absolutely right—we should keep the same Source AMF UE NGAP ID. I've updated the code. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants