-
Notifications
You must be signed in to change notification settings - Fork 129
feat: parallelize NGAP processing in AMF #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements parallel processing of NGAP messages in the AMF component by introducing a worker pool architecture with hash-based task distribution. The goal is to eliminate performance bottlenecks from the previous sequential processing model while maintaining per-UE message ordering guarantees.
Changes:
- Introduced a UE scheduler with configurable worker pool for concurrent NGAP message processing
- Added lightweight UE ID extraction logic to route messages to appropriate workers without full ASN.1 unmarshalling
- Refactored connection handling to dispatch messages asynchronously through the worker pool
- Added configuration options for worker pool size and task buffer size with graceful defaults
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/factory/config.go |
Added configuration fields and getters for NGAP worker pool size and task buffer size |
pkg/service/init.go |
Added scheduler initialization on startup and graceful shutdown on termination |
internal/ngap/scheduler.go |
Implemented worker pool architecture with hash-based UE-to-worker routing |
internal/ngap/ue_id_extractor.go |
Implemented lightweight UE ID extraction from NGAP messages covering all major message types |
internal/ngap/service/service.go |
Modified connection handler to dispatch messages through worker pool with fallback |
internal/ngap/scheduler_test.go |
Added comprehensive tests for hash distribution, concurrency, sequentiality, and shutdown |
internal/ngap/ue_id_extractor_test.go |
Added tests for UE ID extraction across 9 different NGAP message types |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/ngap/scheduler_test.go
Outdated
| time.Sleep(3 * time.Second) | ||
|
|
||
| expectedTotal := numGoroutines * tasksPerGoroutine | ||
| actualProcessed := atomic.LoadInt32(&processedCount) | ||
|
|
||
| t.Logf("Expected %d tasks, processed %d tasks", expectedTotal, actualProcessed) | ||
| assert.Equal(t, int32(expectedTotal), actualProcessed, | ||
| "All tasks should be processed") | ||
|
|
||
| // Verify distribution | ||
| t.Log("Tasks processed per worker:") | ||
| for i := 0; i < numWorkers; i++ { | ||
| count := processedByWorker[i] | ||
| t.Logf(" Worker %d: %d tasks", i, count) | ||
| } | ||
| } | ||
|
|
||
| func TestScheduler_PerUESequentiality(t *testing.T) { | ||
| // Test that messages for the same UE are processed in order | ||
| numWorkers := 4 | ||
| ueID := uint64(12345) | ||
| numMessages := 100 | ||
|
|
||
| var processedOrder []int | ||
| var mu sync.Mutex | ||
|
|
||
| handler := func(conn net.Conn, msg []byte) { | ||
| // Extract message sequence number from message | ||
| seqNum := int(msg[0]) | ||
| mu.Lock() | ||
| processedOrder = append(processedOrder, seqNum) | ||
| mu.Unlock() | ||
| // Small delay to test ordering | ||
| time.Sleep(1 * time.Millisecond) | ||
| } | ||
|
|
||
| scheduler := NewUEScheduler(numWorkers, 1000, handler) | ||
| defer scheduler.Shutdown() | ||
|
|
||
| // Submit messages for the same UE in order | ||
| for i := 0; i < numMessages; i++ { | ||
| task := Task{ | ||
| UEID: ueID, | ||
| Conn: &mockConn{}, | ||
| Message: []byte{byte(i)}, | ||
| } | ||
| scheduler.DispatchTask(task) | ||
| } | ||
|
|
||
| // Wait for all messages to be processed | ||
| time.Sleep(2 * time.Second) | ||
|
|
||
| // Verify messages were processed in order | ||
| require.Equal(t, numMessages, len(processedOrder), | ||
| "All messages should be processed") | ||
|
|
||
| for i := 0; i < numMessages; i++ { | ||
| assert.Equal(t, i, processedOrder[i], | ||
| "Message %d should be processed in order", i) | ||
| } | ||
| } | ||
|
|
||
| func TestScheduler_MultipleUEsConcurrent(t *testing.T) { | ||
| // Test multiple UEs being processed concurrently | ||
| numWorkers := 8 | ||
| numUEs := 20 | ||
| messagesPerUE := 50 | ||
|
|
||
| processedByUE := make(map[uint64][]int) | ||
| var mu sync.Mutex | ||
|
|
||
| handler := func(conn net.Conn, msg []byte) { | ||
| ueID := uint64(msg[0]) | ||
| seqNum := int(msg[1]) | ||
|
|
||
| mu.Lock() | ||
| processedByUE[ueID] = append(processedByUE[ueID], seqNum) | ||
| mu.Unlock() | ||
|
|
||
| time.Sleep(1 * time.Millisecond) | ||
| } | ||
|
|
||
| scheduler := NewUEScheduler(numWorkers, 1000, handler) | ||
| defer scheduler.Shutdown() | ||
|
|
||
| var wg sync.WaitGroup | ||
| wg.Add(numUEs) | ||
|
|
||
| // Each UE submits messages in its own goroutine | ||
| for ueIdx := 0; ueIdx < numUEs; ueIdx++ { | ||
| go func(ueID uint64) { | ||
| defer wg.Done() | ||
|
|
||
| for msgIdx := 0; msgIdx < messagesPerUE; msgIdx++ { | ||
| task := Task{ | ||
| UEID: ueID, | ||
| Conn: &mockConn{}, | ||
| Message: []byte{byte(ueID), byte(msgIdx)}, | ||
| } | ||
| scheduler.DispatchTask(task) | ||
| // Small random delay between messages | ||
| time.Sleep(100 * time.Microsecond) | ||
| } | ||
| }(uint64(ueIdx)) | ||
| } | ||
|
|
||
| wg.Wait() | ||
|
|
||
| // Give workers time to process | ||
| time.Sleep(3 * time.Second) | ||
|
|
||
| // Verify each UE's messages were processed in order | ||
| for ueID := uint64(0); ueID < uint64(numUEs); ueID++ { | ||
| messages := processedByUE[ueID] | ||
| require.Equal(t, messagesPerUE, len(messages), | ||
| "UE %d should have all messages processed", ueID) | ||
|
|
||
| for i := 0; i < messagesPerUE; i++ { | ||
| assert.Equal(t, i, messages[i], | ||
| "UE %d message %d should be in order", ueID, i) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| func TestScheduler_GracefulShutdown(t *testing.T) { | ||
| // Test graceful shutdown of scheduler | ||
| numWorkers := 4 | ||
|
|
||
| var processedCount int32 | ||
| handler := func(conn net.Conn, msg []byte) { | ||
| atomic.AddInt32(&processedCount, 1) | ||
| time.Sleep(10 * time.Millisecond) | ||
| } | ||
|
|
||
| scheduler := NewUEScheduler(numWorkers, 100, handler) | ||
|
|
||
| // Submit some tasks | ||
| for i := 0; i < 50; i++ { | ||
| task := Task{ | ||
| UEID: uint64(i), | ||
| Conn: &mockConn{}, | ||
| Message: []byte{0x00}, | ||
| } | ||
| scheduler.DispatchTask(task) | ||
| } | ||
|
|
||
| // Give some time for processing to start | ||
| time.Sleep(100 * time.Millisecond) | ||
|
|
||
| // Shutdown | ||
| scheduler.Shutdown() | ||
|
|
||
| // Verify some tasks were processed (not all, due to shutdown) | ||
| processed := atomic.LoadInt32(&processedCount) | ||
| t.Logf("Processed %d tasks before shutdown", processed) | ||
| assert.Greater(t, processed, int32(0), | ||
| "Some tasks should be processed before shutdown") | ||
| } | ||
|
|
||
| func TestScheduler_WorkerCount(t *testing.T) { | ||
| testCases := []struct { | ||
| name string | ||
| numWorkers int | ||
| expectedCount int | ||
| }{ | ||
| {"Single worker", 1, 1}, | ||
| {"Four workers", 4, 4}, | ||
| {"Eight workers", 8, 8}, | ||
| {"Auto-detect (0)", 0, -1}, // -1 means check > 0 | ||
| } | ||
|
|
||
| for _, tc := range testCases { | ||
| t.Run(tc.name, func(t *testing.T) { | ||
| scheduler := NewUEScheduler(tc.numWorkers, 100, | ||
| func(conn net.Conn, msg []byte) {}) | ||
| defer scheduler.Shutdown() | ||
|
|
||
| actualCount := len(scheduler.workers) | ||
| if tc.expectedCount == -1 { | ||
| assert.Greater(t, actualCount, 0, | ||
| "Auto-detected worker count should be > 0") | ||
| } else { | ||
| assert.Equal(t, tc.expectedCount, actualCount, | ||
| "Worker count should match expected") | ||
| } | ||
| }) | ||
| } | ||
| } | ||
|
|
||
| func TestScheduler_NonUEMessage(t *testing.T) { | ||
| // Test handling of non-UE messages (UE ID = 0) | ||
| numWorkers := 4 | ||
|
|
||
| var processedCount int32 | ||
|
|
||
| handler := func(conn net.Conn, msg []byte) { | ||
| atomic.AddInt32(&processedCount, 1) | ||
| } | ||
|
|
||
| scheduler := NewUEScheduler(numWorkers, 100, handler) | ||
| defer scheduler.Shutdown() | ||
|
|
||
| // Submit non-UE messages (UE ID = 0) | ||
| // All should go to the same worker (determined by hash) | ||
| expectedWorkerIndex := scheduler.hashUEID(0) | ||
|
|
||
| for i := 0; i < 20; i++ { | ||
| task := Task{ | ||
| UEID: 0, // Non-UE message | ||
| Conn: &mockConn{}, | ||
| Message: []byte{0x00}, | ||
| } | ||
|
|
||
| // Verify they all go to the same worker | ||
| workerIndex := scheduler.hashUEID(0) | ||
| assert.Equal(t, expectedWorkerIndex, workerIndex, | ||
| "All non-UE messages should route to the same worker") | ||
|
|
||
| scheduler.DispatchTask(task) | ||
| } | ||
|
|
||
| time.Sleep(500 * time.Millisecond) |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test relies on time.Sleep() calls to wait for message processing to complete (lines 155, 205, 264, 376). This makes tests non-deterministic and unnecessarily slow. Consider using synchronization primitives like channels or WaitGroups in the mock handler to signal when processing is complete, rather than arbitrary sleep durations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: Refactor scheduler_test.go to use synchronization primitives instead of time.Sleep
- Eliminated time.Sleep: Removed all arbitrary sleep calls (time.Sleep) used to wait for async processing.
- Implemented sync.WaitGroup: Introduced sync.WaitGroup in the test functions and mock handlers to track the exact completion of task processing.
| // Shutdown gracefully shuts down all workers. | ||
| func (s *UEScheduler) Shutdown() { | ||
| s.workerMutex.Lock() | ||
| defer s.workerMutex.Unlock() | ||
|
|
||
| logger.NgapLog.Info("Shutting down UE Scheduler and all workers...") | ||
|
|
||
| var wg sync.WaitGroup | ||
| for i, worker := range s.workers { | ||
| wg.Add(1) | ||
| go func(idx int, w *Worker) { | ||
| defer wg.Done() | ||
| logger.NgapLog.Infof("Closing task channel for Worker %d", idx) | ||
| close(w.taskChan) | ||
| }(i, worker) | ||
| } | ||
| wg.Wait() | ||
|
|
||
| logger.NgapLog.Info("All workers shut down successfully") | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Worker's wg WaitGroup is added to when workers start but never waited upon during shutdown. Each worker calls wg.Done() when it exits, but there's no corresponding wg.Wait() in the shutdown sequence. This means the Shutdown() method can return before all workers have finished processing their remaining tasks and exited cleanly. The UEScheduler should store a reference to this WaitGroup and wait for all workers to complete during shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: Integrated sync.WaitGroup into the UEScheduler struct and ensured Shutdown() waits for all workers. Removed the unused stopChan and Stop() method.
Rationale: This ensures the AMF process does not exit until all workers have finished processing their remaining tasks.
// internal/ngap/scheduler.go
type UEScheduler struct {
// ...
wg sync.WaitGroup // <--- Stored in struct
}
func NewUEScheduler(...) *UEScheduler {
// ...
for i := 0; i < numWorkers; i++ {
// Pass the scheduler's WG reference to workers
scheduler.workers[i] = NewWorker(i, taskBufferSize, handler, &scheduler.wg)
}
return scheduler
}
func (s *UEScheduler) Shutdown() {
s.workerMutex.Lock()
// ... close channels ...
s.workerMutex.Unlock()
s.wg.Wait() // <--- Explicit wait prevents data loss
logger.NgapLog.Info("All workers shut down successfully")
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d like to double-check whether the current implementation can run into a send on closed channel issue.
Shutdown() closes w.taskChan, while UEScheduler.DispatchTask() may still be running in a different goroutine and calling worker.Submit(task). If these overlap, it seems possible for a send to occur after the channel has been closed.
Could you help confirm whether this race is already prevented by the current design, or if additional coordination (e.g. a scheduler-level stop signal or submit-side guarding) is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After consideration, I have decided to revert to the stopChan + select design.
Implementation Details:
- Mechanism:
taskChanis never explicitly closed (left to GC).Submitusesselectto manage both shutdown behavior and normal traffic flow, ensuring backpressure is maintained. - Draining: A
drainAndExitfunction is added to process residual packets during shutdown. - Safety: This approach resolves both the "send on closed channel" panic and "shutdown deadlock" issues.
- Lock-Free: All mutexes have been removed. The worker structure is effectively read-only after initialization, and the shutdown process no longer depends on the specific closing order of
taskChan. - Error Handling:
defer recoveris retained in therunloop strictly to handle unexpected runtime errors (e.g., nil pointer dereferences) and prevent the worker from crashing.
Performance Impact (Hot Path Analysis):
I acknowledge that using select in the Submit hot path introduces overhead. I conducted a simple test using UERANSIM with 100 concurrent UEs (-n 100) to measure the impact on registration time (calculated via log timestamps):
- The initial PR version (which had no extra overhead in Submit): 10.292 seconds
- Current Version (Select-based): 11.829 seconds
Result: The current safety-focused design incurs an approximate 14% performance cost.
internal/ngap/scheduler.go
Outdated
|
|
||
| // Submit submits a task to this worker's queue. | ||
| func (w *Worker) Submit(task Task) { | ||
| w.taskChan <- task |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Submit method performs a blocking send on the task channel. If the worker's buffer is full and no workers are consuming (e.g., during shutdown or if a worker panics), this will block indefinitely, potentially deadlocking the system. Consider using a select statement with a timeout or a default case to handle the scenario where the buffer is full, allowing the caller to handle backpressure appropriately.
| w.taskChan <- task | |
| select { | |
| case w.taskChan <- task: | |
| // Task successfully enqueued. | |
| case <-w.stopChan: | |
| // Worker is stopping; avoid blocking and drop the task. | |
| logger.NgapLog.Warnf("Worker %d is stopping; dropping task for UE ID %d", w.ID, task.UEID) | |
| default: | |
| // Channel buffer is full; avoid blocking and drop the task to prevent deadlock. | |
| logger.NgapLog.Warnf("Worker %d task channel full; dropping task for UE ID %d", w.ID, task.UEID) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: Moved the blocking Submit call outside the workerMutex lock scope.
Rationale: This prevents a deadlock scenario where a full channel blocks Submit (holding the RLock), preventing Shutdown from acquiring the Lock. Note that I retained the blocking behavior (instead of dropping tasks) to maintain backpressure for flow control in the 5G control plane.
// internal/ngap/scheduler.go
func (s *UEScheduler) DispatchTask(task Task) {
s.workerMutex.RLock()
// Hash the UE ID to determine which worker should handle it
workerIndex := s.hashUEID(task.UEID)
worker := s.workers[workerIndex]
s.workerMutex.RUnlock() // <--- Unlock BEFORE submitting
logger.NgapLog.Debugf("Dispatching UE ID %d to Worker %d", task.UEID, workerIndex)
worker.Submit(task) // <--- Safe blocking submission
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that changing only DispatchTask does not fully address the issue. If w.taskChan is full, UEScheduler can still be blocked on worker.Submit(task).
I would suggest that the select-based handling inside Submit is still necessary, so that cases like a full buffer or stopChan being closed are explicitly handled. This also ensures proper behavior during worker shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have re-implemented the select-based logic within Submit. Currently, when the worker is full, the operation blocks to ensure backpressure, which effectively reduces retransmissions and prevents signaling storms. If a shutdown signal is received while the worker is full, Submit will exit immediately, leaving drainAndExit to handle the remaining packets.
| scheduler, err := ngap_internal.GetScheduler() | ||
| if err != nil { | ||
| // Fallback to direct handling if scheduler is not initialized | ||
| logger.NgapLog.Warnf("Scheduler not initialized, falling back to sequential processing: %v", err) |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the scheduler is not initialized, the code falls back to sequential processing with a warning log. However, there's no mechanism to prevent continued operation in this degraded state. If scheduler initialization fails during startup, the system silently continues without parallelization. Consider making scheduler initialization a hard requirement, or at least log this as an error rather than a warning to make the degraded state more visible.
| logger.NgapLog.Warnf("Scheduler not initialized, falling back to sequential processing: %v", err) | |
| logger.NgapLog.Errorf("Scheduler not initialized, falling back to sequential processing: %v", err) |
internal/ngap/scheduler.go
Outdated
| // Convert uint64 to bytes for hashing | ||
| b := make([]byte, 8) | ||
| for i := 0; i < 8; i++ { | ||
| b[i] = byte(ueID >> (i * 8)) | ||
| } | ||
| h.Write(b) |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The byte conversion for hashing uses little-endian order (shifting right and taking lower bytes first). While this is not incorrect, using a standard library function like binary.LittleEndian.PutUint64() would be clearer and more maintainable. This makes the byte order explicit and follows Go conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: Replaced fnv hashing with simple modulo arithmetic.
Rationale: fnv required memory allocation (make([]byte)) on every packet. Using modulo is zero-allocation and sufficient for load distribution.
// internal/ngap/scheduler.go
func (s *UEScheduler) hashUEID(ueID uint64) int {
return int(ueID % uint64(s.numWorkers))
}
internal/ngap/scheduler.go
Outdated
| // Stop gracefully stops the worker. | ||
| func (w *Worker) Stop() { | ||
| close(w.stopChan) | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stopChan is closed in the Stop() method but never actually used in the Shutdown() function. The Shutdown method only closes the taskChan, but doesn't call Stop() on individual workers. This means the stopChan case in the worker's run() loop is dead code. Either remove the stopChan or properly utilize it in the shutdown sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#194 (comment)
I removed the unused stopChan and Stop() method. Worker termination is now handled purely by closing the taskChan, which is a cleaner and more idiomatic Go pattern.
pkg/factory/config.go
Outdated
| if c.Configuration != nil && c.Configuration.NgapWorkerPoolSize > 0 { | ||
| return c.Configuration.NgapWorkerPoolSize | ||
| } | ||
| return 0 // 0 means auto-detect based on CPU cores |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration getters return 0 for auto-detect when NgapWorkerPoolSize is not set or is 0. However, the InitScheduler function is called with this 0 value, and within InitScheduler, it checks if numWorkers <= 0 and defaults to runtime.NumCPU(). This logic is duplicated - InitScheduler already handles the auto-detect case, so the comment on line 1043 is misleading as the function doesn't directly perform auto-detection based on the returned 0.
| return 0 // 0 means auto-detect based on CPU cores | |
| return 0 // 0 indicates that the caller (e.g., InitScheduler) should auto-detect based on CPU cores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment in config.go to clearly explain that a return value of 0 signals the caller to perform auto-detection.
pkg/service/init.go
Outdated
| if err := ngap.InitScheduler(workerPoolSize, taskBufferSize, ngap.Dispatch); err != nil { | ||
| logger.InitLog.Fatalf("Failed to initialize NGAP scheduler: %v", err) | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If InitScheduler returns an error (from the initErr variable), the code calls logger.InitLog.Fatalf which terminates the program. However, InitScheduler never actually sets initErr to a non-nil value - the function always returns nil. Either remove the error return type from InitScheduler or properly implement error handling for initialization failures (e.g., if worker creation fails).
| if err := ngap.InitScheduler(workerPoolSize, taskBufferSize, ngap.Dispatch); err != nil { | |
| logger.InitLog.Fatalf("Failed to initialize NGAP scheduler: %v", err) | |
| } | |
| ngap.InitScheduler(workerPoolSize, taskBufferSize, ngap.Dispatch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: Removed the error return from InitScheduler.
Rationale: Since invalid inputs (<=0) are now handled by applying safe defaults (auto-detect or 4096 buffer), the function cannot fail. Removing the return value clarifies the API and satisfies linter checks.
// internal/ngap/scheduler.go
func InitScheduler(numWorkers int, taskBufferSize int, handler func(conn net.Conn, msg []byte)) {
globalSchedulerOnce.Do(func() {
if numWorkers <= 0 { numWorkers = runtime.NumCPU() }
if taskBufferSize <= 0 { taskBufferSize = 4096 }
// ... init ...
})
// No return error needed
}
internal/ngap/service/service.go
Outdated
| // dispatchToWorkerPool extracts the UE ID and dispatches the task to the appropriate worker. | ||
| // For non-UE messages (e.g., NGSetupRequest), it dispatches to a default worker (worker 0). | ||
| func dispatchToWorkerPool(conn net.Conn, msg []byte, handler NGAPHandler) { | ||
| // Import the ngap package to access the scheduler |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "Import the ngap package to access the scheduler" is misleading. The import statement is at the top of the file (line 12), not at this location. This comment should be removed or clarified to describe what this function does, not how imports work.
| // Import the ngap package to access the scheduler | |
| // Get the NGAP scheduler used for dispatching tasks based on UE ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the confusing comment in service.go
| func (s *UEScheduler) hashUEID(ueID uint64) int { | ||
| h := fnv.New64a() | ||
| // Convert uint64 to bytes for hashing | ||
| b := make([]byte, 8) | ||
| for i := 0; i < 8; i++ { | ||
| b[i] = byte(ueID >> (i * 8)) | ||
| } | ||
| h.Write(b) | ||
| return int(h.Sum64() % uint64(s.numWorkers)) | ||
| } |
Copilot
AI
Jan 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash function creates a new FNV hash instance and allocates a new byte slice on every call. Since this is called for every message dispatch (potentially thousands of times per second), this creates unnecessary allocations and garbage collection pressure. Consider creating a pool of hash instances or using a simpler modulo operation directly on the uint64 value if perfect distribution isn't critical: return int(ueID % uint64(s.numWorkers)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#194 (comment)
Switched to modulo arithmetic to avoid memory allocation on every packet dispatch.
|
Subject: Fix deadlock, improve shutdown safety, and optimize performance.
|
internal/ngap/ue_id_extractor.go
Outdated
| case ngapType.ProcedureCodePathSwitchRequest: | ||
| if msg.Value.PathSwitchRequest != nil { | ||
| for _, ie := range msg.Value.PathSwitchRequest.ProtocolIEs.List { | ||
| if ie.Id.Value == ngapType.ProtocolIEIDRANUENGAPID && ie.Value.RANUENGAPID != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be SourceAMFUENGAPID? Otherwise, it will use a worker different from that of existing RanUe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this possible issue I have written in above Notes. Under the current global AMF Context architecture, a UE message triggers a worker switch only once. While this results in a cache miss, it does not lead to any functional errors. However, if the global Context is to be removed in the future, the allocation mechanisms for both amf-ue-id and ran-ue-id must be modified accordingly.
Architecture & Safety Considerations:
To minimize changes to the core architecture, this implementation continues to utilize the global AMFContext (via sync.Map) for storing UE contexts. Consequently, a UE's InitialUEMessage (keyed by RAN ID) may be processed by Worker A, while subsequent NAS messages (keyed by AMF ID) may be processed by Worker B.
- Safety: This approach is safe because the global context uses thread-safe
sync.Mapfor access. Furthermore, the 5G Request-Response architecture prevents race conditions where a UE sends messages with a RAN ID and an AMF ID simultaneously. - Future Work: Achieving a strictly local "Per-UE Connection" model (removing global lock contention entirely) would require removing the global
UePooland modifying theAMF-UE-IDallocation mechanism to bind specific IDs to specific workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding (please correct me if I’m mistaken) is that while InitialUEMessage only triggers a single worker switch and is functionally safe under the current global AMFContext + sync.Map design, PathSwitchRequest is different.
At this stage, the UE may already have ongoing traffic (e.g. previous uplink NGAP messages) keyed by AMF UE NGAP ID. If PathSwitchRequest is dispatched using the new RAN UE NGAP ID instead of SourceAMFUENGAPID, it could introduce an additional worker switch during an active UE lifecycle.
A possible scenario is:
Initial UE Message → Worker A (RAN ID)
Uplink NAS Transport → Worker B (AMF UE ID)
Path Switch Request → Worker C (new RAN UE ID)
This could lead to multiple workers accessing the same UE context and extra worker hopping.
If my understanding differs from the actual implementation, I’m happy to discuss and clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't fully grasp your question at first. You are absolutely right—we should keep the same Source AMF UE NGAP ID. I've updated the code. Thanks!
1. What changes were made
UEScheduler(scheduler.go): Introduced a worker pool architecture with hash-based task distribution to manage concurrent NGAP message processing.ExtractUEID(ue_id_extractor.go): Created a lightweight decoder to extract UE identifiers (AMF-UE-NGAP-ID or RAN-UE-NGAP-ID) from incoming packets without full ASN.1 unmarshalling.service.go): ModifiedhandleConnectionto decouple SCTP reading from message processing. It now dispatches tasks to the worker pool instead of blocking onhandler.HandleMessage.init.go): Added initialization and graceful shutdown routines for the scheduler within the AMF service startup/teardown sequence.NgapWorkerPoolSizeandNgapTaskBufferSizeoptions toamfcfg.yamlto allow performance tuning.ue_id_extractor_test.go: Validates ID extraction across 9 scenarios, includingInitialUEMessage,UplinkNASTransport,HandoverRequired,PDUSessionResourceSetupResponse, and invalid message handling.scheduler_test.go(Distribution): Verifies hash consistency and uniform distribution across workers (tested with 10,000 UEs).scheduler_test.go(Concurrency): Validates system stability under high load (50 concurrent goroutines submitting 5,000 tasks), ensures strict per-UE message sequentiality, and verifies graceful shutdown procedures.2. How it works
hash(UE_ID) % N) to route all messages belonging to the same UE to the specific worker goroutine.NGSetup) are routed to a default worker to maintain global order where necessary.3. Why this change is necessary
Notes
To minimize changes to the core architecture, this implementation continues to utilize the global
AMFContext(viasync.Map) for storing UE contexts. Consequently, a UE'sInitialUEMessage(keyed by RAN ID) may be processed by Worker A, while subsequent NAS messages (keyed by AMF ID) may be processed by Worker B.sync.Mapfor access. Furthermore, the 5G Request-Response architecture prevents race conditions where a UE sends messages with a RAN ID and an AMF ID simultaneously.UePooland modifying theAMF-UE-IDallocation mechanism to bind specific IDs to specific workers.sudo ./nr-ue -c ../config/free5gc-ue.yaml -n 30(Supports up to ~512 UEs, though ~100 is recommended to avoid gNB instability).Architecture & Simple testing result