Zeitschriftenartikel 2024 SD Task
Zeitschriftenartikel 2024 SD Task
Yang Yang (corresponding author) is with The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China, also
with the Peng Cheng Laboratory, Shenzhen, China, and also with the Terminus Group, Beijing, China; Jianjun Wu (corresponding author),
Chenghui Peng, and Jun Wang are with Huawei Technologies Company, Ltd., Shenzhen, China; Tianjiao Chen, Juan Deng, and Guangyi
Liu are with the China Mobile Research Institute, Beijing, China; Xiaofeng Tao is with the Peng Cheng Laboratory, Shenzhen, China,
and also with the Beijing University of Posts and Telecommunications, Beijing, China; Wenjing Li is with the Beijing University of Posts
and Telecommunications, Beijing, China; Li Yang is with ZTE Corporation, Nanjing, China; Yufeng He is with the Research Institute of
China Telecom, Beijing, China; Tingting Yang is with the Peng Cheng Laboratory, Shenzhen, China; A. Hamid Aghvami is with the King’s
College London, London, U.K.; Frank Eliassen is with the University of Oslo, Oslo, Norway; Schahram Dustdar is with TU Wien, Vienna, Digital Object Identifier:
Austria; Dusit Niyato is with Nanyang Technological University, Singapore; Wanfei Sun is with CICT Mobile Communication Technology 10.1109/MNET.2023.3321464
Company, Ltd., Beijing, China; Yang Xu is with Beijing OPPO Telecommunications Corporation, Ltd., Dongguan, China; Yannan Yuan Date of Current Version:
is with vivo Mobile Communication Company, Ltd., Dongguan, China; Jiang Xie is with the University of North Carolina at Charlotte, 18 April 2024
Charlotte, NC, USA; Rongpeng Li is with Zhejiang University, Hangzhou, China; Cuiqin Dai is with the Chongqing University of Posts and Date of Publication:
Telecommunications, Chongqing, China. 6 October 2023
IEEE Network Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore.
• January/February 2024 0890-8044/23©2023IEEE Restrictions apply.
219
algorithm) still requires cross-layer collaboration
It also provides data storage and processing functions as well as AI capabilities inside the network, in MEC. Consequently, the preceding problems
achieving higher security. involved in cloud AI remain unresolved.
Deploying AI functions (such as cloud and
MEC) at the application layer leads to low
(SLA) indicators—e.g., service requirement throughput, high latency, poor privacy, and high
zone (SRZ) and user satisfaction ratio carbon emissions. To address these problems,
(USR)—to QoAIS indicators, and discusses the network AI is launched to extend comput-
task-level QoS assurance to meet individual ing from the cloud to physically closer edges
requirements of different users. to end users. It also provides data storage and
3. Compares the network AI and cloud/mobile processing functions as well as AI capabilities
edge computing (MEC) in terms of QoAIS inside the network, achieving higher security.
indicators. Thanks to providing the AI exe- Although this “device-edge-cloud” architecture
cuting environments closer to UE, TONA is with edge cloud is expensive to deploy, it can
anticipated to have some advantages, such support compute-intensive, latency-sensitive,
as better data privacy protection, lower security-assured, and privacy-sensitive applica-
latency, and lower energy consumption. tions such as interactive virtual reality (VR) and
4. Lists some open issues, including distribut- augmented reality (AR) games, autonomous driv-
ed AI learning, mobility management, and ing, and smart manufacturing [7]. Therefore, it is
security assurance. becoming promising in various high-value-added
application scenarios.
Network AI By introducing AI in the network, 6G net-
The cloud AI architecture has been widely used work AI applies to three scenarios (shown in
in the 5G era to provide centralized computing, Fig. 1): Network element (NE) intelligence, net-
big data analysis, and AI training and inference work intelligence, and service intelligence. NE
services, where terminals provide data, mobile intelligence is the native intelligence of single
networks provide communication channels, and nodes, e.g., core network (CN) or radio access
clouds provide AI capabilities. Coordinating these network (RAN) nodes. Network intelligence
independent functions and resources among mul- refers to the collaboration of multiple intelli-
tiple facilities provides effective, flexible, smooth, gent NEs to achieve swarm intelligence. Both
and stable services and ensures QoE is extremely NE intelligence and network intelligence can
difficult. For latency-sensitive ultra-reliable low-la- be triggered internally or externally via open
tency communication (URLLC) services, MEC interface. Moreover service intelligence refers
deploys application servers close to base stations to the 6G network AI being provided as a ser-
and therefore has lower latency than cloud AI. vice, which is generally triggered by external
However, the AI platform is still deployed at the services and implemented in the network, with-
application layer. Joint optimization of connection out understanding the application service logic.
and AI resources (i.e., computing, data, model/ Put simply, NE intelligence and network intelli-
gence provide AI services for internal network
modules, and service intelligence provides AI
services for external third-party applications.
Here, we assume that some network units like
base stations and UEs will have some type of
AI processor which can be used for themselves
and the third parties.
To support the three scenarios, the 6G
native AI network architecture should have a
unified framework for different types of AI train-
ing and inference. For example, a distributed AI
environment must be built on the 6G network.
Specifically, the 6G native AI network architecture
must be able to: (1) use various native AI capa-
bilities (e.g., connection, computing, data, and
AI training and inference capabilities) of NEs and
terminals; (2) provide on-demand AI, computing,
and data services for networks and third-party
applications; and (3) guarantee the QoAIS in het-
erogeneous, dynamic, fully distributed, and other
complex wireless environments. This is the reason
that our proposed solution shift from a session-ori-
ented to a task-oriented architecture to address
the preceding challenges.
220 Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore. Restrictions apply.
IEEE Network • January/February 2024
FIGURE 2. Network paradigm changes.
We believe that the 6G network architec- models and reinforcement learning). Thus 6G
ture requires the following changes in the design networks need to introduce new resource man-
paradigm: agement mechanisms. However, it is difficult
1. Change 1: The object to be managed and to efficiently implement AI services on a single
controlled in network are changed from ses- node due to the bottlenecks in single-point com-
sions to tasks. puting, data privacy protection, and ultra-large
2. Change 2: The resources of the object model storage. Consequently, a new collabo-
are changed from one dimension to ration mechanism in 6G networks is required
multi-dimensions, from homogeneous to to implement computing, algorithm, and data
heterogeneous. collaboration among multiple nodes. Hence,
3. Change 3: The object control mecha- sessions and AI tasks have different technical
nism are changed from session-control to methods.
task-control. These differences show that the session-ori-
4. Change 4: The performance indicators of ented system cannot support native AI and that
the object are changed from session-QoS a new task-oriented system needs to be designed
to task-QoS. for the new resource management mechanism
and multi-node collaboration mechanism. This
Change 1: From Session to Task article defines a task that coordinates multi-
AI tasks differ from traditional sessions in terms of node and multi-dimensional resources at the 6G
technical objectives and methods. network layer to achieve a given objective. For
In terms of technical purposes, a traditional example, a federated learning in network needs
communications system provides session services, the coordination of multiple nodes of the base
typically between terminals or between terminals station and multiple UEs, and the coordination
and application servers, to transmit user data of communication, AI model, and computing
(including voice). Conversely, network AI (i.e., NE resources.
intelligence and network intelligence) aims to pro-
vide intelligent services for networks and improve Change 2: From Single-Dimension to Multi-Dimension
communication network efficiency. Service intel- Heterogeneous Resources
ligence seeks to provide app-specific intelligent The traditional wireless system establishes
services for third parties. Thus, sessions and AI tunnels and allocates radio resources for data
tasks have different purposes. transmission. Conversely, TONA implements
In terms of technical methods, to transmit collaboration among heterogeneous resources
user data, a traditional communications service of connection, computing, data and model/
needs to maintain a QoS assurance mechanism algorithm to execute AI tasks. Take an AI infer-
for user-oriented connection channels as well as ence task as an example. In this case, executors
their lifecycle management, such as E2E tunnels need to obtain certain resources to execute the
from UEs to base stations and then to the CN. tasks. Specifically, the executors need to obtain
This is necessary to provide QoS guarantee for computing resources like computing timeslots
the data transmission. Conversely, AI is a data- for the tasks, data resources like the data col-
and computing-intensive service. Compared lected in real-time or external data input, and
with sessions, AI introduces new resources, algorithm resources including a possible AI
including computing (e.g., CPU, GPU, and net- model such as a graph neural network (GNN),
work processing unit (NPU)), data (generated or a convolutional neural network (CNN), or rein-
used by AI), and algorithms (e.g., neural network forcement learning.
IEEE Network Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore.
• January/February 2024 Restrictions apply.
221
The QoS of traditional communication net-
Different AI scenarios have different requirements for AI service quality. works mainly considers connection-specific
performance indicators such as latency and
throughput of communication services [8].
Change 3: From Session-Control to Task-Control In addition to these traditional communication
Unlike session control, task management and resources, 6G networks will introduce new
control in network AI includes the following resources such as computing, algorithm, and
functions: (1) Decomposing and mapping from data, requiring an extension of evaluation indica-
external services to internal tasks, (2) Decompos- tors. At the same time, with the implementation
ing and mapping from service QoS to task QoS, of “Carbon Neutrality” and “Peak Carbon Diox-
and (3) Providing heterogeneous and multi-node ide Emissions” policies, the global AI industry’s
collaboration mechanisms to orchestrate and con- attention on data security and privacy, and users’
trol heterogeneous resources of multiple nodes at increasing requirements for network autonomy,
the infrastructure layer in real-time (to implement users will focus on more than just performance
distributed serial or parallel processing of tasks indicators in the future. The requirements on
and real-time QoS assurance). For a simple ser- aspects such as overhead, security, privacy, and
vice request, one service may correspond to or autonomy will increase, and these aspects will
be mapped as one task. For a complex service become new dimensions for evaluating QoS.
request (e.g., integration of multiple service flows, Consequently, the QoAIS indicator system needs
or a service flow with numerous calculations), one to be extended from the existing indicators during
service may be mapped to multiple nodes for sys- the initial design [9].
tematic execution. For example, the QoAIS indicators for AI train-
For function (3), the execution of an AI task ing services are as follows:
requires collaboration in two dimensions: 1. Efficiency: efficiency indicator boundary,
Heterogeneous Resources Collaboration: The training duration, generalization, reusabil-
execution of a task may require some or all of ity, robustness, explainability, consistency
the heterogeneous resources. For example, task between the loss function and optimization
deployment requires configuring the heteroge- objective, and fairness
neous resources, and task execution requires 2. Overhead: storage overhead, comput-
scheduling the heterogeneous resources in ing overhead, transmission overhead, and
real-time. power consumption
Multi-Node Collaboration: First, in a 3. Security: storage security, computing securi-
traditional communications network, connec- ty, and transmission security
tion-specific computing is mainly implemented 4. Privacy: data privacy, and algorithm privacy
on a single NE, and computing sharing and col- 5. Autonomy: fully autonomous, partially
laboration are not required. The emergence of autonomous, and manually controllable
AI is accompanied by large-scale AI training, QoAIS is an essential input for the network
large-model AI inference, and massive percep- AI orchestration and management system and
tual image processing, requiring significantly control functions. The orchestration and man-
more computing than traditional networks do. agement system decomposes and maps QoAIS
Simply expanding the computing capability of to generate QoS requirements of AI tasks, and
each NE across the entire network will result then maps the task QoS to QoS requirements
in high deployment costs. Hence, distributed of multi-dimensional heterogeneous resources.
computing is needed, which completes a task The management, control, and user plane mech-
collaboratively among multiple nodes through anisms are designed to ensure continuous QoAIS
shared computing. Second, as data owner- assurance.
ship awareness grows, data privacy protection
requirements become more stringent. For Architecture and Key Technologies
example, the raw data of User Equipment (UE) This section describes the logical architecture and
cannot be uploaded to networks for training. deployment options of TONA, and QoAIS details.
Federated learning solves this problem through
collaborative learning and gradient transfer at Logical Architecture of TONA
the data layer among multiple nodes. Third, First, we introduce fundamental basic concepts
model training consumes substantial computing in wireless network. A communications system
and storage resources to support native AI, and consists of a management domain and a con-
thus a good model needs to be shared within trol domain. The Operations Administration and
the network, and collaboration for models Maintenance (OAM) deployed in management
among multiple nodes is required. domain is used to operate and manage NEs
through non-real-time (usually within minutes)
Change 4: From Session-QoS to Task-QoS management plane signaling. The control domain
Unlike previous generations of mobile networks, is deployed on core network (CN) NEs, base sta-
6G networks are not just channels that serve tions, and terminals, and features with real-time
traditional communications services. Different controlling signaling (usually within milliseconds).
AI scenarios have different requirements for AI For example, an E2E tunnel for a voice call can
service quality. They demand an indicator mecha- be established within dozens of milliseconds by
nism to quantitatively or hierarchically convey user control signaling.
requirements while also orchestrating and con- Unlike the centralized, homogeneous, and sta-
trolling the comprehensive effect of AI resources. ble AI environment provided by the cloud, the
Therefore, this article proposes the quality of AI network AI faces the following technical chal-
service (QoAIS). lenges when embedded in the wireless networks:
222 Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore. Restrictions apply.
IEEE Network • January/February 2024
(1) AI needs to be distributed on numerous CN task scope and real-time task scheduling, and
NEs, base stations, and UEs. Therefore, it is nec- effectively manages the numerous, heteroge-
essary to consider how to manage the massive neous nodes and aware of dynamic change of
number of nodes efficiently in the architecture heterogeneous resources (e.g. channel status
design. (2) The computing, memory, data, and and computing load).
algorithm capabilities of different nodes vary sig- The following describes the detailed function-
nificantly, requiring the architecture design to also alities of TA, TS, and TE.
consider how to efficiently manage these het- TA manages the lifecycle of tasks (including
erogeneous nodes efficiently. (3) The dynamic deploying, starting, deleting, modifying, and mon-
variation of the channel status and the comput- itoring tasks) based on task QoS requirements. It
ing load need to be factored into the architecture also implements collaboration among heteroge-
design. neous resources to guarantee coarse-grained QoS
To address the aforementioned challenges, in the initial deployment phase.
TONA includes two logical functions, as TS controls and schedules tasks in the task
shown in Fig. 3: (1) AI orchestration and man- execution phase. It consists of the information
agement, called Network AI Management & collection and resource management modules.
Orchestration (NAMO); and (2) task control. Information collection requires that TS senses
NAMO decomposes and maps AI services to the computing load, data processing capabili-
tasks and orchestrates the AI service flows. It ties, algorithm models being used, and channel
is not performed in real-time and is generally conditions on a plurality of nodes in real-time.
deployed in the management domain. Task Based on this information, TS has a more real-
control introduces the Task Anchor (TA), Task time resource management capability than TA.
Scheduler (TS), and Task Executor (TE) func- For example, when the network environment
tions in the control domain in three layers. This changes, TS adjusts AI models and data pro-
layered design strikes a balance between the cessing functions or schedules connection and
IEEE Network Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore.
• January/February 2024 Restrictions apply.
223
FIGURE 4. Four deployment scenarios of TONA.
computing resources in real-time to achieve (CU) from the distributed units (DUs). In the latter
timely QoS assurance. mode, the CU may be deployed on the cloud for
TE is responsible for task execution and non-real-time signaling control and data transmis-
possible service data interaction. For example, sion. The DUs may be deployed closer to UEs for
federated learning needs to transfer intermediate real-time resource allocation, data transmission
gradient information among multiple nodes. and retransmission.
Scenario 1: gNodeB + UEs. In this scenario,
Deployment Architectures the gNodeB serves as both TA and TS, and the
The statuses of TEs (e.g., the CPU load, memory, UEs serve as TEs. Here, a UE is a computing
electricity, and UE channel status) change in real- provider and task executor, which accepts task
time. As such, deploying TA and TS close to each assignment and scheduling from the gNodeB. The
other can reduce the management delay. Accord- Uu interface and Radio Resource Control (RRC)
ing to the design logic of wireless networks, the layer between the gNodeB and the UE can be
CN and RAN need to be decoupled as much as enhanced to support task controlling and sched-
possible. For example, the CN should be inde- uling purposes.
pendent of RAN Radio Resource Management Scenario 2: CU + DUs. In this scenario, the
(RRM) and Radio Transmission Technology (RTT) CU serves as both TA and TS, and the DUs serve
algorithms. Therefore, this article recommends as TEs. Here, a DU is the computing provider and
that TA/TS be deployed on RAN and CN, named task executor. The F1 interface and F1-AP layer
RAN TA/TS and CN TA/TS, respectively. This way between the CU and the DU can be enhanced to
will allow TA to manage TEs in real-time flexibly. support task controlling and scheduling purposes.
Four deployment scenarios of TONA are shown Scenario 3: CU + DUs + UEs. In this scenario,
in Fig. 4 to describe the necessity and rationality the CU serves as TA, the DUs as TSs, and the UEs
of CN TA and RAN TA. These scenarios are only as TEs. Here, a UE is a computing provider and
examples—there may be other deployment sce- task executor, and the CU is the task manager. A
narios and architectures. DU observes a task allocated by the CU to UEs,
Assume that TA, TS, and TEs are deployed in and performs heterogeneous resources schedul-
RAN to perform federated learning between the ing and real-time QoS guarantee. This scenario
base station and UEs. Considering the 6G archi- separates TA from TSs. TSs are deployed lower
tecture is undetermined, this article reuses the 5G than TA is; TSs can therefore acquire the status
RAN architecture for reference. A gNodeB is a of TE heterogeneous resources more quickly to
5G base station, which can be deployed in stand- achieve real-time task QoS monitoring and rapid
alone mode or by separating the centralized unit adjustment of heterogeneous resources. The
224 Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore. Restrictions apply.
IEEE Network • January/February 2024
QoAIS Indicator Resource- specific Quantitative Indicator Non-quantitative Indicator
Performance indicator boundary, training Data Feature redundancy, integrity, data accuracy, and data Sample space balance, integrity, and
duration, generalization, reusability, preparation duration sample distribution dynamics
robustness, explainability, consistency with
Algorithm Performance indicator boundary, training duration, Robustness, reusability, generalization,
optimization objective, and fairness
convergence, and optimization objective matching degree explainability, and fairness
Computing Computing precision, duration, and efficiency None
Connection Bandwidth and jitter, delay and jitter, bit error rate and None
jitter, and reliability
TABLE 1. Mapping between QoAIS indicators and resource QoS indicators in the AI training service.
IEEE Network Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore.
• January/February 2024 Restrictions apply.
225
TONA, during task execution, TONA can detect
TONA has native data security and privacy protection capabilities because it processes data inside the dynamic changes of computing loads on termi-
the network. nals in real-time and promptly adjust the computing
resource allocation, calculation precision, and serial
or parallel computing mode on the network. As
such, this approach can ensure QoAIS for custom-
ized computing offloading services.
Cloud/MEC AI TONA
QoAIS Difficulty to guarantee the personalized QoAIS. Easy to guarantee the personalized QoAIS. Latency
TONA computing is distributed on NEs closer
Latency Higher latency due to out-network processing (e.g. Lower latency due to in-network processing to UEs or even directly on UEs to process data
second/minute level). (e.g. millisecond level). locally. This not only successfully achieves
Resource overhead Larger transmission overheads. Less transmission overheads. real-time and low-latency AI services, but also sig-
nificantly reduces data transmission. In the cloud/
Larger computation overheads. Less computation overheads. MEC AI mode, a large amount of data needs to
be transmitted to the cloud/MEC for training,
Security Data privacy is ensured by the application layer. Native security and privacy via in-network processing.
meaning that E2E data transmission takes longer
TABLE 2. Performance comparison of cloud/MEC AI and TONA. to complete.
Take joint device-cloud AI inference as an
example [14], [15]. Cloud/MEC AI transmits data
(summarized in Table 2) in meeting users’ custom- from devices to the cloud, performs real-time
ized AI service requirements: training/interference, and transmits the results
back to devices. The long transmission distance
QoAIS Assurance causes high latency, making it difficult to meet the
Dynamic wireless environments require joint requirements of ultra-low latency scenarios such
optimization of the heterogeneous resources as Industrial Internet of Things (IIoT), even if the
(connection and three AI resources) to achieve application server is deployed on the MEC. By
precise QoAIS assurance. In TONA, all hetero- contrast, for TONA, data processing is terminated
geneous resources are inside the network and within a network, the E2E transmission latency is
can perceive each other. Furthermore, a real-time as low as 1 millisecond, enabling ultra-low latency.
(within milliseconds) collaboration mechanism
is designed at the control layer. Conversely, the Overhead
cloud/MEC AI lacks a collaboration mechanism TONA can optimally allocate resources through
between the communication resources and the the real-time collaboration mechanism of the
three AI resources, meaning that these resources heterogeneous resources, maximizing the overall
cannot observe each other in real-time. Generally, resource utilization and reducing the transmission
they observe each other through the management and computing overheads. Conversely, because
layer (with non-real-time capability openness) or the cloud/MEC AI cannot adapt to dynamic envi-
the application layer (within seconds or minutes), ronments, it allocates resources based on only
which cannot adapt to dynamic wireless environ- the maximum resource consumption to ensure
ment changes in real-time, and cannot guarantee QoAIS. As a result, the overall resource utilization
QoAIS. is low, and the resource overhead is high.
Take device-cloud joint AI training as an exam- Take joint device-cloud AI training as an
ple [10], [11], the Cloud AI cannot be aware of example. For Cloud/MEC AI, long device-cloud
the connection’s real-time status to adjust the het- distance causes large transmission overheads.
erogeneous resources and thus cannot provide On the other hand, for TONA, data is processed
customized training solutions for users with dif- nearby, effectively reducing data transmission
ferent connection performances. Meanwhile, for overheads.
TONA, the network detects environment changes Furthermore, Cloud/MEC AI cannot measure
(such as terminal movement, disconnection, and quality of wireless connections in real-time. Dif-
burst interference) in real-time and quickly adjusts ferent connections status of TEs lead to low AI
the joint training solution. For example, in TONA efficiency and increase computing overhead. In
the network can change the split learning [12], federated learning, for example, straggler ter-
[13] point to reduce the intermediate data size minals may be abruptly disconnected from the
when the UE is far away from the base station. network or cause a long delay. If a large amount
Thus, QoAIS can be achieved for customized AI of straggler data is discarded, the number of
training services. training samples is reduced, affecting the conver-
Take device-cloud computing offloading as an gence efficiency of the current round. If a long
example. If a terminal’s local computing resource delay occurs, the iteration time of the current
does not meet the requirements of computing-in- round is prolonged. However, for TONA, the net-
tensive services, cloud/MEC AI offloads some work can detect each UE’s channel status and
computation to the cloud. During the execution set a longer local training period for the straggler
of computing tasks, the computing resource uti- to reduce the total reporting numbers when the
lization of terminals changes in real-time (within UE‘s data rate is low. This improves the overall
milliseconds). The non-real-time collaboration of AI training efficiency and reduces the computing
cloud AI (within seconds or minutes) cannot trace overhead.
users’ computing requirements in real-time, nor
can it promptly offload computing to the cloud. Security
As such, this approach fails to meet users’ custom- TONA has native data security and privacy pro-
ized QoAIS requirements. On the other hand, for tection capabilities because it processes data
226 Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore. Restrictions apply.
IEEE Network • January/February 2024
inside the network. Unlike TONA, the cloud/MEC task concept and TONA proposed in this article
AI protects data privacy only at the application support not only AI tasks, but also sensing-, com-
layer. puting- and data processing-specific tasks.
IEEE Network Authorized licensed use limited to: TU Wien Bibliothek. Downloaded on June 11,2024 at 12:38:23 UTC from IEEE Xplore.
• January/February 2024 Restrictions apply.
227