Efficient Unified Architecture For
Efficient Unified Architecture For
ABSTRACT
As the ongoing standardization process of post-quantum schemes yields initial
outcomes, it becomes increasingly important to not only optimize standalone
implementations but also explore the potential of combining multiple schemes into a
single, unified architecture. In this article, we investigate the combination of two
National Institute of Standards and Technology (NIST)-selected schemes: the
Dilithium digital signature scheme and the Kyber key encapsulation mechanism. We
propose a novel set of optimization techniques for a unified hardware
implementation of these leading post-quantum schemes, achieving a balanced
approach between area efficiency and high performance. Our design demonstrates
superior resource efficiency and performance compared to previously reported
unified architecture (DOI 10.1109/TCSI.2022.3219555), also achieving results that
are better than, or comparable, to those of standalone implementations. The efficient
and combined implementation of lattice-based digital signatures and key
establishment methods can be deployed for establishing secure sessions in high-
speed communication networks at servers and gateways. Moreover, the unique and
compact design that requires small hardware resources can be directly used in small
and cost-effective field programmable gate array (FPGA) platforms that can be used
as security co-processors for embedded devices and in the Internet of Things.
How to cite this article Dobias P, Malina L, Hajny J. 2025. Efficient unified architecture for post-quantum cryptography: combining
Dilithium and Kyber. PeerJ Comput. Sci. 11:e2746 DOI 10.7717/peerj-cs.2746
should substitute the legacy asymmetric schemes such as Rivest–Shamir–Adleman (RSA),
Digital Signature Algorithm (DSA), Elliptic Curve Digital Signature Algorithm (ECDSA)
etc. in the next few years during the post-quantum transition period, as foreseen in reports,
and recommendations (ANSSI, 2022; NSA, 2022). Deploying the quantum-resistant
schemes will also be motivated by harvest now, decrypt later attacks. The results of the
standardization have increased the demand for efficient and secure implementations of
these cryptographic schemes in practical applications. Software implementations are
popular for their flexibility and ease of deployment. However, the significant
computational requirements of post-quantum cryptographic (PQC) algorithms make
software-only approaches insufficient for many high-performance or resource-constrained
environments. Hardware-based implementations of PQC on Field Programmable Gate
Array (FPGA) platforms offer distinct advantages, particularly by providing lower
latencies and, therefore, higher throughput. High-speed communication and operation
can be essential for servers and gateways that manage hundreds to thousands of security
sessions and have to perform key establishment and signing/verification phases efficiently.
This use case is demonstrated in Fig. 1. Additionally, hardware acceleration improves
security by reducing the susceptibility to certain types of side-channel attacks that are often
present in software implementations. Given that ML-KEM and ML-DSA belong to the
same cryptographic family, they are well suited for unified hardware architectures, which
can reduce resource usage and streamline security processes by securing shared
components only once.
In this work, we focus on integrating these schemes into a single hardware
implementation that can be compact and efficiently (low hardware resources, low latency)
provide basic security operations, i.e., key establishment and data signing/verification that
are usually required while establishing secure communication sessions. The work aims at
two research questions (RQ): RQ1) are there any new optimization approaches that can be
designed and applied in the hardware-implementations of lattice-based schemes (Kyber and
Dilithium)? RQ2) how can the lattice-based standards (Kyber with Dilithium) be efficiently
combined and how efficient can this combination be on the hardware (FPGA) platforms in
comparison with standalone HW implementations?
The rest of this article is organized as follows: “Background” introduces background of
lattice-based schemes and their main phases. “Optimized Unified Hardware Architecture”
presents new optimization techniques that are proposed and deployed in our unified
hardware architecture, and deals with RQ1 and RQ2. “Results and Comparison” shows
results and compares our solution with related works, and deals with RQ2. “Discussion”
discusses the practical deployment, limitations, implementation attacks and future open
problems. In “Conclusion” we conclude this work and present our next steps.
Related work
By finishing the NIST PQC standardization, finalist PQC schemes and winners have been
implemented on various FPGA platforms, tested in various architectures, or co-designed
for Hardware/Software (HW/SW) co-processors. Related works have investigated mostly
implementation efficiency (e.g., speed, area, latency) and security (e.g., preventing side
channel attacks, hardware trojans, fault injections etc.).
1. We introduce novel approaches to enhance the level of resource reuse within the unified
architecture. Mainly we use a new memory management, reducing the memory used,
polynomial arithmetic, sample and compression units, and efficient operations
schedule. These optimization steps cause a 50% reduction in BRAM usage in our
hardware implementation compared to Aikata et al. (2023a).
2. We present the most compact and high-performance hardware implementation of a
unified architecture that supports both Dilithium and Kyber for all security levels.
Our solution is compared with related works and also standalone HW
implementations. Our efficient solution can also be deployed in small, cost-effective,
and low-density FPGA platforms.
BACKGROUND
Despite the standards containing Dilithium and Kyber using different names, we will
retain their original designations for consistency with existing articles. This section
provides a brief overview of these schemes, highlighting their similarities that are used in
the unified design.
Dilithium
Dilithium is a lattice-based digital signature scheme whose security relies on the hardness
of module learning with errors (M-LWE) and shortest integer solution (SIS) problems. The
scheme works in three phases: key generation, signing and verification, each involving
multiple operations on polynomials. Below is a brief description of the key operations
involved in these phases.
Polynomial arithmetic
All arithmetic operations on polynomials are performed over the ring
R8380417 ¼ Z8380417 ½X=ðX 256 þ 1Þ. To enable faster polynomial multiplication, the NTT is
used. Since Dilithium includes a 512th root of unity, the polynomial multiplication is
carried out using the complete-NTT, where the coefficients in the NTT domain
correspond to zeroth-degree polynomials, thus the multiplication process requires only
256 pointwise multiplications.
Coefficients sampling
Dilithium uses two forms of coefficient sampling, both of which involve rejection
sampling. The first occurs during the generation of the matrix A, where rejection sampling
is applied to the output of the SHAKE-128 hash function. The second occurs during the
generation of the secret key and error vectors, where rejection sampling is applied to the
output of the SHAKE-256 hash function. Additionally, a SampleInBall operation is used
to sample the challenge polynomial; however, this operation differs significantly and is
handled by its own dedicated unit.
Coefficients unpacking/packing
To reduce the size of keys and signatures, and consequently lower memory and bandwidth
requirements, the coefficients are unpacked and packed to and from the byte arrays.
During these operations, only a specific number of bits from the coefficients are used. In
particular, unpacking and packing operations are required for coefficients of 20-bit, 18-bit,
13-bit, 10-bit, 6-bit, 4-bit, and 3-bit lengths.
Kyber
Kyber is a lattice-based key encapsulation scheme whose security relies on the hardness of
the M-LWE problem. The scheme works in three phases: key generation, key
encapsulation and key decapsulation, each involving multiple operations on polynomials.
Below is a brief description of the key operations involved in these phases.
Polynomial arithmetic
All arithmetic operations on polynomials are performed over the ring
R3329 ¼ Z3329 ½X=ðX 256 þ 1Þ. Similarly to Dilithium, NTT is used to enable faster
polynomial multiplication. However, since Kyber does not include a 512th root of unity,
the polynomial multiplication is carried out using the incomplete-NTT, where the
coefficients in the NTT domain correspond to first-degree polynomials, thus the
multiplication process is more complex.
Coefficients sampling
Similarly to Dilithium, Kyber also uses two types of coefficient sampling, but only one uses
rejection sampling. The first occurs during the generation of the matrix A, where rejection
sampling is applied to the output of the SHAKE-128 hash function. The second is used for
sampling the secret key and error vectors, during which only a specific number of bits
(without rejection) are sampled from the output of SHAKE-256.
Coefficients compression/decompression
Kyber uses compression to reduce the coefficient sizes by discarding the least significant
bits. The compression function is defined as dð2d =qÞ xe mod 2d , and decompression as
dðq=2d Þ xe mod 2d . Decompressing a coefficient and then compressing it again always
gives the same value.
Coefficients decoding/encoding
To reduce the size of keys and ciphertexts, coefficients are decoded and encoded to byte
arrays. During these operations, only a specific number of bits from the coefficients are
used. The decoding and encoding process must support coefficients of 12-bit, 11-bit, 10-
bit, 5-bit, 4-bit, and 1-bit lengths.
Memory management
Our design employs two dual-port RAMs for coefficient storage: a main memory and a
temporary memory, both with a width of 96 bits. This configuration allows each RAM to
store either 4 Dilithium coefficients or 8 Kyber coefficients, matching the number of
coefficients processed in parallel by other units. The main memory has a depth of 4,096,
sufficient to store all necessary polynomials for any phase of Dilithium or Kyber
operations. In contrast, the temporary memory, with a depth of 512, is used to store
temporarily polynomials, that are decoded from input or sampled. This allows the
polynomial arithmetic unit to process polynomials stored in the main memory in parallel
with the preparation of new polynomials in the temporary memory. This arrangement
optimizes the operation schedule by enabling continuous utilization of the arithmetic unit,
thereby reducing idle times that would otherwise occur if the unit had to wait for
polynomial loading or sampling. The selected depth of the temporary memory ensures that
a sufficient number of polynomials is available in the memory to allow this continuous
processing.
As Dilithium requires significantly more memory to store the polynomials, the memory
optimizations were done only for Dilithium phases. On the other hand, during Kyber
phases, the additional space is used to enable efficient processing.
the previous section. This allows the intermediate results for Kyber to be stored in memory
regions that remain unused during Kyber’s computation.
Compression unit
The compression unit unifies all operations that modify the bit size of the coefficients, be it
by reducing or extending it. It performs compression and decompression of all coefficient
sizes for Kyber, and Decompose, Power2round and coefficient modification before
1. The vector and matrix polynomials are sampled interleaved because the sampling of the
matrix polynomial takes slightly more cycles than the multiplication. This can be
compensated for by the depth of the temporary memory. However, if all the
Figure 6 Operations schedule of vector-matrix multiplication during Dilithium and Kyber key generations.
Full-size DOI: 10.7717/peerj-cs.2746/fig-6
multiplications happened without interleaving with the NTT operation as it was done in
previous work, the depth of the temporary memory would have to increase.
2. During the second phase of multiplication, in-place addition is performed by the
butterfly units. This is possible because, while coefficients are loaded from temporary
memory, the coefficients in the main memory can be accessed simultaneously.
The second example focuses on the Kyber scheme. Since polynomial multiplication in
Kyber occurs in two phases (with the second phase highlighted in orange), it takes more
cycles than rejection sampling, making interleaving unnecessary. Additionally, to take
advantage of in-place addition with the error vector, it is beneficial to first sample and
Minor optimizations
Some minor optimizations were further applied to reduce the total area consumption as
well as remove critical paths to increase the design working frequency.
Data storing
As all of the internal data are processed in 64-bit transactions, we were able to use a 64-bit
wide Look-Up Table Random Access Memory (LUTRAM) to store all the necessary data
(seeds for key generations and for samplings, challenge in Dilithium,…) removing the
need for 2,048 registers that would be needed otherwise.
Results
Table 1 presents the hardware resource utilization of the primary components in our
design, as well as the total resource usage for the top-level component that implements
both the Dilithium and Kyber schemes. It is important to note that the sum of the
individual components’ resource usage is lower than the total design utilization, as some
utility components, such as memories and state machines, are not included in the
individual breakdown. The top-level component utilizes 17,138 Look-Up Table (LUTs),
6,559 Flip-Flop (FFs), 4 Digital Signal Processing (DSPs), and 12.5 Block Random Access
Memories (BRAMSs), with the largest impact from the Keccak, polynomial arithmetic, and
compression units. While the Keccak and polynomial arithmetic units have the most
significant impact in previous research and are therefore the focus of most optimizations,
the impact of the compression unit in our design is notable because it combines multiple
operations. These include Kyber’s compression/decompression and Dilithium’s
power2round and decompose, which are typically reported as separate operations in
existing studies. For the Artix-7 platform, this represents 82% LUT, 16% FF, 5% DSP and
25% BRAM utilizations. Fully routed design is shown in Fig. 7 with the top three highest
utilization components highlighted.
Table 2 presents the performance results for all security levels of Kyber and Dilithium
with our implementation targeting a working frequency of 375 MHz. The table shows both
the number of cycles and the corresponding execution time for each phase. For the
Dilithium signing phase, we report the best-case scenario, where the signature is valid in a
single iteration.
Comparison
When comparing our results with existing work, we primarily compare them with designs
of unified architectures. However, only the works of Aikata et al. (2023a) and Aikata et al.
(2023b) implemented hardware designs that can be compared with ours that we know of.
However, it is important to note that the Aikata et al. (2023b) design unifies Dilithium with
Saber, making a direct comparison not entirely fair. Therefore, we also include
comparisons with Dilithium-only and Kyber-only implementations for a broader
evaluation.
computation, while critical path removal and pipelining maximize operating frequency,
resulting in substantial performance improvements. In particular, for the highest security
level, our implementation achieves a performance improvement of 2.2x/1.9x/2.4x for
Dilithium across all phases and 2.3x/2.8x/2.5x for Kyber across all phases, along with a
reduction in area usage by 27% in terms of LUTs and 33% in terms of FFs. In the case of
Aikata et al. (2023b), we compare only the Dilithium portion of their design. Our
implementation again achieves superior resource utilization and performance, with
approximately half the BRAM usage and even more significant gains in performance.
Although differences in area utilization are less pronounced, our design still exhibits better
results. In Table 3, both implementations are compared with our work.
Focusing on compactness, Zhao et al. (2021) and Wang et al. (2022) focus on reduced
area usage while maintaining reasonable performance. Zhao et al. (2021) employ a
segmented pipelining technique that reduces storage requirements and processing time.
Although their BRAM usage is slightly lower than ours, their overall area utilization
remains higher, while offering better performance. Wang et al. (2022). implement only the
core functions in hardware, relying on software for pre-processing and post-processing.
Their design achieves minimal area usage, yet remains marginally higher than ours.
DISCUSSION
The results in the previous section demonstrate significant advantages of the unified
design, particularly with regard to reduced area utilization. This reduction stems from
resource sharing between the DSA and KEM units, highlighting a key efficiency: In a
scenario where a standalone implementation is used for DSA and KEM separately, each
would require unique resources. In our unified design, however, the implementation of
Kyber requires minimal additional area, as most resources are already allocated for
Dilithium. This essentially renders Kyber support nearly “free” in terms of area costs
within the unified framework, a substantial advantage for system design. This unified
approach is beneficial in two primary contexts. First, on high-efficiency server platforms,
where support for multiple cryptographic schemes is necessary, the unified design provides
streamlined resource use and flexibility.
Practical deployment
Server environments often require varied cryptographic algorithms, and this design
supports multiple variants without having to replicate resource blocks for each scheme
independently. Especially servers and gateways in high-speed communication networks
can deploy this efficient solution to manage secure sessions. Second, for resource-
constrained platforms, such as IoT devices, the unified design offers an efficient means to
implement both DSA and KEM without the need for separate hardware for each function.
This approach not only reduces the area, but may also lead to lower power consumption,
which is crucial for battery-operated or low-power devices.
Implementation attacks
While the theoretical security of both cryptographic schemes is well-established,
implementation-specific vulnerabilities remain a concern. Potential threats include side-
channel attacks, as demonstrated in works such as Primas, Pessl & Mangard (2017),
Open problems
Future work could explore further optimizations especially for constrained hardware
platforms, small FPGA boards. One such modification could involve reducing the number
of coefficients processed in parallel to further minimize area usage. Another possible
optimization is to tailor the design to specific variants of the schemes, rather than
supporting all variants universally. These changes would enable a more application-
specific implementation that conserves both area and power.
Another open issue that remains is a deep investigation of side-channel leakage, which
is possible despite the designed solution deploying parallel processing, reducing the signal-
to-noise ratio, thus preventing simple tracing of secret data.
CONCLUSION
In this article, we proposed a set of novel optimization techniques for the unified hardware
implementation of two leading post-quantum cryptographic schemes: Dilithium and
Kyber. The optimization steps mainly dealt with new improved memory management,
operation schedule, and more efficient parts such as polynomial arithmetic unit,
polynomial sample unit, and compression unit. Our design focuses on achieving a balance
between area efficiency and high performance, with the top component utilizing only 17.1k
LUTs, 6.6k FFs, four DSPs, and 12.5 BRAMs, while achieving a working frequency of 375
MHz on the high-efficiency Zynq Ultrascale + platform, respectively, working of 160 MHz
on the resource-constrained Artix-7 platform. These figures represent a significant
improvement over existing unified architectures, particularly with a nearly 50% reduction
in BRAM usage compared to Aikata et al. (2023a). Moreover, we demonstrate that our
unified solution is even comparable with standalone implementations in terms of
hardware resources and efficiency, but it saves costs by deploying only one compact
implementation that can be beneficial in smaller and more cost-efficient FPGA platforms
such as Artix-7.
ACKNOWLEDGEMENTS
During the preparation of this work, the authors used Chat GPT and Writefull to
summarize information and paraphrase it for better presentation and understanding. After
using this tool/service, the authors reviewed and edited the content as needed and take full
responsibility for the content of the published article.
Funding
This work is supported by the Ministry of the Interior of the Czech Republic under Grant
VJ01010008. The funders had no role in study design, data collection and analysis, decision
to publish, or preparation of the manuscript.
Grant Disclosures
The following grant information was disclosed by the authors:
Ministry of the Interior of the Czech Republic: VJ01010008.
Competing Interests
The authors declare that they have no competing interests.
Author Contributions
. Patrik Dobias conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the article, and approved the final draft.
. Lukas Malina conceived and designed the experiments, analyzed the data, prepared
figures and/or tables, authored or reviewed drafts of the article, and approved the final
draft.
. Jan Hajny conceived and designed the experiments, analyzed the data, prepared figures
and/or tables, authored or reviewed drafts of the article, and approved the final draft.
Data Availability
The following information was supplied regarding data availability:
The VHDL sources with Python testbenches are available at GitLab and Zenodo:
- https://gitlab.com/brno-axe/pqc/diky.
- Dobias, P., Malina, L., & Hajny, J. (2025). Efficient Unified Architecture for Post-
Quantum Cryptography: Combining Dilithium and Kyber. Zenodo. https://doi.org/10.
5281/zenodo.14891518.