Chester Thesis
Chester Thesis
A THESIS
submitted by
CHESTER REBEIRO
of
MASTER OF SCIENCE
(by Research)
This is to certify that the thesis titled Architecture Explorations for Elliptic Curve
Cryptography on FPGAs, IIT Madras, submitted by Chester Rebeiro, to the Indian
Institute of Technology Madras, for the award of the degree of Master of Science,
is a bonafide record of the research work done by him under my supervision. The
contents of this thesis, in full or in parts, have not been submitted to any other Institute
or University for the award of any degree or diploma.
Foremost, I would like to thank my guide Dr. Debdeep Mukhopadhyay who shared a
lot of his experience and ideas with me. I appreciate his professionalism, planning, and
constant involvement in my research. I cherish the time we spent in discussions and in
the laboratory pouring over problems. Working under him has sharpened my research
skills and increased my appetite to work in cryptography.
I am grateful to Dr. Kamakoti and Dr. Shankar Balachandran for their encour-
agement, advice, and help whenever needed. I am indebted to the RISE lab and the
Computer Science Department for offering me a fabulous environment to work and
study.
I would like to take this opportunity to acknowledge several friends and lab mates
who made my stay at IIT Madras exciting and unforgettable. I acknowledge the help
received from Noor on innumerable occasions. I would especially like to thank him
for helping me out with various tool flows. Shoaib, for the discussions that we had on
technical as well as non technical topics. Rajesh, for being so easy to connect to, and
Venkat among all things for letting me know the best Idly joints in Chennai. I thank
Pavan, Shyam, Sadgopan, Parthasarthy, and Lalit for working along with me on several
courses and assignments.
I would like to thank my wife Sharon and my parents for the love and encourage-
i
ment I received. Without their support this thesis would not have been possible. I would
like to thank my grandmother for her prayers and for being my role model for hardwork.
I would like to dedicate this thesis to her.
Chester Rebeiro
ii
ABSTRACT
The current era has seen an explosive growth in communications. Applications like on-
line banking, personal digital assistants, mobile communication, smartcards, etc. have
emphasized the need for security in resource constrained environments. Elliptic curve
cryptography (ECC) serves as a perfect cryptographic tool because of its short key sizes
and security comparable to that of other standard public key algorithms. However,
to match the ever increasing requirement for speed in today’s applications, hardware
acceleration of the cryptographic algorithms is a necessity. As a further challenge, the
designs have to be robust against side channel attacks.
This thesis explores efficient hardware architectures for elliptic curve cryptography
over binary Galois fields. The efficiency is largely affected by the underlying arithmetic
primitives. The thesis therefore explores FPGA designs for two of the most important
field primitives namely multiplication and inversion. FPGAs are reconfigurable hard-
ware platforms offering flexibility and lower costs like software programs. However,
designing on FPGA platforms is challenging because of the large granularity, limited
resources, and large routing delay. The smallest programmable entity in an FPGA is
the look up table. The arithmetic algorithms proposed in this thesis maximizes the uti-
lization of LUTs on the FPGA.
A novel finite field multiplier based on the recursive Karatsuba algorithm is pro-
posed. The proposed multiplier combines two variants of Karatsuba, namely the gen-
eral and the simple Karatsuba multipliers. The general Karatsuba multiplier has a
large gate count but for small sized multiplications is compact because it utilizes LUT
resources efficiently. For large sized multiplications, the simple Karatsuba is efficient as
it requires lesser gates. The proposed hybrid multiplier does the initial recursion using
iii
the simple algorithm while final small sized multiplications is done using the general
algorithm. The multiplier thus obtained has the best area time product compared to
reported literature.
The proposed primitives are organized as an elliptic curve crypto processor (ECCP)
and has one of the best timings and area time product compared to reported works. We
conclude that the performance of an ECCP is significantly enhanced if the underlying
primitives are carefully designed. Further, a side channel attack based on simple timing
and power analysis is demonstrated on the ECCP. The construction of the ECCP is then
modified to mitigate such attacks.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS i
ABSTRACT iii
LIST OF TABLES x
ABBREVIATIONS xiii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . 5
2 A Survey 7
2.1 Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . . . 8
2.2 Engineering an Elliptic Curve Crypto Processor . . . . . . . . . . . 10
2.3 Hardware Accelerators for ECCP . . . . . . . . . . . . . . . . . . . 11
2.3.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Side Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Mathematical Background 18
3.1 Abstract Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
v
3.1.1 Groups, Rings and Fields . . . . . . . . . . . . . . . . . . . 18
3.1.2 Binary Finite Fields . . . . . . . . . . . . . . . . . . . . . 20
3.2 Elliptic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Projective Coordinate Representation . . . . . . . . . . . . 27
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
vi
6.1.2 Finite Field Arithmetic Unit . . . . . . . . . . . . . . . . . 71
6.1.3 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Point Arithmetic on the ECCP . . . . . . . . . . . . . . . . . . . . 72
6.2.1 Point Doubling . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.2 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 The Finite State Machine (FSM) . . . . . . . . . . . . . . . . . . . 78
6.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
vii
D.1 Equations for Arithmetic in Affine Coordinates . . . . . . . . . . . 102
D.1.1 Point Inversion . . . . . . . . . . . . . . . . . . . . . . . . 102
D.1.2 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . 102
D.1.3 Point Doubling . . . . . . . . . . . . . . . . . . . . . . . . 104
D.2 Equations for Arithmetic in LD Projective Coordinates . . . . . . . 107
D.2.1 Point Inversion . . . . . . . . . . . . . . . . . . . . . . . . 107
D.2.2 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . 107
D.2.3 Point Doubling . . . . . . . . . . . . . . . . . . . . . . . . 109
ix
A.1 Basepoint and Curve Constants used for Verification of the ECCP and
the SR-ECCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.2 ECCP System Specifications on the Dini Hardware . . . . . . . . . 96
x
LIST OF FIGURES
xi
7.1 Power Trace for a Key with all 1 . . . . . . . . . . . . . . . . . . . 84
7.2 Power Trace for a Key with all 0 . . . . . . . . . . . . . . . . . . . 84
7.3 Power Trace when k = (B9B9)16 . . . . . . . . . . . . . . . . . . 85
7.4 Always Add Method to Prevent SPA . . . . . . . . . . . . . . . . . 87
7.5 FSM for SR-ECCP . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.6 Register File for SR-ECCP . . . . . . . . . . . . . . . . . . . . . . 88
7.7 Power Trace when k = (B9B9)16 . . . . . . . . . . . . . . . . . . 89
xii
ABBREVIATIONS
AU Arithmetic Unit
ASIC Application Specific Integrated Circuit
DPA Differential Power Analysis
ECC Elliptic Curve Cryptography
ECCP Elliptic Curve Crypto Processor
ECDLP Elliptic Curve Discrete Logarithm Problem
EEA Extended Euclid’s Algorithm
FPGA Field Programmable Gate Array
FSM Finite State Machine
GF Galois Field
ITA Itoh-Tsujii Algorithm
LD Lopez-Dahab
LUT Look Up Table
RSA Rivest Shamir Adleman
SPA Simple Power Analysis
SR-ECCP SPA Resistant Elliptic Curve Crypto Processor
VCD Value Change Dump
xiii
CHAPTER 1
Introduction
This era has seen an astronomical increase in communications over the wired and wire-
less networks. Everyday thousands of transactions take place over the world wide web.
Several of these transactions have critical data which need to be confidential, transac-
tions that need to be validated, and users authenticated. These requirements need a
rugged security framework to be in force.
There are two types of cryptographic algorithms, these are symmetric key and asym-
metric key algorithms. Symmetric key cryptographic algorithms have a single key for
both encryption and decryption. These are the most widely used schemes. They are
preferred for their high speed and simplicity. However they can be used only when
the two communicating parties have agreed on the secret key. This could be a hurdle
when used in practical cases as it is not always easy for users to exchange keys. In
asymmetric key cryptographic algorithms two keys are involved. A private key and a
public key. The private key is kept secret while the public key is known to everyone.
The encryption is done with the public key, and the encrypted message can be only
decrypted by the corresponding private key. Security of these algorithms depends on
the hardness of deriving the private key from the public key. Although slow and highly
complex, asymmetric key cryptography has immense advantages. The main advantage
is that the underlying primitives used are based on well known problems such as inte-
ger factorization and discrete logarithm problems. These problems have been studied
extensively and their hardness has not been contradicted after years of research. This is
unlike symmetric key cryptography where the strength of the algorithm relies on combi-
natorial techniques. The security of such algorithms is not proven and does not rely on
well researched problems in literature. The most used asymmetric key crypto algorithm
is RSA [2]. Of late asymmetric crypto algorithms based on elliptic curves have been
rapidly gaining popularity due to the higher level of security offered at lower key sizes.
Several security standards have emerged which use elliptic curves for the underlying
security algorithm.
2
1.1 Motivation
There are two schemes for developing efficient cryptographic implementations. The
first focuses on implementing and optimizing the cryptographic algorithms in software
platforms. This has the advantage of being low cost as no additional hardware is re-
quired. However, benefits obtained by this method are restricted by the architecture
limitations of the microprocessor. For example, arithmetic on large numbers cannot be
as efficiently done on microprocessors as it can be performed on dedicated hardware.
Such arithmetic is a norm in public key cryptographic algorithms. Besides, software
can very easily be tampered thus compromising the security of the application.
3
main differences occur because of the inherent difference in the libraries and the archi-
tectures. FPGAs have fixed resources, a look up table (LUT) based architecture, and
larger interconnect delays. Hence a design on FPGA must be carefully built to utilize
the resources well and satisfy the timing constraints of the FPGA library. In this work
we design and implement a side channel attack (SCA) resistant elliptic curve processor
on an FPGA platform.
In this thesis, architectures for a public key crypto algorithm based on elliptic curves[9–
11] are explored. The architectural explorations are targeted for reconfigurable plat-
forms. The contribution of this thesis is as follows.
• The most complex finite field operation in elliptic curve cryptography (ECC) is
the multiplicative inverse. The thesis proposes a novel inversion algorithm for
FPGA platforms. The proposed algorithm is a generalization of the Itoh-Tsujii
inversion algorithm [13]. Evidence has been furnished and supported with experi-
mental results to show that the proposed inversion algorithm outperforms existing
results. The proposed method is demonstrated to be scalable with respect to field
sizes.
• The work presents the design of a high performance elliptic curve crypto pro-
4
cessor (ECCP) for an elliptic curve over the finite field GF (2233 ). The chosen
elliptic curve is one of the selected curves for the digital signature standard [14].
The high performance is obtained by efficient implementations of the underly-
ing finite field arithmetic. The processor is synthesized for Xilinx’s FPGA [15]
platform and is shown to be one of fastest reported implementations on FPGA.
• Chapter 2 contains a brief introduction of ECC and covers aspects about engineer-
ing an elliptic curve processor. A survey is made on existing elliptic curve crypto
processors reported in literature. The chapter also contains a brief introduction
on FPGA architecture and side channel attacks.
5
hybrid Karatsuba multiplier is proposed for FPGA platforms and shown to have
the best area time product compared to existing works.
• Chapter 6 integrates the various finite field arithmetic primitives into an elliptic
curve crypto processor. The efficient underlying primitives result in one of the
fastest reported elliptic curve crypto processors.
• Chapter 8 has the conclusion of the thesis and future directions of research in this
area of work.
• Appendix A has details of how the correctness of the ECCP was verified and the
testing of the ECCP on a FPGA hardware platform. Appendix B has a list of the
finite fields that were used to test the scalability of the proposed inverse algorithm.
Appendix C has instructions to use XPower to obtain the power trace of an FPGA.
Appendix D has derivations for the elliptic curve arithmetic equations. Appendix
E has derivations for the gate requirements for the simple Karatsuba multiplier.
6
CHAPTER 2
A Survey
Definition 2.0.1 A symmetric key cryptosystem can be defined by the tuple (P, C, K, E, D)
[16], where
The keys for both encryption and decryption are the same and must be kept secret. This
leads to problems related to key distribution and key management. In 1976, Diffie and
Hellman [17] invented asymmetric key cryptography which solved the problem of key
distribution and management. Asymmetric algorithms use a pair of keys for encryption
Definition 2.0.2 A function f (x) from a set X to a set Y is called a one-way function
if f (x) can efficiently be computed but the computation of f −1 (x) is computationally
intractable.
Definition 2.0.3 A trapdoor one-way function is a one-way function f (x) if and only
if there exists some supplementary information (usually the secret key) with which it
becomes feasible to compute f −1 (x).
Thus, trapdoor one way functions rely on intractable problems in computer sci-
ence. An example of an intractable problem is the integer factorization problem, which
states that given an integer n, one has to obtain its prime factorization, i.e find n =
1 2 3 k
pe1 pe2 pe3 · · · pek , where pi is a prime number and ei ≥ 1. Solving the problem of factor-
ing the product of prime numbers is considered computationally difficult for properly
selected primes of size at least 1024 bits. This forms the basic security assumption of
the famous RSA algorithm [2]. Another intractable problem, the elliptic curve discrete
logarithm problem (ECDLP) has given rise to new asymmetric cryptosystems based on
elliptic curves.
Elliptic curves have been studied for over hundred years and have been used to solve
a diverse range of problems. For example, elliptic curves are used in proving Fermat’s
8
last theorem, which states that xn + y n = z n has non zero integer solutions for x, y, and
z when n > 2 [18].
The use of elliptic curves in public key cryptography was first proposed indepen-
dently by Koblitz [19] and Miller [10] in the 1980s. Since then, there has been an
abundance of research on the security of ECC. In the 1990’s ECC began to get ac-
cepted by several accredited organizations, and several security protocols based on ECC
[14, 20, 21] were standardized. The main advantage of ECC over conventional asym-
metric crypto systems [2] is the increased security offered with smaller key sizes. For
example, a 256 bit key in ECC produces the same level of security as a 3072 bit RSA
key1 . The smaller key sizes leads to compact implementations and increased perfor-
mance. This makes ECC suited for low power resource constrained devices.
An elliptic curve is the set of solutions (x, y) to Equation 2.1 together with the point
at infinity (O). This equation is known as the Weierstraß equation [18].
y 2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6 (2.1)
For cryptography, the points on the elliptic curve are chosen from a large finite field.
The set of points on the elliptic curve form a group under the addition rule. The point
O is the identity element of the group. The operations on the elliptic curve, i.e. the
group operations are point addition, point doubling and point inverse. Given a point
P = (x, y) on the elliptic curve, and a positive integer n, scalar multiplication is defined
as
nP = P + P + P + · · · P (n times) (2.2)
The order of the point P is the smallest positive integer n such that nP = O. The points
{O, P, 2P, 3P, · · · (n − 1)P } form a group generated by P . The group is denoted as
< P >.
1
NIST Sources
9
The security of ECC is provided by the elliptic curve discrete logarithm problem
(ECDLP), which is defined as follows : Given a point P on the elliptic curve and
another point Q ∈< P >, determine an integer k (0 ≤ k ≤ n) such that Q = kP . The
difficulty of ECDLP is to calculate the value of the scalar k given the points P and Q.
k is called the discrete logarithm of Q to the base P . P is the generator of the elliptic
curve and is called the basepoint.
The ECDLP forms the base on which asymmetric key algorithms are built. These
algorithms include the elliptic curve Diffie-Hellman key exchange, elliptic curve ElGa-
mal public key encryption, and the elliptic curve digital signature algorithm.
EC
Primitives
Scalar Multiplication
10
operations on the elliptic curve and the scalar multiplication influences the number of
clock cycles required for encryption.
There are two platforms on which hardware accelerators are built: application specific
integrated circuits (ASICs) and field programmable gate arrays (FPGAs). ASICs are
one time programmable and are best suited for high volume production. ASICs can
reach high frequency of operation, and algorithms implemented on these devices have
high performance. Also, ASICs are best when data protection is concerned. Once data
11
is written into an ASIC it is extremely difficult to read back. However, ASICs suffer
from high development costs and lack flexibility with respect to modifying algorithms
and reconfiguring parameters [24]. Besides, production of an ASIC requires to be done
in fabrication units. These fabrication units are generally owned by a third party. This
is not suited for cryptographic applications where minimum number of parties must be
involved.
FPGAs are reconfigurable devices offering parallelism and flexibility on one hand
while being low cost and easy to use on the other. Moreover they have much shorter
design cycle times compared to ASICs. FPGAs were initially used as prototyping de-
vices and in high performance scientific applications but the short time-to-market and
on-site reconfigurability features have expanded their application space. These devices
can now be found in various consumer electronic devices, high performance networking
applications, medical electronics and space applications. The reconfigurability aspect
of FPGAs also makes them suited for cryptographic applications. Reconfigurability re-
sults in flexible implementations allowing operating modes, encryption algorithms, and
curve constants etc. to be configured. FPGA’s do not require sophisticated equipment
for production, they can be programmed in house. This is beneficial for cryptography
as no untrusted party is involved in the production cycle.
There are two main parts of the FPGA chip [25] : the input/output (I/O) blocks and
the core. The I/O blocks are located around the periphery of the chip and are used to
provide programmable connectivity to the chip. The core of the chip consists of pro-
grammable logic blocks and programmable routing architectures. A popular architec-
ture for the core, called island style architecture, is shown in Figure 2.3. Logic blocks,
also called configurable logic blocks (CLB), consists of logic circuitry for implementing
logic. Each CLB is surrounded by routing channels connected through switch blocks
12
Programmable
Routing Switches
Programmable
Logic Block Connection Switch
COUT
F4
F3 PRE
Control
LUT
& D Q
F2
Carry
CE
F1 Logic
CLK
CLR
SR
CE
CLK
BY
CIN
and connection blocks. A switch block connects wires in adjacent channels through
programmable switches. A connection block connects the wire segments around a logic
13
block to its inputs and outputs. Each logic block further contains a group of basic logic
elements (BLE). Each BLE has a look up table (LUT), a storage element, and combina-
tional logic as shown in Figure 2.4. The storage element can be configured as an edge
triggered D-flip flop or as level sensitive latches. The combinational logic generally
contains logic for carry and control signal generation.
LUTs can be configured to be used in logic circuitry. If the LUT has m inputs,
then any m variable boolean function can be implemented. The LUT mainly contains
memory to store truth tables of boolean functions and multiplexers to select the values
of memories. There have been several studies on the best configuration for the LUT.
A larger LUT would result in more logic fitted into a single LUT and hence lesser
critical delay. However, a larger LUT would also indicate larger memory and bigger
multiplexers, hence larger area. Most studies show that a 4 input LUT provides the best
area-time product, though there have been few applications where a 3 input LUT [26]
and 6 input LUT [27] have been found beneficial. Most FPGA manufacturers, including
Xilinx2 and Altera3 , use 4 input LUTs.
From the mid 90’s, a new research area that has gained focus is side channel crypt-
analysis. This is becoming the biggest threat to modern day cryptosystems with many
of the algorithms successfully attacked. These attacks analyze unintended information
leakage that is provided by naive implementations of a crypto algorithm.
Side channel attacks are broadly classified into passive and active attacks. In a pas-
sive attack, the functioning of the cryptographic device is not tampered. The secret key
is revealed by observing physical properties of the device, such as timing characteris-
tics, power consumption traces, etc. In an active attack, the inputs and environment are
2
http://www.xilinx.com
3
http://www.altera.com
14
manipulated to force the device to behave abnormally. The secret key is then revealed
by exploiting the abnormal behavior of the device [28].
The two most extensively exploited side channels are power consumption and tim-
ing analysis. An attack based on timing analysis[3] first identifies and then monitors
certain operations in the device. The time required to complete these operations leaks
information about the secret key. Power consumption attacks [4] reveal the secret key
by monitoring the power consumed by the device. The power consumption of a device
has dependencies on the data being manipulated and the operation being performed.
There are essentially two forms of power attacks : simple power analysis and differen-
tial power analysis. An attacker using a simple power analysis (SPA) requires just a
single power trace. Features of the power trace are used to directly interpret the secret
key. A stronger form of power attack called differential power attack (DPA) was first
introduced by Kocher in [4]. This is a statistical technique and requires several power
traces to be analyzed before the key is revealed. This class of attacks is based on the
power consumption dependence of a device, which is dependent on the key.
There have been several reported high performance FPGA processors for elliptic curve
cryptography. Various acceleration techniques have been used ranging from efficient
implementations to parallel and pipelined architectures. In [29] the Montgomery mul-
tiplier [30] is used for scalar multiplication. The finite field multiplication is performed
using a digit-serial multiplier proposed in [31]. The Itoh-Tsujii algorithm is used for
finite field inversion. A point multiplication over the field GF (2167 ) is performed in
0.21ms.
15
mentation although highly flexible is slow and does not reach required speeds for high
bandwidth applications. A 239 bit point multiplication requires 12.8ms, clearly this is
extremely high compared to other reported implementations.
In [33], the ECC processor designed has squarers, adders ,and multipliers in the data
path. The authors have used a hybrid coordinate representation in affine, Jacobian, and
López-Dahab form.
In [34] an end-to-end system for ECC is developed, which has a hardware imple-
mentation for ECC on an FPGA. The high performance is obtained with an optimized
field multiplier. A digit-serial shift-and-add multiplier is used for the purpose. Inversion
is done with a dedicated division circuit.
The processor presented in [35] achieves point multiplication in 0.074ms over the
field GF (2163 ). However, the implementation is for a specific form of elliptic curves
called Koblitz curves. On these curves, several acceleration techniques based on pre-
computation [36] are possible. However our work focuses on generic curves where such
accelerations do not work.
In [37] a high speed elliptic curve processor is presented for the field GF (2191 ),
where point multiplication is done in 0.056ms. A binary Karatsuba multiplier is used
for the field multiplication. However, no inverse algorithm seems to be specified in the
paper, making the implementation incomplete.
In [40], the finite field multiplier in the processor is prevented from becoming idle.
The finite field multiplier is the bottle neck of the design therefore preventing it from
becoming idle improves the overall performance. Our design of the ECCP is on similar
lines where the operations required for point addition and point doubling are scheduled
16
so that the finite field multiplier is always utilized.
2.6 Conclusion
In this chapter a brief introduction of elliptic curve cryptography was made, and the
hierarchy in an elliptic curve processor was presented. A review of the existing literature
on elliptic curve crypto processors was made. Hardware platforms used for elliptic
curve cryptography were discussed, with special focus on FPGA architectures. The
vulnerability of crypto processors to side channel attacks was also presented.
17
CHAPTER 3
Mathematical Background
Definition 3.1.1 A group denoted by {G, ·}, is a set of elements G with a binary oper-
ation ’·’, such that for each ordered pair (a, b) of elements in G, the following axioms
are obeyed [41, 42]:
• Closure : If a, b ∈ G, then a · b ∈ G.
Definition 3.1.2 A ring denoted by {R, +, ×} or simply R is a set of elements with two
binary operations called addition and multiplication, such that for all a, b, c ∈ R the
following are satisfied:
The set of integers, rational numbers, real numbers, and complex numbers are all
rings. A ring is said to be commutative if the commutative property under multiplication
holds. That is, for all a, b ∈ R, a · b = b · a.
• There exists a multiplicative identity element denoted by 1 such that for every
a ∈ F, a · 1 = 1 · a = 1.
The set of rational numbers, real numbers and complex number are examples of
fields, while the set of integers is not. This is because the multiplicative inverse property
does not hold in the case of integers.
19
The above examples of fields have infinite elements. However in cryptography finite
fields play an important role. A finite field is also known as Galois field and is denoted
by GF (pm ). Here, p is a prime called the characteristic of the field, while m is a
positive integer. The order of the finite field, that is, the number of elements in the field
is pm . When m = 1, the resulting field is called a prime field and contains the residue
classes modulo p[41].
In cryptography two of the most studied fields are finite fields of characteristic two
and prime fields. Finite fields of characteristic two, denoted by GF (2m ), is also known
as binary extension finite fields or simply binary finite fields. They have several advan-
tages when compared to prime fields. Most important is the fact that modern computer
systems are built on the binary number system. With m bits all possible elements of
GF (2m ) can be represented. This is not possible with prime fields (with p 6= 2). For
example a GF (22 ) field would require 2 bits for representation and use all possible
numbers generated by the two bits. A GF (3) field would also require 2 bits for rep-
resenting the three elements in the field. This leaves one of the four possible numbers
generated by two bits unused leading to an inefficient representation. Another advan-
tage of binary extension fields is the simple hardware required for computation of some
of the commonly used arithmetic operations such as addition and squaring. Addition in
binary extension fields can be easily performed by a simple XOR. There is no carry
generated. Squaring in this field is a linear operation and can also be done using XOR
circuits. These circuits are much more simple than the addition and squaring circuits of
a GF (p) field.
20
c or by c · a(x) where c ∈ GF (2) [43]. An irreducible polynomial of degree m with
coefficients in GF (2) can be used to construct the extension field G(2m ). All elements
of the extension field can be represented by polynomials of degree m − 1 over GF (2).
Binary finite fields are generally represented using two types of bases. These are the
polynomial and normal base representations.
Definition 3.1.4 Let p(x) be an irreducible polynomial over GF (2m ) and let α be the
root of p(x). Then the set
{1, α, α2 , · · · , αm−1 }
Definition 3.1.5 Let p(x) be an irreducible polynomial over GF (2m ), and let α be the
root of p(x), then the set
m 2m m(m−1))
{α, α2 , α2 , · · · , α2 }
Any element in the field GF (2m ) can be represented in terms of its bases as shown
below.
a(x) = am−1 αm−1 + · · · + a1 α + a0
21
a(x)
Squaring Operation
0 0 0 0 0 0 0
Modulo Operation
2
a(x)
m−1 m−1
ai x i bi xi
X X
a(x) = b(x) =
i=0 i=0
m−1
(ai + bi )xi
X
a(x) + b(x) = (3.1)
i=0
The squaring operation on binary finite fields is as easy as addition. The square of
the polynomial a(x) ∈ GF (2m ) is given by
m−1
a(x)2 = ai x2i mod p(x)
X
(3.2)
i=0
The squaring essentially spreads out the input bits by inserting zeroes in between two
bits as shown in Figure 3.1.
Multiplication is not as trivial as addition or squaring. The product of the two poly-
22
nomials a(x) and b(x) is given by
n−1
b(x)ai xi mod p(x)
X
a(x) · b(x) = (3.3)
i=0
Inversion is the most complex of all field operations. Even the best technique to
implement inversion is several times more complex than multiplication. Hence, algo-
rithms which use finite field arithmetic generally try to reduce the number of inversions
at the cost of increasing the number of multiplications.
464 232 74 0
11111111111111111
00000000000000000
00000000000000000
11111111111111111
111111
000000
000000
111111
23
root α and 1 < n < m/2. Therefore αm + αn + 1 = 0. Therefore,
αm = 1 + αn
αm+1 = α + αn+1
..
. (3.4)
For example, consider the irreducible trinomial x233 + x74 + 1. The multiplication or
squaring of the polynomial results in a polynomial of degree at most 464. This can be
reduced as shown in Figure 3.2. The higher order terms 233 to 464 are reduced by using
Equation 3.4.
Definition 3.2.1 An elliptic curve E over the field GF (2m ) is given by the simplified
form of the Weierstraß equation mentioned in Equation 2.1. The simplified Weierstraß
equation is :
y 2 + xy = x3 + ax2 + b (3.5)
The set of points on the elliptic curve along with a special point O, called the point
at infinity, form a group under addition. The identity element of the group is the point
at infinity (O). The arithmetic operations permitted on the group are point inversion,
point addition and point doubling which are described as follows.
24
−2P
−(P+Q)
P
Q
(P+Q)
2P
Point Inversion : Let P be a point on the curve with coordinates (x1 , y1 ), then the
inverse of P is the point −P with coordinates (x1 , x1 + y1 ). The point −P is obtained
by drawing a vertical line through P . The point at which the line intersects the curve is
the inverse of P .
Point Addition : Let P and Q be two points on the curve with coordinates (x1 , y1 )
and (x2 , y2 ). Also, let P 6= ±Q, then adding the two points results in a third point
R = (P + Q). The addition is performed by drawing a line through P and Q as shown
in Figure 3.3. The point at which the line intersects the curve is −(P + Q). The inverse
of this is R = (P + Q). Let the coordinates of R be (x3 , y3 ), then the equations for x3
and y3 is
x3 = λ2 + λ + x1 + x2 + a
(3.6)
y3 = λ(x1 + x3 ) + x3 + y1
Point Doubling : Let P be a point on the curve with coordinates (x1 , y1 ) and P 6= −P .
The double of P is the point 2 · P = (x3 , y3 ) obtained by drawing a tangent to the
curve through P . The inverse of the point at which the tangent intersects the curve is
25
Algorithm 3.1: Double and Add algorithm for scalar multiplication
Input: Basepoint P = (px , py ) and Scalar k = (km−1 , km−2 · · · k0 )2 , where
km−1 = 1
Output: Point on the curve Q = kP
1 Q=P
2 for i = m − 2 to 0 do
3 Q=2·Q
4 if ki = 1 then
5 Q=Q+P
6 end
7 end
Table 3.1: Scalar Multiplication using Double and Add to find 22P
i ki Operation Q
3 0 Double only 2P
2 1 Double and Add 5P
1 1 Double and Add 11P
0 0 Double only 22P
b
x3 = λ2 + λ + a = x1 2 +
x1 2 (3.7)
2
y3 = x1 + λx3 + x3
The fundamental algorithm for ECC is the scalar multiplication (defined in Section
2.1). The basic double and add algorithm to perform scalar multiplication is shown in
Algorithm 3.1. The input to the algorithm is a basepoint P and a m bit scalar k. The
result is the scalar product kP .
As an example of how Algorithm 3.1 works, consider k = 22. The binary equivalent
of this is (10110)2 . Table 3.1 below shows how 22P is computed.
26
Each iteration of i does a doubling on Q if ki is 0 or a doubling followed by an
addition if ki is 1. The underlying operations in the addition and doubling equations
use the finite field arithmetic discussed in the previous section. Both point doubling
and point addition have 1 inversion (I) and 2 multiplications (M ) each (from Equations
3.6 and 3.7). From this, the entire scalar multiplier for the m bit scalar k will have
m
m(1I + 2M ) doublings and 2
(1I + 2M ) additions (assuming k has approximately
m/2 ones on an average). The overall expected running time of the scalar multiplier is
therefore obtained as
3
ta ≈ (3M + I)m (3.8)
2
For this expected running time, finite field addition and squaring operations have been
neglected as they are simple operations and can be considered to have no overhead to
the run time.
The complexity of a finite field inversion is typically eight times that of a finite field
multiplier in the same field [44]. Therefore, there is a huge motivation for an alternate
point representation which would require lesser inversions. The two point coordinate
system (x, y) used in Equations 3.5, 3.6 and 3.7 discussed in the previous section is
called affine representation. It has been shown that each affine point on the elliptic
curve has a one to one correspondence with a unique equivalence class in which each
point is represented by three coordinates (X, Y, Z). The three point coordinate system
is called the projective representation [11]. In the projective representation, inversions
are replaced by multiplications. The projective form of the Weierstraß equation can
be obtained by replacing x with X/Z c and y by Y /Z d . There are several projective
coordinates systems proposed. The most commonly used projective coordinate system
are the standard where c = 1 and d = 1, the Jacobian with c = 2 and d = 3 and the
López-Dahab (LD) coordinates[11] which has c = 1 and d = 2. The LD coordinate
27
system [30] allows point addition using mixed coordinates, i.e. one point in affine while
the other in projective.
Y 2 + XY Z = X 3 + aX 2 Z 2 + bZ 4 (3.9)
Let P = (X1 , Y1 , Z1 ) be an LD projective point on the elliptic curve, then the inverse
of point P is given by −P = (X1 , X1 Z1 + Y1 , Z1 ). Also, P + (−P ) = O, where O is
the point at infinity. In LD projective coordinates O is represented as (1, 0, 0).
The equation for doubling the point P in LD projective coordinates [30] results in
the point 2P = (X3 , Y3 , Z3 ). This is given by the following equation.
Z3 = X12 · Z12
The equations for doubling require 5 finite field multiplications and zero inversions.
The equation in LD coordinates for adding the affine point Q = (x2 , y2 ) to P , where
28
Q 6= ±P , is shown in Equation 3.11. The resulting point is P + Q = (X3 , Y3 , Z3 ).
A = y2 · Z12 + Y1
B = x2 · Z1 + X1
C = Z1 · B
D = B 2 · (C + a · Z12 )
Z3 = C 2
(3.11)
E =A·C
X3 = A2 + D + E
F = X3 + x2 · Z3
G = (x2 + y2 ) · Z32
Y3 = (E + Z3 ) · F + G
Point addition in LD coordinates now require 9 finite field multiplications and zero
inversions. For an m bit scalar with approximately half the bits one, the running time
expected is given by Equation 3.12. One inversion and 2 multiplications are required at
the end to convert the result from projective coordinates back into affine.
9M
tld ≈ m(5M + ) + 2M + 1I
2 (3.12)
= (9.5m + 2)M + 1I
The LD coordinates require several multiplications to be done but have the advantage
of requiring just one inversion. To be beneficial, the extra multiplications should have a
lower complexity than the inversions removed.
29
3.3 Conclusion
This chapter presented the necessary mathematical background required for this thesis.
The performance of the entire elliptic curve crypto processor depends on the underlying
finite field primitives therefore the primitives should be efficiently implemented. The
next two chapters discuss implementations of two of the most dominant primitives used
in ECC, namely the finite field multiplication and inversion.
30
CHAPTER 4
The finite field multiplier forms the most important component in the elliptic curve
crypto processor (ECCP). It occupies the most area on the device and also has the
longest latency. The performance of the ECCP is affected most by the multiplier. Finite
field multiplication of two elements in the field GF (2m ) is defined as
where C(x), A(x), and B(x) are in GF (2m ) and P (x) is the irreducible polynomial
that generates the field GF (2m ). Implementing the multiplication requires two steps.
First, the polynomial product C ′ (x) = A(x) · B(x) is determined, then the modulo
operation is done on C ′ (x). This chapter deals with polynomial multiplication.
The organization of the chapter is as follows: the next section contains a brief
overview of important finite field multipliers in literature. Section 4.2 discusses the
Karatsuba algorithm in greater detail. Section 4.3 outlines some of the Karatsuba mul-
tiplication variants used for elliptic curves. Section 4.4 presents how a circuit gets
mapped to a four input LUT based FPGA. Section 4.5 analyzes how the existing Karat-
suba algorithms get mapped on to the FPGA. It also presents the proposed hybrid Karat-
suba multiplier which maximizes utilization of FPGA resources. Section 4.6 compares
the performance of the hybrid Karatsuba multiplier with existing implementations of
the Karatsuba algorithm. The final section has the conclusion.
4.1 Finite Field Multipliers for High Performance Ap-
plications
The school book method to multiply two polynomials requires m2 AN D gates to gen-
erate the partial products. The final product is formed by adding the partial products.
Since we deal with binary fields, addition is easily done using XOR gates without any
carries being propagated, thus (m − 1)2 XOR gates are required to do the additions.
Another multiplier based on normal basis is the Sunar-Koç [46] multiplier. The
multiplier requires lesser hardware compared to the Massey-Omura multiplier but has
similar timing requirements.
In [47], the Montgomery multiplier is adapted to binary finite fields. The multipli-
cation in Equation 4.1 is represented by the following equation
where, R(x) is of the form xk and is an element in the field. Also, gcd(R(x), P (x)) = 1.
The division by R(x) reduces the complexity of the modular operation. For binary finite
fields, R(x) has the form 2k , therefore division by R(x) can be easily accomplished on
a computer. This multiplier is best suited for low resource environments where speed
of operation is not so important [44].
The Karatsuba multiplier [12] uses a divide and conquer approach to multiply A(x)
and B(x). The m term polynomials are recursively split into two. With each split
the size of the multiplication required reduces by half. This leads to a reduction in
32
the number of AN D gates required at the cost of an increase in XOR gates. This
also results in the multiplier having a space complexity of O(mlog2 3 ) for polynomial
representations of finite fields. A comparison of all available multipliers show that only
the Karatsuba multiplier has a complexity which is of sub quadratic order. All other
multipliers have a complexity which is quadratic. Besides this, it has been shown in
[44] and [48] that the Karatsuba multiplier if designed properly is also the fastest.
For a high performance elliptic curve crypto processor, the finite field multiplier
with the smallest delay and the least number of clock cycles is best suited. Karatsuba
multiplier if properly designed, attains the above speed requirements and at the same
time has a sub-quadratic space complexity. This makes the Karatsuba multiplier the
best choice for high performance applications.
In the Karatsuba multiplier, the m bit multiplicands A(x) and B(x) represented in poly-
nomial basis are split as shown in Equation 4.3. For brevity, the equations that follow
represent the polynomials Ah (x), Al (x), Bh (x), and Bl (x) by Ah , Al , Bh , and Bl re-
spectively.
A(x) = Ah xm/2 + Al
(4.3)
m/2
B(x) = Bh x + Bl
33
The multiplication is then done using three m/2 bit multiplications as shown in Equa-
tion 4.4.
= Ah Bh xm + (Ah Bl + Al Bh )xm/2 + Al Bl
= Ah Bh xm (4.4)
+ Al Bl
The Karatsuba multiplier can be applied recursively to each m/2 bit multiplication in
Equation 4.4. Ideally this multiplier is best suited when m is a power of 2, this allows the
multiplicands to be broken down until they reach 2 bits. The final recursion consisting
of 2 bit multiplications can be achieved by AN D gates. Such a multiplier with m a
power of 2 is called the basic Karatsuba multiplier.
The basic recursive Karatsuba multiplier cannot be applied directly to ECC because
the binary extension fields used in standards such as [14] have a prime degree. There
have been several published works which implement a modified Karatsuba algorithm
for use in elliptic curves. There are two main design approaches followed. The first
approach is a sequential circuit having less hardware and latency but requiring several
clock cycles to produce the result. Generally at every clock cycle the outputs are fed-
back into the circuit thus reusing the hardware. The advantage of this approach is that
it can be pipelined. Examples of implementations following this approach can be found
in[48–51]. The second approach is a combinational circuit having large area and delay
but is capable of generating the result in one clock cycle. Examples of this approach
can found in [52–55]. Our proposed Karatsuba multiplier follows the second approach,
34
therefore in the remaining part of this section we analyze the combinational circuits for
Karatsuba multipliers.
The easiest method to modify the Karatsuba algorithm for elliptic curves is by
padding. The padded Karatsuba multiplier extends the m bit multiplicands to 2⌈log2 m⌉
bits by padding the most significant bits with zeroes. This allows the use of the ba-
sic recursive Karatsuba algorithm. The obvious drawback of this method is the extra
arithmetic introduced due to the padding.
The simple Karatsuba multiplier [55] is the basic recursive Karatsuba multiplier
with a small modification. If an m bit multiplication is needed to be done, m being any
integer, it is split into two polynomials as in Equation 4.3. The Al and Bl terms have
⌈m/2⌉ bits and the Ah and Bh terms have ⌊m/2⌋ bits. The Karatsuba multiplication
can then be done with two ⌈m/2⌉ bit multiplications and one ⌊m/2⌋ bit multiplication.
The upper bound for the number of AN D gates and XOR gates required for the simple
Karatsuba multiplier is the same as that of a 2⌈log2 m⌉ bit basic recursive Karatsuba mul-
tiplier. The maximum number of gates required and the time delay for an m bit simple
Karatsuba multiplier is given below.
35
In the general Karatsuba multiplier [55], the multiplicands are split into more than
two terms. For example an m term multiplier is split into m different terms. The number
of gates required is given below.
The delay of the q bit combinational circuit in terms of LUTs is given by Equation 4.8,
where DLU T is the delay of one LUT.
36
The percentage of under utilized LUTs in a design is determined using Equation
4.9. Here, LU Tk signifies that k inputs out of 4 are used by the design block realized by
the LUT. So, LU T2 and LU T3 are under utilized LUTs, while LU T4 is fully utilized.
LU T2 + LU T3
%U nderU tilizedLU T s = ∗ 100 (4.9)
LU T2 + LU T3 + LU T4
Al Bl
Al Bl
AhBh
AhBh
37
such n bit multipliers required. The A and B inputs are split into two: Ah , Al and Bh , Bl
respectively with each term having n/2 bits. n/2 two input XORs are required for the
computation of Ah + Al and Bh + Bl respectively (Equation 4.4). Each two input XOR
requires one LUT on the FPGA, thus in total there are n LUTs required. Combining
the partial products as shown in Figure 4.1 is the last step of the recursion. Determining
the output bits n − 2 to n/2 and 3n/2 − 2 to n requires 3(n/2 − 1) two input XORs
each. The output bit n − 1 requires 2 two input XORs. In all (3n − 4) two input XORs
are required to add the partial products. The number of LUTs required to combine
the partial products is much lower. This is because each LUT implements a four input
XOR. Each output bit n/2 to 3n/2 − 2 requires one LUT, therefore (n − 1) LUTs are
required for the purpose. In total, 2n − 1 LUTs are required for each recursion on the
FPGA. The final recursion has 3(log2 m)−1 two bit Karatsuba multipliers. The equation
for the two bit Karatsuba multiplier is shown in Equation 4.10.
C0 =A0 B0
C2 =A1 B1
This requires three LUTs on the FPGA: one for each of the output bits (C0 , C1 , C2 ).
The total number of LUTs required for the m bit recursive Karatsuba multiplication
is given by Equation 4.11.
logX
2 m−2
= 3k (2log2 m−k+1 − 1)
k=0
The delay of the recursive Karatsuba multiplier in terms of LUTs is given by Equa-
38
tion 4.12. The first log2 (m) − 1 recursions have a delay of 2LU T s. The last recursion
has a delay of 1LU T .
General Karatsuba Multiplier : The m bit general Karatsuba algorithm [55] is shown
in Algorithm 4.1. Each iteration of i computes two output bits Ci and C2m−2−i . Com-
puting the two output bits require same amount of resources on the FPGA. The lines 6
and 7 in the algorithm is executed once for every even iteration of i and is not executed
for odd iterations of i. The term Mj + Mi−j + M(j,i−j) is computed with the four inputs
Aj , Ai−j , Bj and Bi−j , therefore on the FPGA, computing the term would require one
LUT. For an odd i, Ci would have ⌈i/2⌉ such LUTs whose outputs have to be added.
The number of LUTs required for this is obtained from Equation 4.7. An even value of
i would have two additional inputs corresponding to Mi/2 that have to be added. The
number of LUTs required for computing Ci (0 ≤ i ≤ m − 1) is given by Equation 4.15.
1 if i = 0
#LU Tci = ⌈i/2⌉ + #LU T (⌈i/2⌉) if i is odd (4.15)
i/2 + #LU T (i/2 + 2) if i is even
39
Algorithm 4.1: gkmul (General Karatsuba Multiplier)
Input: A, B are multiplicands of m bits
Output: C of length 2m − 1 bits
/* Define : Mx → Ax Bx */
/* Define : M(x,y) → (Ax + Ay )(Bx + By ) */
1 begin
2 for i = 0 to m − 2 do
3 Ci = C2m−2−i = 0
4 for j = 0 to ⌊i/2⌋ do
5 if i = 2j then
6 Ci = Ci + Mj
7 C2m−2−i = C2m−2−i + Mm−1−j
8 else
9 Ci = Ci + Mj + Mi−j + M(j,i−j)
10 C2m−2−i = C2m−2−i + Mm−1−j
11 +Mm−1−i+j + M(m−1−j,m−1−i+j)
12 end
13 end
14 end
15 Cm−1 = 0
16 for j = 0 to ⌊(m − 1)/2⌋ do
17 if m − 1 = 2j then
18 Cm−1 = Cm−1 + Mj
19 else
20 Cm−1 = Cm−1 + Mj + Mm−1−j + M(j,m−1−j)
21 end
22 end
23 end
The total number of LUTs required for the general Karatsuba multiplier is given by
Equation 4.16.
m−2
X
#LU T SG (m) = 2 #LU TCi + #LU TCm−1 (4.16)
i=0
When implemented in hardware, all output bits are computed simultaneously. The
delay of the general Karatsuba multiplier (Equation 4.17) is equal to the delay of the
output bit with the most terms. This is the output bit Cm−1 (lines 15 to 22 in the
Algorithm 4.1). Equation 4.17 is obtained from Equation 4.15 with i = m − 1. The
40
Table 4.1: Comparison of LUT Utilization in Multipliers
m General Simple
Gates LUTs LUTs Under Gates LUTs LUTs Under
Utilized Utilized
2 7 3 66.6% 7 3 66.6%
4 37 11 45.5% 33 16 68.7%
8 169 53 20.7% 127 63 66.6%
16 721 188 17.0% 441 220 65.0%
29 2437 670 10.7% 1339 669 65.4%
32 2977 799 11.3% 1447 723 63.9%
⌈i/2⌉ computations are done with a delay of one LUT (DLU T ). Equation 4.8 is used to
compute the second term of Equation 4.17.
DLU T + DELAY (⌈(m − 1)/2⌉)
if m − 1 is odd
DELAYG (m) = (4.17)
D
LU T + DELAY ((m − 1)/2 + 2) if m − 1 is even
In this section we present our proposed multiplier called the hybrid Karatsuba multi-
plier. We show how we combine techniques to maximize utilization of LUTs resulting
in minimum area.
Table 4.1 compares the general and simple Karatsuba algorithms for gate counts
(two input XOR and AN D gates), LUTs required on a Xilinx Virtex 4 FPGA and the
percentage of LUTs under utilized (Equation 4.9).
The simple Karatsuba multiplier alone is not efficient for FPGA platforms as the
number of under utilized LUTs is about 65%. For an m bit simple Karatsuba multiplier
the two bit multipliers take up approximately a third of the area (for m = 256). In a two
bit multiplier, two out of three LUTs required are under utilized (In Equation 4.10, C0
41
and C2 result in under utilized LUTs). In addition to this, around half the LUTs used
for each recursion is under utilized. The under utilized LUTs results in a bloated area
requirement on the FPGA.
The m-term general Karatsuba is more efficient on the FPGA for small values on
m (Table 4.1) even though the gate count is significantly higher. This is because a
large number of operations can be grouped in fours which fully utilizes the LUT. For
small values of m (m < 29) the compactness obtained by the fully utilized LUTs is
more prominent than the large gate count, resulting in low footprints on the FPGA. For
m ≥ 29, the gate count far exceeds the efficiency obtained by the fully utilized LUTs,
resulting in larger footprints with respect to the simple Karatsuba implementation.
In our proposed hybrid Karatsuba multiplier shown in Algorithm 4.2, the m bit
multiplicands are split into two parts when the number of bits is greater than or equal to
the threshold 29. The higher term has ⌊m/2⌋ bits while the lower term has ⌈m/2⌉ bits.
If the number of bits of the multiplicand is less than 29 the general Karatsuba algorithm
42
233 Simple
116 117
58 58 58 59
29 29 29 29 29 29 30 29
14 15 14 15 14 15 14 15 14 15 14 15 15 15 14 15
General
is invoked. The general Karatsuba algorithm ensures maximum utilization of the LUTs
for the smaller bit multiplications, while the simple Karatsuba algorithm ensures least
gate count for the larger bit multiplications. For a 233 bit hybrid Karatsuba multiplier
(Figure 4.2), the multiplicands are split into two terms with Ah and Bh of 116 bits and
Al and Bl of 117 bits. The 116 bit multiplication is implemented using three 58 bit
multipliers, while the 117 bit multiplier is implemented using two 59 bit multipliers
and a 58 bit multiplier. The 58 and 59 bit multiplications are implemented with 29 and
30 bit multipliers, the 29 and 30 bit multiplications are done using 14 and 15 bit general
Karatsuba multipliers.
m
r = ⌈log2 ⌉+1 (4.18)
29
The ith recursion (0 < i < r) of the m bit multiplier has 3i multiplications. The
multipliers in this recursion have bit lengths ⌈m/2i ⌉ and ⌊m/2i ⌋. For simplicity we
assume the number of gates required for the ⌊m/2i ⌋ bit multiplier is equal to that of the
⌈m/2i ⌉ bit multiplier. The total number of AN D gates required is the AN D gates for
the multiplier in the final recursion (i.e. ⌈m/2r−1 ⌉ bit multiplier) times the number of
43
⌈m/2r−1 ⌉ multipliers present. Using Equation 4.6,
3r−1 m m
#AN D = ⌈ r−1 ⌉ ⌈ r−1 ⌉ + 1 (4.19)
2 2 2
The number of XOR gates required for the ith recursion is given by 4⌈ m
2i
⌉ − 4.
The total number of two input XORs is the sum of the XORs required for last recur-
sion, #XORgr−1 , and the XORs required for the other recursions, #XORsi . Using
Equations 4.5 and 4.6,
r−2
r−1
3i #XORsi
X
#XOR = 3 #XORgr−1 +
i=1
!r−2
! (4.20)
r−1 m 2 m m
3i 4⌈ i ⌉ − 4
X
=3 10⌈ r ⌉ − 7⌈ r ⌉ + 1 +
2 2 i=1 2
The delay of the hybrid Karatsuba multiplier (Equation 4.21) is obtained by sub-
tracting the delay of a ⌈m/2r−1 ⌉ bit simple Karatsuba multiplier from the delay of an m
bit simple Karatsuba multiplier and adding the delay of a ⌈m/2r−1 ⌉ bit general Karat-
suba multiplier.
The graph in Figure 4.3 compares the area time product for the hybrid Karatsuba mul-
tiplier with the simple Karatsuba multiplier and the binary Karatsuba multipliers for
increasing values of m. The simple and binary Karatsuba multipliers were reimple-
mented and scaled for different field sizes. The results were obtained by synthesizing
44
1.1e+06
1e+06
900000
800000
700000
Area * Delay
600000
500000
400000
300000
200000
Table 4.2: Comparison of the Hybrid Karatsuba Multiplier with Reported FPGA Imple-
mentations
using Xilinx’s ISE for a Virtex 4 FPGA. The area was determined by the number of
LUTs required for the multiplier, and the time in nano seconds includes the I/O pad
delay. The graph shows that the area time product for the hybrid Karatsuba multiplier
is lesser compared to the other multipliers. The power×delay graph for the multipliers
is expected to be similar to the area×delay graph of Figure 4.3.
Table 4.2 compares the hybrid Karatsuba with reported FPGA implementations of
Karatsuba variants. The implementations of [48] and [50] are sequential and hence re-
quire multiple clock cycles, thus they are not suited for high performance ECC. In order
45
to alleviate this, we proposed a combinational Karatsuba multiplier. However to ensure
that the design operates at a high clock frequency, we perform hardware replication.
For example, in a 233 bit multiplier, 14 bit and 15 bit general Karatsuba multipliers are
replicated, since the general Karatsuba multipliers utilize LUTs efficiently. This gain is
reflected in Table 4.2.
4.7 Conclusion
In this chapter we discussed the finite field multiplication unit. We proposed a hybrid
technique for implementing the Karatsuba multiplier. Our proposed design results in
best area × time product on an FPGA compared to existing works. The hybrid Karat-
suba multiplier forms the most important module for our elliptic curve crypto processor.
In the next chapter, we discuss the finite field inversion which would also use the hybrid
Karatsuba multiplier.
46
CHAPTER 5
The inverse of a non zero element a in the field GF (2m ) is the element a−1 ∈ GF (2m )
such that a · a−1 = a−1 · a = 1. Among all finite field operations, computing the inverse
of an element is the most computationally intensive. Yet it forms an integral part of
many public key cryptography algorithms including ECC. It is therefore important to
have an efficient technique to find the multiplicative inverse.
This chapter is organized as follows : the next section has a brief discussion on
various multiplicative inverse algorithms and reasons out why the Itoh-Tsujii algorithm
is most suited for elliptic curve cryptography. Section 5.2 describes the Itoh-Tsujii al-
gorithm and some of the reported literature on its implementation. Section 5.3 derives
an equation to determine the number of clock cycles required to find the inverse. Sec-
tion 5.4 proposes a generalized Itoh-Tsujii algorithm and presents a special case of the
generalized version called the quad-Itoh Tsujii algorithm, which is efficient for FPGA
platforms. This section also builds a controller that implements the quad-Itoh Tsujii
algorithm. Section 5.5 has the performance evaluation of the proposed algorithm with
the best existing inverse algorithms available. The final section has the conclusion.
The most common algorithms for finding the multiplicative inverse are the extended
Euclidean algorithms (EEA) and the Itoh-Tsujii Algorithm (ITA) [13]. Generally, the
EEA and its variants, the binary EEA and Montgomery [56] inverse algorithms result
in compact hardware implementations, while the ITA is faster. The large area required
by the ITA is mainly due to the multiplication unit. All cryptographic applications
require to perform finite field multiplications, hence their hardware implementations
require a multiplier to be present. This multiplier can be reused by the ITA for inverse
computations. In this case the multiplier need not be considered in the area required by
the ITA. The resulting ITA without the multiplier is as compact as the EEA making it
an ideal choice for multiplicative inverse hardware [44].
The Itoh-Tsujii algorithm was initially proposed to find the multiplicative inverse
for normal basis representation of elements in the field GF (2m )[13]. Since then, there
have been several works that improved the original algorithm and adapted the algorithm
to other basis representations [57–59]. In [57], inversion in polynomial basis represen-
tations of field elements was presented. In [58] addition chains were used efficiently
to compute the multiplicative inverse in 27 clock cycles for an element represented in
polynomial basis in the field GF (2193 ). In [59] a parallel implementation of ITA was
proposed to generate the inverse in 20 clock cycles for the same field and basis repre-
sentation.
m −2
a−1 = a2 (5.1)
The naive technique of implementing a−1 requires (m−2) multiplications and (m−
1) squarings. Itoh and Tsujii in [13] reduced the number of multiplications required by
using addition chains. An addition chain [60] for n ∈ N is a sequence of integers of the
form U = ( u0 u1 u2 · · · ur ) satisfying the properties
48
• u0 = 1
• ur = n
k −1
βk (a) = a2 ∈ GF (2m )
then,
a−1 = [βm−1 (a)]2
In [59] a recursive sequence (Equation 5.2) is used with an addition chain to com-
pute the multiplicative inverse. βk+j (a) ∈ GF (2m ) can be expressed as shown in Equa-
tion 5.2. For simplicity of notation we shall represent βk (a) by βk .
k j
βk+j (a) = (βj )2 βk = (βk )2 βj (5.2)
49
Table 5.1: Inverse of a ∈ GF (2233 ) using generic ITA
βui (a) βuj +uk (a) Exponentiation
1 β1 (a) a
21 2
2 β2 (a) β1+1 (a) (β1 ) β1 = a2 −1
1 3
3 β3 (a) β2+1 (a) (β2 )2 β1 = a2 −1
3 6
4 β6 (a) β3+3 (a) (β3 )2 β3 = a2 −1
1 7
5 β7 (a) β6+1 (a) (β6 )2 β1 = a2 −1
7 14
6 β14 (a) β7+7 (a) (β7 )2 β7 = a2 −1
14 28
7 β28 (a) β14+14 (a) (β14 )2 β14 = a2 −1
1 29
8 β29 (a) β28+1 (a) (β28 )2 β1 = a2 −1
29 58
9 β58 (a) β29+29 (a) (β29 )2 β29 = a2 −1
58 116
10 β116 (a) β58+58 (a) (β58 )2 β58 = a2 −1
116 232
11 β232 (a) β116+116 (a) (β116 )2 β116 = a2 −1
Computing β232 (a) is done in 10 steps with 231 squarings and 10 multiplications as
shown in Table 5.1.
In general if l is the length of the addition chain, finding the inverse of an element
in GF (2m ) requires l − 1 multiplications and m − 1 squarings. The length of the
addition chain is related to m by the equation l ≤ ⌊log2 m⌋ [60], therefore the number
of multiplications required by the ITA is much lesser than that of the naive method.
In the ITA for field GF (2m ), the number of squarings required is as high as m. Further
from Table 5.1, it may be noted that most of the squarings required is towards the end of
the addition chain. The maximum number of squarings at any particular step could be
as high as ui /2. Although the circuit for a squarer is relatively simple, the large number
of squarings required hampers the performance of the ITA. A straightforward way of
implement the squarings would require ui /2 clock cycles at each step. The technique
used in [58] and [59] cascades us (where us is an element in the addition chain) squarers
50
Input Squarer−1
Squarer−2
Squarer−3
Multiplexer
Squarer−(us−1)
Squarer−us
Control
(Figure 5.1) so that the output of one squarer is fed to the input of the next. If the number
of squarings required is less than us , a multiplexer is used to tap out interim outputs.
In this case the output can be obtained in one clock cycle. If the number of squarings
required is greater than us the output of the squaring block is fed back to get squares
which are a multiple of us . For example, if ui (ui > us ) squarings are needed, the
output of the squarer block would be fed back ⌈ui /us ⌉ times. This would also require
⌈ui /us ⌉ clock cycles.
In addition to the squarings, each step in the ITA has exactly one multiplication
requiring one clock cycle. The total number of clock cycles required for this design,
assuming a Brauer chain, is given by Equation 5.4. The summation in the equation is
the clock cycles for the squarings at each step of the algorithm. The (l − 1) term is due
to the (l − 1) multiplications. The extra clock cycle is for the final squaring.
l
X ui − ui−1
#ClockCycles = 1 + (l − 1) + ⌈ ⌉
i=2 us
l
(5.4)
X ui − ui−1
=l+ ⌈ ⌉
i=2 us
In order to reduce the clock cycles a parallel architecture was proposed in [59]. The
reduced clock cycles is achieved at the cost of increased hardware. In the remaining
51
part of this section we propose a novel ITA designed for the FPGA architecture. The
proposed design, though sequential, requires the same number of clock cycles as the
parallel architecture of [59] but has better area×time product.
The equation for the square of an element a ∈ GF (2m ) is given by Equation 5.5, where
p(x) is the irreducible polynomial.
m−1
a(x)2 = ai x2i mod p(x)
X
(5.5)
i=0
This is a linear equation and hence can be represented in the form of a matrix (T ) as
shown in the equation below.
a2 = T · a
The matrix depends on the finite field GF (2m ) and the irreducible polynomial of the
field. Exponentiation in the ITA is done with squarer circuits. We extend the ITA so
that the exponentiation can be done with any 2n circuit and not just squarers. Raising a
to the power of 2n is also linear and can be represented in the form of a matrix as shown
below.
n
a2 = T n (a) = T ′ a
nk −1
αk (a) = a2 (5.6)
nk1 −1 nk2 −1
Theorem 5.4.1 If a ∈ GF (2m ) , αk1 (a) = a2 and αk2 (a) = a2 then
nk2
αk1 +k2 (a) = (αk1 (a))2 αk2 (a)
52
where k1 , k2 , and n ∈ N
Proof
nk2
RHS = (αk1 (a))2 αk2 (a)
nk1 −1 nk2 nk2 −1
= (a2 )2 (a2 )
n(k1 +k2 ) −2nk2 +2nk2 −1
= (a2 )
n(k1 +k2 ) −1
= (a2 )
= LHS
Proof When n | (m − 1)
2 m−1 )
2
2n( n −1
α m−1 (a) = a
n
2
2m−1 −1
= a
= a−1
When n ∤ (m − 1)
2 2
r nq −1 r r −1
(αq (a))2 βr (a) = (a2 )2 (a2 )
2
2nq+r −1
= a
2
m−1 −1
= a2
= a−1
53
Table 5.2: Comparison of LUTs Required for a Squarer and Quad Circuit for GF (29 )
We note that elliptic curves over the field GF (2m ) used for cryptographic purposes
[14] have an odd m, therefore we discuss with respect to such values of m, although the
results are valid for all m. In particular, we consider the case when n = 2; such that
k −1
αk (a) = a4
To implement this we require quad circuits. To show the benefits of using a quad
circuit on an FPGA instead of the conventional squarer, consider the equations for a
squarer and a quad for an element b(x) ∈ GF (29 ) (Table 5.2). The irreducible polyno-
mial for the field is x9 + x + 1. In the table, b0 · · · b8 are the coefficients of b(x). The
#LUTs column shows the number of LUTs required for obtaining the particular output
bit.
We would expect the LUTs required by the quad circuit be twice that of the squarer.
However this is not the case. The quad circuit’s LUT requirement is only 1.5 times that
of the squarer. This is because the quad circuit has a lower percentage of under utilized
LUTs (Equation 4.9). For example, from Table 5.2 we note that output bit 4 requires
54
Table 5.3: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA
three XOR gates in the quad circuit and only one in the squarer. However, both circuits
require only 1 LUT. This is also the case with output bit 8. This shows that the quad
circuit is better at utilizing FPGA resources compared to the squarer. Moreover, both
circuits have the same delay of one LUT. If we generate the fourth power by cascading
two squarer circuits (i.e (b(x)2 )2 ), the resulting circuit would have twice the delay and
require 25% more hardware resources than a single quad circuit.
These observations are scalable to larger fields as shown in Table 5.3. The circuits
for the finite fields GF (2233 ) and GF (2193 ) use the irreducible polynomials x233 +
x74 + 1 and x193 + x15 + 1 respectively. They were synthesized for a Xilinx Virtex 4
FPGA. The table shows that the area saved even for large fields is about 25%. While
the combinational delay of a single squarer is equal to that of the quad.
The overhead of the quad-ITA is the need to precompute a3 . Since we do not have a
squarer this has to be done by the multiplication block, which is present in the architec-
ture. Using the multiplication unit, cubing is accomplished in two clock cycles without
any additional hardware requirements. Similarly, the final squaring can be done in one
clock cycle by the multiplier with no additional hardware required.
55
Algorithm 5.1: qitmia (Quad-ITA)
Input: The element a ∈ GF (2m ) and the Brauer chain
U = {1, 2, · · · , m−1
2
, m − 1}
Output: The multiplicative inverse a−1
1 begin
2 l = length(U )
3 a2 = hmul(a, a); /* hmul: hybrid Karatsuba multiplier */
; /* proposed in Algorithm 4.2 */
4 αu1 = a3 = a2 · a
5 foreach ui ∈ U (2 ≤ i ≤ l − 1) do
6 p = ui−1
7 q = ui − ui−1
q
8 αui = hmul(αp4 , αq )
9 end
10 a−1 = hmul(αul−1 , αu1−1 )
11 end
2.116 −1 116 −1
[α116 (a)]2 . This requires computation of α116 (a) = a2 = a4 and then doing a
squaring, a−1 = (α116 (a))2 . We use the same Brauer chain (Equation 5.3) as we did in
the previous example. Excluding the precomputation step, computing α116 (a) requires
9 steps. The total number of quad operations to compute α116 (a) is 115 and the number
of multiplications is 9. The precomputation step requires 2 clock cycles and the final
squaring takes one clock cycle. In all 12 multiplications are required for the inverse
operation. In general for an addition chain for m − 1 of length l, the quad-ITA requires
two additional multiplications compared to the ITA implementation of [59].
#M ultiplications : l + 1 (5.7)
(m − 1)
#QuadP owers : −1 (5.8)
2
The number of clock cycles required is given by the Equation 5.9. The summation
in the equation is the clock cycles required for the quadblock, while l + 1 is the clock
56
cycles of the multiplier.
l−1
X ui − ui−1
#ClockCycles = (l + 1) + ⌈ ⌉ (5.9)
i=2 us
The difference in the clock cycles between the ITA of [59] (Equation 5.4) and the
quad-ITA (Equation 5.9) is
ul − ul−1
⌈ − 1⌉ (5.10)
us
In general for addition chains used in ECC, the value of ul −ul−1 is as large as (m−1)/2
and much greater than us , therefore the clock cycles saved is significant.
57
Clk
Reset
Control
sel1
sel1 sel2 sel3 rcntl qsel en
a 0
MOUT
1 MUX Clk
A
Hybrid
2
Karatsuba
a−1
sel2 Multiplier
0
1 MUX
B QOUT
2 qsel Clk
sel3 Quadblock
0
MUX
C
1
Regbank
rcntl
Fig. 5.2: Quad-ITA Architecture for GF (2233 ) with the Addition Chain 5.3
algorithm (Section 4.5.1). The quadblock (Figure 5.3) consists of 14 cascaded circuits,
each circuit generating the fourth power of its input. If qin is the input to the quad-
quad circuit − 2
quad circuit − 3
Multiplexer
quad circuit − us
qsel
58
2 3 14
block, the powers of qin generated are qin4 , qin4 , qin4 · · · qin4 . A multiplexer in
the quadblock, controlled by the select lines qsel, determines which of the 14 powers
qsel
gets passed on to the output. The output of the quadblock can be represented as qin4 .
Two buffers M OU T and QOU T store the output of the multiplier and the quad-
block respectively. At every clock cycle, either the multiplier or the quadblock (but not
both) is active (The en signal if 1 enables either the M OU T , otherwise the QOU T
buffer). A register bank may be used to store results of each step (αui ) of Algorithm
5.1. A result is stored only if it is required for later computations.
The controller is a state machine designed based on the adder chain and the number
of cascaded quad circuits in the quadblock. At every clock cycle, control signals are
generated for the multiplexer selection lines, enables to the buffers and access signals
to the register bank. As an example, consider the computations of Table 5.4. The
corresponding control signals generated by the controller is as shown in Table 5.5. The
first step in the computation of a−1 is the determination of a3 . This takes two clock
cycles. In the first clock, a is fed to both inputs of the multiplier. This is done by
controlling the appropriate select lines of the multiplexers. The result, a2 , is used in the
following clock along with a to produce a3 . This is stored in the register bank. The
second step is the computation of α2 (a). This too requires two clock cycles. The first
1
clock uses a3 as the input to the quadblock to compute (α1 )4 . In the next clock, this
is multiplied with a3 to produce the required output. In general, computing any step
αui (a) = αuj +uk (a) takes 1 + ⌈ u14j ⌉ clock cycles. Of this, ⌈ u14j ⌉ clock cycles are used by
the quadblock, while the multiplier requires a single clock cycle. At the end of a step,
the result is present in M OU T .
The length of the addition chain influences the number of clock cycles required to com-
pute the inverse (Equations 5.4 and 5.9), hence proper selection of the addition chain is
59
Table 5.5: Control Word for GF (2233 ) Quad-ITA for Table 5.4
60
critical to the design. For a given m, there could be several optimal addition chains. It
is required to select one chain from available optimal chains. The amount of memory
required by the addition chain can be used as a secondary selection criteria. The mem-
ory utilized by an addition chain is the registers required for storage of the results from
intermediate steps. The result of step αi (a) is stored only if it is required to be used in
any other step αj (a) and j > i + 1. Consider the addition chain in 5.11.
Computing α5 (a) = α2+3 (a) requires α2 (a), therefore α2 (a) needs to be stored. Simi-
larly, α1 (a), α5 (a) and α12 (a) needs to be stored to compute α3 (a), α17 (a) and α29 (a)
respectively. In all four registers are required. Minimizing the number of registers is
important because for cryptographic applications m is generally large therefore each
register’s size is significant.
Using Brauer chains has the advantage that for every step (except the first) at least
one input is read from the output of the previous step. The output of the previous
step is stored in M OU T therefore need not be read from any register and no storage
is required. The second input to the step would ideally be a doubling. For example,
computing α116 (a) requires only α58 (a). Since α58 (a) is the result from the previous
step, it is stored in M OU T . Therefore computing α116 (a) does not require any stored
values.
The number of quad circuits cascaded (us ) has an influence on the clock cycles, fre-
quency, and area requirements of the quad-ITA. Increasing the number of cascaded
blocks would reduce the number of clock cycles (Equation 5.4) required at the cost of
an increase in area and delay.
61
850
800
Computational Time of Cascaded Quad Block (in ns)
750
700
650
600
550
500
450
400
350
300
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of cascaded Quads
Fig. 5.4: Clock Cycles of Computation Time versus Number of Quads in Quadblock on
a Xilinx Virtex 4 FPGA for GF (2233 )
Let a single quad circuit require lp LUTs and have a combinational delay of tp . For
this analysis we assume that tp includes the gate delay as well as the path delay. We also
assume that the path delay is constant. The values of lp and tp depend on the finite field
GF (2m ) and the irreducible polynomial. A cascade of us quad circuits would require
us · lp LUTs and have a delay of us · tp .
In order that the quadblock not alter the frequency of operation, us should be se-
lected such that us · tp is less than the maximum combinational delay of the entire
design. In the quad-ITA hardware, the maximum delay is from the Karatsuba multi-
plier, therefore we select us such that the delay of the quadblock is less than the delay
of the multiplier.
us · tp ≤ Delay of multiplier
However, reducing us would increase the clock cycles required. Therefore we select us
so that the quadblock delay is close to the multiplier delay.
The graph in Figure 5.4 plots the computation delay (clock period in nanoseconds
× the clock cycles) required versus the number of quads in the quad-ITA for the field
62
GF (2233 ). For small values of us , the delay is mainly decided by the multiplier, while
the clock cycles required is large. For large number of cascades, the delay of the quad-
block exceeds that of the multiplier, therefore the delay of the circuit is now decided by
the quadblock. Lowest computation time is obtained with around 11 cascaded quads.
For this, the delay of the quadblock is slightly lower than the multiplier. Therefore,
the critical delay is the path through the multiplier, while the clock cycles required is
around 30. Therefore for the quad-ITA in a field GF (2233 ), 11 cascaded quads result in
least computation time. However, in order to make the clock cycles required to com-
pute the finite field inverse in GF (2233 ) equal to the parallel implementation of [59], 14
cascaded quads are used even though this causes a marginal increase in the computation
time (which is still quite lesser than the parallel implementation at 0.55µsec).
Quad-ITA
Squarer-ITA
500
400
1/(LUTs * Delay * Clock Cycles)
300
200
100
0
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
In this section we compare our work with reported finite field inverse results. We also
test our design for scalability over several fields.
63
The graph in Figure 5.5 shows the scalability of the quad-ITA and compares it with
a squarer-ITA. The design of the squarer-ITA is similar to that of the quad-ITA (Figure
5.2) except for the quadblock. The quad circuits in the quadblock is replaced by squarer
circuits. Both the quadblock and squarer block have the same number of cascaded
circuits. The platform used for generating the graph is a Xilinx Virtex 4 FPGA. The X
axis has increasing field sizes (see the Appendix for list of finite fields), and the Y axis
has the performance metric shown below.
f requency
Performance = (5.12)
Slices × ClockCycles
The slices is the number of slices required on the FPGA as reported by Xilinx’s ISE
synthesis tool. The graph shows that the quad-ITA has better performance compared to
the squarer-ITA for most fields.
Table 5.6 compares the quad-ITA with the best reported ITA and Montgomery in-
verse algorithms available. The FPGA used in all designs is the Xilinx Virtex E. The
quad-ITA has the best computation time and performance compared to the other im-
plementations. It may be noted that the larger area compared to [58] and [59] of the
quad-ITA is because it uses distributed RAM [61] for registers, while [58] and [59] use
block RAM [39]. The distributed RAM requires additional CLB resources while block
RAM does not.
64
Table 5.6: Comparison for Inversion on Xilinx Virtex E
This chapter discussed the finite field inverter required for the elliptic curve crypto pro-
cessor. The Itoh-Tsujii algorithm was used for the inversion. A generalized version
of the ITA was proposed that improves the utilization of FPGA resources. With this
method, we show that raising an element by a power of 4 (quad operation) on an FPGA
is more compact and faster than using squarers. Thus the quad operation forms the core
of an improved ITA algorithm called the quad-ITA. The quad-ITA takes least number
of clock cycles, has lesser computational time and has better performance compared
to the best reported inversion algorithms. The quad-ITA is used for the final inversion
required in the elliptic curve crypto processor. This is discussed in the next chapter.
66
CHAPTER 6
This chapter presents the construction of an elliptic curve crypto processor (ECCP)
for the NIST specified curve [14] given in Equation 6.1 over the binary finite field
GF (2233 ).
y 2 + xy = x3 + ax2 + b (6.1)
The processor implements the double and add scalar multiplication algorithm described
in Algorithm 3.1. The processor (Figure 6.1), is capable of doing the elliptic curve
operations of point addition and point doubling. Point doubling is done at every iteration
of the loop in Algorithm 3.1, while point addition is done for every bit set to one in the
binary expansion of the scalar input k. The output produced as a result of the scalar
Qin
Qout
A0
Regbank Arithmetic
A1 C0
Unit
A2
C1
kP A3
c[10:25] c[0:9],c[29:26]
The scalar multiplication implemented in the processor of Figure 6.1 is done using
the López-Dahab (LD) projective coordinate system. The LD coordinate form of the
elliptic curve over binary finite fields is
Y 2 + XY Z = X 3 + aX 2 Z 2 + bZ 4 (6.2)
In the ECCP, a is taken as 1, while b is stored in the ROM along with the basepoint
P . Equations for point doubling and point addition in LD coordinates are shown in
Equations 3.10 and 3.11 respectively.
During the initialization phase the curve constant b and the basepoint P are loaded
from the ROM into the registers after which there are two computational phases. The
first phase multiplies the scalar k to the basepoint P . The result produced by this phase
is in projective coordinates. The second phase of the computation converts the projec-
tive point result of the first phase into the affine point kP . The second phase mainly
involves an inverse computation. The inverse is computed using the quad Itoh-Tsujii
inverse algorithm proposed in Algorithm 5.1.
The next section describes in detail the ECCP. Section 6.2 describes the implemen-
tation of the elliptic curve operations in the processor. Section 6.3 presents the finite
state machine that implements Algorithm 3.1. Section 6.4 has the performance results,
68
c[24]
c[21]
0
c[10] ad1 RA1 out1 0
MUX din
IN1 MUX A0
1
c[11] ad2 RA2 out2
OUT1
c[12] we
C0 c[25] 1
c[22]
0
c[14:13] ad1 RB1 out1 0
MUX din
IN2 c[16:15] MUX A2
ad2 RB2 out2
OUT2
C1 1 c[31]
c[17] RB3 1 0
we
c[32],c[30]
MUX Qin
00
RB4 c[23]
OUT4
0 1
This section describes in detail the register file, arithmetic unit and the control unit of
the elliptic curve crypto processor.
The heart of the register file (Figure 6.2) are eight registers each of size 233 bits. The
registers are used to store the results of the computations done at every clock cycle.
The registers are dual ported and arranged in three banks, RA, RB, and RC. The dual
ported RAM allows asynchronous reads on the lines out1 and out2 corresponding to the
69
Table 6.1: Utility of Registers in the Register Bank
Register Description
RA1 1. During initialization it is loaded with Px .
2. Stores the x coordinate of the result.
3. Also used for temporary storage.
RA2 Stores Px .
RB1 1. During initialization it is loaded with Py .
2. Stores the y coordinate of the result.
3. Also used for temporary storage.
RB2 Stores Py .
RB3 Used for temporary storage.
RB4 Stores the curve constant b.
RC1 1. During initialization it is set to 1.
2. Store z coordinate of the projective result.
3. Also used for temporary storage.
RC2 Used for temporary storage.
address on the address lines ad1 and ad2 respectively. A synchronous write of the data
on din is done to the location addressed by ad1. The we signal enables the write. On the
FPGA, the registers are implemented as distributed RAM[61]. At every clock cycle, the
register file is capable of delivering five operands (on buses A0, A1, A2, A3 and Qin)
to the arithmetic unit and able to store three results (from buses C0, C1, and Qout).
The inputs to the register file is either the arithmetic unit outputs, the curve constant (b
of Equation 6.2), or the basepoint P = (Px , Py ).
70
Qin
c[2:0]
A0
A0 QUADBLOCK Qout
A02
SQUARE
A2 A2 c[29:26] qsel
A0+A2 MUX
A04 A04+A1 A c[7:6]
SQUARE
A1 A1
A14 M
SQUARE SQUARE
A2
MUX C0
A12 C
c[5:3] A3
HYBRID
A1
A12 KARATSUBA M
c[9:8]
A12+A2
A1+A3 MULTIPLIER M+A0
MUX A0
A22+A1+A3 B A22
2
A2+M+A0
SQUARE
A3 A3 MUX
A14 D C1
A24
SQUARE
A14 A04+A1
The arithmetic unit (Figure 6.3) is built using finite field arithmetic circuits and orga-
nized for efficient implementation of point addition (Equation 3.11) and point doubling
(Equation 3.10) in LD coordinates. The AU has 5 inputs (A0 to A3 and Qin) and 3
outputs (C0, C1, and Qout). The main components of the AU is a quadblock and a
multiplier. The multiplier is based on the hybrid Karatsuba algorithm (Section 4.5.1).
It is used in both phases (during the scalar multiplication phase and conversion to affine
coordinate phase) of the computation. The quadblock is designed according to Fig-
ure 5.3. Here, the quadblock consists of 14 cascaded quad circuits and is capable of
c[29]···c[26]
generating the output Qout = Qin4 . The quadblock is used only for inversion
which is done during the final phase of the computation. The AU has several adders and
squarer circuits. These circuits are small compared to the multiplier and the quadblock
and therefore contribute marginally to the overall area and latency of the processor.
71
6.1.3 Control Unit
At every clock cycle the control unit produces a control word. Control words are pro-
duced in a sequence depending on the type of elliptic curve operation being done. The
control word signals control the flow of data and also decide the operations performed
on the data. There are 33 control signals (c[0] to c[32]) that are generated by the control
unit. The signals c[0] to c[9] control the inputs to the finite field multiplier and the out-
puts C0 and C1 of the AU. The control lines c[26] to c[29] are used for the select lines
to the multiplexer in the quadblock (Figure 5.3). The remaining control bits are used in
the register file to read and write data to the registers. Section 6.3 has the detailed list
of all control words generated.
This section presents the implementation of LD point addition and doubling equations
on the ECCP.
The equation for doubling the point P in LD projective coordinates was shown in Equa-
tion 3.10 and is repeated here (Equation 6.3). [30]. The input required for doubling is
the point P = (X1 , Y1 , Z1 ) and the output is its double 2P = (X3 , Y3 , Z3 ). The equa-
tion shows that four multiplications are required (assuming a = 1). The ECCP has just
one multiplier, which is capable of doing one multiplication per clock cycle. Hence, the
72
ECCP would require at least four clock cycles for computing the double.
Z3 = X12 · Z12
This doubling operation is mapped to the elliptic curve hardware using Algorithm
6.1.
Table 6.3: Inputs and Outputs of the Register File for Point Doubling
Clock A0 A1 A2 A3 C0 C1
1 RA1 RC1 - - RC1 RB3
2 - RB4 RB3 - RB3
3 RA1 RB3 RB1 RC1 RC2 RA1
4 RB3 RC1 - RC2 RB1 -
73
On the ECCP, the LD doubling algorithm can be parallelized to complete in four
clock cycles as shown in Table 6.2 [64]. The parallelization is based on the fact that the
multiplier is several times more complex than the squarer and adder circuits used. So,
in every clock cycle the multiplier is used and it produces one of the outputs of the AU.
The other AU output is produced by additions or squaring operations alone.
Table 6.3 shows the data held on the buses at every clock cycle. It also shows where
the results are stored. For example, in clock cycle 1, the contents of the registers RA1
and RC1 are placed on the bus A0 and A1 respectively. Control lines in M U XA and
M U XB of the AU are set such that A02 and A1 are fed to the multiplier. The output
multiplexers M U XC and M U XD are set such that M and A14 are sent on the buses
C0 and C1. These are stored in registers RC1 and RB3 respectively. Effectively, the
computation done by the AU are RC1 = RA21 · RC12 and RB3 = RC14 . Similarly the
subsequent operations required for doubling as stated in 6.2 are performed.
A = y2 · Z12 + Y1
B = x2 · Z1 + X1
C = Z1 · B
D = B 2 · (C + a · Z12 )
Z3 = C 2
(6.4)
E =A·C
X3 = A2 + D + E
F = X3 + x2 · Z3
G = (x2 + y2 ) · Z32
Y3 = (E + Z3 ) · F + G
74
6.2.2 Point Addition
The equation for adding an affine point to a point in LD projective coordinates was
shown in Equation 3.11 and repeated here in Equation 6.4. The equation adds two
points P = (X1 , Y1 , Z1 ) and Q = (x2 , y2 ) where Q 6= ±P . The resulting point is
P + Q = (X3 , Y3 , Z3 ).
The addition operation is mapped to the elliptic curve hardware using Algorithm
6.2. Note, a is taken as 1. On the ECCP the operations in Algorithm 6.2 are scheduled
75
Table 6.5: Inputs and Outputs of the Register Bank for Point Addition
Clock A0 A1 A2 A3 C0 C1
1 RB2 RC1 RB1 - RB1 -
2 RA1 RC1 RA2 - RA1 -
3 RA1 - - RC1 RB3 -
4 RA1 RC1 RB3 - RA1 -
5 RA1 RB3 RB1 - RC2 RA1
6 RA1 RB3 RA2 - RC1 RB3
7 RB2 RC1 RA2 - RB1 -
8 RB3 RC1 RB1 RC2 RB1 -
efficiently to complete in eight clock cycles [64]. The scheduled operations for point
addition is shown in Table 6.4, and the inputs and outputs of the registers at each clock
cycle is shown in Table 6.5.
A2
A1 A3
ki=1
D4 A4
D3 A5
ki=0 complete
D2 A6
D1 A7
A8
Detect leading 1
complete
Init1 Init2 Init3 I1 I2 I21 I22 I23 I24
76
Table 6.6: Inputs and Outputs of Regbank for Every State
77
Table 6.7: Control Words for ECCP
State Quadblock Regfile MUXIN Regfile MUXOUT Regbank signals AU Mux C and D AU Mux A and B
c29 · · · c26 c32 c30 c25 c24 c31 c23 c22 c21 c20 · · · c10 c9 · · · c6 c5 · · · c0
Init1 xxxx 1010 00xx 1x01xx001x0 0000 000000
Init2 xxxx 1010 00xx 0xx1xx011x1 xxxx xxxxxx
Init3 xxxx 1xxx xxxx 0xx1xx110xx xxxx xxxxxx
The three phases of computation done by the ECCP, namely the initialization, scalar
multiplication and projective to affine conversion phase are implemented using the FSM
shown in Figure 6.4. The first three states of the FSM do the initialization. In these
states the curve constant and basepoint coordinates are loaded from ROM into the reg-
isters (Table 6.6). These states also detect the leading MSB in the scalar key k. After
initialization, the scalar multiplication is done. This consists of 4 states for doubling
and 8 for the point addition. The states that do the doubling are D1 · · · D4. In state
D4, a decision is made depending on the key bit ki (i is a loop counter initially set to
78
the position of the leading one in the key, and ki is the ith bit of the key k). If ki = 1
then a point addition is done and state A1 is entered. If ki = 0, the addition is not done
and the next key bit (corresponding to i − 1) is considered. If ki = 0 and there are no
more key bits to be considered then the complete signal is issued and it marks the end
of the scalar multiplication phase. The states that do the addition are A1 · · · A8. At the
end of the addition (state A8) state D1 is entered and the key bit ki−1 is considered. If
there are no more key bits remaining the complete signal is asserted. Table 6.7 shows
the control words generated at every state.
At the end of the scalar multiplication phase, the result obtained is in projective
coordinates and the X, Y , and Z coordinates are stored in the registers RA1 , RB1 , and
RC1 respectively. To convert the projective point to affine, the following equation is
used.
x = X · Z −1
(6.5)
−1 2
y = Y · (Z )
The inverse of Z is obtained using the quad-ITA discussed in Algorithm 5.1. The ad-
dition chain used is the Brauer chain in Equation 5.3. The processor implements the
steps given in Table 5.4. Each step in Table 5.4 gets mapped into one or more states
from I1 to I21. The number of clock cycles required to find the inverse is 21. This is
lesser than the clock cycles estimated by Equation 5.9. This is because, inverse can be
implemented more efficiently in the ECCP by utilizing the squarers present in the AU.
At the end of state I21, the inverse of Z is present in the register RC1 . The states
I22 and I23 compute the affine coordinates x and y respectively.
The number of clock cycles required for the ECCP to produce the output is com-
puted as follows. Let the scalar k has length l and hamming weight h, then the clock
79
cycles required to produce the output is given by the following equation.
Three clock cycles are added for the initial states, 24 clock cycles are required for the
final projective to affine conversion. 12(h − 1) cycles are required to handle the 1’s in
k. Note that the MSB of k does not need to be considered. 4(l − h) cycles are required
for the 0’s in k.
In this section we compare our work with reported GF (2m ) elliptic curve crypto pro-
cessors implemented on FPGA platforms (Table 6.8). Our ECCP was synthesized using
Xilinx’s ISE for Virtex 4 and Virtex E platforms. Since, the reported works are done on
different field sizes. We use the measure latency/bit for evaluation. Here latency is
the time required to compute kP . Latency is computed by assuming the scalar k has
half the number of bits 1. The only faster implementations are [37] and [1]. However,
[37] does not perform the final inverse computation required for converting from LD
to affine coordinates. Also, as shown in Table 6.9 our implementation has a better area
time product compared to [1], while the latency is almost equal. To compare the two
designs we scaled the area of [1] by a factor of (233/m)2 , since area of the elliptic curve
processors is mostly influenced by the multiplier which has an area of O(n2 ). The time
is scaled by a factor (233/m), since the time required is linear.
80
Table 6.8: Comparison of the Proposed GF (2m ) ECCP with FPGA based Published Results
This chapter integrates the previously developed finite field arithmetic blocks to form
an arithmetic unit. The AU is used in a elliptic curve crypto processor to compute
the scalar product kP for a NIST specified curve. Our ECCP has the best timing per
bit compared to most of the reported works. Of all works compared, only two have
better timing compared to ours. We showed that our design has more efficient FPGA
utilization compared to these works.
82
CHAPTER 7
The previous chapter presented the construction of an elliptic curve crypto processor.
This chapter discusses issues regarding side channel analysis of the processor. First a
side channel attack based on simple power analysis (SPA) of the ECCP is demonstrated.
Then the architecture of the ECCP is modified to reduce the threat of SPA. We call this
new architecture SPA resistant elliptic curve crypto processor (SR-ECCP).
This chapter is organized as follows : the next section demonstrates a simple power
analysis on the ECCP. Section 7.2 presents the SR-ECCP and shows how the power
traces do not reveal the key any more. The final section has the conclusion.
The state machine for the scalar multiplication in the ECCP has 12 states (Figure 6.4),
4 states (D1 · · · D4) for doubling and 8 states (A1 · · · A8) for addition. Each iteration
in the scalar multiplication handles a bit in the key starting from the most significant
one to the least significant bit. If the key bit is zero a doubling is done and no addition
is done. If the key bit is one the doubling is followed by an addition. The dissimilarity
in the way a 1 and a 0 in the key is handled makes the ECCP vulnerable to side channel
attacks as enumerated below.
• The duration of an iteration depends on the key bit. A key bit of 0 leads to a short
cycle compared to a key bit of 1. Thus measuring the duration of an iteration will
give an attacker knowledge about the key bit.
Fig. 7.1: Power Trace for a Key with all 1 Fig. 7.2: Power Trace for a Key with all 0
• Each state in the FSM has a unique power consumption trace. Monitoring the
power consumption trace would reveal if an addition is done thus revealing the
key bit.
1
To demonstrate the attack we used Xilinx’s XPower tool. Given a value change
dump (VCD) file generated from a flattened post map or post route netlist, XPower is
capable of generating a power trace for a given testbench (details on generating the
power trace is given in Appendix C).
Figures 7.1 and 7.2 are partial power traces generated for the key (F F F F F F F F )16
and (80000000)16 respectively. The graphs plots the power on the Y axis with the time
line on the X axis for a Xilinx Virtex 4 FPGA. The difference in the graphs is easily
noticeable. The spikes in Figure 7.1 occurs in state A6. This state is entered only when
a point addition is done, which in turn is done only when the key bit is 1. The spikes
are not present in Figure 7.2 as the state A6 is never entered. Therefore the spikes in
the trace can be used to identify ones in the key.
The duration between two spikes in Figure 7.1 is the time taken to do a point dou-
bling and a point addition. This is 12 clock cycles. If there are two spikes with a
distance greater than 12 clock cycles, it indicates that one or more zeroes are present in
the key. The number of zeroes (n) present can be determined by Equation 7.1. In the
1
http://www.xilinx.com/products/design_tools/logic_design/verification/xpower.htm
84
Fig. 7.3: Power Trace when k = (B9B9)16
equation t is the duration between the two spikes and T is the time period of the clock.
t
n= −3 (7.1)
4T
The number of zeroes between the leading one in k and the one due to the first spike
can be inferred by the amount of shift in the first spike.
As an example consider the power trace (Figure 7.3) for the ECCP obtained when
the key was set to (B9B9)16 . There are 9 spikes indicating 9 ones in the key (excluding
the leading one). Table 7.1 infers the key from the time duration between spikes. The
clock has a period T = 200ns.
The first spike t1 is obtained at 3506th ns. If there were no zeros before t1 the spike
should have been present at 2706th ns (this is obtained from the first spike of Figure
7.1). The shift is 800 ns equal to four clock cycles. Therefore a 0 is present before the
t1 spike.
85
Table 7.1: SPA for the key B9B916
The key obtained from the attack is (1011100110111001)2 , and it matches the actual
key.
To harden the ECCP against SPA, the sequence of computations involved when the key
bit is 1 and when the key bit is 0 must be indistinguishable. There are several ways
to achieve this. The most common technique is by inserting a dummy addition when
the key bit is 0[66]. This is shown in Figure 7.4. With this method, a doubling and
an addition is always done. The value of the key bit decides if the addition should
be considered. This makes the sequence for a key bit of 1 indistinguishable from a 0.
The time for an iteration is a constant therefore reducing timing attacks. Similar power
traces are seen at every iteration thus reducing threats of power attacks. The following
section modifies the ECCP architecture using the dummy addition to make it robust
against SPA.
86
Double
Addition
1 0
k Multiplexer
i
Modifying the ECCP to incorporate ’adding always’ requires a change in the FSM and
the register file. The new FSM is as shown in Figure 7.5. Irrespective of the key bit
all states D1 · · · D4 and A1 · · · A8 are entered in every iteration. If the key bit is 1 the
result of state A8 is considered as the output of the iteration. If the key bit is 0 the result
of D4 is taken as the output. After all key bits are processed the complete signal is
asserted.
A2
A1 A3
D4 A4
D3 A5
D2 A6
D1 A7
A8
Detect leading 1
complete
Init1 Init2 Init3 I1 I2 I21 I22 I23 I24
The SR-ECCP also requires a modification in the register file as shown in Figure
87
7.6. An additional register bank RD containing three registers is introduced. The three
registers in the bank RD1 , RD2 and RD3 store the coordinates of the computed double.
The outputs of the register bank is used in state A8 only when the key bit is 0. RD
requires an additional input multiplexer M U XIN 4 to store the doubled result. The size
of the output multiplexers M U XOU T 1, M U XOU T 2 and M U XOU T 3 are increased
to incorporate RD′ s outputs.
c[24]
c[39]c[21]
MUX
c[10] ad1 RA1 out1
din
IN1 MUX A0
c[11] ad2 RA out2
OUT1
c[12] we 2
C0 c[25]
c[40]c[22]
c[14:13] ad1 RB1
c[16:15] out1
MUX din
IN2
ad2 RB2 out2
MUX A2
C1 c[17] OUT2 c[31]
we RB 3
A3
c[34:33]
MUX din
ad1 RD1 out1
IN4 c[36:35]
ad2 RD2 out2
c[37]
we
RD 3
Figure 7.7 has the power trace for the SR-ECCP for the key (B9B9)16 . This is the same
key used in the power trace of Figure 7.3. However, unlike Figure 7.3, Figure 7.7 has
no periodic spikes. Thus using a simple power analysis the key cannot be inferred from
Figure 7.7.
88
Fig. 7.7: Power Trace when k = (B9B9)16
The modification of the ECCP to improve its security comes at a cost of increased area,
lower frequency and increased computation time. Table 7.2 shows the overhead of the
SR-ECCP compared to the ECCP. The clock cycles is the number of clocks required
to compute kP , assuming k has 116 zeroes out of 233 and the MSB of k is 1. The
clock cycles required for the SR-ECCP is always a constant irrespective of the number
of zeroes in k.
89
7.3 Conclusion
This chapter demonstrated the vulnerability of the ECCP to simple power analysis.
Simulations show that power trace of the processor leak the secret key. The vulnera-
bilities of the ECCP were fixed in the SR-ECCP, which does homogeneous operations
irrespective of the key bit. The penalty of the SR-ECCP is a larger area requirement
and lower frequency compared to the ECCP.
90
CHAPTER 8
The thesis explores various architectures for the construction of an elliptic curve crypto
processor for high performance applications. The most important factor contributing to
the performance is the finite field multiplication and finite field inversion. A combina-
tional multiplier is able to obtain the product in one clock cycle at the cost of increased
area and delay. In order to ensure that the primitives have a good area delay product,
the thesis suggests techniques to reduce the area time product by effectively utilizing
the available FPGA resources.
A hybrid Karatsuba multiplier is proposed for finite field multiplication, which has
been shown to possess the best area time product compared to reported Karatsuba im-
plementations. The hybrid Karatsuba multiplier is a recursive algorithm which does the
initial recursions using the simple Karatsuba multiplier [55], while the final recursion is
done using the general Karatsuba multiplier [55]. The general Karatsuba has large gate
counts, however it is more compact for small sized multiplications due to the better LUT
utilization. The simple Karatsuba multiplier is more efficient for large sized multipli-
cations. After a thorough search, a threshold of 29 was found. Multiplications smaller
than 29 bits is done using the general Karatsuba multiplier, while larger multiplications
are done with the simple Karatsuba multiplier.
The quad-Itoh Tsujii inversion algorithm proposed to find the multiplicative inverse
has the best computation time and area time product compared to works reported in
literature. This work first generalizes the Itoh-Tsujii algorithm and then shows that a
specific instance of the generalization, which uses quad circuits instead of squarers, is
more efficient on FPGAs.
An elliptic curve crypto processor is built using the proposed finite field primitives.
Except for [1], the constructed processor has better timing than all reported works.
However, the constructed processor has much better area requirements and area time
product compared to [1]. These were achieved in spite of the fact that the scalar mul-
tiplication implemented was straight forward and no parallelism or pipelining in the
architecture was used.
• The focus of this work was on the implementation of efficient elliptic curve prim-
itives for ECC and its impact on the overall performance of the ECCP. Thus a
possible future work could be to combine architectural techniques like pipelining
and parallelism in the higher level scalar multiplier with techniques proposed in
this thesis.
• A simple power attack was analyzed and prevented in the side channel resistant
version of the elliptic curve crypto processor. A very interesting field of research,
would be to study the effect of the more powerful differential power analysis
(DPA) on the proposed architecture.
• To make the work proposed in this thesis usable in practice, the developed el-
liptic curve crypto processor may be incorporated in security toolkits such as
OpenSSL1 . This involves the development of a communication interface for com-
1
http://www.openssl.org
92
munication with the host processor, operating system device drivers and library
modifications.
93
APPENDIX A
The elliptic curve crypto processor (ECCP) and the side channel resistant version of
the ECCP, the SR-ECCP, have to be verified for their correctness. The verification was
done for the curve given Equation A.1.
y 2 + xy = x3 + ax2 + b (A.1)
The basepoint and the values of the curve constants used is given in Table A.1. These
constants were taken from NIST’s digital signature specification [14] for elliptic curves
over GF (2233 ).
For a key (k), the scalar product kP is determined by simulation of the ECCP (or
the SR-ECCP) with Modelsim or iVerilog. Here, P is the basepoint with coordinates
(Px , Py ). The result thus obtained is verified against the result obtained by running the
Table A.1: Basepoint and Curve Constants used for Verification of the ECCP and the
SR-ECCP
Basepoint X (Px ) 233’h0FAC9DFCBAC8313BB2139F1
BB755FEF65BC391F8B36F8F8EB7371FD558B
Basepoint Y (Py ) 233’h1006A08A41903350678E585
28BEBF8A0BEFF867A7CA36716F7E01F81052
Curve constant (b) 233’h066647EDE6C332C7F8C0923
BB58213B333B20E9CE4281FE115F7D8F90AD
Curve constant (a) 1
Virtex 4 FPGA
USB Controller
Elliptic Curve
11
00 USB
Crypto Processor
HARDWARE
elliptic curve software with the same key k. The elliptic curve software was obtained
from the book Implementing Elliptic Curve Cryptography by Michael Rosing [67].
A Python 1 script was developed which would automatically generate a random key
k. This key is used by Rosing’s software to determine Q1 = kP . The key is also used
in the test vector of the ECCP(or SR-ECCP) to determine Q2 = kP . The python script
would then verify if Q1 = Q2 . A large number of scalar multiplications were tested
using the above mentioned procedure.
The testing of the ECCP was done using the Virtex 4 FPGA board from Dinigroup2 .
The simplified block diagram of the test platform is shown in Figure A.1. USB com-
munication software supplied by the manufacturer was used to communicate between
the PC and the hardware. Onboard devices convert the USB protocol into a proprietary
main bus protocol3 . This channel is used to configure the FPGAs as well as commu-
nicate with the elliptic curve processor. Our implementations resides in the Virtex 4
FPGA (FPGA_A). The main bus slave has eight 32 bit input registers (Rin0 · · · Rin7)
and sixteen output registers (Rout0 · · · Rout15). It also has a control register contain-
1
www.python.org
2
http://www.dinigroup.com/DN8000k10pcie.php
3
http://www.dinigroup.com/product/common/mainbus_spec.pdf
95
Table A.2: ECCP System Specifications on the Dini Hardware
Frequency 24MHz
Slices occupied 22526
Size on device 25%
Clock Cycles Required 1883 (average case)
Critical Path Follows the path from register bank, quadblock
MUX C and register bank again.
ing status bits such as start (to start the scalar multiplication) and done (to indicate
completion). To initialize, the 233 bit scalar k (in Algorithm 3.1) is loaded into the
registers Rin0 to Rin7. On completion the result Qx and Qy can be read from Rout0
to Rout15. The results from testing on the hardware was as expected. Table A.2 shows
the specifications of the system when used with the Dini card.
96
APPENDIX B
The graph in Figure 5.5 was plotted after synthesizing the quad-ITA and the squarer-
ITA for several finite fields. The following table contains the addition chains, irre-
ducible polynomials and number of cascaded quad circuits in the quadblock for each
implementation of the (quad-)ITA.
There are two forms of power dissipation for a device; static and dynamic power. Static
power is the amount of power dissipated by the device when no clock is running. During
this phase no signals toggle, hence the power consumed is the minimum power required
to maintain the state of the logic cell. Dynamic power is the amount of power dissipated
by the device when the clock is running. The dynamic power is considerably higher than
the static power consumed by the device, and it is generally caused when one or more
of the inputs toggle. Analysis of the instantaneous dynamic power of the device is used
in side channel attacks.
C.1 XPower
The XPower tool estimates the power consumption for a variety of Xilinx FPGA archi-
tectures. The estimation is based on the device and the number of transitions (activity
rate) of the device.
The following procedure is used to estimate the power consumed by a device using
Xilinx’s ISE and XPower.
• The developed verilog code is synthesized using the Xilinx ISE tool. The result
of synthesis is a .ngd file. This file is a netlist of primitive gates which could be
implemented on several of the Xilinx FPGAs.
• The next step is to map the primitives onto the resources available on the specific
FPGA platform. This is done by the Xilinx map tool. The output of the tool is an
.ncd file.
• The .ncd file is then passed to the place and route tool, where specific locations
on the FPGA are assigned. This tool tries to incorporate all the timing constraints
specified in the constraints file. The output of the place and route tool is an
updated .ncd file.
• In ISE, a flattened verilog netlist can be generated after the mapping or the place
and route. This verilog netlist after the mapping can be created by clicking the
generate post-map simulation model. This would create a verilog netlist called
topmodule_map.v. Also a .sdf file is created containing timing information of the
device.
• Now the flattened verilog file and the sdf along with a testbench can be simulated
in Modelsim. A value change dump file containing all the signal transitions can
be generated from the simulation. This requires the following lines to be present
in the test bench.
initial begin
$dumpfile ("dump.vcd"); /* File to place signal activity report */
$dumpvars; /* Dump all signals in the design */
$dumpon; /* Turn on dump */
#100000 $dumpoff; /* Turn off dump */
end
These lines will result in a file called dump.vcd to be generated during simulation.
The VCD file contains the activity on each signal in the design.
99
• The constraints file (.pcf ), the .vcd file and the .ncd file are used as inputs to
XPower. XPower can be run from command line as shown below.
xpwr topmodule_map.ncd topmodule.pcf -s dump.vcd
The result produced by xpwr is present in a text file called topmodule.txt. The
topmodule.txt file contains the instantaneous power consumption for the given
test vector.
If the .sdf file generated by ISE is used in XPower, then the power measurement
would include the power consumed due to glitches. If the post place and route verilog
netlist was used instead of the mapped netlist then more accurate power measurement
is possible.
100
APPENDIX D
This appendix derives the elliptic curve equations for points in affine coordinates and
López-Dahab projective coordinates.
Consider the elliptic curve E over the field GF (2m ). This is given by
y 2 + xy = x3 + ax2 + b (D.1)
where a, b ∈ GF (2m ).
dF
=x
dy
(D.3)
dF
= x2 + y
dx
If we consider the curve given in Equation D.1, with b = 0, then the point (0, 0)
lies on the curve. At this point dF/dy = dF/dx = 0. This forms a singular point
and cannot be included in the elliptic curve group, therefore an additional condition of
b 6= 0 is required on the elliptic curve of Equation D.1. This condition ensures that the
curve is non singular.
D.1 Equations for Arithmetic in Affine Coordinates
Let P = (x1 , y1 ) be a point on the elliptic curve of Equation D.1. To find the inverse of
point P , a vertical line is drawn passing through P . The equation of this line is x = x1 .
The point at which this line intersects the curve is the inverse −P . The coordinates of
−P is (x1 , y1′ ). To find y1′ , the point of intersection between the line and the curve must
be found. Equation D.2 is represented in terms of its roots p and q as shown below.
The coefficients of y is the sum of the roots. Equating the coefficients of y in Equations
D.2 and D.4.
p + q = x1
p = x1 + y1
This is the y coordinate of the inverse. The inverse of the point P is therefore given by
(x1 , x1 + y1 ).
Let P = (x1 , y1 ) and Q = (x2 , y2 ) be two points on the elliptic curve. To add the two
points, a line (l) is drawn through P and Q. If P 6= ±Q, the line intersects the curve of
Equation D.1 at the point −R = (x3 , y3′ ). The inverse of the point −R is R = (P + Q)
having coordinates (x3 , y3 ).
102
The slope of the line l passing through P and Q is given by
y2 − y1
λ=
x2 − x1
y − y1 = λ(x − x1 )
(D.5)
y = λ(x − x1 ) + y1
Equation D.6 is a cubic equation having three roots. Let the roots be p, q and r. These
roots represent the x coordinates of the points on the line that intersect the curve (the
point P , Q and −R). Equation D.6 can be also represented in terms of its roots as
(x − p)(x − q)(x − r) = 0
(D.7)
3 2
x − (p + q + r)x · · · = 0
p + q + r = λ2 + λ + a (D.8)
Since P = (x1 , y1 ) and Q = (x2 , y2 ) lie on the line l, therefore two roots of Equation
D.6 are x1 and x2 . Substituting p = x1 and q = x2 in Equation D.8 we get the third
root, this is the x coordinate of the third point on the line which intersects the curve( ie.
103
−R). This point is denoted by x3 , and it also represents the x coordinate of R.
x3 = λ2 + λ + x1 + x2 + a (D.9)
Reflecting this point about the x axis is done by substituting y3′ = x3 + y3 . This gives
the y coordinate of R, denoted by y3 .
y3 = λ(x3 + x1 ) + y1 + x3 (D.11)
Since we are working with binary finite fields, subtraction is the same as addition.
Therefore,
x3 = λ2 + λ + x1 + x2 + a
y3 = λ(x3 + x1 ) + y1 + x3 (D.12)
y2 + y1
λ=
x2 + x1
Let P = (x1 , y1 ) be a point on the elliptic curve. The double of P , ie. 2P , is found by
drawing a tangent t through P . This tangent intersects the curve at the point −2P =
(x3 , y3′ ). Taking the reflection of the point −2P about the X axis gives 2P = (x3 , y3 ).
First, let us look at the tangent t through P . The slope of the tangent t is obtained
by implicit differentiation of Equation D.1.
dy dy
2y + x + y = 3x2 + 2ax
dx dx
104
Since we are using modular 2 arithmetic,
dy
x + y = x2
dx
The slope dy/dx of the line t passing through the point P is given by
x1 2 + y1
λ= (D.13)
x1
y + y1 = λ(x + x1 ) (D.14)
This gives,
y = λ(x + x1 ) + y1
This equation is cubic and has three roots. Of these three roots, two roots must be
equal since the line intersects the curve at exactly two points. The two equal roots are
represented by p. The sum of the three roots is (λ2 + λ + a), similar to Equation D.7.
105
Therefore,
p + p + r = λ2 + λ + a
r = λ2 + λ + a
The dissimilar root is r. This root corresponds to the x coordinate of −2P ie. x3 .
Therefore,
x3 = λ2 + λ + a
To find the y coordinate of −2P , ie. y3′ , substitute x3 in Equation D.14. This gives,
y3′ = λx3 + x1 2
To find y3 , the y coordinate of 2P , the point y3′ is reflected on the x axis. From the point
inverse equation
y3 = λx3 + x1 2 + x3
x3 = λ 2 + λ + a
y3 = x1 2 + λx3 + x3 (D.16)
y1
λ = x1 +
x1
106
D.2 Equations for Arithmetic in LD Projective Coordi-
nates
Inverting a point P = (x1 , y1 ) on the elliptic curve results in the point −P = (x3 , y3 ) =
(x1 , x1 + y1 ). Converting x1 to X1 /Z1 , x3 to X3 /Z3 and y1 to Y1 /Z1 2 , y3 to Y3 /Z3 2
X3 X1
Then Z3
= Z1
, therefore X3 = X1 and Z3 = Z1 . Also,
Y3 X1 Y1
2 = + 2
Z3 Z1 Z1
X1 Z1 + Y1
=
Z1 2
y2 + (Y1 /Z1 2 )
λ=
x2 + (X1 /Z1 )
y 2 Z1 2 + y 1
=
Z1 (x2 Z1 + X1 )
A
λ=
Z1 · B
107
Consider equation for x3 in Equation D.12.
!2 !
X3 A A X1
x3 = = + + + x2 + a
Z3 BZ1 BZ1 Z1
A2 + ABZ1 + B 2 X1 Z1 + B 2 x2 Z12 + aB 2 Z1 2
=
(BZ1 )2
Therefore,
Z3 = (BZ1 )2 = C 2 (D.17)
and,
X3 = A2 + AC + B 2 X1 Z1 + B 2 x2 Z1 2 + aB 2 Z1 2
= A2 + AC + B 2 (Z1 B + aZ1 2 )
X3 = A2 + E + D (D.18)
108
Substituting X1 = B + x2 Z1 and E = ABZ1 we get
Y3 = (B + x2 Z1 )AB 3 Z1 2 + EX3 + X3 Z3 + B 4 Y1 Z1 2
= y2 Z3 2 + Ex2 Z3 + EX3 + X3 Z3
Y3 = (G + x2 Z3 2 ) + Ex2 Z3 + EX3 + X3 Z3
(D.19)
Y3 = G + F (E + Z3 )
y 1 2 y1
x3 = x1 + + x1 + +a
x1 x1
(D.20)
x4 + y12 + x31 + x1 y1 + ax21
= 1
x21
X3 X 2 bZ 2
= 12 + 21
Z3 Z1 X1
X3 X + bZ 4
4
= 1 2 21
Z3 X1 Z1
109
Therefore,
X3 = X14 + bZ14
Z3 = X12 Z12
y1
y3 = x21 + x1 + x3 + x3
x1
x3 + x y
1 1
= (x21 + x3 ) + 1 2 x3
x1
b y 2 + ax2 + b
1 1
y3 = + x3
x21 x21
Therefore
Y3 = bZ14 Z3 + (Y12 + aX12 Z12 + bZ14 )X3
110
APPENDIX E
This appendix determines the estimates of AN D and XOR gates for the simple Karat-
suba multiplier.
For an m = 2k bit basic Karatsuba multiplier, the first recursion splits the m bit multi-
plicands into m/2 bits. Three m/2 = 2k−1 bit multipliers are then required. The second
recursion has nine m/4 = 2k−2 bit multipliers. The ith recursion has 3i multipliers with
each multiplier being m/2i = 2k−i bits in length. There are k = log2 m such recursions.
The final recursion containing two bit multiplications has 3log2 m multipliers. In the final
recursion each multiplication is done using a single AN D gates. Therefore,
Let A and B be the two m = 2k bit multiplicands. In the first recursion, the multipli-
cands are split into two halves. Let the higher bits be Ah and Bh and the lower bits
Table E.1: Combining the Partial Products
4n − 4 3n − 2 2n − 1 2n − 2 n−1
to to to to
3n − 1 2n n 0
- - - Ml Ml
- Ml Ml Ml -
- Mh Mh Mh -
- Mhl Mhl Mhl -
Mh Mh Mh - -
be Al and Bl . The three m/2 bit multiplications that are performed are Mh = Ah Bh ,
Ml = Al Bl and Mhl = (Ah + Al )(Bh + Bl ). Let n = m/2. Forming the terms Ah + Al
requires n XOR gates. Similarly the terms Bh + Bl requires n XOR gates. In all, 2n
XORs are required. After the three multiplications are completed, the partial products
are added as shown in the Table E.1. The columns in the table show the output bits
of the multiplier and partial products that need to be combined to form the output bit.
Combining the terms (2n − 2) to n requires 3(n − 1) XOR gates. Similarly the terms
from (3n−2) to 2n require 3(n−1) XOR gates. Combining the terms (2n−1) requires
2 XOR gates. Thus, the total number of XOR gates required for combining the partial
products is 6n − 4, and the number of XOR gates required is 6n − 4 + 2n = 4m − 4.
Since m/2r is the length of the multiplier in the rth recursion, the number of XOR gates
required in the rth recursion is 4(m/2r ) − 4. Adding up the XOR gates required for all
the recursions gives the XOR gate estimate (Equation E.2.
log2m
3r 4m/2r − 4
X
#XOR gates : (E.2)
r=0
112
E.2 Gate Requirements for the Simple Karatsuba Mul-
tiplier
The simple Karatsuba is basically the basic Karatsuba multiplier with a small modifica-
tion to handle bit lengths of the form m 6= 2k . The number of XOR and AN D gates for
the basic Karatsuba multiplier form the upper bound for the number of gates required
by the simple Karatsuba multiplier. Therefore,
113
REFERENCES
[1] W. N. Chelton and M. Benaissa, “Fast Elliptic Curve Cryptography on FPGA,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no.
2, pp. 198–205, Feb. 2008.
[4] Paul Kocher, Joshua Jaffe, and Benjamin Jun, “Differential Power Analysis,”
Lecture Notes in Computer Science, vol. 1666, pp. 388–397, 1999.
[5] Mitsuru Matsui and Junko Nakajima, “On the Power of Bitslice Implementation
on Intel Core2 Processor,” in CHES, 2007, pp. 121–134.
[6] Thomas Wollinger, Jan Pelzl, Volker Wittelsberger, Christof Paar, Gökay Sal-
damli, and Çetin K. Koç, “Elliptic and Hyperelliptic Curves on Embedded µP ,”
Trans. on Embedded Computing Sys., vol. 3, no. 3, pp. 509–533, 2004.
[11] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone, Handbook of
Applied Cryptography, CRC Press, 2001.
114
[13] Toshiya Itoh and Shigeo Tsujii, “A Fast Algorithm For Computing Multiplicative
Inverses in GF (2m ) Using Normal Bases,” Inf. Comput., vol. 78, no. 3, pp. 171–
177, 1988.
[16] Douglas R. Stinson, Cryptography: Theory and Practice, Third Edition (Discrete
Mathematics and Its Applications), Chapman & Hall/CRC, 2005.
[17] Whitfield Diffie and Martin E. Hellman, “New Directions in Cryptography,” IEEE
Transactions on Information Theory, vol. IT-22, no. 6, pp. 644–654, 1976.
[18] Darrel Hankerson, Alfred J. Menezes, and Scott Vanstone, Guide to Elliptic Curve
Cryptography, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2003.
[20] IEEE Computer Society, “IEEE Standard Specifications for Public-key Cryptog-
raphy,” 2000.
[21] American National Standards Institute, “Public Key Cryptography for the Finan-
cial Service Industry : The Elliptic Curve Digital Signature Algorithm (ECDSA),”
1998.
[24] Thomas Wollinger, Jorge Guajardo, and Christof Paar, “Security on FPGAs:
State-of-the-art Implementations and Attacks,” Trans. on Embedded Computing
Sys., vol. 3, no. 3, pp. 534–574, 2004.
[25] Deming Chen, Jason Cong, and Peichen Pan, “FPGA Design Automation: A
Survey,” Found. Trends Electron. Des. Autom., vol. 1, no. 3, pp. 139–169, 2006.
[26] Takashi Horiyama, Masaki Nakanishi, Hirotsugu Kajihara, and Shinji Kimura,
“Folding of Logic Functions and its Application to Look Up Table Compaction,”
ICCAD, vol. 00, pp. 694–697, 2002.
115
[27] Michael Hutton, Jay Schleicher, David M. Lewis, Bruce Pedersen, Richard
Yuan, Sinan Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark
Bourgeault, Andy Lee, Henry Kim, and Rahul Saini, “Improving FPGA Perfor-
mance and Area Using an Adaptive Logic Module,” in FPL, 2004, pp. 135–144.
[28] Eli Biham and Adi Shamir, “Differential Fault Analysis of Secret Key Cryptosys-
tems,” in CRYPTO ’97: Proceedings of the 17th Annual International Cryptol-
ogy Conference on Advances in Cryptology, London, UK, 1997, pp. 513–525,
Springer-Verlag.
[29] Gerardo Orlando and Christof Paar, “A High Performance Reconfigurable Elliptic
Curve Processor for GF (2m ),” in CHES ’00: Proceedings of the Second Inter-
national Workshop on Cryptographic Hardware and Embedded Systems, London,
UK, 2000, pp. 41–56, Springer-Verlag.
[30] Julio López and Ricardo Dahab, “Improved Algorithms for Elliptic Curve Arith-
metic in GF (2n ),” in SAC ’98: Proceedings of the Selected Areas in Cryptogra-
phy, London, UK, 1999, pp. 201–212, Springer-Verlag.
[31] Leilei Song and Keshab K. Parhi, “Low-Energy Digit-Serial/Parallel Finite Field
Multipliers,” J. VLSI Signal Process. Syst., vol. 19, no. 2, pp. 149–166, 1998.
[32] Tim Kerins, Emanuel Popovici, William P. Marnane, and Patrick Fitzpatrick,
“Fully Parameterizable Elliptic Curve Cryptography Processor over GF (2),” in
FPL ’02: Proceedings of the Reconfigurable Computing Is Going Mainstream,
12th International Conference on Field-Programmable Logic and Applications,
London, UK, 2002, pp. 750–759, Springer-Verlag.
[33] M. Bednara, M. Daldrup, J. von zur Gathen, J. Shokrollahi, and J. Teich, “Re-
configurable Implementation of Elliptic Curve Crypto Algorithms,” in Parallel
and Distributed Processing Symposium., Proceedings International, IPDPS 2002,
Abstracts and CD-ROM, 2002, pp. 157–164.
[34] Nils Gura, Sheueling Chang Shantz, Hans Eberle, Sumit Gupta, Vipul Gupta,
Daniel Finchelstein, Edouard Goupy, and Douglas Stebila, “An End-to-End Sys-
tems Approach to Elliptic Curve Cryptography,” in CHES ’02: Revised Papers
from the 4th International Workshop on Cryptographic Hardware and Embedded
Systems, London, UK, 2003, pp. 349–365, Springer-Verlag.
[35] Jonathan Lutz and Anwarul Hasan, “High Performance FPGA based Elliptic
Curve Cryptographic Co-Processor,” in ITCC ’04: Proceedings of the Interna-
tional Conference on Information Technology: Coding and Computing (ITCC’04)
Volume 2, Washington, DC, USA, 2004, p. 486, IEEE Computer Society.
[36] Jerome A. Solinas, “Efficient Arithmetic on Koblitz Curves,” Des. Codes Cryp-
tography, vol. 19, no. 2-3, pp. 195–249, 2000.
116
[37] N. A. Saqib, F. Rodríiguez-Henríquez, and A. Diaz-Perez, “A Parallel Architec-
ture for Fast Computation of Elliptic Curve Scalar Multiplication Over GF (2m ),”
in 18th International Parallel and Distributed Processing Symposium, 2004. Pro-
ceedings, Apr. 2004.
[38] Qiong Pu and Jianhua Huang, “A Microcoded Elliptic Curve Processor for
GF (2m ) Using FPGA Technology,” in Communications, Circuits and Systems
Proceedings, 2006 International Conference on, June 2006, vol. 4, pp. 2771–
2775.
[39] Xilinx, “Using Block RAM in Spartan-3 Generation FPGAs,” Application Note,
XAPP-463, 2005.
[40] Bijan Ansari and M. Anwar Hasan, “High Performance Architecture of Elliptic
Curve Scalar Multiplication,” Tech. Rep., Department of Electrical and Computer
Engineering, University of Waterloo, 2006.
[42] William Stallings, Cryptography and Network Security (4th Edition), Prentice-
Hall, Inc., Upper Saddle River, NJ, USA, 2005.
[43] Christof Paar, Efficient VLSI Architectures for Bit-Parallel Computation in Galois
Fields, Ph.D. thesis, Institute for Experimental Mathematics, Universität Essen,
Germany, June 1994.
[45] Gregory C. Ahlquist, Brent E. Nelson, and Michael Rice, “Optimal Finite Field
Multipliers for FPGAs,” in FPL ’99: Proceedings of the 9th International Work-
shop on Field-Programmable Logic and Applications, London, UK, 1999, pp.
51–60, Springer-Verlag.
[46] Ç. K. Koç and B. Sunar, “An Efficient Optimal Normal Basis Type II Multiplier,”
IEEE Trans. Comput., vol. 50, no. 1, pp. 83–87, 2001.
[47] Çetin K. Koç and Tolga Acar, “Montgomery Multiplication in GF (2k ),” DES
Codes Cryptography, vol. 14, no. 1, pp. 57–69, 1998.
[48] C. Grabbe, M. Bednara, J. Shokrollahi, J. Teich, and J. von zur Gathen, “FPGA
Designs of Parallel High Performance GF (2233 ) Multipliers,” in Proc. of the
IEEE International Symposium on Circuits and Systems (ISCAS-03), Bangkok,
Thailand, May 2003, vol. II, pp. 268–271.
117
[49] Zoya Dyka and Peter Langendoerfer, “Area Efficient Hardware Implementation
of Elliptic Curve Cryptography by Iteratively Applying Karatsuba’s Method,” in
DATE ’05: Proceedings of the conference on Design, Automation and Test in
Europe, Washington, DC, USA, 2005, pp. 70–75, IEEE Computer Society.
[50] Joachim von zur Gathen and Jamshid Shokrollahi, “Efficient FPGA-Based Karat-
suba Multipliers for Polynomials over F2 ,” in Selected Areas in Cryptography,
2005, pp. 359–369.
[51] Steffen Peter and Peter Langendörfer, “An efficient polynomial multiplier in
GF (2m ) and its application to ECC designs,” in DATE ’07: Proceedings of the
conference on Design, automation and test in Europe, San Jose, CA, USA, 2007,
pp. 1253–1258, EDA Consortium.
[52] Christof Paar, “A New Architecture for a Parallel Finite Field Multiplier with Low
Complexity Based on Composite Fields,” IEEE Transactions on Computers, vol.
45, no. 7, pp. 856–861, 1996.
[53] Francisco Rodríguez-Henríquez and Çetin Kaya Koç, “On Fully Parallel Karat-
suba Multipliers for GF (2m ),” in Proc. of the International Conference on Com-
puter Science and Technology (CST), 2003, pp. 405–410.
[56] Burton S. Kaliski, “The Montgomery Inverse and its Applications,” IEEE Trans-
actions on Computers, vol. 44, no. 8, pp. 1064–1065, 1995.
[57] Jorge Guajardo and Christof Paar, “Itoh-Tsujii Inversion in Standard Basis and Its
Application in Cryptography and Codes,” Des. Codes Cryptography, vol. 25, no.
2, pp. 207–216, 2002.
118
[60] Donald E. Knuth, The Art of Computer Programming Volumes 1-3 Boxed Set,
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1998.
[62] Guerric Meurice de Dormale, Philippe Bulens, and Jean-Jacques Quisquater, “An
Improved Montgomery Modular Inversion Targeted for Efficient Implementation
on FPGA,” in International Conference on Field-Programmable Technology -
FPT 2004, O. Diessel and J.A. Williams, Eds., 2004, pp. 441–444.
[65] Nele Mentens, Siddika Berna Ors, and Bart Preneel, “An FPGA Implementation
of an Elliptic Curve Processor GF (2m ),” in GLSVLSI ’04: Proceedings of the 14th
ACM Great Lakes symposium on VLSI, New York, NY, USA, 2004, pp. 454–457,
ACM.
[66] Jean-Sébastien Coron, “Resistance against Differential Power Analysis for El-
liptic Curve Cryptosystems,” in CHES ’99: Proceedings of the First Interna-
tional Workshop on Cryptographic Hardware and Embedded Systems, London,
UK, 1999, pp. 292–302, Springer-Verlag.
119
PUBLICATIONS AND AWARDS BASED ON THESIS
Publications
Awards
1. Chester Rebeiro and Debdeep Mukhopadhyay won the second prize at the de-
sign contest conducted by the 22nd International Conference on VLSI Design,
New Delhi, January 2009. The entry was titled "High Performance Galois Field
Elliptic Curve Cryptographic Processor for FPGA Platforms".
120