Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views135 pages

Chester Thesis

This thesis explores efficient hardware architectures for elliptic curve cryptography (ECC) on FPGAs, focusing on improving arithmetic primitives like multiplication and inversion to enhance performance. It introduces a novel hybrid finite field multiplier and a quad-Itoh Tsujii inverse algorithm, both designed to optimize resource utilization on FPGA platforms. The research also addresses the robustness of the ECC processor against side channel attacks, demonstrating significant improvements in efficiency and security.

Uploaded by

emmanuel990jesus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views135 pages

Chester Thesis

This thesis explores efficient hardware architectures for elliptic curve cryptography (ECC) on FPGAs, focusing on improving arithmetic primitives like multiplication and inversion to enhance performance. It introduces a novel hybrid finite field multiplier and a quad-Itoh Tsujii inverse algorithm, both designed to optimize resource utilization on FPGA platforms. The research also addresses the robustness of the ECC processor against side channel attacks, demonstrating significant improvements in efficiency and security.

Uploaded by

emmanuel990jesus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 135

ARCHITECTURE EXPLORATIONS FOR ELLIPTIC

CURVE CRYPTOGRAPHY ON FPGAS

A THESIS

submitted by

CHESTER REBEIRO

for the award of the degree

of

MASTER OF SCIENCE
(by Research)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


INDIAN INSTITUTE OF TECHNOLOGY MADRAS
FEBRUARY 2009
THESIS CERTIFICATE

This is to certify that the thesis titled Architecture Explorations for Elliptic Curve
Cryptography on FPGAs, IIT Madras, submitted by Chester Rebeiro, to the Indian
Institute of Technology Madras, for the award of the degree of Master of Science,
is a bonafide record of the research work done by him under my supervision. The
contents of this thesis, in full or in parts, have not been submitted to any other Institute
or University for the award of any degree or diploma.

Dr. Debdeep Mukhopadhyay


Research Guide
Professor
Dept. of CS and Engineering
IIT Madras, 600 036

Date: 14th January 2009


ACKNOWLEDGEMENTS

Foremost, I would like to thank my guide Dr. Debdeep Mukhopadhyay who shared a
lot of his experience and ideas with me. I appreciate his professionalism, planning, and
constant involvement in my research. I cherish the time we spent in discussions and in
the laboratory pouring over problems. Working under him has sharpened my research
skills and increased my appetite to work in cryptography.

I am grateful to Dr. Kamakoti and Dr. Shankar Balachandran for their encour-
agement, advice, and help whenever needed. I am indebted to the RISE lab and the
Computer Science Department for offering me a fabulous environment to work and
study.

I would like to take this opportunity to acknowledge several friends and lab mates
who made my stay at IIT Madras exciting and unforgettable. I acknowledge the help
received from Noor on innumerable occasions. I would especially like to thank him
for helping me out with various tool flows. Shoaib, for the discussions that we had on
technical as well as non technical topics. Rajesh, for being so easy to connect to, and
Venkat among all things for letting me know the best Idly joints in Chennai. I thank
Pavan, Shyam, Sadgopan, Parthasarthy, and Lalit for working along with me on several
courses and assignments.

I am grateful to the Centre for Development of Advanced Computing for giving me


this opportunity to further my studies. I would like to acknowledge the help received
from my colleagues Hari Babu, Ramana Rao, and Alok Singh who took care of things
while I was away.

I would like to thank my wife Sharon and my parents for the love and encourage-

i
ment I received. Without their support this thesis would not have been possible. I would
like to thank my grandmother for her prayers and for being my role model for hardwork.
I would like to dedicate this thesis to her.

Chester Rebeiro

ii
ABSTRACT

The current era has seen an explosive growth in communications. Applications like on-
line banking, personal digital assistants, mobile communication, smartcards, etc. have
emphasized the need for security in resource constrained environments. Elliptic curve
cryptography (ECC) serves as a perfect cryptographic tool because of its short key sizes
and security comparable to that of other standard public key algorithms. However,
to match the ever increasing requirement for speed in today’s applications, hardware
acceleration of the cryptographic algorithms is a necessity. As a further challenge, the
designs have to be robust against side channel attacks.

This thesis explores efficient hardware architectures for elliptic curve cryptography
over binary Galois fields. The efficiency is largely affected by the underlying arithmetic
primitives. The thesis therefore explores FPGA designs for two of the most important
field primitives namely multiplication and inversion. FPGAs are reconfigurable hard-
ware platforms offering flexibility and lower costs like software programs. However,
designing on FPGA platforms is challenging because of the large granularity, limited
resources, and large routing delay. The smallest programmable entity in an FPGA is
the look up table. The arithmetic algorithms proposed in this thesis maximizes the uti-
lization of LUTs on the FPGA.

A novel finite field multiplier based on the recursive Karatsuba algorithm is pro-
posed. The proposed multiplier combines two variants of Karatsuba, namely the gen-
eral and the simple Karatsuba multipliers. The general Karatsuba multiplier has a
large gate count but for small sized multiplications is compact because it utilizes LUT
resources efficiently. For large sized multiplications, the simple Karatsuba is efficient as
it requires lesser gates. The proposed hybrid multiplier does the initial recursion using

iii
the simple algorithm while final small sized multiplications is done using the general
algorithm. The multiplier thus obtained has the best area time product compared to
reported literature.

The Itoh-Tsujii multiplicative inverse algorithm is based on Fermat’s little theorem


and requires m − 1 squarings and O(log2 (m)) multiplications. The proposed inverse
algorithm called quad-Itoh Tsujii, is based on the fact that on an FPGA, using quad
circuits is more efficient than using squarers due to a better LUT utilization. The quad-
Itoh Tsujii requires (m−1)/2 quad circuits and has the best computation time compared
to any inverse algorithm reported.

The proposed primitives are organized as an elliptic curve crypto processor (ECCP)
and has one of the best timings and area time product compared to reported works. We
conclude that the performance of an ECCP is significantly enhanced if the underlying
primitives are carefully designed. Further, a side channel attack based on simple timing
and power analysis is demonstrated on the ECCP. The construction of the ECCP is then
modified to mitigate such attacks.

iv
TABLE OF CONTENTS

ACKNOWLEDGEMENTS i

ABSTRACT iii

LIST OF TABLES x

LIST OF FIGURES xii

ABBREVIATIONS xiii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . 5

2 A Survey 7
2.1 Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . . . 8
2.2 Engineering an Elliptic Curve Crypto Processor . . . . . . . . . . . 10
2.3 Hardware Accelerators for ECCP . . . . . . . . . . . . . . . . . . . 11
2.3.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Side Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Mathematical Background 18
3.1 Abstract Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

v
3.1.1 Groups, Rings and Fields . . . . . . . . . . . . . . . . . . . 18
3.1.2 Binary Finite Fields . . . . . . . . . . . . . . . . . . . . . 20
3.2 Elliptic Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Projective Coordinate Representation . . . . . . . . . . . . 27
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Architecting an Efficient Implementation of a Finite Field Multiplier on


FPGA Platforms 31
4.1 Finite Field Multipliers for High Performance Applications . . . . . 32
4.2 Karatsuba Multiplication . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Karatsuba Multipliers for Elliptic Curves . . . . . . . . . . . . . . . 34
4.4 Designing for the FPGA Architecture . . . . . . . . . . . . . . . . 36
4.5 Analyzing Karatsuba Multipliers on FPGA Platforms . . . . . . . . 37
4.5.1 The Hybrid Karatsuba Multiplier . . . . . . . . . . . . . . . 41
4.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 High Performance Finite Field Inversion for FPGA Platforms 47


5.1 Algorithms for Multiplicative Inverse . . . . . . . . . . . . . . . . 47
5.2 The Itoh-Tsujii Algorithm (ITA) . . . . . . . . . . . . . . . . . . . 48
5.3 Clock Cycles for the ITA . . . . . . . . . . . . . . . . . . . . . . . 50
5.4 Generalizing the Itoh-Tsujii Algorithm . . . . . . . . . . . . . . . . 52
5.4.1 Hardware Architecture . . . . . . . . . . . . . . . . . . . . 57
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6 Constructing the Elliptic Curve Crypto Processor 67


6.1 The Elliptic Curve Cryptoprocessor . . . . . . . . . . . . . . . . . 69
6.1.1 Register Bank . . . . . . . . . . . . . . . . . . . . . . . . . 69

vi
6.1.2 Finite Field Arithmetic Unit . . . . . . . . . . . . . . . . . 71
6.1.3 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2 Point Arithmetic on the ECCP . . . . . . . . . . . . . . . . . . . . 72
6.2.1 Point Doubling . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.2 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 The Finite State Machine (FSM) . . . . . . . . . . . . . . . . . . . 78
6.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7 Side Channel Analysis of the ECCP 83


7.1 Simple Power Analysis on the ECCP . . . . . . . . . . . . . . . . . 83
7.2 SPA Resistant ECCP . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2.1 The SR-ECCP . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2.2 Power Trace of the SR-ECCP . . . . . . . . . . . . . . . . 88
7.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . 89
7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8 Conclusions and Future Work 91


8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A Verification and Testing of the ECCP 94


A.1 Verification of the ECCP and SR-ECCP . . . . . . . . . . . . . . . 94
A.2 Testing of the ECCP . . . . . . . . . . . . . . . . . . . . . . . . . 95

B Finite Fields used for Performance Evaluation of ITA 97

C Using XPower to Obtain Power Traces of a Device 98


C.1 XPower . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

D Elliptic Curve Arithmetic 101

vii
D.1 Equations for Arithmetic in Affine Coordinates . . . . . . . . . . . 102
D.1.1 Point Inversion . . . . . . . . . . . . . . . . . . . . . . . . 102
D.1.2 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . 102
D.1.3 Point Doubling . . . . . . . . . . . . . . . . . . . . . . . . 104
D.2 Equations for Arithmetic in LD Projective Coordinates . . . . . . . 107
D.2.1 Point Inversion . . . . . . . . . . . . . . . . . . . . . . . . 107
D.2.2 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . 107
D.2.3 Point Doubling . . . . . . . . . . . . . . . . . . . . . . . . 109

E Gates Requirements for the Simple Karatsuba Multiplier 111


E.1 Gate Requirements for the Basic Karatsuba Multiplier . . . . . . . . 111
E.1.1 AND Gate Estimate . . . . . . . . . . . . . . . . . . . . . 111
E.1.2 XOR Gate Estimate . . . . . . . . . . . . . . . . . . . . . . 111
E.2 Gate Requirements for the Simple Karatsuba Multiplier . . . . . . . 113
LIST OF TABLES

3.1 Scalar Multiplication using Double and Add to find 22P . . . . . . 26

4.1 Comparison of LUT Utilization in Multipliers . . . . . . . . . . . . 41


4.2 Comparison of the Hybrid Karatsuba Multiplier with Reported FPGA
Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Inverse of a ∈ GF (2233 ) using generic ITA . . . . . . . . . . . . . 50


5.2 Comparison of LUTs Required for a Squarer and Quad Circuit for
GF (29 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA 55
5.4 Inverse of a ∈ GF (2233 ) using Quad-ITA . . . . . . . . . . . . . . 57
5.5 Control Word for GF (2233 ) Quad-ITA for Table 5.4 . . . . . . . . . 60
5.6 Comparison for Inversion on Xilinx Virtex E . . . . . . . . . . . . . 65

6.1 Utility of Registers in the Register Bank . . . . . . . . . . . . . . . 70


6.2 Parallel LD Point Doubling on the ECCP . . . . . . . . . . . . . . 73
6.3 Inputs and Outputs of the Register File for Point Doubling . . . . . 73
6.4 Parallel LD Point Addition on the ECCP . . . . . . . . . . . . . . . 75
6.5 Inputs and Outputs of the Register Bank for Point Addition . . . . . 76
6.6 Inputs and Outputs of Regbank for Every State . . . . . . . . . . . 77
6.7 Control Words for ECCP . . . . . . . . . . . . . . . . . . . . . . . 78
6.8 Comparison of the Proposed GF (2m ) ECCP with FPGA based Pub-
lished Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.9 Comparing Area×Time Requirements with [1] . . . . . . . . . . . 81

7.1 SPA for the key B9B916 . . . . . . . . . . . . . . . . . . . . . . . 86


7.2 Performance Evaluation of the SR-ECCP . . . . . . . . . . . . . . 89

ix
A.1 Basepoint and Curve Constants used for Verification of the ECCP and
the SR-ECCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.2 ECCP System Specifications on the Dini Hardware . . . . . . . . . 96

E.1 Combining the Partial Products . . . . . . . . . . . . . . . . . . . . 112

x
LIST OF FIGURES

2.1 Public Key Cryptosystem . . . . . . . . . . . . . . . . . . . . . . . 7


2.2 Elliptic Curve Pyramid . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 FPGA Island Style Architecture . . . . . . . . . . . . . . . . . . . 13
2.4 FPGA Logic Block . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Squaring Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


3.2 Modular Reduction with Trinomial x233 + x74 + 1 . . . . . . . . . . 23
3.3 Point Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Point Doubling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Combining the Partial Products in a Karatsuba Multiplier . . . . . . 37


4.2 233 Bit Hybrid Karatsuba Multiplier . . . . . . . . . . . . . . . . . 43
4.3 m Bit Multiplication vs Area × Time . . . . . . . . . . . . . . . . . 45

5.1 Circuit to Raise the Input to the Power of 2k . . . . . . . . . . . . . 51


5.2 Quad-ITA Architecture for GF (2233 ) with the Addition Chain 5.3 . 58
5.3 Quadblock Design: Raises the Input to the Power of 4k . . . . . . . 58
5.4 Clock Cycles of Computation Time versus Number of Quads in Quad-
block on a Xilinx Virtex 4 FPGA for GF (2233 ) . . . . . . . . . . . 62
5.5 Performance of Quad-ITA vs Squarer-ITA Implementation for Different
Fields on a Xilinx Virtex 4 FPGA . . . . . . . . . . . . . . . . . . 63

6.1 Block Diagram of the Elliptic Curve Crypto Processor . . . . . . . 67


6.2 Register File for Elliptic Curve Crypto Processor . . . . . . . . . . 69
6.3 Finite Field Arithmetic Unit . . . . . . . . . . . . . . . . . . . . . 71
6.4 The ECCP Finite State Machine . . . . . . . . . . . . . . . . . . . 76

xi
7.1 Power Trace for a Key with all 1 . . . . . . . . . . . . . . . . . . . 84
7.2 Power Trace for a Key with all 0 . . . . . . . . . . . . . . . . . . . 84
7.3 Power Trace when k = (B9B9)16 . . . . . . . . . . . . . . . . . . 85
7.4 Always Add Method to Prevent SPA . . . . . . . . . . . . . . . . . 87
7.5 FSM for SR-ECCP . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.6 Register File for SR-ECCP . . . . . . . . . . . . . . . . . . . . . . 88
7.7 Power Trace when k = (B9B9)16 . . . . . . . . . . . . . . . . . . 89

A.1 Test Platform for the ECCP . . . . . . . . . . . . . . . . . . . . . . 95

xii
ABBREVIATIONS

AU Arithmetic Unit
ASIC Application Specific Integrated Circuit
DPA Differential Power Analysis
ECC Elliptic Curve Cryptography
ECCP Elliptic Curve Crypto Processor
ECDLP Elliptic Curve Discrete Logarithm Problem
EEA Extended Euclid’s Algorithm
FPGA Field Programmable Gate Array
FSM Finite State Machine
GF Galois Field
ITA Itoh-Tsujii Algorithm
LD Lopez-Dahab
LUT Look Up Table
RSA Rivest Shamir Adleman
SPA Simple Power Analysis
SR-ECCP SPA Resistant Elliptic Curve Crypto Processor
VCD Value Change Dump

xiii
CHAPTER 1

Introduction

This era has seen an astronomical increase in communications over the wired and wire-
less networks. Everyday thousands of transactions take place over the world wide web.
Several of these transactions have critical data which need to be confidential, transac-
tions that need to be validated, and users authenticated. These requirements need a
rugged security framework to be in force.

Cryptology is the science concerned with providing secure communications. The


goal of cryptology is to construct schemes which allow only authorized access to in-
formation. All malicious attempts to access information is prevented. An authorized
access is identified by a cryptographic key. A user having the right key will be able to
access the hidden information, while all other users will not have access to the infor-
mation. Cryptology consists of cryptography and cryptanalysis. The former involves
the study and application of various techniques through which information may be ren-
dered unintelligible to all but the intended receiver. On the other hand cryptanalysis is
the science of breaking crypto systems and recovering the secret information.

There are two types of cryptographic algorithms, these are symmetric key and asym-
metric key algorithms. Symmetric key cryptographic algorithms have a single key for
both encryption and decryption. These are the most widely used schemes. They are
preferred for their high speed and simplicity. However they can be used only when
the two communicating parties have agreed on the secret key. This could be a hurdle
when used in practical cases as it is not always easy for users to exchange keys. In
asymmetric key cryptographic algorithms two keys are involved. A private key and a
public key. The private key is kept secret while the public key is known to everyone.
The encryption is done with the public key, and the encrypted message can be only
decrypted by the corresponding private key. Security of these algorithms depends on
the hardness of deriving the private key from the public key. Although slow and highly
complex, asymmetric key cryptography has immense advantages. The main advantage
is that the underlying primitives used are based on well known problems such as inte-
ger factorization and discrete logarithm problems. These problems have been studied
extensively and their hardness has not been contradicted after years of research. This is
unlike symmetric key cryptography where the strength of the algorithm relies on combi-
natorial techniques. The security of such algorithms is not proven and does not rely on
well researched problems in literature. The most used asymmetric key crypto algorithm
is RSA [2]. Of late asymmetric crypto algorithms based on elliptic curves have been
rapidly gaining popularity due to the higher level of security offered at lower key sizes.
Several security standards have emerged which use elliptic curves for the underlying
security algorithm.

There are several methods to cryptanalyze modern cryptographic algorithms. Con-


ventional cryptanalysis techniques exploit algorithm weaknesses. They cannot be ap-
plied in practice due to the large number of data that is required. In addition most tech-
niques require huge amount of computation time making them very expensive. How-
ever, the most serious threat to modern cryptographic algorithms are attacks based on
information gathered from side channels. These attacks [3, 4] target the implementation
rather than the algorithm. Sources of side channel include power consumption of the
device, timing, acoustics, and radiation characteristics, thus an attacker monitoring one
or more side channels of a device performing an encryption (or decryption) can gather
information about the secret key. Optimized cryptographic implementations are more
susceptible to side channel attacks, therefore high performing cryptographic hardware
must consider this class of attacks during implementation.

2
1.1 Motivation

Though asymmetric key cryptography is indispensable for communication, there is a


penalty on the application’s performance. Most of the pubic key cryptographic algo-
rithms have several complex mathematical computations making the penalty dear. It is
therefore important to have efficient implementations of the algorithms.

There are two schemes for developing efficient cryptographic implementations. The
first focuses on implementing and optimizing the cryptographic algorithms in software
platforms. This has the advantage of being low cost as no additional hardware is re-
quired. However, benefits obtained by this method are restricted by the architecture
limitations of the microprocessor. For example, arithmetic on large numbers cannot be
as efficiently done on microprocessors as it can be performed on dedicated hardware.
Such arithmetic is a norm in public key cryptographic algorithms. Besides, software
can very easily be tampered thus compromising the security of the application.

Even if software implementations are tailored to exploit the processor’s architecture


[5–8] they are no match to dedicated hardware implementations. The inherent par-
allelism, flexibility, and custom design of hardware significantly speed up execution.
Also, hardware devices can be made more tamper resistant compared to software. This
is beneficial for cryptographic applications. However, hardware is more expensive than
software and the amount of resources available is limited. Design cycles for hardware
is also more involved and complex. Memory is yet another constraint for such designs.
It is therefore vital to have compact, scalable and modular hardware designs which are
fine tuned to the specific application. Field programmable gate arrays (FPGAs) are re-
configurable platforms to build hardware. They offer advantages of hardware platforms
as well as software platforms. While on one hand they offer more programmability and
lower costs like a software platform, they also offer better performance than a software
implementation. However designing for FPGAs is tricky. What works for an applica-
tion specific integrated circuit (ASIC) library does not always work for an FPGA. The

3
main differences occur because of the inherent difference in the libraries and the archi-
tectures. FPGAs have fixed resources, a look up table (LUT) based architecture, and
larger interconnect delays. Hence a design on FPGA must be carefully built to utilize
the resources well and satisfy the timing constraints of the FPGA library. In this work
we design and implement a side channel attack (SCA) resistant elliptic curve processor
on an FPGA platform.

1.2 Contribution of the Thesis

In this thesis, architectures for a public key crypto algorithm based on elliptic curves[9–
11] are explored. The architectural explorations are targeted for reconfigurable plat-
forms. The contribution of this thesis is as follows.

• The thesis presents an architecture for efficient implementations of finite field


multiplication. The proposed multiplier is called hybrid Karatsuba multiplier and
is based on the Karatsuba-Ofman multiplication algorithm [12]. Detailed analysis
has been carried out on how existing multiplication algorithms utilize FPGA re-
sources. Based on the observations, the work develops a hybrid technique which
has a better area delay product compared to known algorithms. Results have been
practically demonstrated through a large number of experiments.

• The most complex finite field operation in elliptic curve cryptography (ECC) is
the multiplicative inverse. The thesis proposes a novel inversion algorithm for
FPGA platforms. The proposed algorithm is a generalization of the Itoh-Tsujii
inversion algorithm [13]. Evidence has been furnished and supported with experi-
mental results to show that the proposed inversion algorithm outperforms existing
results. The proposed method is demonstrated to be scalable with respect to field
sizes.

• The work presents the design of a high performance elliptic curve crypto pro-

4
cessor (ECCP) for an elliptic curve over the finite field GF (2233 ). The chosen
elliptic curve is one of the selected curves for the digital signature standard [14].
The high performance is obtained by efficient implementations of the underly-
ing finite field arithmetic. The processor is synthesized for Xilinx’s FPGA [15]
platform and is shown to be one of fastest reported implementations on FPGA.

• The thesis demonstrates that a naive implementation of an elliptic curve crypto


processor is vulnerable to simple power attacks. The attack is demonstrated using
XPower1 ; a power simulation tool from Xilinx. The power traces are shown to
leak information about the key and internal activities of the state machine of the
processor. A side channel resistant processor is also designed and demonstrated
to be resistant to similar attacks.

1.3 Organization of the Thesis

The rest of this thesis is organized as follows.

• Chapter 2 contains a brief introduction of ECC and covers aspects about engineer-
ing an elliptic curve processor. A survey is made on existing elliptic curve crypto
processors reported in literature. The chapter also contains a brief introduction
on FPGA architecture and side channel attacks.

• Chapter 3 contains the mathematical background required to understand ECC.


The first part of the chapter outlines the required concepts in abstract algebra.
It also presents some of the basic arithmetic circuits such as adders, squarers,
and modular operators. The second part of the chapter discusses elliptic curve
cryptography.

• Finite field multiplication is discussed in detail in Chapter 4. The Karatsuba


multiplier is chosen as the multiplier in the elliptic curve crypto processor. A
1
http://www.xilinx.com/products/design_tools/logic_design/verification/xpower.htm

5
hybrid Karatsuba multiplier is proposed for FPGA platforms and shown to have
the best area time product compared to existing works.

• Chapter 5 discusses finite field inversion. A generalization of the Itoh-Tsujii in-


version algorithm is proposed. A specific form of the generalized Itoh Tsujii
algorithm known as the quad Itoh Tsujii is shown to be more efficient for FPGA
platforms. A processor based on the quad Itoh Tsujii is constructed and shown to
be the fastest inversion algorithm reported.

• Chapter 6 integrates the various finite field arithmetic primitives into an elliptic
curve crypto processor. The efficient underlying primitives result in one of the
fastest reported elliptic curve crypto processors.

• Chapter 7 uses Xilinx tools to demonstrate that the naive implementation of an


elliptic curve crypto processor is vulnerable to side channel attacks. The chapter
then proposes a modification to the architecture which makes the processor less
prone to side channel attacks.

• Chapter 8 has the conclusion of the thesis and future directions of research in this
area of work.

• Appendix A has details of how the correctness of the ECCP was verified and the
testing of the ECCP on a FPGA hardware platform. Appendix B has a list of the
finite fields that were used to test the scalability of the proposed inverse algorithm.
Appendix C has instructions to use XPower to obtain the power trace of an FPGA.
Appendix D has derivations for the elliptic curve arithmetic equations. Appendix
E has derivations for the gate requirements for the simple Karatsuba multiplier.

6
CHAPTER 2

A Survey

Definition 2.0.1 A symmetric key cryptosystem can be defined by the tuple (P, C, K, E, D)
[16], where

• P represents the finite set of possible plaintexts.

• C represents the finite set of possible ciphertexts.

• K represents the finite set of possible keys.

• For each k ∈ K there is an encryption rule ek ∈ E and a corresponding decryp-


tion rule dk ∈ D. Each ek : P → C and dk : C → P are functions such that
dk (ek (x)) = x for every plaintext x ∈ P.

The keys for both encryption and decryption are the same and must be kept secret. This
leads to problems related to key distribution and key management. In 1976, Diffie and
Hellman [17] invented asymmetric key cryptography which solved the problem of key
distribution and management. Asymmetric algorithms use a pair of keys for encryption

Plaintext Encryption Decryption Plaintext

Public Key Private Key

Fig. 2.1: Public Key Cryptosystem


and decryption (Figure 2.1). Encryption is done by a public key which is known to
everyone. Decryption can be only done using the corresponding private key. Given the
private key, the corresponding public key can easily be derived. However, the private
key cannot be efficiently derived from the public key. An asymmetric key cryptosystem
is constructed by means of trapdoor one-way functions which are defined as follows
[11].

Definition 2.0.2 A function f (x) from a set X to a set Y is called a one-way function
if f (x) can efficiently be computed but the computation of f −1 (x) is computationally
intractable.

Definition 2.0.3 A trapdoor one-way function is a one-way function f (x) if and only
if there exists some supplementary information (usually the secret key) with which it
becomes feasible to compute f −1 (x).

Thus, trapdoor one way functions rely on intractable problems in computer sci-
ence. An example of an intractable problem is the integer factorization problem, which
states that given an integer n, one has to obtain its prime factorization, i.e find n =
1 2 3 k
pe1 pe2 pe3 · · · pek , where pi is a prime number and ei ≥ 1. Solving the problem of factor-
ing the product of prime numbers is considered computationally difficult for properly
selected primes of size at least 1024 bits. This forms the basic security assumption of
the famous RSA algorithm [2]. Another intractable problem, the elliptic curve discrete
logarithm problem (ECDLP) has given rise to new asymmetric cryptosystems based on
elliptic curves.

2.1 Elliptic Curve Cryptography

Elliptic curves have been studied for over hundred years and have been used to solve
a diverse range of problems. For example, elliptic curves are used in proving Fermat’s

8
last theorem, which states that xn + y n = z n has non zero integer solutions for x, y, and
z when n > 2 [18].

The use of elliptic curves in public key cryptography was first proposed indepen-
dently by Koblitz [19] and Miller [10] in the 1980s. Since then, there has been an
abundance of research on the security of ECC. In the 1990’s ECC began to get ac-
cepted by several accredited organizations, and several security protocols based on ECC
[14, 20, 21] were standardized. The main advantage of ECC over conventional asym-
metric crypto systems [2] is the increased security offered with smaller key sizes. For
example, a 256 bit key in ECC produces the same level of security as a 3072 bit RSA
key1 . The smaller key sizes leads to compact implementations and increased perfor-
mance. This makes ECC suited for low power resource constrained devices.

An elliptic curve is the set of solutions (x, y) to Equation 2.1 together with the point
at infinity (O). This equation is known as the Weierstraß equation [18].

y 2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6 (2.1)

For cryptography, the points on the elliptic curve are chosen from a large finite field.
The set of points on the elliptic curve form a group under the addition rule. The point
O is the identity element of the group. The operations on the elliptic curve, i.e. the
group operations are point addition, point doubling and point inverse. Given a point
P = (x, y) on the elliptic curve, and a positive integer n, scalar multiplication is defined
as
nP = P + P + P + · · · P (n times) (2.2)

The order of the point P is the smallest positive integer n such that nP = O. The points
{O, P, 2P, 3P, · · · (n − 1)P } form a group generated by P . The group is denoted as
< P >.
1
NIST Sources

9
The security of ECC is provided by the elliptic curve discrete logarithm problem
(ECDLP), which is defined as follows : Given a point P on the elliptic curve and
another point Q ∈< P >, determine an integer k (0 ≤ k ≤ n) such that Q = kP . The
difficulty of ECDLP is to calculate the value of the scalar k given the points P and Q.
k is called the discrete logarithm of Q to the base P . P is the generator of the elliptic
curve and is called the basepoint.

The ECDLP forms the base on which asymmetric key algorithms are built. These
algorithms include the elliptic curve Diffie-Hellman key exchange, elliptic curve ElGa-
mal public key encryption, and the elliptic curve digital signature algorithm.

2.2 Engineering an Elliptic Curve Crypto Processor

The implementation of elliptic curve crypto systems constitutes a complex interdisci-


plinary research field involving mathematics, computer science, and electrical engineer-
ing [22]. Elliptic curve crypto systems have a layered hierarchy as shown in Figure 2.2.
The bottom layer constituting the arithmetic on the underlying finite field most promi-
nently influences the area and critical delay of the overall implementation. The group

EC
Primitives

Scalar Multiplication

Elliptic Curve Group Operations

Finite Field Operations

Fig. 2.2: Elliptic Curve Pyramid

10
operations on the elliptic curve and the scalar multiplication influences the number of
clock cycles required for encryption.

To be usable in real world applications, implementations of the crypto system must


be efficient, scalable, and reusable. Applications such as smart cards and mobile phones
require implementations where the amount of resources used and the power consumed
is critical. Such implementations should be compact and designed for low power. Com-
putation speed is a secondary criteria. Also, the degree of reconfigurability of the device
can be kept minimum [23]. This is because such devices have a short lifetime and are
generally configured only once. On the other side of the spectrum, high performance
systems such as network servers, data base systems etc. require high speed implemen-
tations of ECC. The crypto algorithm should not be the bottleneck on the application’s
performance. These implementations must also be highly flexible. Operating param-
eters such as algorithm constants, etc. should be reconfigurable. Reconfiguration can
easily be done in software, however software implementations do not always scale to
the performance demanded by the application. Such systems require to use dedicated
hardware to speedup computations. When using such hardware accelerators, the clock
cycles required, frequency of operation, and area are important design criteria. The
clock cycles should be low and frequency high so that the overall latency of the hard-
ware is less. The area is important because smaller area implies more parallelism can
be implemented on the same hardware, thus increasing the device’s throughput.

2.3 Hardware Accelerators for ECCP

There are two platforms on which hardware accelerators are built: application specific
integrated circuits (ASICs) and field programmable gate arrays (FPGAs). ASICs are
one time programmable and are best suited for high volume production. ASICs can
reach high frequency of operation, and algorithms implemented on these devices have
high performance. Also, ASICs are best when data protection is concerned. Once data

11
is written into an ASIC it is extremely difficult to read back. However, ASICs suffer
from high development costs and lack flexibility with respect to modifying algorithms
and reconfiguring parameters [24]. Besides, production of an ASIC requires to be done
in fabrication units. These fabrication units are generally owned by a third party. This
is not suited for cryptographic applications where minimum number of parties must be
involved.

FPGAs are reconfigurable devices offering parallelism and flexibility on one hand
while being low cost and easy to use on the other. Moreover they have much shorter
design cycle times compared to ASICs. FPGAs were initially used as prototyping de-
vices and in high performance scientific applications but the short time-to-market and
on-site reconfigurability features have expanded their application space. These devices
can now be found in various consumer electronic devices, high performance networking
applications, medical electronics and space applications. The reconfigurability aspect
of FPGAs also makes them suited for cryptographic applications. Reconfigurability re-
sults in flexible implementations allowing operating modes, encryption algorithms, and
curve constants etc. to be configured. FPGA’s do not require sophisticated equipment
for production, they can be programmed in house. This is beneficial for cryptography
as no untrusted party is involved in the production cycle.

2.3.1 FPGA Architecture

There are two main parts of the FPGA chip [25] : the input/output (I/O) blocks and
the core. The I/O blocks are located around the periphery of the chip and are used to
provide programmable connectivity to the chip. The core of the chip consists of pro-
grammable logic blocks and programmable routing architectures. A popular architec-
ture for the core, called island style architecture, is shown in Figure 2.3. Logic blocks,
also called configurable logic blocks (CLB), consists of logic circuitry for implementing
logic. Each CLB is surrounded by routing channels connected through switch blocks

12
Programmable
Routing Switches

Programmable
Logic Block Connection Switch

Fig. 2.3: FPGA Island Style Architecture

COUT

F4

F3 PRE
Control
LUT
& D Q
F2
Carry
CE
F1 Logic
CLK
CLR

SR
CE
CLK
BY
CIN

Fig. 2.4: FPGA Logic Block

and connection blocks. A switch block connects wires in adjacent channels through
programmable switches. A connection block connects the wire segments around a logic

13
block to its inputs and outputs. Each logic block further contains a group of basic logic
elements (BLE). Each BLE has a look up table (LUT), a storage element, and combina-
tional logic as shown in Figure 2.4. The storage element can be configured as an edge
triggered D-flip flop or as level sensitive latches. The combinational logic generally
contains logic for carry and control signal generation.

LUTs can be configured to be used in logic circuitry. If the LUT has m inputs,
then any m variable boolean function can be implemented. The LUT mainly contains
memory to store truth tables of boolean functions and multiplexers to select the values
of memories. There have been several studies on the best configuration for the LUT.
A larger LUT would result in more logic fitted into a single LUT and hence lesser
critical delay. However, a larger LUT would also indicate larger memory and bigger
multiplexers, hence larger area. Most studies show that a 4 input LUT provides the best
area-time product, though there have been few applications where a 3 input LUT [26]
and 6 input LUT [27] have been found beneficial. Most FPGA manufacturers, including
Xilinx2 and Altera3 , use 4 input LUTs.

2.4 Side Channel Attacks

From the mid 90’s, a new research area that has gained focus is side channel crypt-
analysis. This is becoming the biggest threat to modern day cryptosystems with many
of the algorithms successfully attacked. These attacks analyze unintended information
leakage that is provided by naive implementations of a crypto algorithm.

Side channel attacks are broadly classified into passive and active attacks. In a pas-
sive attack, the functioning of the cryptographic device is not tampered. The secret key
is revealed by observing physical properties of the device, such as timing characteris-
tics, power consumption traces, etc. In an active attack, the inputs and environment are
2
http://www.xilinx.com
3
http://www.altera.com

14
manipulated to force the device to behave abnormally. The secret key is then revealed
by exploiting the abnormal behavior of the device [28].

The two most extensively exploited side channels are power consumption and tim-
ing analysis. An attack based on timing analysis[3] first identifies and then monitors
certain operations in the device. The time required to complete these operations leaks
information about the secret key. Power consumption attacks [4] reveal the secret key
by monitoring the power consumed by the device. The power consumption of a device
has dependencies on the data being manipulated and the operation being performed.
There are essentially two forms of power attacks : simple power analysis and differen-
tial power analysis. An attacker using a simple power analysis (SPA) requires just a
single power trace. Features of the power trace are used to directly interpret the secret
key. A stronger form of power attack called differential power attack (DPA) was first
introduced by Kocher in [4]. This is a statistical technique and requires several power
traces to be analyzed before the key is revealed. This class of attacks is based on the
power consumption dependence of a device, which is dependent on the key.

2.5 Related Work

There have been several reported high performance FPGA processors for elliptic curve
cryptography. Various acceleration techniques have been used ranging from efficient
implementations to parallel and pipelined architectures. In [29] the Montgomery mul-
tiplier [30] is used for scalar multiplication. The finite field multiplication is performed
using a digit-serial multiplier proposed in [31]. The Itoh-Tsujii algorithm is used for
finite field inversion. A point multiplication over the field GF (2167 ) is performed in
0.21ms.

In [32] a fully parameterizable ABC processor is introduced, which can be used


with any field and irreducible polynomial without need for reconfiguration. This imple-

15
mentation although highly flexible is slow and does not reach required speeds for high
bandwidth applications. A 239 bit point multiplication requires 12.8ms, clearly this is
extremely high compared to other reported implementations.

In [33], the ECC processor designed has squarers, adders ,and multipliers in the data
path. The authors have used a hybrid coordinate representation in affine, Jacobian, and
López-Dahab form.

In [34] an end-to-end system for ECC is developed, which has a hardware imple-
mentation for ECC on an FPGA. The high performance is obtained with an optimized
field multiplier. A digit-serial shift-and-add multiplier is used for the purpose. Inversion
is done with a dedicated division circuit.

The processor presented in [35] achieves point multiplication in 0.074ms over the
field GF (2163 ). However, the implementation is for a specific form of elliptic curves
called Koblitz curves. On these curves, several acceleration techniques based on pre-
computation [36] are possible. However our work focuses on generic curves where such
accelerations do not work.

In [37] a high speed elliptic curve processor is presented for the field GF (2191 ),
where point multiplication is done in 0.056ms. A binary Karatsuba multiplier is used
for the field multiplication. However, no inverse algorithm seems to be specified in the
paper, making the implementation incomplete.

In [38] a microcoded approach is followed for ECC making it easy to modify,


change, and optimize. The microcode is stored in the block RAM [39] and does not
require additional resources.

In [40], the finite field multiplier in the processor is prevented from becoming idle.
The finite field multiplier is the bottle neck of the design therefore preventing it from
becoming idle improves the overall performance. Our design of the ECCP is on similar
lines where the operations required for point addition and point doubling are scheduled

16
so that the finite field multiplier is always utilized.

In [1], a pipelined ECC processor is developed which uses a combined algorithm to


perform point doubling and point addition. This computes the scalar product in 0.019ms
for an elliptic curve over GF (2163 ). This is the fastest reported in literature. However,
the seven stage pipeline used has huge area requirements.

In this thesis, high performance is attained by focusing on efficient implementations


of the finite field primitives. The algorithms used for the critical finite field operations
are tuned for the FPGA platform. Our novel finite field multiplier is a combinationl
circuit and produces the output in one clock cycle. This has tremendous performance
benefits. The proposed inversion algorithm is the fastest reported in literature. These
efficient underlying primitives result in one of the fastest elliptic curve processors even
though no pipelining is used.

2.6 Conclusion

In this chapter a brief introduction of elliptic curve cryptography was made, and the
hierarchy in an elliptic curve processor was presented. A review of the existing literature
on elliptic curve crypto processors was made. Hardware platforms used for elliptic
curve cryptography were discussed, with special focus on FPGA architectures. The
vulnerability of crypto processors to side channel attacks was also presented.

17
CHAPTER 3

Mathematical Background

Understanding Elliptic Curve Cryptography (ECC) requires a good understanding of


the underlying mathematics. ECC relies heavily on abstract algebra for its construc-
tion. This chapter therefore starts with a brief overview of the primitive algebraic struc-
tures, namely groups, rings, and fields. The second part of this chapter is dedicated
to the mathematics behind elliptic curves. In specific, elliptic curves over finite fields
of the form GF (2m ) are considered. The operations on this form of elliptic curve are
discussed.

3.1 Abstract Algebra

3.1.1 Groups, Rings and Fields

Definition 3.1.1 A group denoted by {G, ·}, is a set of elements G with a binary oper-
ation ’·’, such that for each ordered pair (a, b) of elements in G, the following axioms
are obeyed [41, 42]:

• Closure : If a, b ∈ G, then a · b ∈ G.

• Associative : a · (b · c) = (a · b) · c for all a, b, c ∈ G.

• Identity element : There is a unique element e ∈ G such that a · e = e · a = a for


all a ∈ G.

• Inverse element : For each a ∈ G, there is an element a′ ∈ G such that a · a′ =


a′ · a = e
If the group also satisfies a·b = b·a for all a, b ∈ G then it is known as a commutative
or an abelian group.

Definition 3.1.2 A ring denoted by {R, +, ×} or simply R is a set of elements with two
binary operations called addition and multiplication, such that for all a, b, c ∈ R the
following are satisfied:

• R is an abelian group under addition.

• The closure property of R is satisfied under multiplication.

• The associativity property of R is satisfied under multiplication.

• Distributive Law : For all a, b, c ∈ R, a·(b+c) = ab+ac and (a+b)·c = ac+bc.

The set of integers, rational numbers, real numbers, and complex numbers are all
rings. A ring is said to be commutative if the commutative property under multiplication
holds. That is, for all a, b ∈ R, a · b = b · a.

Definition 3.1.3 A field denoted by {F, +, ×} or simply F is a commutative ring which


satisfies the following properties

• There exists a multiplicative identity element denoted by 1 such that for every
a ∈ F, a · 1 = 1 · a = 1.

• Multiplicative inverse : For every element a ∈ F except 0, there exists a unique


element a−1 such that a · (a−1 ) = (a−1 ) · a = 1. a−1 is called the multiplicative
inverse of the element a.

• No zero divisors : If a, b ∈ F and a · b = 0, then either a = 0 or b = 0.

The set of rational numbers, real numbers and complex number are examples of
fields, while the set of integers is not. This is because the multiplicative inverse property
does not hold in the case of integers.

19
The above examples of fields have infinite elements. However in cryptography finite
fields play an important role. A finite field is also known as Galois field and is denoted
by GF (pm ). Here, p is a prime called the characteristic of the field, while m is a
positive integer. The order of the finite field, that is, the number of elements in the field
is pm . When m = 1, the resulting field is called a prime field and contains the residue
classes modulo p[41].

In cryptography two of the most studied fields are finite fields of characteristic two
and prime fields. Finite fields of characteristic two, denoted by GF (2m ), is also known
as binary extension finite fields or simply binary finite fields. They have several advan-
tages when compared to prime fields. Most important is the fact that modern computer
systems are built on the binary number system. With m bits all possible elements of
GF (2m ) can be represented. This is not possible with prime fields (with p 6= 2). For
example a GF (22 ) field would require 2 bits for representation and use all possible
numbers generated by the two bits. A GF (3) field would also require 2 bits for rep-
resenting the three elements in the field. This leaves one of the four possible numbers
generated by two bits unused leading to an inefficient representation. Another advan-
tage of binary extension fields is the simple hardware required for computation of some
of the commonly used arithmetic operations such as addition and squaring. Addition in
binary extension fields can be easily performed by a simple XOR. There is no carry
generated. Squaring in this field is a linear operation and can also be done using XOR
circuits. These circuits are much more simple than the addition and squaring circuits of
a GF (p) field.

3.1.2 Binary Finite Fields

A polynomial of the form a(x) = am xm + am−1 xm−1 + · · · + a1 x + a0 is said to


be a polynomial over GF (2) if the coefficients am , am−1 , · · · , a1 , a0 are in GF (2).
Further, the polynomial is said to be irreducible over GF (2) if a(x) is divisible only by

20
c or by c · a(x) where c ∈ GF (2) [43]. An irreducible polynomial of degree m with
coefficients in GF (2) can be used to construct the extension field G(2m ). All elements
of the extension field can be represented by polynomials of degree m − 1 over GF (2).

Binary finite fields are generally represented using two types of bases. These are the
polynomial and normal base representations.

Definition 3.1.4 Let p(x) be an irreducible polynomial over GF (2m ) and let α be the
root of p(x). Then the set
{1, α, α2 , · · · , αm−1 }

is called the polynomial base.

Definition 3.1.5 Let p(x) be an irreducible polynomial over GF (2m ), and let α be the
root of p(x), then the set

m 2m m(m−1))
{α, α2 , α2 , · · · , α2 }

is called the normal base if the m elements are linearly independent.

Any element in the field GF (2m ) can be represented in terms of its bases as shown
below.
a(x) = am−1 αm−1 + · · · + a1 α + a0

Alternatively, the element a(x) can be represented as a binary string (am−1 , · · · , a1 , a0 )


making it suited for representation on computer systems. For example, the polynomial
x4 + x3 + x + 1 in the field GF (28 ) is represented as (00011011)2 .

Various arithmetic operations such as addition, subtraction, multiplication, squaring


and inversion are carried out on binary fields. Addition and subtraction operations are
identical and are performed by XOR operations.

21
a(x)

Squaring Operation

0 0 0 0 0 0 0

Modulo Operation

2
a(x)

Fig. 3.1: Squaring Circuit

Let a(x), b(x) ∈ GF (2m ) be denoted by

m−1 m−1
ai x i bi xi
X X
a(x) = b(x) =
i=0 i=0

then the addition (or subtraction) of a(x) and b(x) is given by

m−1
(ai + bi )xi
X
a(x) + b(x) = (3.1)
i=0

here the + between ai and bi denotes a XOR operation.

The squaring operation on binary finite fields is as easy as addition. The square of
the polynomial a(x) ∈ GF (2m ) is given by

m−1
a(x)2 = ai x2i mod p(x)
X
(3.2)
i=0

The squaring essentially spreads out the input bits by inserting zeroes in between two
bits as shown in Figure 3.1.

Multiplication is not as trivial as addition or squaring. The product of the two poly-

22
nomials a(x) and b(x) is given by

 n−1 
b(x)ai xi mod p(x)
X
a(x) · b(x) = (3.3)
i=0

Most multiplication algorithms are of order O(n2 ).

Inversion is the most complex of all field operations. Even the best technique to
implement inversion is several times more complex than multiplication. Hence, algo-
rithms which use finite field arithmetic generally try to reduce the number of inversions
at the cost of increasing the number of multiplications.

The multiplication and squaring operation require a modular operation to be done.


The modular operation is the remainder produced when divided by the field’s irre-
ducible polynomial. If a certain class of irreducible polynomials is used, the modular
operation can be easily done. Consider the irreducible trinomial xm + xn + 1, having a

464 232 74 0

11111111111111111
00000000000000000
00000000000000000
11111111111111111

111111
000000
000000
111111

Fig. 3.2: Modular Reduction with Trinomial x233 + x74 + 1

23
root α and 1 < n < m/2. Therefore αm + αn + 1 = 0. Therefore,

αm = 1 + αn

αm+1 = α + αn+1
..
. (3.4)

α2m−3 = αm−3 + αm+n−3

α2m−2 = αm−2 + αm+n−2

For example, consider the irreducible trinomial x233 + x74 + 1. The multiplication or
squaring of the polynomial results in a polynomial of degree at most 464. This can be
reduced as shown in Figure 3.2. The higher order terms 233 to 464 are reduced by using
Equation 3.4.

3.2 Elliptic Curves

Definition 3.2.1 An elliptic curve E over the field GF (2m ) is given by the simplified
form of the Weierstraß equation mentioned in Equation 2.1. The simplified Weierstraß
equation is :
y 2 + xy = x3 + ax2 + b (3.5)

with the coefficients a and b in GF (2m ) and b 6= 0.

If b 6= 0, then the curve in Equation 3.5 is a non-singular curve. A point on the


curve is said to be singular if its partial derivatives vanish.

The set of points on the elliptic curve along with a special point O, called the point
at infinity, form a group under addition. The identity element of the group is the point
at infinity (O). The arithmetic operations permitted on the group are point inversion,
point addition and point doubling which are described as follows.

24
−2P
−(P+Q)

P
Q

(P+Q)
2P

Fig. 3.3: Point Addition Fig. 3.4: Point Doubling

Point Inversion : Let P be a point on the curve with coordinates (x1 , y1 ), then the
inverse of P is the point −P with coordinates (x1 , x1 + y1 ). The point −P is obtained
by drawing a vertical line through P . The point at which the line intersects the curve is
the inverse of P .

Point Addition : Let P and Q be two points on the curve with coordinates (x1 , y1 )
and (x2 , y2 ). Also, let P 6= ±Q, then adding the two points results in a third point
R = (P + Q). The addition is performed by drawing a line through P and Q as shown
in Figure 3.3. The point at which the line intersects the curve is −(P + Q). The inverse
of this is R = (P + Q). Let the coordinates of R be (x3 , y3 ), then the equations for x3
and y3 is

x3 = λ2 + λ + x1 + x2 + a
(3.6)
y3 = λ(x1 + x3 ) + x3 + y1

where λ = (y1 + y2 )/(x1 + x2 ). If P = −Q, then P + (−P ) is O.

Point Doubling : Let P be a point on the curve with coordinates (x1 , y1 ) and P 6= −P .
The double of P is the point 2 · P = (x3 , y3 ) obtained by drawing a tangent to the
curve through P . The inverse of the point at which the tangent intersects the curve is

25
Algorithm 3.1: Double and Add algorithm for scalar multiplication
Input: Basepoint P = (px , py ) and Scalar k = (km−1 , km−2 · · · k0 )2 , where
km−1 = 1
Output: Point on the curve Q = kP
1 Q=P
2 for i = m − 2 to 0 do
3 Q=2·Q
4 if ki = 1 then
5 Q=Q+P
6 end
7 end

Table 3.1: Scalar Multiplication using Double and Add to find 22P

i ki Operation Q
3 0 Double only 2P
2 1 Double and Add 5P
1 1 Double and Add 11P
0 0 Double only 22P

the double of P (Figure 3.4). The equation for computing 2 · P is given as

b
x3 = λ2 + λ + a = x1 2 +
x1 2 (3.7)
2
y3 = x1 + λx3 + x3

where λ = x1 + (y1 /x1 ).

The fundamental algorithm for ECC is the scalar multiplication (defined in Section
2.1). The basic double and add algorithm to perform scalar multiplication is shown in
Algorithm 3.1. The input to the algorithm is a basepoint P and a m bit scalar k. The
result is the scalar product kP .

As an example of how Algorithm 3.1 works, consider k = 22. The binary equivalent
of this is (10110)2 . Table 3.1 below shows how 22P is computed.

26
Each iteration of i does a doubling on Q if ki is 0 or a doubling followed by an
addition if ki is 1. The underlying operations in the addition and doubling equations
use the finite field arithmetic discussed in the previous section. Both point doubling
and point addition have 1 inversion (I) and 2 multiplications (M ) each (from Equations
3.6 and 3.7). From this, the entire scalar multiplier for the m bit scalar k will have
m
m(1I + 2M ) doublings and 2
(1I + 2M ) additions (assuming k has approximately
m/2 ones on an average). The overall expected running time of the scalar multiplier is
therefore obtained as
3
ta ≈ (3M + I)m (3.8)
2
For this expected running time, finite field addition and squaring operations have been
neglected as they are simple operations and can be considered to have no overhead to
the run time.

3.2.1 Projective Coordinate Representation

The complexity of a finite field inversion is typically eight times that of a finite field
multiplier in the same field [44]. Therefore, there is a huge motivation for an alternate
point representation which would require lesser inversions. The two point coordinate
system (x, y) used in Equations 3.5, 3.6 and 3.7 discussed in the previous section is
called affine representation. It has been shown that each affine point on the elliptic
curve has a one to one correspondence with a unique equivalence class in which each
point is represented by three coordinates (X, Y, Z). The three point coordinate system
is called the projective representation [11]. In the projective representation, inversions
are replaced by multiplications. The projective form of the Weierstraß equation can
be obtained by replacing x with X/Z c and y by Y /Z d . There are several projective
coordinates systems proposed. The most commonly used projective coordinate system
are the standard where c = 1 and d = 1, the Jacobian with c = 2 and d = 3 and the
López-Dahab (LD) coordinates[11] which has c = 1 and d = 2. The LD coordinate

27
system [30] allows point addition using mixed coordinates, i.e. one point in affine while
the other in projective.

Replacing x by X/Z and y by Y /Z 2 in Equation 3.5 results in the LD projective


form of the Weierstraß equation.

Y 2 + XY Z = X 3 + aX 2 Z 2 + bZ 4 (3.9)

Let P = (X1 , Y1 , Z1 ) be an LD projective point on the elliptic curve, then the inverse
of point P is given by −P = (X1 , X1 Z1 + Y1 , Z1 ). Also, P + (−P ) = O, where O is
the point at infinity. In LD projective coordinates O is represented as (1, 0, 0).

The equation for doubling the point P in LD projective coordinates [30] results in
the point 2P = (X3 , Y3 , Z3 ). This is given by the following equation.

Z3 = X12 · Z12

X3 = X14 + b · Z14 (3.10)

Y3 = b · Z14 · Z3 + X3 · (a · Z3 + Y12 + b · Z14 )

The equations for doubling require 5 finite field multiplications and zero inversions.

The equation in LD coordinates for adding the affine point Q = (x2 , y2 ) to P , where

28
Q 6= ±P , is shown in Equation 3.11. The resulting point is P + Q = (X3 , Y3 , Z3 ).

A = y2 · Z12 + Y1

B = x2 · Z1 + X1

C = Z1 · B

D = B 2 · (C + a · Z12 )

Z3 = C 2
(3.11)
E =A·C

X3 = A2 + D + E

F = X3 + x2 · Z3

G = (x2 + y2 ) · Z32

Y3 = (E + Z3 ) · F + G

Point addition in LD coordinates now require 9 finite field multiplications and zero
inversions. For an m bit scalar with approximately half the bits one, the running time
expected is given by Equation 3.12. One inversion and 2 multiplications are required at
the end to convert the result from projective coordinates back into affine.

9M
tld ≈ m(5M + ) + 2M + 1I
2 (3.12)
= (9.5m + 2)M + 1I

The LD coordinates require several multiplications to be done but have the advantage
of requiring just one inversion. To be beneficial, the extra multiplications should have a
lower complexity than the inversions removed.

29
3.3 Conclusion

This chapter presented the necessary mathematical background required for this thesis.
The performance of the entire elliptic curve crypto processor depends on the underlying
finite field primitives therefore the primitives should be efficiently implemented. The
next two chapters discuss implementations of two of the most dominant primitives used
in ECC, namely the finite field multiplication and inversion.

30
CHAPTER 4

Architecting an Efficient Implementation of a Finite


Field Multiplier on FPGA Platforms

The finite field multiplier forms the most important component in the elliptic curve
crypto processor (ECCP). It occupies the most area on the device and also has the
longest latency. The performance of the ECCP is affected most by the multiplier. Finite
field multiplication of two elements in the field GF (2m ) is defined as

C(x) = A(x) · B(x) mod P (x) (4.1)

where C(x), A(x), and B(x) are in GF (2m ) and P (x) is the irreducible polynomial
that generates the field GF (2m ). Implementing the multiplication requires two steps.
First, the polynomial product C ′ (x) = A(x) · B(x) is determined, then the modulo
operation is done on C ′ (x). This chapter deals with polynomial multiplication.

The organization of the chapter is as follows: the next section contains a brief
overview of important finite field multipliers in literature. Section 4.2 discusses the
Karatsuba algorithm in greater detail. Section 4.3 outlines some of the Karatsuba mul-
tiplication variants used for elliptic curves. Section 4.4 presents how a circuit gets
mapped to a four input LUT based FPGA. Section 4.5 analyzes how the existing Karat-
suba algorithms get mapped on to the FPGA. It also presents the proposed hybrid Karat-
suba multiplier which maximizes utilization of FPGA resources. Section 4.6 compares
the performance of the hybrid Karatsuba multiplier with existing implementations of
the Karatsuba algorithm. The final section has the conclusion.
4.1 Finite Field Multipliers for High Performance Ap-
plications

The school book method to multiply two polynomials requires m2 AN D gates to gen-
erate the partial products. The final product is formed by adding the partial products.
Since we deal with binary fields, addition is easily done using XOR gates without any
carries being propagated, thus (m − 1)2 XOR gates are required to do the additions.

The Massey-Omura multiplier operates in normal basis representations of the field


elements. With this representation, the structure of the multiplication becomes highly
uniform resulting in efficient hardware architecture. The architecture takes a parallel
input but the result is produced serially [45].

Another multiplier based on normal basis is the Sunar-Koç [46] multiplier. The
multiplier requires lesser hardware compared to the Massey-Omura multiplier but has
similar timing requirements.

In [47], the Montgomery multiplier is adapted to binary finite fields. The multipli-
cation in Equation 4.1 is represented by the following equation

C(x) = A(x) · B(x) · R(x)−1 modP (x) (4.2)

where, R(x) is of the form xk and is an element in the field. Also, gcd(R(x), P (x)) = 1.
The division by R(x) reduces the complexity of the modular operation. For binary finite
fields, R(x) has the form 2k , therefore division by R(x) can be easily accomplished on
a computer. This multiplier is best suited for low resource environments where speed
of operation is not so important [44].

The Karatsuba multiplier [12] uses a divide and conquer approach to multiply A(x)
and B(x). The m term polynomials are recursively split into two. With each split
the size of the multiplication required reduces by half. This leads to a reduction in

32
the number of AN D gates required at the cost of an increase in XOR gates. This
also results in the multiplier having a space complexity of O(mlog2 3 ) for polynomial
representations of finite fields. A comparison of all available multipliers show that only
the Karatsuba multiplier has a complexity which is of sub quadratic order. All other
multipliers have a complexity which is quadratic. Besides this, it has been shown in
[44] and [48] that the Karatsuba multiplier if designed properly is also the fastest.

For a high performance elliptic curve crypto processor, the finite field multiplier
with the smallest delay and the least number of clock cycles is best suited. Karatsuba
multiplier if properly designed, attains the above speed requirements and at the same
time has a sub-quadratic space complexity. This makes the Karatsuba multiplier the
best choice for high performance applications.

4.2 Karatsuba Multiplication

In the Karatsuba multiplier, the m bit multiplicands A(x) and B(x) represented in poly-
nomial basis are split as shown in Equation 4.3. For brevity, the equations that follow
represent the polynomials Ah (x), Al (x), Bh (x), and Bl (x) by Ah , Al , Bh , and Bl re-
spectively.

A(x) = Ah xm/2 + Al
(4.3)
m/2
B(x) = Bh x + Bl

33
The multiplication is then done using three m/2 bit multiplications as shown in Equa-
tion 4.4.

C ′ (x) = (Ah xm/2 + Al )(Bh xm/2 + Bl )

= Ah Bh xm + (Ah Bl + Al Bh )xm/2 + Al Bl

= Ah Bh xm (4.4)

+ ((Ah + Al )(Bh + Bl ) + Ah Bh + Al Bl )xm/2

+ Al Bl

The Karatsuba multiplier can be applied recursively to each m/2 bit multiplication in
Equation 4.4. Ideally this multiplier is best suited when m is a power of 2, this allows the
multiplicands to be broken down until they reach 2 bits. The final recursion consisting
of 2 bit multiplications can be achieved by AN D gates. Such a multiplier with m a
power of 2 is called the basic Karatsuba multiplier.

4.3 Karatsuba Multipliers for Elliptic Curves

The basic recursive Karatsuba multiplier cannot be applied directly to ECC because
the binary extension fields used in standards such as [14] have a prime degree. There
have been several published works which implement a modified Karatsuba algorithm
for use in elliptic curves. There are two main design approaches followed. The first
approach is a sequential circuit having less hardware and latency but requiring several
clock cycles to produce the result. Generally at every clock cycle the outputs are fed-
back into the circuit thus reusing the hardware. The advantage of this approach is that
it can be pipelined. Examples of implementations following this approach can be found
in[48–51]. The second approach is a combinational circuit having large area and delay
but is capable of generating the result in one clock cycle. Examples of this approach
can found in [52–55]. Our proposed Karatsuba multiplier follows the second approach,

34
therefore in the remaining part of this section we analyze the combinational circuits for
Karatsuba multipliers.

The easiest method to modify the Karatsuba algorithm for elliptic curves is by
padding. The padded Karatsuba multiplier extends the m bit multiplicands to 2⌈log2 m⌉
bits by padding the most significant bits with zeroes. This allows the use of the ba-
sic recursive Karatsuba algorithm. The obvious drawback of this method is the extra
arithmetic introduced due to the padding.

In [53], a binary Karatsuba multiplier was proposed to handle multiplications in


any field of the form GF (2m ), where m = 2k + d and k is the largest integer such
that 2k < m. The binary Karatsuba multiplier splits the m bit multiplicands (A(x)
and B(x)) into two terms. The lower terms (Al and Bl ) have 2k bits while the higher
terms (Ah and Bh ) have d bits. Two 2k bit multipliers are required to obtain the partial
products Al Bl and (Ah + Al )(Bh + Bl ). For the latter multiplication, the Ah and Bh
terms have to be padded with 2k − d bits. Ah Bh product is determined using a d bit
binary Karatsuba multiplier.

The simple Karatsuba multiplier [55] is the basic recursive Karatsuba multiplier
with a small modification. If an m bit multiplication is needed to be done, m being any
integer, it is split into two polynomials as in Equation 4.3. The Al and Bl terms have
⌈m/2⌉ bits and the Ah and Bh terms have ⌊m/2⌋ bits. The Karatsuba multiplication
can then be done with two ⌈m/2⌉ bit multiplications and one ⌊m/2⌋ bit multiplication.
The upper bound for the number of AN D gates and XOR gates required for the simple
Karatsuba multiplier is the same as that of a 2⌈log2 m⌉ bit basic recursive Karatsuba mul-
tiplier. The maximum number of gates required and the time delay for an m bit simple
Karatsuba multiplier is given below.

#AN Dgates : 3⌈log2 m⌉


⌈log2 m⌉   (4.5)
3r 4⌈m/2r ⌉ − 4
X
#XORgates :
r=0

35
In the general Karatsuba multiplier [55], the multiplicands are split into more than
two terms. For example an m term multiplier is split into m different terms. The number
of gates required is given below.

#AN Dgates :m(m + 1)/2


5 7 (4.6)
#XORgates : m2 − m + 1
2 2

4.4 Designing for the FPGA Architecture

Maximizing the performance of a hardware design requires the design to be customized


for the target architecture. The smallest programmable entity on an FPGA is the look
up table (LUT) (Section 2.3.1). A LUT generally has four inputs and can be configured
for any logic function having a maximum of four inputs. The LUT can also be used to
implement logic functions having less than four inputs, two for example. In this case,
only half the LUT is utilized the remaining part is not utilized. Such a LUT having less
than four inputs is an under utilized LUT. For example, the logic function y = x1 + x2
under utilizes the LUT as it has only two inputs. Most compact implementations are
obtained when the utilization of each LUT is maximized. From the above fact it may be
derived that the minimum number of LUTs required for a q bit combinational circuit is
given by Equation 4.7.




 0 if q = 1


 1 if 1 < q ≤ 4


#LU T (q) = (4.7)
⌈q/3⌉ if q > 4 and q mod 3 = 2







 ⌊q/3⌋

if q > 4 and q mod 3 6= 2

The delay of the q bit combinational circuit in terms of LUTs is given by Equation 4.8,
where DLU T is the delay of one LUT.

DELAY (q) = ⌈log4 (q)⌉ ∗ DLU T (4.8)

36
The percentage of under utilized LUTs in a design is determined using Equation
4.9. Here, LU Tk signifies that k inputs out of 4 are used by the design block realized by
the LUT. So, LU T2 and LU T3 are under utilized LUTs, while LU T4 is fully utilized.

LU T2 + LU T3
%U nderU tilizedLU T s = ∗ 100 (4.9)
LU T2 + LU T3 + LU T4

Al Bl

Al Bl

AhBh

(Ah +Al )(B h +Bl )

AhBh

(2n−2) (3n/2)−2 n (n−2) (n/2) 0

Fig. 4.1: Combining the Partial Products in a Karatsuba Multiplier

4.5 Analyzing Karatsuba Multipliers on FPGA Platforms

In this section we discuss the mapping of various Karatsuba algorithms on an FPGA.


We estimate the amount of FPGA resources that is required for the implementations.

Recursive Karatsuba Multiplier : In an m (= 2k ) bit recursive Karatsuba multiplier the


basic Karatsuba algorithm of [12] is applied recursively. Each recursion reduces the
size of the input by half while tripling the number of multiplications required. At each
recursion, except the final, only XOR operations are involved. Let n = 2(log2 m)−k be
the size of the inputs (A and B) for the k th recursion of the m bit multiplier. There are 3k

37
such n bit multipliers required. The A and B inputs are split into two: Ah , Al and Bh , Bl
respectively with each term having n/2 bits. n/2 two input XORs are required for the
computation of Ah + Al and Bh + Bl respectively (Equation 4.4). Each two input XOR
requires one LUT on the FPGA, thus in total there are n LUTs required. Combining
the partial products as shown in Figure 4.1 is the last step of the recursion. Determining
the output bits n − 2 to n/2 and 3n/2 − 2 to n requires 3(n/2 − 1) two input XORs
each. The output bit n − 1 requires 2 two input XORs. In all (3n − 4) two input XORs
are required to add the partial products. The number of LUTs required to combine
the partial products is much lower. This is because each LUT implements a four input
XOR. Each output bit n/2 to 3n/2 − 2 requires one LUT, therefore (n − 1) LUTs are
required for the purpose. In total, 2n − 1 LUTs are required for each recursion on the
FPGA. The final recursion has 3(log2 m)−1 two bit Karatsuba multipliers. The equation
for the two bit Karatsuba multiplier is shown in Equation 4.10.

C0 =A0 B0

C1 =A0 B0 + A1 B1 + (A0 + A1 )(B0 + B1 ) (4.10)

C2 =A1 B1

This requires three LUTs on the FPGA: one for each of the output bits (C0 , C1 , C2 ).

The total number of LUTs required for the m bit recursive Karatsuba multiplication
is given by Equation 4.11.

logX
2 m−2

#LU T SR (m) = 3 ∗ 3log2 m−1 + 3k (2 ∗ 2log2 m−k − 1)


k=0
logX
(4.11)
2 m−1

= 3k (2log2 m−k+1 − 1)
k=0

The delay of the recursive Karatsuba multiplier in terms of LUTs is given by Equa-

38
tion 4.12. The first log2 (m) − 1 recursions have a delay of 2LU T s. The last recursion
has a delay of 1LU T .

DELAYR (m) = (2(log2 (m) − 1) + 1)DLU T


(4.12)
= (2log2 (m) − 1)DLU T

When m is not necessarily a power of 2, the number of recursions of an m bit simple


Karatsuba multiplier is equivalent to that of a 2⌈log2 m⌉ recursive Karatsuba multiplier,
therefore Equations 4.11 and 4.12 form the upper bound for the number of LUTs and
delay of a simple Karatsuba multiplier [55] (Equations 4.13 and 4.14).

#LU T SS (m) ≤ #LU T SR (2⌈log2 m⌉ ) (4.13)

DELAYS (m) ≤ DELAYR (2⌈log2 m⌉ ) (4.14)

General Karatsuba Multiplier : The m bit general Karatsuba algorithm [55] is shown
in Algorithm 4.1. Each iteration of i computes two output bits Ci and C2m−2−i . Com-
puting the two output bits require same amount of resources on the FPGA. The lines 6
and 7 in the algorithm is executed once for every even iteration of i and is not executed
for odd iterations of i. The term Mj + Mi−j + M(j,i−j) is computed with the four inputs
Aj , Ai−j , Bj and Bi−j , therefore on the FPGA, computing the term would require one
LUT. For an odd i, Ci would have ⌈i/2⌉ such LUTs whose outputs have to be added.
The number of LUTs required for this is obtained from Equation 4.7. An even value of
i would have two additional inputs corresponding to Mi/2 that have to be added. The
number of LUTs required for computing Ci (0 ≤ i ≤ m − 1) is given by Equation 4.15.

1 if i = 0






#LU Tci =  ⌈i/2⌉ + #LU T (⌈i/2⌉) if i is odd (4.15)


 i/2 + #LU T (i/2 + 2) if i is even

39
Algorithm 4.1: gkmul (General Karatsuba Multiplier)
Input: A, B are multiplicands of m bits
Output: C of length 2m − 1 bits
/* Define : Mx → Ax Bx */
/* Define : M(x,y) → (Ax + Ay )(Bx + By ) */
1 begin
2 for i = 0 to m − 2 do
3 Ci = C2m−2−i = 0
4 for j = 0 to ⌊i/2⌋ do
5 if i = 2j then
6 Ci = Ci + Mj
7 C2m−2−i = C2m−2−i + Mm−1−j
8 else
9 Ci = Ci + Mj + Mi−j + M(j,i−j)
10 C2m−2−i = C2m−2−i + Mm−1−j
11 +Mm−1−i+j + M(m−1−j,m−1−i+j)
12 end
13 end
14 end
15 Cm−1 = 0
16 for j = 0 to ⌊(m − 1)/2⌋ do
17 if m − 1 = 2j then
18 Cm−1 = Cm−1 + Mj
19 else
20 Cm−1 = Cm−1 + Mj + Mm−1−j + M(j,m−1−j)
21 end
22 end
23 end

The total number of LUTs required for the general Karatsuba multiplier is given by
Equation 4.16.
 
m−2
X
#LU T SG (m) = 2 #LU TCi  + #LU TCm−1 (4.16)
i=0

When implemented in hardware, all output bits are computed simultaneously. The
delay of the general Karatsuba multiplier (Equation 4.17) is equal to the delay of the
output bit with the most terms. This is the output bit Cm−1 (lines 15 to 22 in the
Algorithm 4.1). Equation 4.17 is obtained from Equation 4.15 with i = m − 1. The

40
Table 4.1: Comparison of LUT Utilization in Multipliers

m General Simple
Gates LUTs LUTs Under Gates LUTs LUTs Under
Utilized Utilized
2 7 3 66.6% 7 3 66.6%
4 37 11 45.5% 33 16 68.7%
8 169 53 20.7% 127 63 66.6%
16 721 188 17.0% 441 220 65.0%
29 2437 670 10.7% 1339 669 65.4%
32 2977 799 11.3% 1447 723 63.9%

⌈i/2⌉ computations are done with a delay of one LUT (DLU T ). Equation 4.8 is used to
compute the second term of Equation 4.17.


 DLU T + DELAY (⌈(m − 1)/2⌉)

if m − 1 is odd
DELAYG (m) =  (4.17)
 D
LU T + DELAY ((m − 1)/2 + 2) if m − 1 is even

4.5.1 The Hybrid Karatsuba Multiplier

In this section we present our proposed multiplier called the hybrid Karatsuba multi-
plier. We show how we combine techniques to maximize utilization of LUTs resulting
in minimum area.

Table 4.1 compares the general and simple Karatsuba algorithms for gate counts
(two input XOR and AN D gates), LUTs required on a Xilinx Virtex 4 FPGA and the
percentage of LUTs under utilized (Equation 4.9).

The simple Karatsuba multiplier alone is not efficient for FPGA platforms as the
number of under utilized LUTs is about 65%. For an m bit simple Karatsuba multiplier
the two bit multipliers take up approximately a third of the area (for m = 256). In a two
bit multiplier, two out of three LUTs required are under utilized (In Equation 4.10, C0

41
and C2 result in under utilized LUTs). In addition to this, around half the LUTs used
for each recursion is under utilized. The under utilized LUTs results in a bloated area
requirement on the FPGA.

The m-term general Karatsuba is more efficient on the FPGA for small values on
m (Table 4.1) even though the gate count is significantly higher. This is because a
large number of operations can be grouped in fours which fully utilizes the LUT. For
small values of m (m < 29) the compactness obtained by the fully utilized LUTs is
more prominent than the large gate count, resulting in low footprints on the FPGA. For
m ≥ 29, the gate count far exceeds the efficiency obtained by the fully utilized LUTs,
resulting in larger footprints with respect to the simple Karatsuba implementation.

Algorithm 4.2: hmul (Hybrid Karatsuba Multiplier)


Input: The multiplicands A, B and their length m
Output: C of length 2m − 1 bits
1 begin
2 if m < 29 then
3 return gkmul(A, B, m)
4 else
5 l = ⌈m/2⌉

6 A = A[m−1···l] + A[l−1···0]

7 B = B[m−1···l] + B[l−1···0]
8 Cp1 = hmul(A[l−1···0] , B[l−1···0] , l)
′ ′
9 Cp2 = hmul(A , B , l)
10 Cp3 = hmul(A[m−1···l] , B[m−1···l] , m − l)
11 return (Cp3 << 2l) + (Cp1 + Cp2 + Cp3 ) << l + Cp1
; /* << indicates left shift */
12
13 end
14 end

In our proposed hybrid Karatsuba multiplier shown in Algorithm 4.2, the m bit
multiplicands are split into two parts when the number of bits is greater than or equal to
the threshold 29. The higher term has ⌊m/2⌋ bits while the lower term has ⌈m/2⌉ bits.
If the number of bits of the multiplicand is less than 29 the general Karatsuba algorithm

42
233 Simple

116 117

58 58 58 59

29 29 29 29 29 29 30 29

14 15 14 15 14 15 14 15 14 15 14 15 15 15 14 15
General

Fig. 4.2: 233 Bit Hybrid Karatsuba Multiplier

is invoked. The general Karatsuba algorithm ensures maximum utilization of the LUTs
for the smaller bit multiplications, while the simple Karatsuba algorithm ensures least
gate count for the larger bit multiplications. For a 233 bit hybrid Karatsuba multiplier
(Figure 4.2), the multiplicands are split into two terms with Ah and Bh of 116 bits and
Al and Bl of 117 bits. The 116 bit multiplication is implemented using three 58 bit
multipliers, while the 117 bit multiplier is implemented using two 59 bit multipliers
and a 58 bit multiplier. The 58 and 59 bit multiplications are implemented with 29 and
30 bit multipliers, the 29 and 30 bit multiplications are done using 14 and 15 bit general
Karatsuba multipliers.

The number of recursions in the hybrid Karatsuba multiplier is given by

m
r = ⌈log2 ⌉+1 (4.18)
29

The ith recursion (0 < i < r) of the m bit multiplier has 3i multiplications. The
multipliers in this recursion have bit lengths ⌈m/2i ⌉ and ⌊m/2i ⌋. For simplicity we
assume the number of gates required for the ⌊m/2i ⌋ bit multiplier is equal to that of the
⌈m/2i ⌉ bit multiplier. The total number of AN D gates required is the AN D gates for
the multiplier in the final recursion (i.e. ⌈m/2r−1 ⌉ bit multiplier) times the number of

43
⌈m/2r−1 ⌉ multipliers present. Using Equation 4.6,

3r−1 m m
 
#AN D = ⌈ r−1 ⌉ ⌈ r−1 ⌉ + 1 (4.19)
2 2 2

The number of XOR gates required for the ith recursion is given by 4⌈ m
2i
⌉ − 4.
The total number of two input XORs is the sum of the XORs required for last recur-
sion, #XORgr−1 , and the XORs required for the other recursions, #XORsi . Using
Equations 4.5 and 4.6,

r−2
r−1
3i #XORsi
X
#XOR = 3 #XORgr−1 +
i=1
!r−2
! (4.20)
r−1 m 2 m m
3i 4⌈ i ⌉ − 4
X
=3 10⌈ r ⌉ − 7⌈ r ⌉ + 1 +
2 2 i=1 2

The delay of the hybrid Karatsuba multiplier (Equation 4.21) is obtained by sub-
tracting the delay of a ⌈m/2r−1 ⌉ bit simple Karatsuba multiplier from the delay of an m
bit simple Karatsuba multiplier and adding the delay of a ⌈m/2r−1 ⌉ bit general Karat-
suba multiplier.

DELAYH (m) = DELAYS (m)


(4.21)
r−1 r−1
− DELAYS (⌈m/2 ⌉) + DELAYG (⌈m/2 ⌉)

4.6 Performance Evaluation

The graph in Figure 4.3 compares the area time product for the hybrid Karatsuba mul-
tiplier with the simple Karatsuba multiplier and the binary Karatsuba multipliers for
increasing values of m. The simple and binary Karatsuba multipliers were reimple-
mented and scaled for different field sizes. The results were obtained by synthesizing

44
1.1e+06

1e+06

900000

800000

700000
Area * Delay

600000

500000

400000

300000

200000

100000 Simple Karatsuba


Binary Karatsuba
Hybrid Karatsuba
0
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
Number of bits

Fig. 4.3: m Bit Multiplication vs Area × Time

Table 4.2: Comparison of the Hybrid Karatsuba Multiplier with Reported FPGA Imple-
mentations

Multiplier Platform Field Slices Delay Clock Computation Performance


(ns) Cycles Time(ns) AT (µs)
Grabbe [48] XC2V6000 240 1660 12.12 54 655 1087
Gathen [50] XC2V6000 240 1480 12.6 30 378 559
This work XC4V140 233 10434 16 1 16 154
XC2VP100 233 12107 19.9 1 19.9 241

using Xilinx’s ISE for a Virtex 4 FPGA. The area was determined by the number of
LUTs required for the multiplier, and the time in nano seconds includes the I/O pad
delay. The graph shows that the area time product for the hybrid Karatsuba multiplier
is lesser compared to the other multipliers. The power×delay graph for the multipliers
is expected to be similar to the area×delay graph of Figure 4.3.

Table 4.2 compares the hybrid Karatsuba with reported FPGA implementations of
Karatsuba variants. The implementations of [48] and [50] are sequential and hence re-
quire multiple clock cycles, thus they are not suited for high performance ECC. In order

45
to alleviate this, we proposed a combinational Karatsuba multiplier. However to ensure
that the design operates at a high clock frequency, we perform hardware replication.
For example, in a 233 bit multiplier, 14 bit and 15 bit general Karatsuba multipliers are
replicated, since the general Karatsuba multipliers utilize LUTs efficiently. This gain is
reflected in Table 4.2.

4.7 Conclusion

In this chapter we discussed the finite field multiplication unit. We proposed a hybrid
technique for implementing the Karatsuba multiplier. Our proposed design results in
best area × time product on an FPGA compared to existing works. The hybrid Karat-
suba multiplier forms the most important module for our elliptic curve crypto processor.
In the next chapter, we discuss the finite field inversion which would also use the hybrid
Karatsuba multiplier.

46
CHAPTER 5

High Performance Finite Field Inversion for FPGA


Platforms

The inverse of a non zero element a in the field GF (2m ) is the element a−1 ∈ GF (2m )
such that a · a−1 = a−1 · a = 1. Among all finite field operations, computing the inverse
of an element is the most computationally intensive. Yet it forms an integral part of
many public key cryptography algorithms including ECC. It is therefore important to
have an efficient technique to find the multiplicative inverse.

This chapter is organized as follows : the next section has a brief discussion on
various multiplicative inverse algorithms and reasons out why the Itoh-Tsujii algorithm
is most suited for elliptic curve cryptography. Section 5.2 describes the Itoh-Tsujii al-
gorithm and some of the reported literature on its implementation. Section 5.3 derives
an equation to determine the number of clock cycles required to find the inverse. Sec-
tion 5.4 proposes a generalized Itoh-Tsujii algorithm and presents a special case of the
generalized version called the quad-Itoh Tsujii algorithm, which is efficient for FPGA
platforms. This section also builds a controller that implements the quad-Itoh Tsujii
algorithm. Section 5.5 has the performance evaluation of the proposed algorithm with
the best existing inverse algorithms available. The final section has the conclusion.

5.1 Algorithms for Multiplicative Inverse

The most common algorithms for finding the multiplicative inverse are the extended
Euclidean algorithms (EEA) and the Itoh-Tsujii Algorithm (ITA) [13]. Generally, the
EEA and its variants, the binary EEA and Montgomery [56] inverse algorithms result
in compact hardware implementations, while the ITA is faster. The large area required
by the ITA is mainly due to the multiplication unit. All cryptographic applications
require to perform finite field multiplications, hence their hardware implementations
require a multiplier to be present. This multiplier can be reused by the ITA for inverse
computations. In this case the multiplier need not be considered in the area required by
the ITA. The resulting ITA without the multiplier is as compact as the EEA making it
an ideal choice for multiplicative inverse hardware [44].

The Itoh-Tsujii algorithm was initially proposed to find the multiplicative inverse
for normal basis representation of elements in the field GF (2m )[13]. Since then, there
have been several works that improved the original algorithm and adapted the algorithm
to other basis representations [57–59]. In [57], inversion in polynomial basis represen-
tations of field elements was presented. In [58] addition chains were used efficiently
to compute the multiplicative inverse in 27 clock cycles for an element represented in
polynomial basis in the field GF (2193 ). In [59] a parallel implementation of ITA was
proposed to generate the inverse in 20 clock cycles for the same field and basis repre-
sentation.

5.2 The Itoh-Tsujii Algorithm (ITA)

The Itoh-Tsujii Multiplicative Inverse Algorithm is based on Fermat’s little theorem, by


which the inverse of an element a ∈ GF (2m ) is computed using Equation 5.1.

m −2
a−1 = a2 (5.1)

The naive technique of implementing a−1 requires (m−2) multiplications and (m−
1) squarings. Itoh and Tsujii in [13] reduced the number of multiplications required by
using addition chains. An addition chain [60] for n ∈ N is a sequence of integers of the
form U = ( u0 u1 u2 · · · ur ) satisfying the properties

48
• u0 = 1

• ur = n

• ui = uj + uk , for some k ≤ j < i

Brauer chains are a special class of addition chains in which j = i − 1. An optimal


addition chain for n is the smallest addition chain for n.

To understand how the Itoh-Tsujii algorithm works Equation in 5.1 is rewritten as


shown below.
m−1 −1
a−1 = (a2 )2

We reuse notations from paper [59]. For k ∈ N, let

k −1
βk (a) = a2 ∈ GF (2m )

then,
a−1 = [βm−1 (a)]2

In [59] a recursive sequence (Equation 5.2) is used with an addition chain to com-
pute the multiplicative inverse. βk+j (a) ∈ GF (2m ) can be expressed as shown in Equa-
tion 5.2. For simplicity of notation we shall represent βk (a) by βk .

k j
βk+j (a) = (βj )2 βk = (βk )2 βj (5.2)

As an example consider finding the inverse of an element a ∈ GF (2233 ). This


232 −1
requires computing β232 (a) = a2 and then doing a squaring (i.e. [β232 (a)]2 = a−1 ).
A Brauer chain for 232 is as shown below.

U1 = ( 1 2 3 6 7 14 28 29 58 116 232 ) (5.3)

49
Table 5.1: Inverse of a ∈ GF (2233 ) using generic ITA
βui (a) βuj +uk (a) Exponentiation
1 β1 (a) a
21 2
2 β2 (a) β1+1 (a) (β1 ) β1 = a2 −1
1 3
3 β3 (a) β2+1 (a) (β2 )2 β1 = a2 −1
3 6
4 β6 (a) β3+3 (a) (β3 )2 β3 = a2 −1
1 7
5 β7 (a) β6+1 (a) (β6 )2 β1 = a2 −1
7 14
6 β14 (a) β7+7 (a) (β7 )2 β7 = a2 −1
14 28
7 β28 (a) β14+14 (a) (β14 )2 β14 = a2 −1
1 29
8 β29 (a) β28+1 (a) (β28 )2 β1 = a2 −1
29 58
9 β58 (a) β29+29 (a) (β29 )2 β29 = a2 −1
58 116
10 β116 (a) β58+58 (a) (β58 )2 β58 = a2 −1
116 232
11 β232 (a) β116+116 (a) (β116 )2 β116 = a2 −1

Computing β232 (a) is done in 10 steps with 231 squarings and 10 multiplications as
shown in Table 5.1.

In general if l is the length of the addition chain, finding the inverse of an element
in GF (2m ) requires l − 1 multiplications and m − 1 squarings. The length of the
addition chain is related to m by the equation l ≤ ⌊log2 m⌋ [60], therefore the number
of multiplications required by the ITA is much lesser than that of the naive method.

5.3 Clock Cycles for the ITA

In the ITA for field GF (2m ), the number of squarings required is as high as m. Further
from Table 5.1, it may be noted that most of the squarings required is towards the end of
the addition chain. The maximum number of squarings at any particular step could be
as high as ui /2. Although the circuit for a squarer is relatively simple, the large number
of squarings required hampers the performance of the ITA. A straightforward way of
implement the squarings would require ui /2 clock cycles at each step. The technique
used in [58] and [59] cascades us (where us is an element in the addition chain) squarers

50
Input Squarer−1

Squarer−2

Squarer−3
Multiplexer

Squarer−(us−1)

Squarer−us

Control

Fig. 5.1: Circuit to Raise the Input to the Power of 2k

(Figure 5.1) so that the output of one squarer is fed to the input of the next. If the number
of squarings required is less than us , a multiplexer is used to tap out interim outputs.
In this case the output can be obtained in one clock cycle. If the number of squarings
required is greater than us the output of the squaring block is fed back to get squares
which are a multiple of us . For example, if ui (ui > us ) squarings are needed, the
output of the squarer block would be fed back ⌈ui /us ⌉ times. This would also require
⌈ui /us ⌉ clock cycles.

In addition to the squarings, each step in the ITA has exactly one multiplication
requiring one clock cycle. The total number of clock cycles required for this design,
assuming a Brauer chain, is given by Equation 5.4. The summation in the equation is
the clock cycles for the squarings at each step of the algorithm. The (l − 1) term is due
to the (l − 1) multiplications. The extra clock cycle is for the final squaring.

l
X ui − ui−1
#ClockCycles = 1 + (l − 1) + ⌈ ⌉
i=2 us
l
(5.4)
X ui − ui−1
=l+ ⌈ ⌉
i=2 us

In order to reduce the clock cycles a parallel architecture was proposed in [59]. The
reduced clock cycles is achieved at the cost of increased hardware. In the remaining

51
part of this section we propose a novel ITA designed for the FPGA architecture. The
proposed design, though sequential, requires the same number of clock cycles as the
parallel architecture of [59] but has better area×time product.

5.4 Generalizing the Itoh-Tsujii Algorithm

The equation for the square of an element a ∈ GF (2m ) is given by Equation 5.5, where
p(x) is the irreducible polynomial.

m−1
a(x)2 = ai x2i mod p(x)
X
(5.5)
i=0

This is a linear equation and hence can be represented in the form of a matrix (T ) as
shown in the equation below.
a2 = T · a

The matrix depends on the finite field GF (2m ) and the irreducible polynomial of the
field. Exponentiation in the ITA is done with squarer circuits. We extend the ITA so
that the exponentiation can be done with any 2n circuit and not just squarers. Raising a
to the power of 2n is also linear and can be represented in the form of a matrix as shown
below.
n
a2 = T n (a) = T ′ a

For any a ∈ GF (2m ) and k ∈ N, Define,

nk −1
αk (a) = a2 (5.6)

nk1 −1 nk2 −1
Theorem 5.4.1 If a ∈ GF (2m ) , αk1 (a) = a2 and αk2 (a) = a2 then

nk2
αk1 +k2 (a) = (αk1 (a))2 αk2 (a)

52
where k1 , k2 , and n ∈ N

Proof

nk2
RHS = (αk1 (a))2 αk2 (a)
nk1 −1 nk2 nk2 −1
= (a2 )2 (a2 )
n(k1 +k2 ) −2nk2 +2nk2 −1
= (a2 )
n(k1 +k2 ) −1
= (a2 )

= αk1 +k2 (a)

= LHS

Theorem 5.4.2 The inverse of an element a ∈ GF (2m ) is given by


  2
 α m−1 (a) when n | (m − 1)


a−1 =  n 2
r
 (αq (a))2 βr (a)
 when n ∤ (m − 1)

where nq + r = m − 1 and n, q, and r ∈ N

Proof When n | (m − 1)

 2  m−1 )
2
2n( n −1
α m−1 (a) = a
n
 2
2m−1 −1
= a

= a−1

When n ∤ (m − 1)

 2  2
r nq −1 r r −1
(αq (a))2 βr (a) = (a2 )2 (a2 )
 2
2nq+r −1
= a
 2
m−1 −1
= a2

= a−1

53
Table 5.2: Comparison of LUTs Required for a Squarer and Quad Circuit for GF (29 )

Output Squarer Circuit Quad Circuit


bit b(x)2 #LUTs b(x)4 #LUTs
0 b0 0 b0 0
1 b5 0 b7 0
2 b1 + b5 1 b5 + b7 1
3 b6 0 b3 + b7 1
4 b2 + b6 1 b1 + b3 + b5 + b7 1
5 b7 0 b8 0
6 b3 + b8 1 b6 + b8 1
7 b8 0 b4 + b8 1
8 b4 + b8 1 b2 + b4 + b6 + b8 1
Total LUTs 4 6

We note that elliptic curves over the field GF (2m ) used for cryptographic purposes
[14] have an odd m, therefore we discuss with respect to such values of m, although the
results are valid for all m. In particular, we consider the case when n = 2; such that

k −1
αk (a) = a4

To implement this we require quad circuits. To show the benefits of using a quad
circuit on an FPGA instead of the conventional squarer, consider the equations for a
squarer and a quad for an element b(x) ∈ GF (29 ) (Table 5.2). The irreducible polyno-
mial for the field is x9 + x + 1. In the table, b0 · · · b8 are the coefficients of b(x). The
#LUTs column shows the number of LUTs required for obtaining the particular output
bit.

We would expect the LUTs required by the quad circuit be twice that of the squarer.
However this is not the case. The quad circuit’s LUT requirement is only 1.5 times that
of the squarer. This is because the quad circuit has a lower percentage of under utilized
LUTs (Equation 4.9). For example, from Table 5.2 we note that output bit 4 requires

54
Table 5.3: Comparison of Squarer and Quad Circuits on Xilinx Virtex 4 FPGA

Field Squarer Circuit Quad Circuit Size ratio


#LU Tq
#LU Ts Delay (ns) #LU Tq Delay (ns)
2(#LU Ts )
GF (2193 ) 96 1.48 145 1.48 0.75
GF (2233 ) 153 1.48 230 1.48 0.75

three XOR gates in the quad circuit and only one in the squarer. However, both circuits
require only 1 LUT. This is also the case with output bit 8. This shows that the quad
circuit is better at utilizing FPGA resources compared to the squarer. Moreover, both
circuits have the same delay of one LUT. If we generate the fourth power by cascading
two squarer circuits (i.e (b(x)2 )2 ), the resulting circuit would have twice the delay and
require 25% more hardware resources than a single quad circuit.

These observations are scalable to larger fields as shown in Table 5.3. The circuits
for the finite fields GF (2233 ) and GF (2193 ) use the irreducible polynomials x233 +
x74 + 1 and x193 + x15 + 1 respectively. They were synthesized for a Xilinx Virtex 4
FPGA. The table shows that the area saved even for large fields is about 25%. While
the combinational delay of a single squarer is equal to that of the quad.

Based on this observation we propose a quad-ITA using quad exponentiation circuits


instead of squarers. The procedure for obtaining the inverse for an odd m using the
quad-ITA is shown in Algorithm 5.1. The algorithm assumes a Brauer addition chain.

The overhead of the quad-ITA is the need to precompute a3 . Since we do not have a
squarer this has to be done by the multiplication block, which is present in the architec-
ture. Using the multiplication unit, cubing is accomplished in two clock cycles without
any additional hardware requirements. Similarly, the final squaring can be done in one
clock cycle by the multiplier with no additional hardware required.

Consider the example of finding the multiplicative inverse of an element a ∈ GF (2233 )


using the quad-ITA. From Theorem 5.4.2, setting n = 2 and m = 233, a−1 = [α 232 (a)]2 =
2

55
Algorithm 5.1: qitmia (Quad-ITA)
Input: The element a ∈ GF (2m ) and the Brauer chain
U = {1, 2, · · · , m−1
2
, m − 1}
Output: The multiplicative inverse a−1
1 begin
2 l = length(U )
3 a2 = hmul(a, a); /* hmul: hybrid Karatsuba multiplier */
; /* proposed in Algorithm 4.2 */
4 αu1 = a3 = a2 · a
5 foreach ui ∈ U (2 ≤ i ≤ l − 1) do
6 p = ui−1
7 q = ui − ui−1
q
8 αui = hmul(αp4 , αq )
9 end
10 a−1 = hmul(αul−1 , αu1−1 )
11 end

2.116 −1 116 −1
[α116 (a)]2 . This requires computation of α116 (a) = a2 = a4 and then doing a
squaring, a−1 = (α116 (a))2 . We use the same Brauer chain (Equation 5.3) as we did in
the previous example. Excluding the precomputation step, computing α116 (a) requires
9 steps. The total number of quad operations to compute α116 (a) is 115 and the number
of multiplications is 9. The precomputation step requires 2 clock cycles and the final
squaring takes one clock cycle. In all 12 multiplications are required for the inverse
operation. In general for an addition chain for m − 1 of length l, the quad-ITA requires
two additional multiplications compared to the ITA implementation of [59].

#M ultiplications : l + 1 (5.7)

The number of quad operations required is given by

(m − 1)
#QuadP owers : −1 (5.8)
2

The number of clock cycles required is given by the Equation 5.9. The summation
in the equation is the clock cycles required for the quadblock, while l + 1 is the clock

56
cycles of the multiplier.

l−1
X ui − ui−1
#ClockCycles = (l + 1) + ⌈ ⌉ (5.9)
i=2 us

The difference in the clock cycles between the ITA of [59] (Equation 5.4) and the
quad-ITA (Equation 5.9) is
ul − ul−1
⌈ − 1⌉ (5.10)
us
In general for addition chains used in ECC, the value of ul −ul−1 is as large as (m−1)/2
and much greater than us , therefore the clock cycles saved is significant.

5.4.1 Hardware Architecture

To compare the proposed quad-ITA with other reported inverse implementations we


develop a dedicated processor (Figure 5.2) that generates the inverse of the input a ∈
GF (2233 ). Generating the inverse requires the computation of the steps in Table 5.4
followed by a squaring. The main components of the architecture is a finite field mul-
tiplier and a quadblock. The multiplier is an implementation of the hybrid Karatsuba

Table 5.4: Inverse of a ∈ GF (2233 ) using Quad-ITA

αui (a) αuj +uk (a) Exponentiation


1 α1 (a) a3
1 2
2 α2 (a) α1+1 (a) (α1 )4 α1 = a4 −1
1 3
3 α3 (a) α2+1 (a) (α2 )4 α1 = a4 −1
3 6
4 α6 (a) α3+3 (a) (α3 )4 α3 = a4 −1
1 7
5 α7 (a) α6+1 (a) (α6 )4 α1 = a4 −1
7 14
6 α14 (a) α7+7 (a) (α7 )4 α7 = a4 −1
14 28
7 α28 (a) α14+14 (a) (α14 )4 α14 = a4 −1
1 29
8 α29 (a) α28+1 (a) (α28 )4 α1 = a4 −1
29 58
9 α58 (a) α29+29 (a) (α29 )4 α29 = a4 −1
58 116
10 α116 (a) α58+58 (a) (α58 )4 α58 = a4 −1

57
Clk
Reset
Control

sel1
sel1 sel2 sel3 rcntl qsel en
a 0
MOUT
1 MUX Clk
A
Hybrid
2
Karatsuba
a−1
sel2 Multiplier
0

1 MUX
B QOUT
2 qsel Clk

sel3 Quadblock
0

MUX
C
1

Regbank
rcntl

Fig. 5.2: Quad-ITA Architecture for GF (2233 ) with the Addition Chain 5.3

algorithm (Section 4.5.1). The quadblock (Figure 5.3) consists of 14 cascaded circuits,
each circuit generating the fourth power of its input. If qin is the input to the quad-

Input quad circuit − 1

quad circuit − 2

quad circuit − 3
Multiplexer

quad circuit − (us−1)

quad circuit − us
qsel

Fig. 5.3: Quadblock Design: Raises the Input to the Power of 4k

58
2 3 14
block, the powers of qin generated are qin4 , qin4 , qin4 · · · qin4 . A multiplexer in
the quadblock, controlled by the select lines qsel, determines which of the 14 powers
qsel
gets passed on to the output. The output of the quadblock can be represented as qin4 .

Two buffers M OU T and QOU T store the output of the multiplier and the quad-
block respectively. At every clock cycle, either the multiplier or the quadblock (but not
both) is active (The en signal if 1 enables either the M OU T , otherwise the QOU T
buffer). A register bank may be used to store results of each step (αui ) of Algorithm
5.1. A result is stored only if it is required for later computations.

The controller is a state machine designed based on the adder chain and the number
of cascaded quad circuits in the quadblock. At every clock cycle, control signals are
generated for the multiplexer selection lines, enables to the buffers and access signals
to the register bank. As an example, consider the computations of Table 5.4. The
corresponding control signals generated by the controller is as shown in Table 5.5. The
first step in the computation of a−1 is the determination of a3 . This takes two clock
cycles. In the first clock, a is fed to both inputs of the multiplier. This is done by
controlling the appropriate select lines of the multiplexers. The result, a2 , is used in the
following clock along with a to produce a3 . This is stored in the register bank. The
second step is the computation of α2 (a). This too requires two clock cycles. The first
1
clock uses a3 as the input to the quadblock to compute (α1 )4 . In the next clock, this
is multiplied with a3 to produce the required output. In general, computing any step
αui (a) = αuj +uk (a) takes 1 + ⌈ u14j ⌉ clock cycles. Of this, ⌈ u14j ⌉ clock cycles are used by
the quadblock, while the multiplier requires a single clock cycle. At the end of a step,
the result is present in M OU T .

Addition Chain Selection Criteria

The length of the addition chain influences the number of clock cycles required to com-
pute the inverse (Equations 5.4 and 5.9), hence proper selection of the addition chain is

59
Table 5.5: Control Word for GF (2233 ) Quad-ITA for Table 5.4

Step Clock sel1 sel2 sel3 qsel en


α1 (a) 1 0 0 × × 1
2 0 2 × × 1
α2 (a) 3 × × 0 1 0
4 1 1 × × 1
α3 (a) 5 × × 0 1 0
6 1 1 × × 1
α6 (a) 7 × × 0 3 0
8 2 1 × × 1
α7 (a) 9 × × 0 1 0
10 1 1 × × 1
α14 (a) 11 × × 0 7 0
12 2 1 × × 1
α28 (a) 13 × × 0 14 0
14 2 1 × × 1
α29 (a) 15 × × 0 1 0
16 1 1 × × 1
α58 (a) 17 × × 0 14 0
18 × × 1 14 0
19 × × 1 1 0
20 2 1 × × 1
α116 (a) 21 × × 0 14 0
22 × × 1 14 0
23 × × 1 14 0
24 × × 1 14 0
25 × × 1 2 0
26 2 1 × × 1
F inalSquare 27 2 2 × × 1

60
critical to the design. For a given m, there could be several optimal addition chains. It
is required to select one chain from available optimal chains. The amount of memory
required by the addition chain can be used as a secondary selection criteria. The mem-
ory utilized by an addition chain is the registers required for storage of the results from
intermediate steps. The result of step αi (a) is stored only if it is required to be used in
any other step αj (a) and j > i + 1. Consider the addition chain in 5.11.

U2 = ( 1 2 3 5 6 12 17 29 58 116 232 ) (5.11)

Computing α5 (a) = α2+3 (a) requires α2 (a), therefore α2 (a) needs to be stored. Simi-
larly, α1 (a), α5 (a) and α12 (a) needs to be stored to compute α3 (a), α17 (a) and α29 (a)
respectively. In all four registers are required. Minimizing the number of registers is
important because for cryptographic applications m is generally large therefore each
register’s size is significant.

Using Brauer chains has the advantage that for every step (except the first) at least
one input is read from the output of the previous step. The output of the previous
step is stored in M OU T therefore need not be read from any register and no storage
is required. The second input to the step would ideally be a doubling. For example,
computing α116 (a) requires only α58 (a). Since α58 (a) is the result from the previous
step, it is stored in M OU T . Therefore computing α116 (a) does not require any stored
values.

Design of the Quadblock

The number of quad circuits cascaded (us ) has an influence on the clock cycles, fre-
quency, and area requirements of the quad-ITA. Increasing the number of cascaded
blocks would reduce the number of clock cycles (Equation 5.4) required at the cost of
an increase in area and delay.

61
850

800
Computational Time of Cascaded Quad Block (in ns)
750

700

650

600

550

500

450

400

350

300
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of cascaded Quads

Fig. 5.4: Clock Cycles of Computation Time versus Number of Quads in Quadblock on
a Xilinx Virtex 4 FPGA for GF (2233 )

Let a single quad circuit require lp LUTs and have a combinational delay of tp . For
this analysis we assume that tp includes the gate delay as well as the path delay. We also
assume that the path delay is constant. The values of lp and tp depend on the finite field
GF (2m ) and the irreducible polynomial. A cascade of us quad circuits would require
us · lp LUTs and have a delay of us · tp .

In order that the quadblock not alter the frequency of operation, us should be se-
lected such that us · tp is less than the maximum combinational delay of the entire
design. In the quad-ITA hardware, the maximum delay is from the Karatsuba multi-
plier, therefore we select us such that the delay of the quadblock is less than the delay
of the multiplier.
us · tp ≤ Delay of multiplier

However, reducing us would increase the clock cycles required. Therefore we select us
so that the quadblock delay is close to the multiplier delay.

The graph in Figure 5.4 plots the computation delay (clock period in nanoseconds
× the clock cycles) required versus the number of quads in the quad-ITA for the field

62
GF (2233 ). For small values of us , the delay is mainly decided by the multiplier, while
the clock cycles required is large. For large number of cascades, the delay of the quad-
block exceeds that of the multiplier, therefore the delay of the circuit is now decided by
the quadblock. Lowest computation time is obtained with around 11 cascaded quads.
For this, the delay of the quadblock is slightly lower than the multiplier. Therefore,
the critical delay is the path through the multiplier, while the clock cycles required is
around 30. Therefore for the quad-ITA in a field GF (2233 ), 11 cascaded quads result in
least computation time. However, in order to make the clock cycles required to com-
pute the finite field inverse in GF (2233 ) equal to the parallel implementation of [59], 14
cascaded quads are used even though this causes a marginal increase in the computation
time (which is still quite lesser than the parallel implementation at 0.55µsec).

Quad-ITA
Squarer-ITA
500

400
1/(LUTs * Delay * Clock Cycles)

300

200

100

0
100

110

120

130

140

150

160

170

180

190

200

210

220

230

240

250

260

270

280

290

300

Finite Field GF(2^x)

Fig. 5.5: Performance of Quad-ITA vs Squarer-ITA Implementation for Different Fields


on a Xilinx Virtex 4 FPGA

5.5 Experimental Results

In this section we compare our work with reported finite field inverse results. We also
test our design for scalability over several fields.

63
The graph in Figure 5.5 shows the scalability of the quad-ITA and compares it with
a squarer-ITA. The design of the squarer-ITA is similar to that of the quad-ITA (Figure
5.2) except for the quadblock. The quad circuits in the quadblock is replaced by squarer
circuits. Both the quadblock and squarer block have the same number of cascaded
circuits. The platform used for generating the graph is a Xilinx Virtex 4 FPGA. The X
axis has increasing field sizes (see the Appendix for list of finite fields), and the Y axis
has the performance metric shown below.

f requency
Performance = (5.12)
Slices × ClockCycles

The slices is the number of slices required on the FPGA as reported by Xilinx’s ISE
synthesis tool. The graph shows that the quad-ITA has better performance compared to
the squarer-ITA for most fields.

Table 5.6 compares the quad-ITA with the best reported ITA and Montgomery in-
verse algorithms available. The FPGA used in all designs is the Xilinx Virtex E. The
quad-ITA has the best computation time and performance compared to the other im-
plementations. It may be noted that the larger area compared to [58] and [59] of the
quad-ITA is because it uses distributed RAM [61] for registers, while [58] and [59] use
block RAM [39]. The distributed RAM requires additional CLB resources while block
RAM does not.

64
Table 5.6: Comparison for Inversion on Xilinx Virtex E

Implementation Algorithm Platform Field Slices Frequency Clock Computation Performance


(MHz) Cycle Time (Equation 5.12)
(f ) (c) (c/f )
Dormale [62] Montgomery XCV2000E 160 890 50 - 9.71µsec 115.7
65

XCV2000E 256 1390 41 - 18.7µsec 38.4


Crowe [63] Montgomery XCV2000E 160 1094 51 - 6.28µsec 145.5
XCV2000E 256 1722 39 - 13.17µsec 44.1
Henriquez [58] ITA XCV3200E 193 10065 21.2 27 1.33µsec 78
Henriquez [59] Parallel ITA XCV3200E 193 11081 21.2 20 0.94µsec 95.7
This work quad-ITA XCV3200E 193 11911 36.2 20 0.55µsec 152.1
5.6 Conclusion

This chapter discussed the finite field inverter required for the elliptic curve crypto pro-
cessor. The Itoh-Tsujii algorithm was used for the inversion. A generalized version
of the ITA was proposed that improves the utilization of FPGA resources. With this
method, we show that raising an element by a power of 4 (quad operation) on an FPGA
is more compact and faster than using squarers. Thus the quad operation forms the core
of an improved ITA algorithm called the quad-ITA. The quad-ITA takes least number
of clock cycles, has lesser computational time and has better performance compared
to the best reported inversion algorithms. The quad-ITA is used for the final inversion
required in the elliptic curve crypto processor. This is discussed in the next chapter.

66
CHAPTER 6

Constructing the Elliptic Curve Crypto Processor

This chapter presents the construction of an elliptic curve crypto processor (ECCP)
for the NIST specified curve [14] given in Equation 6.1 over the binary finite field
GF (2233 ).
y 2 + xy = x3 + ax2 + b (6.1)

The processor implements the double and add scalar multiplication algorithm described
in Algorithm 3.1. The processor (Figure 6.1), is capable of doing the elliptic curve
operations of point addition and point doubling. Point doubling is done at every iteration
of the loop in Algorithm 3.1, while point addition is done for every bit set to one in the
binary expansion of the scalar input k. The output produced as a result of the scalar

Qin
Qout
A0
Regbank Arithmetic
A1 C0
Unit
A2
C1
kP A3

c[10:25] c[0:9],c[29:26]

ROM Control Unit


curve constant
and basepoint
k

Fig. 6.1: Block Diagram of the Elliptic Curve Crypto Processor


multiplication is the product kP . Here, P is the basepoint of the curve and is stored in
the ROM in its affine form. At every clock cycle, the register bank (regbank) containing
dual ported registers feed the arithmetic unit (AU) through five buses (A0, A1, A2, A3
and Qin). At the end of the clock cycle, results of the computation are stored in registers
through buses C0, C1 and Qout. There can be at most two results produced at every
clock. Control signals (c[0] · · · c[32]) generated every clock cycle depending on the
elliptic curve operation control the data flow and the computation done. Details about
the processor, the flow of data on the buses, the computations done etc. are elaborated
in following sections.

The scalar multiplication implemented in the processor of Figure 6.1 is done using
the López-Dahab (LD) projective coordinate system. The LD coordinate form of the
elliptic curve over binary finite fields is

Y 2 + XY Z = X 3 + aX 2 Z 2 + bZ 4 (6.2)

In the ECCP, a is taken as 1, while b is stored in the ROM along with the basepoint
P . Equations for point doubling and point addition in LD coordinates are shown in
Equations 3.10 and 3.11 respectively.

During the initialization phase the curve constant b and the basepoint P are loaded
from the ROM into the registers after which there are two computational phases. The
first phase multiplies the scalar k to the basepoint P . The result produced by this phase
is in projective coordinates. The second phase of the computation converts the projec-
tive point result of the first phase into the affine point kP . The second phase mainly
involves an inverse computation. The inverse is computed using the quad Itoh-Tsujii
inverse algorithm proposed in Algorithm 5.1.

The next section describes in detail the ECCP. Section 6.2 describes the implemen-
tation of the elliptic curve operations in the processor. Section 6.3 presents the finite
state machine that implements Algorithm 3.1. Section 6.4 has the performance results,

68
c[24]
c[21]
0
c[10] ad1 RA1 out1 0
MUX din
IN1 MUX A0
1
c[11] ad2 RA2 out2
OUT1
c[12] we
C0 c[25] 1
c[22]
0
c[14:13] ad1 RB1 out1 0
MUX din
IN2 c[16:15] MUX A2
ad2 RB2 out2
OUT2
C1 1 c[31]
c[17] RB3 1 0
we
c[32],c[30]
MUX Qin
00
RB4 c[23]
OUT4
0 1

Qout 101 MUX MUX A1


c[18] ad1 OUT3
IN3 RC1 out1
din 1
1 1x c[19] ad2
c[20] we
RC2 out2
A3

Fig. 6.2: Register File for Elliptic Curve Crypto Processor

while the final section has the conclusion.

6.1 The Elliptic Curve Cryptoprocessor

This section describes in detail the register file, arithmetic unit and the control unit of
the elliptic curve crypto processor.

6.1.1 Register Bank

The heart of the register file (Figure 6.2) are eight registers each of size 233 bits. The
registers are used to store the results of the computations done at every clock cycle.
The registers are dual ported and arranged in three banks, RA, RB, and RC. The dual
ported RAM allows asynchronous reads on the lines out1 and out2 corresponding to the

69
Table 6.1: Utility of Registers in the Register Bank

Register Description
RA1 1. During initialization it is loaded with Px .
2. Stores the x coordinate of the result.
3. Also used for temporary storage.
RA2 Stores Px .
RB1 1. During initialization it is loaded with Py .
2. Stores the y coordinate of the result.
3. Also used for temporary storage.
RB2 Stores Py .
RB3 Used for temporary storage.
RB4 Stores the curve constant b.
RC1 1. During initialization it is set to 1.
2. Store z coordinate of the projective result.
3. Also used for temporary storage.
RC2 Used for temporary storage.

address on the address lines ad1 and ad2 respectively. A synchronous write of the data
on din is done to the location addressed by ad1. The we signal enables the write. On the
FPGA, the registers are implemented as distributed RAM[61]. At every clock cycle, the
register file is capable of delivering five operands (on buses A0, A1, A2, A3 and Qin)
to the arithmetic unit and able to store three results (from buses C0, C1, and Qout).
The inputs to the register file is either the arithmetic unit outputs, the curve constant (b
of Equation 6.2), or the basepoint P = (Px , Py ).

Multiplexers M U XIN 1, M U XIN 2, and M U XIN 3 determine which of the three


inputs gets stored into the register banks. Further, bits in the control word select a
register, or enable or disable a write operation to a particular register bank. Multiplexers
M U XOU T 1, M U XOU T 2, M U XOU T 3, and M U XOU T 4 determine which output
of a register bank get driven on the output buses. Table 6.1 shows how the each register
in the bank is utilized.

70
Qin

c[2:0]

A0
A0 QUADBLOCK Qout
A02
SQUARE
A2 A2 c[29:26] qsel
A0+A2 MUX
A04 A04+A1 A c[7:6]
SQUARE
A1 A1
A14 M
SQUARE SQUARE
A2
MUX C0
A12 C
c[5:3] A3
HYBRID
A1
A12 KARATSUBA M
c[9:8]
A12+A2
A1+A3 MULTIPLIER M+A0
MUX A0
A22+A1+A3 B A22
2
A2+M+A0
SQUARE
A3 A3 MUX
A14 D C1
A24
SQUARE
A14 A04+A1

Fig. 6.3: Finite Field Arithmetic Unit

6.1.2 Finite Field Arithmetic Unit

The arithmetic unit (Figure 6.3) is built using finite field arithmetic circuits and orga-
nized for efficient implementation of point addition (Equation 3.11) and point doubling
(Equation 3.10) in LD coordinates. The AU has 5 inputs (A0 to A3 and Qin) and 3
outputs (C0, C1, and Qout). The main components of the AU is a quadblock and a
multiplier. The multiplier is based on the hybrid Karatsuba algorithm (Section 4.5.1).
It is used in both phases (during the scalar multiplication phase and conversion to affine
coordinate phase) of the computation. The quadblock is designed according to Fig-
ure 5.3. Here, the quadblock consists of 14 cascaded quad circuits and is capable of
c[29]···c[26]
generating the output Qout = Qin4 . The quadblock is used only for inversion
which is done during the final phase of the computation. The AU has several adders and
squarer circuits. These circuits are small compared to the multiplier and the quadblock
and therefore contribute marginally to the overall area and latency of the processor.

71
6.1.3 Control Unit

At every clock cycle the control unit produces a control word. Control words are pro-
duced in a sequence depending on the type of elliptic curve operation being done. The
control word signals control the flow of data and also decide the operations performed
on the data. There are 33 control signals (c[0] to c[32]) that are generated by the control
unit. The signals c[0] to c[9] control the inputs to the finite field multiplier and the out-
puts C0 and C1 of the AU. The control lines c[26] to c[29] are used for the select lines
to the multiplexer in the quadblock (Figure 5.3). The remaining control bits are used in
the register file to read and write data to the registers. Section 6.3 has the detailed list
of all control words generated.

6.2 Point Arithmetic on the ECCP

This section presents the implementation of LD point addition and doubling equations
on the ECCP.

6.2.1 Point Doubling

The equation for doubling the point P in LD projective coordinates was shown in Equa-
tion 3.10 and is repeated here (Equation 6.3). [30]. The input required for doubling is
the point P = (X1 , Y1 , Z1 ) and the output is its double 2P = (X3 , Y3 , Z3 ). The equa-
tion shows that four multiplications are required (assuming a = 1). The ECCP has just
one multiplier, which is capable of doing one multiplication per clock cycle. Hence, the

72
ECCP would require at least four clock cycles for computing the double.

Z3 = X12 · Z12

X3 = X14 + b · Z14 (6.3)

Y3 = b · Z14 · Z3 + X3 · (a · Z3 + Y12 + b · Z14 )

This doubling operation is mapped to the elliptic curve hardware using Algorithm
6.1.

Algorithm 6.1: Hardware Implementation of Doubling on ECCP


Input: LD Point P=(X1 , Y1 , Z1 ) present in registers (RA1 , RB1 , RC1 )
respectively. The curve constant b is present in register RB4
Output: LD Point 2P=(X3 , Y3 , Z3 ) present in registers (RA1 , RB1 , RC1 )
respectively.
1 RB3 = RB4 · RC14
2 RC1 = RA21 · RC12
3 RA1 = RA41 + RB3
4 RB1 = RB3 · RC1 + RA1 · (RC1 + RB12 + RB3 )

Table 6.2: Parallel LD Point Doubling on the ECCP

Clock Operation 1 (C0) Operation 2(C1)


1 RC1 = RA21 · RC12 RB3 = RC14
2 RB3 = RB3 · RB4
3 RC2 = (RA41 + RB3 ) · (RC1 + RB12 + RB3 ) RA1 = (RA41 + RB3 )
4 RB1 = RB3 · RC1 + RC2

Table 6.3: Inputs and Outputs of the Register File for Point Doubling

Clock A0 A1 A2 A3 C0 C1
1 RA1 RC1 - - RC1 RB3
2 - RB4 RB3 - RB3
3 RA1 RB3 RB1 RC1 RC2 RA1
4 RB3 RC1 - RC2 RB1 -

73
On the ECCP, the LD doubling algorithm can be parallelized to complete in four
clock cycles as shown in Table 6.2 [64]. The parallelization is based on the fact that the
multiplier is several times more complex than the squarer and adder circuits used. So,
in every clock cycle the multiplier is used and it produces one of the outputs of the AU.
The other AU output is produced by additions or squaring operations alone.

Table 6.3 shows the data held on the buses at every clock cycle. It also shows where
the results are stored. For example, in clock cycle 1, the contents of the registers RA1
and RC1 are placed on the bus A0 and A1 respectively. Control lines in M U XA and
M U XB of the AU are set such that A02 and A1 are fed to the multiplier. The output
multiplexers M U XC and M U XD are set such that M and A14 are sent on the buses
C0 and C1. These are stored in registers RC1 and RB3 respectively. Effectively, the
computation done by the AU are RC1 = RA21 · RC12 and RB3 = RC14 . Similarly the
subsequent operations required for doubling as stated in 6.2 are performed.

A = y2 · Z12 + Y1

B = x2 · Z1 + X1

C = Z1 · B

D = B 2 · (C + a · Z12 )

Z3 = C 2
(6.4)
E =A·C

X3 = A2 + D + E

F = X3 + x2 · Z3

G = (x2 + y2 ) · Z32

Y3 = (E + Z3 ) · F + G

74
6.2.2 Point Addition

The equation for adding an affine point to a point in LD projective coordinates was
shown in Equation 3.11 and repeated here in Equation 6.4. The equation adds two
points P = (X1 , Y1 , Z1 ) and Q = (x2 , y2 ) where Q 6= ±P . The resulting point is
P + Q = (X3 , Y3 , Z3 ).

Algorithm 6.2: Hardware Implementation of Addition on ECCP


Input: LD Point P=(X1 , Y1 , Z1 ) present in registers (RA1 , RB1 , RC1 )
respectively and Affine Point Q=(x2 , y2 ) present in registers (RA2 , RB2 )
respectively
Output: LD Point P+Q=(X3 , Y3 , Z3 ) present in registers (RA1 , RB1 , RC1 )
respectively
1 RB1 = RB2 · RC12 + RB1 ; /* A */
2 RA1 = RA2 · RC1 + RA1 ; /* B */
3 RB3 = RC1 · RA1 ; /* C */
4 RA1 = RA21 · (RB3 + RC12 ) ; /* D */
5 RC1 = RB32 ; /* Z3 */
6 RC2 = RB1 · RB3 ; /* E */
7 RA1 = RB12 + RA1 + RC2 ; /* X3 */
8 RB3 = RA1 + RA2 · RC12 ; /* F */
9 RB1 = (RA2 + RB2 ) · RC12 ; /* G */
10 RB1 = (RC2 + RC1 ) · RB3 + RB1 ; /* Y3 */

Table 6.4: Parallel LD Point Addition on the ECCP


Clock Operation 1 (C0 ) Operation 2(C1 )
1 RB1 = RB2 · RC12 + RB1 -
2 RA1 = RA2 · RC1 + RA1 -
3 RB3 = RC1 · RA1 -
4 RA1 = RA21 · (RB3 + RC12 ) -
5 RC2 = RB1 · RB3 RA1 = RB12 + RA1 + RB1 · RB3
6 RC1 = RB32 RB3 = RA1 + RA2 · RB32
7 RB1 = (RA2 + RB2 ) · RC12 -
8 RB1 = (RC2 + RC1 ) · RB3 + RB1 -

The addition operation is mapped to the elliptic curve hardware using Algorithm
6.2. Note, a is taken as 1. On the ECCP the operations in Algorithm 6.2 are scheduled

75
Table 6.5: Inputs and Outputs of the Register Bank for Point Addition

Clock A0 A1 A2 A3 C0 C1
1 RB2 RC1 RB1 - RB1 -
2 RA1 RC1 RA2 - RA1 -
3 RA1 - - RC1 RB3 -
4 RA1 RC1 RB3 - RA1 -
5 RA1 RB3 RB1 - RC2 RA1
6 RA1 RB3 RA2 - RC1 RB3
7 RB2 RC1 RA2 - RB1 -
8 RB3 RC1 RB1 RC2 RB1 -

efficiently to complete in eight clock cycles [64]. The scheduled operations for point
addition is shown in Table 6.4, and the inputs and outputs of the registers at each clock
cycle is shown in Table 6.5.

A2
A1 A3
ki=1

D4 A4

D3 A5
ki=0 complete

D2 A6

D1 A7
A8
Detect leading 1

complete
Init1 Init2 Init3 I1 I2 I21 I22 I23 I24

Fig. 6.4: The ECCP Finite State Machine

76
Table 6.6: Inputs and Outputs of Regbank for Every State

State Regbank Outputs Regbank Inputs


A0 A1 A2 A3 Qin
Init1 - - - - - C0 :RA1 = Px ; C1 :RB1 = Py ; RC1 = 1
Init2 - - - - - C0 :RA2 = Px ; C1 :RB2 = Py
Init3 - - - - - C1 :RB4 = b
D1 RA1 RC1 - - - C0 :RC1 = RA21 · RC12 ; C1 :RB3 = RC14
D2 - RB4 RB3 - - C0 :RB3 = RB3 · RB4
D3 RA1 RB3 RB1 RC1 - C0 :RC2 = (RA41 + RB3 ) · (RC1 + RB12 + RB3 ) ;
C1 :RA1 = (RA41 + RB3 )
D4 RB3 RC1 - RC2 - C0 :RB1 = RB3 · RC1 + RC2
A1 RB2 RC1 RB1 - - C0 :RB1 = RB2 · RC12 + RB1
A2 RA1 RC1 RA2 - - C0 :RA1 = RA2 · RC1 + RA1
A3 RA1 - - RC1 - C0 :RB3 = RC1 · RA1
A4 RA1 RC1 RB3 - - C0 :RA1 = RA21 · (RB3 + RC12 )
A5 RA1 RB3 RB1 - - C0 :RC2 = RB1 · RB3
C1 :RA1 = RB12 + RA1 + RB1 · RB3
A6 RA1 RB3 RA2 - - C0 :RC1 = RB32 ; C1 :RB3 = RA1 + RA2 · RB32
A7 RB2 RC1 RA2 - - C0 :RB1 = (RA2 + RB2 ) · RC12
A8 RB3 RC1 RB1 RC2 - C0 :RB1 = (RC2 + RC1 ) · RB3 + RB1
I1 - RC1 - - - C0 :RC1 = RC12 · RC1
I2 - RC1 - - - C0 :RB3 = RC14 · RC1
I3 - RC1 RB3 - - C0 :RB3 = RB34 · RC1
I4 - - - - RB3 Qout :RC2 = RB33
I5 - RC2 RB3 - - CO :RB3 = RC2 · RB3
I6 - RC1 RB3 - - C0 :RB3 = RB34 · RC1
I7 - - - - RB3 Qout :RC2 = RB37
I8 - RC2 RB3 - - C0 :RB3 = RC2 · RB3
I9 - - - - RB3 Qout :RC2 = RB314
I10 - RC2 RB3 - - C0 :RB3 = RC2 · RB3
I11 - RC1 RB3 - - C0 :RB3 = RB34 · RC1
I12 - - - - RB3 Qout :RC2 = RB314
I13 - - - - RC2 Qout :RC2 = RC214
I14 - RC2 RB3 - - CO :RB3 = RC24 · RB3
I15 - - - - RB3 Qout :RC2 = RB314
I16 - - - - RC2 Qout :RC2 = RC214
I17 - - - - RC2 Qout :RC2 = RC214
I18 - - - - RC2 Qout :RC2 = RC214
I19 - - - - RC2 Qout :RC2 = RC22
I20 - RC2 RB3 - - C0 :RB3 = RC2 · RB3
I21 - RB3 - - - C0 :RC1 = RB32
I22 RA1 RC1 - - - C0 :RA1 = RA1 · RC1
I23 RB1 RC1 - - - C0 :RB1 = RB1 · RC12

77
Table 6.7: Control Words for ECCP
State Quadblock Regfile MUXIN Regfile MUXOUT Regbank signals AU Mux C and D AU Mux A and B
c29 · · · c26 c32 c30 c25 c24 c31 c23 c22 c21 c20 · · · c10 c9 · · · c6 c5 · · · c0
Init1 xxxx 1010 00xx 1x01xx001x0 0000 000000
Init2 xxxx 1010 00xx 0xx1xx011x1 xxxx xxxxxx
Init3 xxxx 1xxx xxxx 0xx1xx110xx xxxx xxxxxx

D1 xxxx 001x 00x0 1x01xx100x0 1000 001001


D2 xxxx 000x x10x 0xx111100xx xx00 000010
D3 xxxx 00x1 0100 101010001x0 1100 100100
D4 xxxx 000x 00x1 010110000xx xx11 000000

A1 xxxx 000x 0001 0x0101000xx xx01 001000


A2 xxxx 00x1 0010 0x00xx00110 00xx 000010
A3 xxxx 00xx 00x0 00x1xx100x0 xx00 101000
A4 xxxx 00x0 0000 0100xx101x0 xx00 010001
A5 xxxx 00x1 0100 1x1010001x0 0100 000010
A6 xxxx 001x 0110 1x011010010 0010 001010
A7 xxxx 000x 0011 0x01010001x xx00 001011
A8 xxxx 000x 0001 010110000xx xx01 011000

I1 xxxx 00xx 00xx 1x0xxxxx0xx xx00 001101


I2 xxxx 000x 000x 0x01xx100xx xx00 000110
I3 xxxx 000x 000x xx01xx100xx xx00 110101
I4 0011 01xx 000x 1x10xx100xx xxxx xxxxxx
I5 xxxx 000x 000x 0x11xx100xx xx00 000010
I6 xxxx 000x 000x 0x01xx100xx xx00 110101
I7 0111 01xx 000x 1x10xx100xx xxxx xxxxxx
I8 xxxx 000x 00xx 0x11xx100xx xx00 000010
I9 1110 01xx 000x 1x10xx100xx xxxx xxxxxx
I10 xxxx 000x 00xx 0x11xx100xx xx00 000010
I11 xxxx 000x 000x 0x01xx100xx xx00 110101
I12 1110 01xx 000x 1x10xx100xx xxxx xxxxxx
I13 1110 01xx 100x 1x10xxxx0xx xxxx xxxxxx
I14 xxxx 000x 000x 0x11xx100xx xx00 111010
I15 1110 01xx 000x 1x10xx100xx xxxx xxxxxx
I16 1110 01xx 100x 1x10xxxx0xx xxxx xxxxxx
I17 1110 01xx 100x 1x10xxxx0xx xxxx xxxxxx
I18 1110 01xx 100x 1x10xxxx0xx xxxx xxxxxx
I19 0010 01xx 100x 1x10xxxx0xx xxxx xxxxxx
I20 xxxx 000x 000x 0x11xx100xx xx00 000010
I21 xxxx 000x 01xx 1x0010xx0xx xx10 xxxxxx
I22 xxxx 00x0 00x0 0x00xxxx1x0 xx00 000000
I23 xxxx 000x 00x1 0x0100xx0xx xx00 001000
I24 xxxx 000x 0000 0xx0xx000x0 xxxx xxxxxx

6.3 The Finite State Machine (FSM)

The three phases of computation done by the ECCP, namely the initialization, scalar
multiplication and projective to affine conversion phase are implemented using the FSM
shown in Figure 6.4. The first three states of the FSM do the initialization. In these
states the curve constant and basepoint coordinates are loaded from ROM into the reg-
isters (Table 6.6). These states also detect the leading MSB in the scalar key k. After
initialization, the scalar multiplication is done. This consists of 4 states for doubling
and 8 for the point addition. The states that do the doubling are D1 · · · D4. In state
D4, a decision is made depending on the key bit ki (i is a loop counter initially set to

78
the position of the leading one in the key, and ki is the ith bit of the key k). If ki = 1
then a point addition is done and state A1 is entered. If ki = 0, the addition is not done
and the next key bit (corresponding to i − 1) is considered. If ki = 0 and there are no
more key bits to be considered then the complete signal is issued and it marks the end
of the scalar multiplication phase. The states that do the addition are A1 · · · A8. At the
end of the addition (state A8) state D1 is entered and the key bit ki−1 is considered. If
there are no more key bits remaining the complete signal is asserted. Table 6.7 shows
the control words generated at every state.

At the end of the scalar multiplication phase, the result obtained is in projective
coordinates and the X, Y , and Z coordinates are stored in the registers RA1 , RB1 , and
RC1 respectively. To convert the projective point to affine, the following equation is
used.

x = X · Z −1
(6.5)
−1 2
y = Y · (Z )

The inverse of Z is obtained using the quad-ITA discussed in Algorithm 5.1. The ad-
dition chain used is the Brauer chain in Equation 5.3. The processor implements the
steps given in Table 5.4. Each step in Table 5.4 gets mapped into one or more states
from I1 to I21. The number of clock cycles required to find the inverse is 21. This is
lesser than the clock cycles estimated by Equation 5.9. This is because, inverse can be
implemented more efficiently in the ECCP by utilizing the squarers present in the AU.

At the end of state I21, the inverse of Z is present in the register RC1 . The states
I22 and I23 compute the affine coordinates x and y respectively.

The number of clock cycles required for the ECCP to produce the output is com-
puted as follows. Let the scalar k has length l and hamming weight h, then the clock

79
cycles required to produce the output is given by the following equation.

#ClockCycles = 3 + 12(h − 1) + 4(l − h) + 24


(6.6)
= 15 + 8h + 4l

Three clock cycles are added for the initial states, 24 clock cycles are required for the
final projective to affine conversion. 12(h − 1) cycles are required to handle the 1’s in
k. Note that the MSB of k does not need to be considered. 4(l − h) cycles are required
for the 0’s in k.

6.4 Performance Evaluation

In this section we compare our work with reported GF (2m ) elliptic curve crypto pro-
cessors implemented on FPGA platforms (Table 6.8). Our ECCP was synthesized using
Xilinx’s ISE for Virtex 4 and Virtex E platforms. Since, the reported works are done on
different field sizes. We use the measure latency/bit for evaluation. Here latency is
the time required to compute kP . Latency is computed by assuming the scalar k has
half the number of bits 1. The only faster implementations are [37] and [1]. However,
[37] does not perform the final inverse computation required for converting from LD
to affine coordinates. Also, as shown in Table 6.9 our implementation has a better area
time product compared to [1], while the latency is almost equal. To compare the two
designs we scaled the area of [1] by a factor of (233/m)2 , since area of the elliptic curve
processors is mostly influenced by the multiplier which has an area of O(n2 ). The time
is scaled by a factor (233/m), since the time required is linear.

80
Table 6.8: Comparison of the Proposed GF (2m ) ECCP with FPGA based Published Results

Work Platform Field Slices LUTs Gate Freq Latency Latency


m Count (MHz) (ms) /bit (ns)
Orlando [29] XCV400E 163 - 3002 - 76.7 0.21 1288
Bednara [33] XCV1000 191 - 48300 - 36 0.27 1413
Kerins [32] XCV2000 239 - - 74103 30 12.8 53556
Gura [34] XCV2000E 163 - 19508 - 66.5 0.14 858
Mentens [65] XCV800 160 - - 150678 47 3.810 23812
Lutz [35] XCV2000E 163 - 10017 - 66 0.075 460
Saqib [37] XCV3200 191 18314 - - 10 0.056 293
Pu [38] XC2V1000 193 - 3601 - 115 0.167 865
Ansari [40] XC2V2000 163 - 8300 - 100 0.042 257
Chelton [1] XCV2600E 163 15368 26390 238145 91 0.033 202
81

XC4V200 163 16209 26364 264197 153.9 0.019 116


This Work XCV3200E 233 20325 40686 333063 25.31 0.074 317
XC4V140 233 20917 39303 334709 64.46 0.029 124

Table 6.9: Comparing Area×Time Requirements with [1]

Work Field Platform Slices Scaled Latency Scaled Area


Slices (ms) Latency (ms) ×Time
(m) (S) SS = S( 233
m )
2 (T ) T S = T ( 233
m ) (SS × T S)
Chelton [1] 163 XC4V200 16209 33120 0.019 0.027 894
This Work 233 XC4V140 20917 20917 0.029 0.029 606
6.5 Conclusion

This chapter integrates the previously developed finite field arithmetic blocks to form
an arithmetic unit. The AU is used in a elliptic curve crypto processor to compute
the scalar product kP for a NIST specified curve. Our ECCP has the best timing per
bit compared to most of the reported works. Of all works compared, only two have
better timing compared to ours. We showed that our design has more efficient FPGA
utilization compared to these works.

82
CHAPTER 7

Side Channel Analysis of the ECCP

The previous chapter presented the construction of an elliptic curve crypto processor.
This chapter discusses issues regarding side channel analysis of the processor. First a
side channel attack based on simple power analysis (SPA) of the ECCP is demonstrated.
Then the architecture of the ECCP is modified to reduce the threat of SPA. We call this
new architecture SPA resistant elliptic curve crypto processor (SR-ECCP).

This chapter is organized as follows : the next section demonstrates a simple power
analysis on the ECCP. Section 7.2 presents the SR-ECCP and shows how the power
traces do not reveal the key any more. The final section has the conclusion.

7.1 Simple Power Analysis on the ECCP

The state machine for the scalar multiplication in the ECCP has 12 states (Figure 6.4),
4 states (D1 · · · D4) for doubling and 8 states (A1 · · · A8) for addition. Each iteration
in the scalar multiplication handles a bit in the key starting from the most significant
one to the least significant bit. If the key bit is zero a doubling is done and no addition
is done. If the key bit is one the doubling is followed by an addition. The dissimilarity
in the way a 1 and a 0 in the key is handled makes the ECCP vulnerable to side channel
attacks as enumerated below.

• The duration of an iteration depends on the key bit. A key bit of 0 leads to a short
cycle compared to a key bit of 1. Thus measuring the duration of an iteration will
give an attacker knowledge about the key bit.
Fig. 7.1: Power Trace for a Key with all 1 Fig. 7.2: Power Trace for a Key with all 0

• Each state in the FSM has a unique power consumption trace. Monitoring the
power consumption trace would reveal if an addition is done thus revealing the
key bit.

1
To demonstrate the attack we used Xilinx’s XPower tool. Given a value change
dump (VCD) file generated from a flattened post map or post route netlist, XPower is
capable of generating a power trace for a given testbench (details on generating the
power trace is given in Appendix C).

Figures 7.1 and 7.2 are partial power traces generated for the key (F F F F F F F F )16
and (80000000)16 respectively. The graphs plots the power on the Y axis with the time
line on the X axis for a Xilinx Virtex 4 FPGA. The difference in the graphs is easily
noticeable. The spikes in Figure 7.1 occurs in state A6. This state is entered only when
a point addition is done, which in turn is done only when the key bit is 1. The spikes
are not present in Figure 7.2 as the state A6 is never entered. Therefore the spikes in
the trace can be used to identify ones in the key.

The duration between two spikes in Figure 7.1 is the time taken to do a point dou-
bling and a point addition. This is 12 clock cycles. If there are two spikes with a
distance greater than 12 clock cycles, it indicates that one or more zeroes are present in
the key. The number of zeroes (n) present can be determined by Equation 7.1. In the
1
http://www.xilinx.com/products/design_tools/logic_design/verification/xpower.htm

84
Fig. 7.3: Power Trace when k = (B9B9)16

equation t is the duration between the two spikes and T is the time period of the clock.

t
n= −3 (7.1)
4T

The number of zeroes between the leading one in k and the one due to the first spike
can be inferred by the amount of shift in the first spike.

As an example consider the power trace (Figure 7.3) for the ECCP obtained when
the key was set to (B9B9)16 . There are 9 spikes indicating 9 ones in the key (excluding
the leading one). Table 7.1 infers the key from the time duration between spikes. The
clock has a period T = 200ns.

The first spike t1 is obtained at 3506th ns. If there were no zeros before t1 the spike
should have been present at 2706th ns (this is obtained from the first spike of Figure
7.1). The shift is 800 ns equal to four clock cycles. Therefore a 0 is present before the
t1 spike.

85
Table 7.1: SPA for the key B9B916

i ti − ti−1 n Key Inferred


1 - - 01
2 2400 ns 0 1
3 2400 ns 0 1
4 4000 ns 2 001
5 2400 ns 0 1
6 3200 ns 0 01
7 2400 ns 0 1
8 2400 ns 0 1
9 4000 ns 2 001

The key obtained from the attack is (1011100110111001)2 , and it matches the actual
key.

7.2 SPA Resistant ECCP

To harden the ECCP against SPA, the sequence of computations involved when the key
bit is 1 and when the key bit is 0 must be indistinguishable. There are several ways
to achieve this. The most common technique is by inserting a dummy addition when
the key bit is 0[66]. This is shown in Figure 7.4. With this method, a doubling and
an addition is always done. The value of the key bit decides if the addition should
be considered. This makes the sequence for a key bit of 1 indistinguishable from a 0.
The time for an iteration is a constant therefore reducing timing attacks. Similar power
traces are seen at every iteration thus reducing threats of power attacks. The following
section modifies the ECCP architecture using the dummy addition to make it robust
against SPA.

86
Double

Addition

1 0
k Multiplexer
i

Fig. 7.4: Always Add Method to Prevent SPA

7.2.1 The SR-ECCP

Modifying the ECCP to incorporate ’adding always’ requires a change in the FSM and
the register file. The new FSM is as shown in Figure 7.5. Irrespective of the key bit
all states D1 · · · D4 and A1 · · · A8 are entered in every iteration. If the key bit is 1 the
result of state A8 is considered as the output of the iteration. If the key bit is 0 the result
of D4 is taken as the output. After all key bits are processed the complete signal is
asserted.
A2
A1 A3

D4 A4

D3 A5

D2 A6

D1 A7
A8
Detect leading 1

complete
Init1 Init2 Init3 I1 I2 I21 I22 I23 I24

Fig. 7.5: FSM for SR-ECCP

The SR-ECCP also requires a modification in the register file as shown in Figure

87
7.6. An additional register bank RD containing three registers is introduced. The three
registers in the bank RD1 , RD2 and RD3 store the coordinates of the computed double.
The outputs of the register bank is used in state A8 only when the key bit is 0. RD
requires an additional input multiplexer M U XIN 4 to store the doubled result. The size
of the output multiplexers M U XOU T 1, M U XOU T 2 and M U XOU T 3 are increased
to incorporate RD′ s outputs.

c[24]
c[39]c[21]

MUX
c[10] ad1 RA1 out1
din
IN1 MUX A0
c[11] ad2 RA out2
OUT1
c[12] we 2
C0 c[25]
c[40]c[22]
c[14:13] ad1 RB1
c[16:15] out1
MUX din
IN2
ad2 RB2 out2
MUX A2
C1 c[17] OUT2 c[31]
we RB 3

c[32],c[30] RB4 MUX Qin


c[41]c[23]
OUT4
c[18] ad1
Qout MUX din RC1 out1
MUX A1
IN3
c[19] ad2 OUT3
1 c[20] we
RC 2 out2

A3
c[34:33]
MUX din
ad1 RD1 out1

IN4 c[36:35]
ad2 RD2 out2
c[37]
we
RD 3

Fig. 7.6: Register File for SR-ECCP

7.2.2 Power Trace of the SR-ECCP

Figure 7.7 has the power trace for the SR-ECCP for the key (B9B9)16 . This is the same
key used in the power trace of Figure 7.3. However, unlike Figure 7.3, Figure 7.7 has
no periodic spikes. Thus using a simple power analysis the key cannot be inferred from
Figure 7.7.

88
Fig. 7.7: Power Trace when k = (B9B9)16

Table 7.2: Performance Evaluation of the SR-ECCP


Processor Device Slices Frequency Clock Cycles
ECCP Xilinx Virtex 4 (XC4VFX140) 21852 64.46MHz 1883
SR-ECCP Xilinx Virtex 4 (XC4VFX140) 23511 56.46MHz 2811

7.2.3 Performance Evaluation

The modification of the ECCP to improve its security comes at a cost of increased area,
lower frequency and increased computation time. Table 7.2 shows the overhead of the
SR-ECCP compared to the ECCP. The clock cycles is the number of clocks required
to compute kP , assuming k has 116 zeroes out of 233 and the MSB of k is 1. The
clock cycles required for the SR-ECCP is always a constant irrespective of the number
of zeroes in k.

89
7.3 Conclusion

This chapter demonstrated the vulnerability of the ECCP to simple power analysis.
Simulations show that power trace of the processor leak the secret key. The vulnera-
bilities of the ECCP were fixed in the SR-ECCP, which does homogeneous operations
irrespective of the key bit. The penalty of the SR-ECCP is a larger area requirement
and lower frequency compared to the ECCP.

90
CHAPTER 8

Conclusions and Future Work

The thesis explores various architectures for the construction of an elliptic curve crypto
processor for high performance applications. The most important factor contributing to
the performance is the finite field multiplication and finite field inversion. A combina-
tional multiplier is able to obtain the product in one clock cycle at the cost of increased
area and delay. In order to ensure that the primitives have a good area delay product,
the thesis suggests techniques to reduce the area time product by effectively utilizing
the available FPGA resources.

A hybrid Karatsuba multiplier is proposed for finite field multiplication, which has
been shown to possess the best area time product compared to reported Karatsuba im-
plementations. The hybrid Karatsuba multiplier is a recursive algorithm which does the
initial recursions using the simple Karatsuba multiplier [55], while the final recursion is
done using the general Karatsuba multiplier [55]. The general Karatsuba has large gate
counts, however it is more compact for small sized multiplications due to the better LUT
utilization. The simple Karatsuba multiplier is more efficient for large sized multipli-
cations. After a thorough search, a threshold of 29 was found. Multiplications smaller
than 29 bits is done using the general Karatsuba multiplier, while larger multiplications
are done with the simple Karatsuba multiplier.

The quad-Itoh Tsujii inversion algorithm proposed to find the multiplicative inverse
has the best computation time and area time product compared to works reported in
literature. This work first generalizes the Itoh-Tsujii algorithm and then shows that a
specific instance of the generalization, which uses quad circuits instead of squarers, is
more efficient on FPGAs.
An elliptic curve crypto processor is built using the proposed finite field primitives.
Except for [1], the constructed processor has better timing than all reported works.
However, the constructed processor has much better area requirements and area time
product compared to [1]. These were achieved in spite of the fact that the scalar mul-
tiplication implemented was straight forward and no parallelism or pipelining in the
architecture was used.

8.1 Future Work

• The focus of this work was on the implementation of efficient elliptic curve prim-
itives for ECC and its impact on the overall performance of the ECCP. Thus a
possible future work could be to combine architectural techniques like pipelining
and parallelism in the higher level scalar multiplier with techniques proposed in
this thesis.

• The toplevel is a simple implementation of the Montgomery multiplication using


López-Dahab (LD) projective coordinates. The combination of more sophisti-
cated methods like add and half method, LD method, non adjacent form methods,
mixed coordinates etc. with the proposed primitives may be experimented.

• A simple power attack was analyzed and prevented in the side channel resistant
version of the elliptic curve crypto processor. A very interesting field of research,
would be to study the effect of the more powerful differential power analysis
(DPA) on the proposed architecture.

• To make the work proposed in this thesis usable in practice, the developed el-
liptic curve crypto processor may be incorporated in security toolkits such as
OpenSSL1 . This involves the development of a communication interface for com-
1
http://www.openssl.org

92
munication with the host processor, operating system device drivers and library
modifications.

93
APPENDIX A

Verification and Testing of the ECCP

A.1 Verification of the ECCP and SR-ECCP

The elliptic curve crypto processor (ECCP) and the side channel resistant version of
the ECCP, the SR-ECCP, have to be verified for their correctness. The verification was
done for the curve given Equation A.1.

y 2 + xy = x3 + ax2 + b (A.1)

The basepoint and the values of the curve constants used is given in Table A.1. These
constants were taken from NIST’s digital signature specification [14] for elliptic curves
over GF (2233 ).

For a key (k), the scalar product kP is determined by simulation of the ECCP (or
the SR-ECCP) with Modelsim or iVerilog. Here, P is the basepoint with coordinates
(Px , Py ). The result thus obtained is verified against the result obtained by running the

Table A.1: Basepoint and Curve Constants used for Verification of the ECCP and the
SR-ECCP
Basepoint X (Px ) 233’h0FAC9DFCBAC8313BB2139F1
BB755FEF65BC391F8B36F8F8EB7371FD558B
Basepoint Y (Py ) 233’h1006A08A41903350678E585
28BEBF8A0BEFF867A7CA36716F7E01F81052
Curve constant (b) 233’h066647EDE6C332C7F8C0923
BB58213B333B20E9CE4281FE115F7D8F90AD
Curve constant (a) 1
Virtex 4 FPGA
USB Controller
Elliptic Curve
11
00 USB
Crypto Processor

Main Bus Main Bus Slave


Main Bus Master

HARDWARE

Fig. A.1: Test Platform for the ECCP

elliptic curve software with the same key k. The elliptic curve software was obtained
from the book Implementing Elliptic Curve Cryptography by Michael Rosing [67].

A Python 1 script was developed which would automatically generate a random key
k. This key is used by Rosing’s software to determine Q1 = kP . The key is also used
in the test vector of the ECCP(or SR-ECCP) to determine Q2 = kP . The python script
would then verify if Q1 = Q2 . A large number of scalar multiplications were tested
using the above mentioned procedure.

A.2 Testing of the ECCP

The testing of the ECCP was done using the Virtex 4 FPGA board from Dinigroup2 .
The simplified block diagram of the test platform is shown in Figure A.1. USB com-
munication software supplied by the manufacturer was used to communicate between
the PC and the hardware. Onboard devices convert the USB protocol into a proprietary
main bus protocol3 . This channel is used to configure the FPGAs as well as commu-
nicate with the elliptic curve processor. Our implementations resides in the Virtex 4
FPGA (FPGA_A). The main bus slave has eight 32 bit input registers (Rin0 · · · Rin7)
and sixteen output registers (Rout0 · · · Rout15). It also has a control register contain-
1
www.python.org
2
http://www.dinigroup.com/DN8000k10pcie.php
3
http://www.dinigroup.com/product/common/mainbus_spec.pdf

95
Table A.2: ECCP System Specifications on the Dini Hardware
Frequency 24MHz
Slices occupied 22526
Size on device 25%
Clock Cycles Required 1883 (average case)
Critical Path Follows the path from register bank, quadblock
MUX C and register bank again.

ing status bits such as start (to start the scalar multiplication) and done (to indicate
completion). To initialize, the 233 bit scalar k (in Algorithm 3.1) is loaded into the
registers Rin0 to Rin7. On completion the result Qx and Qy can be read from Rout0
to Rout15. The results from testing on the hardware was as expected. Table A.2 shows
the specifications of the system when used with the Dini card.

96
APPENDIX B

Finite Fields used for Performance Evaluation of ITA

The graph in Figure 5.5 was plotted after synthesizing the quad-ITA and the squarer-
ITA for several finite fields. The following table contains the addition chains, irre-
ducible polynomials and number of cascaded quad circuits in the quadblock for each
implementation of the (quad-)ITA.

Finite Field Addition Chain Irreducible Polynomial us


GF (2103 ) (1 2 3 6 12 24 25 50 51 102) x103 + x9 +1=0 12
GF (2111 ) (1 2 3 6 12 13 26 27 54 55 110) x111 + x10 +1=0 13
GF (2121 ) (1 2 3 6 7 14 15 30 60 120) x121 + x18 + 1 = 0 14
GF (2129 ) (1 2 4 8 16 32 64 128) x129 + x5 +1=0 16
GF (2147 ) (1 2 4 8 9 18 36 72 73 146) x147 + x14 +1=0 18
GF (2161 ) (1 2 4 5 10 20 40 80 160) x161 + x18 + 1 = 0 10
GF (2169 ) (1 2 4 5 10 20 21 42 84 168) x169 + x34 +1=0 10
GF (2177 ) (1 2 4 5 10 11 22 44 88 176) x177 + x8 +1=0 11
GF (2193 ) (1 2 3 6 12 24 48 96 192) x193 + x15 + 1 = 0 12
GF (2201 ) (1 2 3 6 12 24 25 50 100 200) x201 + x14 +1=0 12
GF (2209 ) (1 2 3 6 12 13 26 52 104 208) x209 + x6 +1=0 13
GF (2225 ) (1 2 3 6 7 14 28 56 112 224) x225 + x32 + 1 = 0 14
GF (2233 ) (1 2 3 6 7 14 28 29 58 116 232) x233 + x74 +1=0 14
GF (2241 ) (1 2 3 6 7 14 15 30 60 120 240) x241 + x70 + 1 = 0 15
GF (2253 ) (1 2 3 6 7 14 15 30 31 62 63 126 252) x253 + x46 +1=0 15
GF (2273 ) (1 2 4 8 16 17 34 68 136 272) x273 + x23 +1=0 17
GF (2281 ) (1 2 4 8 16 17 34 35 70 140 280) x281 + x93 + 1 = 0 17
GF (2289 ) (1 2 4 8 9 18 36 72 144 288) x289 + x21 +1=0 18
APPENDIX C

Using XPower to Obtain Power Traces of a Device

There are two forms of power dissipation for a device; static and dynamic power. Static
power is the amount of power dissipated by the device when no clock is running. During
this phase no signals toggle, hence the power consumed is the minimum power required
to maintain the state of the logic cell. Dynamic power is the amount of power dissipated
by the device when the clock is running. The dynamic power is considerably higher than
the static power consumed by the device, and it is generally caused when one or more
of the inputs toggle. Analysis of the instantaneous dynamic power of the device is used
in side channel attacks.

Obtaining power traces of a device require equipments such as storage oscilloscopes


and power analyzers. However these equipments are expensive and therefore not easy
to procure. Most importantly, through this flow we can cross check the side channel
vulnerability using simulation without being hampered by noise picked up during an
actual measurement. We therefore use Xilinx’s XPower tool to analyze the power con-
sumption of a design after it has been placed and routed.

C.1 XPower

The XPower tool estimates the power consumption for a variety of Xilinx FPGA archi-
tectures. The estimation is based on the device and the number of transitions (activity
rate) of the device.

The following procedure is used to estimate the power consumed by a device using
Xilinx’s ISE and XPower.
• The developed verilog code is synthesized using the Xilinx ISE tool. The result
of synthesis is a .ngd file. This file is a netlist of primitive gates which could be
implemented on several of the Xilinx FPGAs.

• The next step is to map the primitives onto the resources available on the specific
FPGA platform. This is done by the Xilinx map tool. The output of the tool is an
.ncd file.

• The .ncd file is then passed to the place and route tool, where specific locations
on the FPGA are assigned. This tool tries to incorporate all the timing constraints
specified in the constraints file. The output of the place and route tool is an
updated .ncd file.

• In ISE, a flattened verilog netlist can be generated after the mapping or the place
and route. This verilog netlist after the mapping can be created by clicking the
generate post-map simulation model. This would create a verilog netlist called
topmodule_map.v. Also a .sdf file is created containing timing information of the
device.

• Now the flattened verilog file and the sdf along with a testbench can be simulated
in Modelsim. A value change dump file containing all the signal transitions can
be generated from the simulation. This requires the following lines to be present
in the test bench.

initial begin
$dumpfile ("dump.vcd"); /* File to place signal activity report */
$dumpvars; /* Dump all signals in the design */
$dumpon; /* Turn on dump */
#100000 $dumpoff; /* Turn off dump */
end

These lines will result in a file called dump.vcd to be generated during simulation.
The VCD file contains the activity on each signal in the design.

99
• The constraints file (.pcf ), the .vcd file and the .ncd file are used as inputs to
XPower. XPower can be run from command line as shown below.
xpwr topmodule_map.ncd topmodule.pcf -s dump.vcd
The result produced by xpwr is present in a text file called topmodule.txt. The
topmodule.txt file contains the instantaneous power consumption for the given
test vector.

• This text file is plotted on a graph to obtain the power trace.

If the .sdf file generated by ISE is used in XPower, then the power measurement
would include the power consumed due to glitches. If the post place and route verilog
netlist was used instead of the mapped netlist then more accurate power measurement
is possible.

100
APPENDIX D

Elliptic Curve Arithmetic

This appendix derives the elliptic curve equations for points in affine coordinates and
López-Dahab projective coordinates.

Consider the elliptic curve E over the field GF (2m ). This is given by

y 2 + xy = x3 + ax2 + b (D.1)

where a, b ∈ GF (2m ).

Equation D.1 can be rewritten as

F (x, y) : y 2 + x3 + xy + ax2 + b = 0 (D.2)

The partial derivatives of this equation are

dF
=x
dy
(D.3)
dF
= x2 + y
dx

If we consider the curve given in Equation D.1, with b = 0, then the point (0, 0)
lies on the curve. At this point dF/dy = dF/dx = 0. This forms a singular point
and cannot be included in the elliptic curve group, therefore an additional condition of
b 6= 0 is required on the elliptic curve of Equation D.1. This condition ensures that the
curve is non singular.
D.1 Equations for Arithmetic in Affine Coordinates

D.1.1 Point Inversion

Let P = (x1 , y1 ) be a point on the elliptic curve of Equation D.1. To find the inverse of
point P , a vertical line is drawn passing through P . The equation of this line is x = x1 .
The point at which this line intersects the curve is the inverse −P . The coordinates of
−P is (x1 , y1′ ). To find y1′ , the point of intersection between the line and the curve must
be found. Equation D.2 is represented in terms of its roots p and q as shown below.

(y − A)(y − B) = y 2 − (p + q)y + pq (D.4)

The coefficients of y is the sum of the roots. Equating the coefficients of y in Equations
D.2 and D.4.
p + q = x1

One of the roots is q = y1 , therefore the other root p is given by

p = x1 + y1

This is the y coordinate of the inverse. The inverse of the point P is therefore given by
(x1 , x1 + y1 ).

D.1.2 Point Addition

Let P = (x1 , y1 ) and Q = (x2 , y2 ) be two points on the elliptic curve. To add the two
points, a line (l) is drawn through P and Q. If P 6= ±Q, the line intersects the curve of
Equation D.1 at the point −R = (x3 , y3′ ). The inverse of the point −R is R = (P + Q)
having coordinates (x3 , y3 ).

102
The slope of the line l passing through P and Q is given by

y2 − y1
λ=
x2 − x1

Equation of the line l is

y − y1 = λ(x − x1 )
(D.5)
y = λ(x − x1 ) + y1

Substituting y from D.5 in the elliptic curve equation D.1 we get,

(λ(x − x1 ) + y1 )2 + x(λ(x − x1 ) + y1 ) = x3 + ax2 + b

This can be rewritten as

x3 + (λ2 + λ + a)x2 + · · · = 0 (D.6)

Equation D.6 is a cubic equation having three roots. Let the roots be p, q and r. These
roots represent the x coordinates of the points on the line that intersect the curve (the
point P , Q and −R). Equation D.6 can be also represented in terms of its roots as

(x − p)(x − q)(x − r) = 0
(D.7)
3 2
x − (p + q + r)x · · · = 0

Equating the x2 coefficients of Equations D.7 and D.6 we get,

p + q + r = λ2 + λ + a (D.8)

Since P = (x1 , y1 ) and Q = (x2 , y2 ) lie on the line l, therefore two roots of Equation
D.6 are x1 and x2 . Substituting p = x1 and q = x2 in Equation D.8 we get the third
root, this is the x coordinate of the third point on the line which intersects the curve( ie.

103
−R). This point is denoted by x3 , and it also represents the x coordinate of R.

x3 = λ2 + λ + x1 + x2 + a (D.9)

The y coordinate of −R can be obtained by substituting x = x3 in Equation D.5. This


point is denoted as y3′ .
y3′ = λ(x3 + x1 ) + y1 (D.10)

Reflecting this point about the x axis is done by substituting y3′ = x3 + y3 . This gives
the y coordinate of R, denoted by y3 .

y3 = λ(x3 + x1 ) + y1 + x3 (D.11)

Since we are working with binary finite fields, subtraction is the same as addition.
Therefore,

x3 = λ2 + λ + x1 + x2 + a

y3 = λ(x3 + x1 ) + y1 + x3 (D.12)
y2 + y1
λ=
x2 + x1

D.1.3 Point Doubling

Let P = (x1 , y1 ) be a point on the elliptic curve. The double of P , ie. 2P , is found by
drawing a tangent t through P . This tangent intersects the curve at the point −2P =
(x3 , y3′ ). Taking the reflection of the point −2P about the X axis gives 2P = (x3 , y3 ).

First, let us look at the tangent t through P . The slope of the tangent t is obtained
by implicit differentiation of Equation D.1.

dy dy
2y + x + y = 3x2 + 2ax
dx dx

104
Since we are using modular 2 arithmetic,

dy
x + y = x2
dx

The slope dy/dx of the line t passing through the point P is given by

x1 2 + y1
λ= (D.13)
x1

The equation of the line t can be represented by the following.

y + y1 = λ(x + x1 ) (D.14)

This gives,

y = λ(x + x1 ) + y1

y = λx + c for some constant c

To find x3 (the x coordinate of −2P ), substitute for y in Equation D.1.

(λx + c)2 + x(λx + c) = x3 + ax + b

This equation can be rewritten as

0 = x3 + (λ2 + λ + a)x + · · · (D.15)

This equation is cubic and has three roots. Of these three roots, two roots must be
equal since the line intersects the curve at exactly two points. The two equal roots are
represented by p. The sum of the three roots is (λ2 + λ + a), similar to Equation D.7.

105
Therefore,

p + p + r = λ2 + λ + a

r = λ2 + λ + a

The dissimilar root is r. This root corresponds to the x coordinate of −2P ie. x3 .
Therefore,
x3 = λ2 + λ + a

To find the y coordinate of −2P , ie. y3′ , substitute x3 in Equation D.14. This gives,

y3′ = λx3 + λx1 + y1

y3′ = λx3 + x1 2

To find y3 , the y coordinate of 2P , the point y3′ is reflected on the x axis. From the point
inverse equation
y3 = λx3 + x1 2 + x3

To summarize, the coordinates of the double are given by Equation D.16

x3 = λ 2 + λ + a

y3 = x1 2 + λx3 + x3 (D.16)
y1
λ = x1 +
x1

106
D.2 Equations for Arithmetic in LD Projective Coordi-
nates

D.2.1 Point Inversion

Inverting a point P = (x1 , y1 ) on the elliptic curve results in the point −P = (x3 , y3 ) =
(x1 , x1 + y1 ). Converting x1 to X1 /Z1 , x3 to X3 /Z3 and y1 to Y1 /Z1 2 , y3 to Y3 /Z3 2
X3 X1
Then Z3
= Z1
, therefore X3 = X1 and Z3 = Z1 . Also,

Y3 X1 Y1
2 = + 2
Z3 Z1 Z1
X1 Z1 + Y1
=
Z1 2

Therefore, −P = (X3 , Y3 , Z3 ) in projective coordinates is (X1 , X1 Z1 + Y1 , Z1 ).

D.2.2 Point Addition

In Equation D.12, change x1 to X1 /Z1 , x3 to X3 /Z3 and y1 to Y1 /Z1 2 , y3 to Y3 /Z3 2 .


Then the slope λ becomes

y2 + (Y1 /Z1 2 )
λ=
x2 + (X1 /Z1 )
y 2 Z1 2 + y 1
=
Z1 (x2 Z1 + X1 )

Let A = y2 Z1 2 + Y1 , B = x2 Z1 + X1 and C = Z1 B. Then,

A
λ=
Z1 · B

107
Consider equation for x3 in Equation D.12.
!2 !
X3 A A X1
x3 = = + + + x2 + a
Z3 BZ1 BZ1 Z1
A2 + ABZ1 + B 2 X1 Z1 + B 2 x2 Z12 + aB 2 Z1 2
=
(BZ1 )2

Therefore,
Z3 = (BZ1 )2 = C 2 (D.17)

and,

X3 = A2 + AC + B 2 X1 Z1 + B 2 x2 Z1 2 + aB 2 Z1 2

= A2 + AC + B 2 (Z1 (X1 + x2 Z1 ) + aZ1 2 )

= A2 + AC + B 2 (Z1 B + aZ1 2 )

Let, E = AC and D = B 2 (Z1 B + aZ1 2 ), then

X3 = A2 + E + D (D.18)

Consider the equation for y3 in Equation D.12.


! ! !
Y3 A X1 X3 X3 Y1
y3 = 2 = + + +
Z3 Z1 B Z1 Z3 Z3 Z1 2
AB 3 X1 Z1 2 + ABX3 Z1 + X3 Z3 + B 4 Y1 Z1 2
=
Z3 2
Y3 = AB 3 X1 Z1 2 + ABX3 Z1 + X3 Z3 + B 4 Y1 Z1 2

108
Substituting X1 = B + x2 Z1 and E = ABZ1 we get

Y3 = (B + x2 Z1 )AB 3 Z1 2 + EX3 + X3 Z3 + B 4 Y1 Z1 2

= (AB 4 Z1 2 + Ex2 Z3 ) + EX3 + X3 Z3 + B 4 Y1 Z1 2

= (y2 Z1 2 + Y1 )B 4 Z1 2 + Ex2 Z3 + EX3 + X3 Z3 + B 4 Y1 Z1 2

= y2 Z3 2 + Ex2 Z3 + EX3 + X3 Z3

Let F = X3 + x2 Z3 and G = (x2 + y2 )Z3 2 .

Y3 = (G + x2 Z3 2 ) + Ex2 Z3 + EX3 + X3 Z3
(D.19)
Y3 = G + F (E + Z3 )

D.2.3 Point Doubling

The x3 equation in D.16 can be rewritten as follow.

 y 1 2  y1 
x3 = x1 + + x1 + +a
x1 x1
(D.20)
x4 + y12 + x31 + x1 y1 + ax21
= 1
x21

From Equation D.1


b = x31 + y12 + x1 y1 + ax21

Substituting in Equation D.20


b
x3 = x21 + (D.21)
x21
Convert x1 to X1 /Z1 and x3 to X3 /Z3 .

X3 X 2 bZ 2
= 12 + 21
Z3 Z1 X1
X3 X + bZ 4
4
= 1 2 21
Z3 X1 Z1

109
Therefore,

X3 = X14 + bZ14

Z3 = X12 Z12

The y3 equation in D.16 can be represented by the following.

 y1 
y3 = x21 + x1 + x3 + x3
x1
 x3 + x y 
1 1
= (x21 + x3 ) + 1 2 x3
x1

From Equations for D.21 and D.1,

b  y 2 + ax2 + b 
1 1
y3 = + x3
x21 x21

Converting this equation to projective coordinates by changing y3 to Y3 /Z32 , and y1 to


Y1 /Z12 .

Y3 bZ12  Y12 Z12  X3


= + + a + b
Z32 X12 X12 Z12 X12 Z3
Y3 bZ14 Z3 + (Y12 + aX12 Z12 + bZ14 )X3
=
Z32 Z32

Therefore
Y3 = bZ14 Z3 + (Y12 + aX12 Z12 + bZ14 )X3

110
APPENDIX E

Gates Requirements for the Simple Karatsuba


Multiplier

This appendix determines the estimates of AN D and XOR gates for the simple Karat-
suba multiplier.

E.1 Gate Requirements for the Basic Karatsuba Multi-


plier

E.1.1 AND Gate Estimate

For an m = 2k bit basic Karatsuba multiplier, the first recursion splits the m bit multi-
plicands into m/2 bits. Three m/2 = 2k−1 bit multipliers are then required. The second
recursion has nine m/4 = 2k−2 bit multipliers. The ith recursion has 3i multipliers with
each multiplier being m/2i = 2k−i bits in length. There are k = log2 m such recursions.
The final recursion containing two bit multiplications has 3log2 m multipliers. In the final
recursion each multiplication is done using a single AN D gates. Therefore,

#AN D gates : 3log2 m (E.1)

E.1.2 XOR Gate Estimate

Let A and B be the two m = 2k bit multiplicands. In the first recursion, the multipli-
cands are split into two halves. Let the higher bits be Ah and Bh and the lower bits
Table E.1: Combining the Partial Products

4n − 4 3n − 2 2n − 1 2n − 2 n−1
to to to to
3n − 1 2n n 0
- - - Ml Ml
- Ml Ml Ml -
- Mh Mh Mh -
- Mhl Mhl Mhl -
Mh Mh Mh - -

be Al and Bl . The three m/2 bit multiplications that are performed are Mh = Ah Bh ,
Ml = Al Bl and Mhl = (Ah + Al )(Bh + Bl ). Let n = m/2. Forming the terms Ah + Al
requires n XOR gates. Similarly the terms Bh + Bl requires n XOR gates. In all, 2n
XORs are required. After the three multiplications are completed, the partial products
are added as shown in the Table E.1. The columns in the table show the output bits
of the multiplier and partial products that need to be combined to form the output bit.
Combining the terms (2n − 2) to n requires 3(n − 1) XOR gates. Similarly the terms
from (3n−2) to 2n require 3(n−1) XOR gates. Combining the terms (2n−1) requires
2 XOR gates. Thus, the total number of XOR gates required for combining the partial
products is 6n − 4, and the number of XOR gates required is 6n − 4 + 2n = 4m − 4.
Since m/2r is the length of the multiplier in the rth recursion, the number of XOR gates
required in the rth recursion is 4(m/2r ) − 4. Adding up the XOR gates required for all
the recursions gives the XOR gate estimate (Equation E.2.

log2m  
3r 4m/2r − 4
X
#XOR gates : (E.2)
r=0

112
E.2 Gate Requirements for the Simple Karatsuba Mul-
tiplier

The simple Karatsuba is basically the basic Karatsuba multiplier with a small modifica-
tion to handle bit lengths of the form m 6= 2k . The number of XOR and AN D gates for
the basic Karatsuba multiplier form the upper bound for the number of gates required
by the simple Karatsuba multiplier. Therefore,

#AN D gates : 3⌈log2 m⌉


⌈log2 m⌉   (E.3)
r r
X
#XOR gates : 3 4⌈m/2 ⌉ − 4
r=0

113
REFERENCES
[1] W. N. Chelton and M. Benaissa, “Fast Elliptic Curve Cryptography on FPGA,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no.
2, pp. 198–205, Feb. 2008.

[2] RSA Laboratories, “RSA Cryptograhy Standard,” 2002.

[3] Paul C. Kocher, “Timing Attacks on Implementations of Diffie-Hellman, RSA,


DSS, and Other Systems,” in CRYPTO ’96: Proceedings of the 16th Annual
International Cryptology Conference on Advances in Cryptology, London, UK,
1996, pp. 104–113, Springer-Verlag.

[4] Paul Kocher, Joshua Jaffe, and Benjamin Jun, “Differential Power Analysis,”
Lecture Notes in Computer Science, vol. 1666, pp. 388–397, 1999.

[5] Mitsuru Matsui and Junko Nakajima, “On the Power of Bitslice Implementation
on Intel Core2 Processor,” in CHES, 2007, pp. 121–134.

[6] Thomas Wollinger, Jan Pelzl, Volker Wittelsberger, Christof Paar, Gökay Sal-
damli, and Çetin K. Koç, “Elliptic and Hyperelliptic Curves on Embedded µP ,”
Trans. on Embedded Computing Sys., vol. 3, no. 3, pp. 509–533, 2004.

[7] Chester Rebeiro, A. David Selvakumar, and A. S. L. Devi, “Bitslice Implementa-


tion of AES,” in CANS, 2006, pp. 203–212.

[8] Robert Konighofer, “A Fast and Cache-Timing Resistant Implementation of the


AES,” in Topics in Cryptology CT-RSA 2008. 2008, pp. 187–202, Springer Berlin
/ Heidelberg.

[9] Lawrence C. Washington, Elliptic Curves: Number Theory and Cryptography,


CRC Press, Inc., Boca Raton, FL, USA, 2003.

[10] Victor Miller, “Uses of Elliptic Curves in Cryptography,” Advances in Cryptology,


Crypto’85, vol. 218, pp. 417–426, 1986.

[11] Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone, Handbook of
Applied Cryptography, CRC Press, 2001.

[12] Anatoly A. Karatsuba and Y. Ofman, “Multiplication of Multidigit Numbers on


Automata,” Soviet Physics Doklady, vol. 7, pp. 595–596, 1963.

114
[13] Toshiya Itoh and Shigeo Tsujii, “A Fast Algorithm For Computing Multiplicative
Inverses in GF (2m ) Using Normal Bases,” Inf. Comput., vol. 78, no. 3, pp. 171–
177, 1988.

[14] U.S. Department of Commerce,National Institute of Standards and Technology,


“Digital signature standard (DSS),” 2000.

[15] Xilinx, Virtex-4 User Guide, 2007.

[16] Douglas R. Stinson, Cryptography: Theory and Practice, Third Edition (Discrete
Mathematics and Its Applications), Chapman & Hall/CRC, 2005.

[17] Whitfield Diffie and Martin E. Hellman, “New Directions in Cryptography,” IEEE
Transactions on Information Theory, vol. IT-22, no. 6, pp. 644–654, 1976.

[18] Darrel Hankerson, Alfred J. Menezes, and Scott Vanstone, Guide to Elliptic Curve
Cryptography, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2003.

[19] Neal Koblitz, “Elliptic Curve Cryptosystems,” Mathematics of Computation, vol.


48, pp. 203–209, 1987.

[20] IEEE Computer Society, “IEEE Standard Specifications for Public-key Cryptog-
raphy,” 2000.

[21] American National Standards Institute, “Public Key Cryptography for the Finan-
cial Service Industry : The Elliptic Curve Digital Signature Algorithm (ECDSA),”
1998.

[22] N Mazzocca A. Cilardo, L Coppolino and L Romano, “Elliptic Curve Cryptog-


raphy Engineering,” Proceedings of the IEEE, vol. 94, no. 2, pp. 395–406, Feb
2006.

[23] Johannes Wolkerstorfer, Hardware Aspects of Elliptic Curve Cryptography, Ph.D.


thesis, Institute for Applied Information Processing and Communications, Graz
University of Technology, 2004.

[24] Thomas Wollinger, Jorge Guajardo, and Christof Paar, “Security on FPGAs:
State-of-the-art Implementations and Attacks,” Trans. on Embedded Computing
Sys., vol. 3, no. 3, pp. 534–574, 2004.

[25] Deming Chen, Jason Cong, and Peichen Pan, “FPGA Design Automation: A
Survey,” Found. Trends Electron. Des. Autom., vol. 1, no. 3, pp. 139–169, 2006.

[26] Takashi Horiyama, Masaki Nakanishi, Hirotsugu Kajihara, and Shinji Kimura,
“Folding of Logic Functions and its Application to Look Up Table Compaction,”
ICCAD, vol. 00, pp. 694–697, 2002.

115
[27] Michael Hutton, Jay Schleicher, David M. Lewis, Bruce Pedersen, Richard
Yuan, Sinan Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark
Bourgeault, Andy Lee, Henry Kim, and Rahul Saini, “Improving FPGA Perfor-
mance and Area Using an Adaptive Logic Module,” in FPL, 2004, pp. 135–144.

[28] Eli Biham and Adi Shamir, “Differential Fault Analysis of Secret Key Cryptosys-
tems,” in CRYPTO ’97: Proceedings of the 17th Annual International Cryptol-
ogy Conference on Advances in Cryptology, London, UK, 1997, pp. 513–525,
Springer-Verlag.

[29] Gerardo Orlando and Christof Paar, “A High Performance Reconfigurable Elliptic
Curve Processor for GF (2m ),” in CHES ’00: Proceedings of the Second Inter-
national Workshop on Cryptographic Hardware and Embedded Systems, London,
UK, 2000, pp. 41–56, Springer-Verlag.

[30] Julio López and Ricardo Dahab, “Improved Algorithms for Elliptic Curve Arith-
metic in GF (2n ),” in SAC ’98: Proceedings of the Selected Areas in Cryptogra-
phy, London, UK, 1999, pp. 201–212, Springer-Verlag.

[31] Leilei Song and Keshab K. Parhi, “Low-Energy Digit-Serial/Parallel Finite Field
Multipliers,” J. VLSI Signal Process. Syst., vol. 19, no. 2, pp. 149–166, 1998.

[32] Tim Kerins, Emanuel Popovici, William P. Marnane, and Patrick Fitzpatrick,
“Fully Parameterizable Elliptic Curve Cryptography Processor over GF (2),” in
FPL ’02: Proceedings of the Reconfigurable Computing Is Going Mainstream,
12th International Conference on Field-Programmable Logic and Applications,
London, UK, 2002, pp. 750–759, Springer-Verlag.

[33] M. Bednara, M. Daldrup, J. von zur Gathen, J. Shokrollahi, and J. Teich, “Re-
configurable Implementation of Elliptic Curve Crypto Algorithms,” in Parallel
and Distributed Processing Symposium., Proceedings International, IPDPS 2002,
Abstracts and CD-ROM, 2002, pp. 157–164.

[34] Nils Gura, Sheueling Chang Shantz, Hans Eberle, Sumit Gupta, Vipul Gupta,
Daniel Finchelstein, Edouard Goupy, and Douglas Stebila, “An End-to-End Sys-
tems Approach to Elliptic Curve Cryptography,” in CHES ’02: Revised Papers
from the 4th International Workshop on Cryptographic Hardware and Embedded
Systems, London, UK, 2003, pp. 349–365, Springer-Verlag.

[35] Jonathan Lutz and Anwarul Hasan, “High Performance FPGA based Elliptic
Curve Cryptographic Co-Processor,” in ITCC ’04: Proceedings of the Interna-
tional Conference on Information Technology: Coding and Computing (ITCC’04)
Volume 2, Washington, DC, USA, 2004, p. 486, IEEE Computer Society.

[36] Jerome A. Solinas, “Efficient Arithmetic on Koblitz Curves,” Des. Codes Cryp-
tography, vol. 19, no. 2-3, pp. 195–249, 2000.

116
[37] N. A. Saqib, F. Rodríiguez-Henríquez, and A. Diaz-Perez, “A Parallel Architec-
ture for Fast Computation of Elliptic Curve Scalar Multiplication Over GF (2m ),”
in 18th International Parallel and Distributed Processing Symposium, 2004. Pro-
ceedings, Apr. 2004.

[38] Qiong Pu and Jianhua Huang, “A Microcoded Elliptic Curve Processor for
GF (2m ) Using FPGA Technology,” in Communications, Circuits and Systems
Proceedings, 2006 International Conference on, June 2006, vol. 4, pp. 2771–
2775.

[39] Xilinx, “Using Block RAM in Spartan-3 Generation FPGAs,” Application Note,
XAPP-463, 2005.

[40] Bijan Ansari and M. Anwar Hasan, “High Performance Architecture of Elliptic
Curve Scalar Multiplication,” Tech. Rep., Department of Electrical and Computer
Engineering, University of Waterloo, 2006.

[41] John B. Fraleigh, First Course in Abstract Algebra, Addison-Wesley, Boston,


MA, USA, 2002.

[42] William Stallings, Cryptography and Network Security (4th Edition), Prentice-
Hall, Inc., Upper Saddle River, NJ, USA, 2005.

[43] Christof Paar, Efficient VLSI Architectures for Bit-Parallel Computation in Galois
Fields, Ph.D. thesis, Institute for Experimental Mathematics, Universität Essen,
Germany, June 1994.

[44] Francisco Rodríguez-Henríquez, N. A. Saqib, A. Díaz-Pèrez, and Çetin Kaya Ķoc,


Cryptographic Algorithms on Reconfigurable Hardware (Signals and Communi-
cation Technology), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

[45] Gregory C. Ahlquist, Brent E. Nelson, and Michael Rice, “Optimal Finite Field
Multipliers for FPGAs,” in FPL ’99: Proceedings of the 9th International Work-
shop on Field-Programmable Logic and Applications, London, UK, 1999, pp.
51–60, Springer-Verlag.

[46] Ç. K. Koç and B. Sunar, “An Efficient Optimal Normal Basis Type II Multiplier,”
IEEE Trans. Comput., vol. 50, no. 1, pp. 83–87, 2001.

[47] Çetin K. Koç and Tolga Acar, “Montgomery Multiplication in GF (2k ),” DES
Codes Cryptography, vol. 14, no. 1, pp. 57–69, 1998.

[48] C. Grabbe, M. Bednara, J. Shokrollahi, J. Teich, and J. von zur Gathen, “FPGA
Designs of Parallel High Performance GF (2233 ) Multipliers,” in Proc. of the
IEEE International Symposium on Circuits and Systems (ISCAS-03), Bangkok,
Thailand, May 2003, vol. II, pp. 268–271.

117
[49] Zoya Dyka and Peter Langendoerfer, “Area Efficient Hardware Implementation
of Elliptic Curve Cryptography by Iteratively Applying Karatsuba’s Method,” in
DATE ’05: Proceedings of the conference on Design, Automation and Test in
Europe, Washington, DC, USA, 2005, pp. 70–75, IEEE Computer Society.

[50] Joachim von zur Gathen and Jamshid Shokrollahi, “Efficient FPGA-Based Karat-
suba Multipliers for Polynomials over F2 ,” in Selected Areas in Cryptography,
2005, pp. 359–369.

[51] Steffen Peter and Peter Langendörfer, “An efficient polynomial multiplier in
GF (2m ) and its application to ECC designs,” in DATE ’07: Proceedings of the
conference on Design, automation and test in Europe, San Jose, CA, USA, 2007,
pp. 1253–1258, EDA Consortium.

[52] Christof Paar, “A New Architecture for a Parallel Finite Field Multiplier with Low
Complexity Based on Composite Fields,” IEEE Transactions on Computers, vol.
45, no. 7, pp. 856–861, 1996.

[53] Francisco Rodríguez-Henríquez and Çetin Kaya Koç, “On Fully Parallel Karat-
suba Multipliers for GF (2m ),” in Proc. of the International Conference on Com-
puter Science and Technology (CST), 2003, pp. 405–410.

[54] Peter L. Montgomery, “Five, Six, and Seven-Term Karatsuba-Like Formulae,”


IEEE Transactions on Computers, vol. 54, no. 3, pp. 362–369, 2005.

[55] André Weimerskirch and Christof Paar, “Generalizations of the Karatsuba


Algorithm for Efficient Implementations,” Cryptology ePrint Archive, Report
2006/224, 2006.

[56] Burton S. Kaliski, “The Montgomery Inverse and its Applications,” IEEE Trans-
actions on Computers, vol. 44, no. 8, pp. 1064–1065, 1995.

[57] Jorge Guajardo and Christof Paar, “Itoh-Tsujii Inversion in Standard Basis and Its
Application in Cryptography and Codes,” Des. Codes Cryptography, vol. 25, no.
2, pp. 207–216, 2002.

[58] Francisco Rodríguez-Henríquez, Nazar A. Saqib, and Nareli Cruz-Cortés, “A


Fast Implementation of Multiplicative Inversion Over GF (2m ),” in ITCC ’05:
Proceedings of the International Conference on Information Technology: Coding
and Computing (ITCC’05) - Volume I, Washington, DC, USA, 2005, pp. 574–579,
IEEE Computer Society.

[59] Francisco Rodríguez-Henríquez, Guillermo Morales-Luna, Nazar A. Saqib, and


Nareli Cruz-Cortés, “Parallel Itoh-Tsujii Multiplicative Inversion Algorithm for a
Special Class of Trinomials,” Des. Codes Cryptography, vol. 45, no. 1, pp. 19–37,
2007.

118
[60] Donald E. Knuth, The Art of Computer Programming Volumes 1-3 Boxed Set,
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1998.

[61] Xilinx, “Using Look-Up Tables as Distributed RAM in Spartan-3 Generation


FPGAs,” Application Note, XAPP-464, 2005.

[62] Guerric Meurice de Dormale, Philippe Bulens, and Jean-Jacques Quisquater, “An
Improved Montgomery Modular Inversion Targeted for Efficient Implementation
on FPGA,” in International Conference on Field-Programmable Technology -
FPT 2004, O. Diessel and J.A. Williams, Eds., 2004, pp. 441–444.

[63] F. Crowe, A. Daly, and W. Marnane, “Optimised Montgomery Domain Inver-


sion on FPGA,” in Circuit Theory and Design, 2005. Proceedings of the 2005
European Conference on, Aug./Sept. 2005, vol. 1.

[64] Sabel Mercurio Henríquez Rodríguez and Francisco Rodríguez-Henríquez, “An


FPGA Arithmetic Logic Unit for Computing Scalar Multiplication using the Half-
and-Add Method,” in ReConFig 2005: International Conference on Reconfig-
urable Computing and FPGAs, Washington, DC, USA, 2005, IEEE Computer
Society.

[65] Nele Mentens, Siddika Berna Ors, and Bart Preneel, “An FPGA Implementation
of an Elliptic Curve Processor GF (2m ),” in GLSVLSI ’04: Proceedings of the 14th
ACM Great Lakes symposium on VLSI, New York, NY, USA, 2004, pp. 454–457,
ACM.

[66] Jean-Sébastien Coron, “Resistance against Differential Power Analysis for El-
liptic Curve Cryptosystems,” in CHES ’99: Proceedings of the First Interna-
tional Workshop on Cryptographic Hardware and Embedded Systems, London,
UK, 1999, pp. 292–302, Springer-Verlag.

[67] Michael Rosing, Implementing Elliptic Curve Cryptography, Manning Publica-


tions Co, Sound View Ct. 3B Greenwich, CT 06830, 1998.

119
PUBLICATIONS AND AWARDS BASED ON THESIS

Publications

1. Chester Rebeiro and Debdeep Mukhopadhyay, "Hybrid Masked Karatsuba Mul-


tiplier for GF (2233 )" in Proceedings of the 11th IEEE VLSI Design and Test Sym-
posium, Kolkata, August 2007, pp 379-387, VLSI Society of India.

2. Chester Rebeiro and Debdeep Mukhopadhyay, "Power Attack Resistant Efficient


FPGA Architecture for Karatsuba Multiplier" in Proceedings of the 21st Interna-
tional Conference on VLSI Design, Hyderabad, January 2008, pp 706–711, IEEE
Computer Society.

3. Chester Rebeiro and Debdeep Mukhopadhyay, "High Performance Elliptic Curve


Crypto Processor for FPGA Platforms" in Proceedings of the 12th IEEE VLSI
Design and Test Symposium, Bangalore, July 2008, pp. 107–117, VLSI Society
of India.

4. Chester Rebeiro and Debdeep Mukhopadhyay, "High Speed Compact Elliptic


Curve Cryptoprocessor for FPGA Platforms" in INDOCRYPT 2008 : 9th In-
ternational Conference of Cryptology in India, Kharagpur, December 2008, pp.
376–388, Springer-Verlag.

Awards

1. Chester Rebeiro and Debdeep Mukhopadhyay won the second prize at the de-
sign contest conducted by the 22nd International Conference on VLSI Design,
New Delhi, January 2009. The entry was titled "High Performance Galois Field
Elliptic Curve Cryptographic Processor for FPGA Platforms".

120

You might also like