Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Cryptographic Algorithm with Multi-Board FPGA Architecture: Design by Ananth & Karthikeyan, Study notes of Indian Literature

A first prize-winning design from the nios ii embedded processor design contest 2005, where ananth and karthikeyan from iit chennai implemented cryptographic algorithms using a multi-board fpga architecture. The project focused on generating random numbers using a des random bit generator and implementing the rsa algorithm on altera’s apex 20ke device family. The design effectively used interrupts for inter-board communication and custom peripherals for random number generation and rsa algorithm implementation.

Typology: Study notes

2010/2011

Uploaded on 12/21/2011

carolyn
carolyn 🇮🇳

4.7

(9)

58 documents

1 / 19

Toggle sidebar

Related documents


Partial preview of the text

Download Cryptographic Algorithm with Multi-Board FPGA Architecture: Design by Ananth & Karthikeyan and more Study notes Indian Literature in PDF only on Docsity! Nios II Embedded Processor Design Contest—Outstanding Designs 2005 118 First Prize Cryptographic Algorithm Using a Multi- Board FPGA Architecture Institution: Indian Institute of Technology, Chennai Participants: G. Ananth and U.S. Karthikeyan Instructor: Dr. V. Kamakoti Design Introduction Information security has assumed a significant importance in today’s world, especially because minor breaches can lead to major risks in the fields of national security and other e-commerce applications and transactions. This necessitates implementing cryptographic algorithms in hardware to achieve better security and faster response as opposed to any software implementation. A promising solution combining high flexibility with the speed and physical security of traditional hardware is the FPGA. Implementing cryptographic algorithms requires the generation of random numbers that can be then used in any algorithm to derive the keys for carrying out a secure transmission. Keeping this in mind, a design was created implementing a multi-board architecture using two Altera® boards. One board constantly generates random numbers using a data encryption standard (DES) random bit generator and at the same time keeps polling its input port for requests by another program designed to receive random numbers. The second board contains a design that implements the RSA algorithm and incorporates the reception of random numbers on the fly by means of hardware interrupts. On receiving the random number, the second board sends an acknowledgement back to the first board to continue the process. The designs (implemented as peripherals) on each board make use of a Nios® embedded processor for communicating and exchanging data between the driver program and the peripheral. The FPGA device family chosen for implementing the RSA algorithm is Altera’s APEX™ 20KE device family. APEX devices are high-density FPGAs that allow complex designs to be implemented on a single device. The target device was an APEX 20K EP200EFC484-2X and the design files were written in Verilog HDL, while compilation, synthesis, fitting, placement, and routing was carried out using the Quartus® II software. The Nios development board provided a hardware platform to immediately start developing embedded systems based on Altera APEX devices. The Nios development board was preloaded with a 32-bit Nios embedded processor system reference design. Cryptographic Algorithm Using a Multi-Board FPGA Architecture 119 The highlight of this project is the efficacious use of interrupts for inter-board communication and the use of numerous custom peripherals for both random number generation and implementing the RSA algorithm and hardware acceleration. Functional Description The functional description of this project is depicted through the flow diagram below. It is essentially comprised of two flows. One flow is the generation of the random number using the DES-based random bit generator. See Figure 1. Figure 1. DES-Based Random Bit Generator Incrementer Plain Text Key DES Algorithm Cipher Text Random Number The flow diagram for the RSA implementation is as follows: 1. A request is sent from the RSA module to fetch a random byte. 2. On receipt of a request, a random byte is sent by the DES random bit generator that continuously polls a designated port for the request for random bytes. 3. It also signals READY after sending the random byte, and indicates readiness to accept the next request from the device (FPGA running RSA). Nios II Embedded Processor Design Contest—Outstanding Designs 2005 122 Lines of Verilog HDL Code Number Design Lines of Verilog HDL Code 1. PLL-Based TRBG 275 2. Ring Oscillator-Based TRBG 275 3. Modified LILI-II PRBG 177 4. Nonlinear Combiner Model-Based PRBG 189 5. Nonlinear Combiner Model (Enhanced with Memory)-Based PRBG 322 6. Nonlinear State Filter Model-Based PRBG 324 7. DES-Based PRBG 1,143 8. DES-ALFG-Based PRBG 1,191 9. BBS-Based PRBG 601 The RSA algorithm was implemented as separate peripherals performing the following operations: ■ Random number receiver ■ Multiplicative inverse calculator ■ Modular exponentiation calculator After implementing these peripherals, all were combined to form a RSA integrated design working through a C driver program, which passed inputs and outputs between the various peripherals, in order. Due to the paucity of the space on the FPGA in terms of the number of LEs, only a 32-bit RSA integrated algorithm was implemented. Space on FPGA (number of LEs) permitting, this design can easily be scaled up. Random Number Receiver The random number receiver was implemented to receive one byte of random number through the external pins on the board. The peripheral consumed the following resources. Family APEX 20KE Device APEX 20K EP200EFC484-2X Total LEs 2,783/8,320 (33%) Total Pins 121/376 (32%) Total Memory Bits 26,496/106,496 (24%) Total PLLs 0/2 (0%) The total time taken for compilation, synthesis, fitting, placement, and routing of this peripheral was 4 minutes and 42 seconds. Cryptographic Algorithm Using a Multi-Board FPGA Architecture 123 Multiplicative Inverse This peripheral was implemented to compute the secret key using an extended Euclidean algorithm. Since the algorithm implemented required division operations to compute the remainder and quotient at every step, it consumed a lot of resources. In simulation, this algorithm was tried and tested up to 128 bits, but in hardware, it could be implemented only up to 48 bits. Total time taken for compilation and synthesis, fitting, placement, and routing was 12 minutes, 31 seconds. The compilation report for this peripheral was: Family APEX 20KE Device APEX 20K EP200EFC484-2X Total LEs 6,524/8,320 (78%) Total Pins 111/376 (29%) Total Memory Bits 26,496/106,496 (24%) Total PLLs 0/2 (0%) As can be seen from the compilation report, a 48-bit implementation itself consumes 6,524 LEs. Hence, if used along with other peripherals such as the exponentiator and the random number receiver, no other peripheral would be able to fit on the FPGA. Therefore, only a 32-bit implementation was used in the RSA integrated implementation. Exponentiator This peripheral was implemented to carry out the following tasks: ■ Primality check using Fermat’s Theorem ■ Encryption ■ Decryption The algorithm implemented was the Montgomery exponentiation algorithm, which in turn uses the Montgomery multiplication algorithm for the intermediate steps. The modular multiplication was implemented using the systolic array architecture, which is quite resource efficient. In simulation, a 512-bit exponentiation was implemented, however, in hardware only a 128-bit exponentiation was possible. The total time taken for compilation, synthesis, fitting, placement, and routing was 11 minutes, 36 seconds. The compilation report for this peripheral was: Family APEX 20KE Device APEX 20K EP200EFC484-2X Total LEs 6,971/8,320 (83%) Total Pins 111/376 (29%) Total Memory Bits 26,496/106,496 (24%) Total PLLs 0/2 (0%) The peripheral consumed 6,971 LEs, hence a higher implementation such as 256- or 512-bit exponentiation was not possible, despite a resource-efficient architecture. The 256-bit exponentiator Nios II Embedded Processor Design Contest—Outstanding Designs 2005 124 itself required 10,277 LEs, while a 512-bit exponentiator required 17,459 LEs. In the RSA integrated implementation, only a 32-bit exponentiator was included, since two other peripherals, the random number receiver and the multiplicative inverse, were also required to be fitted on the same chip. RSA Integrated The RSA integrated peripheral implements the complete RSA algorithm primitive, which includes the following operations: ■ Receiving random numbers. ■ Primality checking ■ Computation of multiplicative inverse. ■ Computation of modular exponentiation. All of the above operations were implemented as separate peripherals and fitted on the same chip. A C driver program then interacts with all the peripherals and passes appropriate values between them. This requires that all the peripherals are instantiated correctly in the C program. The total time taken for compilation, synthesis, fitting, placement, and routing was 13 minutes, 8 seconds. The compilation report for this integrated design was: Family APEX 20KE Device APEX 20K EP200EFC484-2X Total LEs 6,984/8,320 (84%) Total Pins 121/376 (32%) Total Memory Bits 26,496/106,496 (24%) Total PLLs 0/2 (0%) The RSA integrated implementation of 48 bits, excluding the random number receiver and the primality checker, consumed 8239 LEs, which is almost 99% of the total available LEs on the board. Hence the final implementation was scaled down to 32 bits to accommodate the random number and the primality check peripherals. Execution Time & Throughput The RSA algorithm has been implemented with a modulus of 32 bits, with a multi-board architecture also included to receive the random numbers on the fly. However, this makes the measurement of the execution time difficult since it involves an interrupt-driven mechanism. By simulation, the execution time and the throughput for only the encryption/decryption can be approximated for a clock speed of 33 MHz. In the case of RSA, the encryption and decryption is carried out by modular exponentiation, and for a modulus of 32 bits, it took 1,555 clock cycles, which gave a throughput of 0.68 Mbps. Design Architecture The system architecture entails two parts, namely: ■ Generation of random numbers using the DES random bit generator Cryptographic Algorithm Using a Multi-Board FPGA Architecture 127 Architecture of Random Number Receiver The project has been implemented on Altera’s APEX 20K EP200EFC484-2X board, which has a space limitation as far as the number of LEs is concerned. Also, the board has been manufactured in such a way that it does not permit daisy-chaining architecture to overcome the above limitation. Hence, the only method available is to use the external pins on the board, connect those to another board, and exchange data between the two. This, however, has certain limitations, such as the numbers of bits that can be exchanged, the timing issues between the two independent programs, and the requirement of exchanging signals between the boards to facilitate communication as per specific requirements. A multi-board architecture was realized to exchange data between two boards connected through external pins. Due to the limitations mentioned above, a peripheral module for handling random numbers of 16 bits each was implemented. This design is completely scalable and, hardware permitting, can receive any number of bits from another board. This peripheral module has the following components: ■ Random number receiver module ■ Driver program, which receives the random numbers from the random number receiver module ■ Primality check module, based on Fermat's Theorem and utilizing the exponentiator peripheral Random Number Receiver Module This is a module written in Verilog HDL and it resides on the hardware (FPGA). To receive the random numbers and to communicate with another board, 10 external pins have been mapped with this module. On eight of these external pins, the module receives the random numbers, one byte at a time. Of the other two pins, one is used to send a \emph{start} signal to the other board and the other to receive the \emph{done} signal from it. A common ground is necessary for this type of data exchange. On receiving the \emph{done} signal from the second board, this module transfers the byte received on the external pins, first to an internal register and thereafter to the driver program. After sending that byte to the driver program, it is ready to receive the next byte. The number of bytes to be received can be set at the beginning of the data exchange. On completion, it hands over control to the driver program for further processing of these random numbers received. Random Number Receiver Block Diagram The random number receiver has been implemented as a peripheral and shown in the figure given below. The block diagram also shows the random number generator peripheral implemented on a different board. Both these peripherals exchange data and signals through the external pins of the Altera board. As explained earlier, these pins have been mapped on to the inputs and outputs of the peripherals in the FPGA. Driver Program This program has been written in C and it interacts with the random number receiver module through the Nios processor. With each hardware interrupt, it activates its hardware handler subroutine and captures the byte sent into an array. It then combines two bytes at random and then sends it to the primality check module. If the primality check is positive, this driver program stores that 16-bit random prime number to be used subsequently in the RSA algorithm, else it discards that number. The same process is repeated until it gets at least two prime numbers of 16 bits each. These two prime numbers eventually make p and q for the RSA algorithm. After obtaining p and q, it also computes n = pq, which is the modulus, and phi = (p-1)(q-1)}, which is phi(n). Nios II Embedded Processor Design Contest—Outstanding Designs 2005 128 Primality Check Module This module is based on the Fermat's Theorem, which states that for any integer a, and any prime number n, if n is prime then a^{n} mod n = a If a^{n} mod n ,n eq a, then n is not prime. By testing sufficient number of a's, all composite a's can be excluded and all primes can be retained. Another variation of Fermat's Theorem that can also be utilized to carry out a primality check is Euler's Theorem. It states that, if a is any integer and p is prime, such that gcd(p,a) = 1, then a^{p-1} mod p = 1 This is possible only if p is prime. The existing modular exponentiation architecture can be utilized to carry out the exponentiation required by Fermat's theorem or Euler's theorem to determine whether the number is prime or not. If the number is prime, then, the driver program retains that number to be further handed over to the main RSA driver routine. Multiplicative Inverse The multiplicative inverse of a number, over a modulus, is computed based on the Extended Euclidean algorithm. The algorithm needs to do integer division twice for that which the module calModulus makes use of. This is by far the most time consuming, as well as resource consuming, operation in RSA. The Altera APEX 20K EP200EFC484-2X board is able to accommodate the algorithm for computing the multiplicative inverse only up to 48 bits. The design incorporates two modules: ■ Extended Euclidean module ■ Modulus Extended Euclidean Module This is the top-level module, which takes as input the value of exponent e and the value of phi. Based on the value of e, it goes through the various steps of the Extended Euclidean algorithm. For each step, it sends the dividend and divisor values to the modulus for performing the integer division. The modulus returns the remainder and quotient after the division operation. Finally, the inverse value is returned after ascertaining that the last non-zero remainder is one, and the algorithm is executed for two steps beyond the Euclidean algorithm. Modulus This module is based on the non-restoring division method of calculating the modulo. It takes two inputs, the dividend and the divisor. After division, it returns the remainder and quotient back to the Extended Euclidean module. The multiplicative inverse computed by this peripheral is based on the value of phi generated, as well as the value of exponent e chosen. The value of e chosen is actually the public key and the multiplicative inverse computed is the secret key or d. This value of d is then used during the decryption phase for computing the original plaintext. The module for computing the multiplicative inverse has been implemented as a peripheral on the FPGA. The driver program sends the exponent value and the phi value to this peripheral through the Nios processor. The peripheral computes the secret key or the inverse value of the exponent with respect to phi and returns it via the Nios processor to the driver program. Cryptographic Algorithm Using a Multi-Board FPGA Architecture 129 Modular Exponentiation An architecture for modular exponentiation proposed by Thomas Blum and Christof Paar was chosen for implementation. It is based on the Montgomery exponentiation and Montgomery modular multiplication for radix 2. It is a resource-efficient architecture suitable for implementation in FPGAs. Its design is based on an exponentiator, which handles the exponentiation and feeds values to a systolic array that computes the modular multiplication. The architecture essentially consists of two basic units, the exponentiator and the systolic array. Exponentiator This is the top-level module and is based on the Montgomery exponentiation algorithm. It takes as input the following parameters: ■ Modulus m ■ Message x ■ Exponent e ■ Number of bits in exponent ■ Precomputation factor R^{2} mod m The precomputation factor and A are fed as inputs so that all values in the intermediate stages of exponentiation are in Montgomery domain carrying a factor of 2^{n+2}, where n is the number of bits in the modulus. This module first feeds the values of x and R^{2} mod m to the systolic array for computation of widetilde{x}. Thereafter, it first checks the exponent bit and then feeds appropriate values to the systolic array for multiplication. At the end it feeds the result and value 1 again to the systolic array to obtain the final result, thereby getting rid of the additional factor of 2^{n+2}. The final result so obtained is either the ciphertext or the plaintext depending upon whether it is encryption or decryption. In case of encryption, the exponent used is 65537, while in the case of decryption it is the secret key or d computed as the multiplicative inverse earlier. Systolic Array The systolic array computes the modular multiplication based on the Montgomery modular multiplication algorithm. A systolic system comprises a set of interconnected cells, each capable of performing a specified operation. The cells and operations performed by them are usually identical. The time taken for processing by each of the cells is identical. Individual cells are connected only to their nearest neighbors. The flow of data between the cells is rhythmic and regular. Except those at the boundary of the array, the cells do not communicate with the outside world. Systolic architectures are essentially suited for implementing computationally bound operations. The following arithmetic operation is required to be implemented. S_{i+1} = (Si + q_iM)/2 + a_iB, q_i, a_i {0,1} The above equation can be modified into S_{i+1} = (S_i + q_iM + 2a_iB)/2, q_i, a_i {0,1} Instead of using two adders for computing the addition required in the above step, the sum 2B + M is precomputed and stored in a register. A single adder is sufficient to add 0, 2B, M or 2B + M to S_i, depending on the values of a_i and q_i. The same adder can also be used to precompute 2B + M. The systolic array has the following inputs and outputs: Nios II Embedded Processor Design Contest—Outstanding Designs 2005 132 correction factor, message, and the number of bits in the exponent from the driver program via the Nios processor in 32 bits each. The peripheral then computes the value of the exponentiation and returns it back to the driver program. Design Methodology RSA Implementation Altera’s APEX 20KE FPGA family was chosen for implementing the RSA algorithm. APEX devices are high-density FPGAs that allow complex designs to be implemented on a single device. The target device was an EP20K200EFC484-2X. The design files were written in Verilog HDL, while compilation, synthesis, fitting, placement, and routing were carried out using Quartus II software. Design Flow The complete implementation of the RSA in FPGA was performed in the following stages: 4. Design entry 5. Compilation and synthesis 6. Fitting, placement, and routing 7. Interaction with the C Driver program Design Entry The designs for the project were specified by using the Verilog HDL. The Verilog HDL files are essentially the source files, giving the structural description of each of the sub-units. Random Receiver This contains the design file random_receiver.v, which receives the random numbers on the output pins, generated on the other board. A total of 10 external pins were used to collect the random numbers one byte at a time. The balance of the two pins was used for synchronization purposes. This Verilog file contains the mechanism of raising hardware interrupts and throwing out the byte received to the driver program for further processing. Multiplicative Inverse This contains the following design files: ■ calModulus.v—This module performs the division operation, given the dividend and divisor, and returns the remainder and quotient after the division operation. The size of the inputs and outputs of this module are parameterized to facilitate easy scalability. ■ topInverse.v—This module implements the extended Euclidean algorithm for calculating the multiplicative inverse. It instantiates the calModulus.v module for performing the division operation. The inputs and outputs of this module are also parameterized. Cryptographic Algorithm Using a Multi-Board FPGA Architecture 133 Modular Exponentiation This contains the following design files: ■ processing element.v—This gives the structural description of the processing element of the systolic array. The word size of the processing element is parameterized and can be altered. Each processing element computes the sum as per the algorithm. ■ systolic_array.v—This module instantiates a series of processing elements and specifies the interconnections between them in terms of inputs and outputs. It returns the result of a multiplication to the exponentiator module, based on the Montgomery modular multiplication algorithm. ■ monty_expo.v—This is the top-level module that implements the Montgomery exponentiation algorithm as a series of modular multiplications with the help of the underlying systolic array module. Compilation & Synthesis The design files form the input to the compilation and synthesis tool (i.e., Quartus II development software). The design files are first included in the project standard_32 directory within Quartus II software. Thereafter, a new peripheral is created for each top-level module with the help of the SOPC builder. The SOPC builder is then generated to build the user-defined peripherals along with the design files of the standard_32 directory. The operating frequency and the target devices are selected at the time of opening a new project. Finally, the whole project is compiled and synthesized. Fitting, Placement & Routing Quartus II development software is also used for this purpose. The netlist file generated during the compilation and synthesis forms the input to it. The fitter in Quartus II software assigns each logic function to the best logic cell location for routing and timing. It also selects appropriate interconnection paths and pin assignments. The final output is the standard.sof file, which contains the complete routed application. Interaction with C Driver Program The design files implemented in the hardware are actually peripherals to the Nios processor and work through the Avalon® bus signals. To write/read data to/from the peripheral, a C driver program is used. This C program is loaded in \cpu_sdk\src project subdirectory within standard_32. The nios_build and nios_run utilities are then used to compile the C program and run it on the design files already downloaded to the FPGA. The C program includes the nios.h, which in turn includes all the header files required for compilation. Also, the peripheral created in the SOPC builder is instantiated in the C program along with its IRQ number. The handler function in the C program then performs the functions mentioned inside the handler in the event of the peripheral raising an interrupt. The data is written to the peripheral through the writedata Avalon signal while the reading of data from the peripheral is done through readdata. Both writedata and readdata work for specific addresses that need to be mentioned in the C program. Implementation Issues This section describes the implementation issues for this project. Use of External Pins For peripherals involving use of external pins, the additional pins used are marked as export, before generation in the SOPC Builder. After generation, physical assignment of each and every pin is carried Nios II Embedded Processor Design Contest—Outstanding Designs 2005 134 out using the assignment editor within Quartus II software. The external pins to be assigned are selected through the Nios development manual. The balance of the operations is similar to that described in earlier sections. This configuration and implementation was carried out for the random_receiver module and peripheral. 16-Bit Implementation Handling large numbers (1024 bit) makes debugging and functional verification very difficult. Also, the time taken by the software tools, especially the placement and routing (fitter) and simulator, is extremely high. Therefore, a 16-bit exponentiate was built and tested thoroughly as a first step. The exponentiate was then scaled up from 16 bits to 512 bits. Modular Design The design of the exponentiate is modular with the processing element and the systolic array being independently implemented and tested. Finally the modules were then integrated together and tested. Same is the case for the other peripherals like multiplicative inverse and the random number receiver. The peripherals have been designed in such a manner that all inputs and outputs are parameterized and can be changed easily without affecting any other part of the module. Design Scalability The exponentiator and the multiplicative inverse peripherals scale linearly and therefore require little effort. Testing & Verification The test cases for testing were given using the C driver program to the Verilog HDL design file and then reading back the results in the driver program. Initial simulation and testing was carried out using iverilog, being faster. The testing and verification in the hardware takes time owing to time taken for compilation, synthesis, fitting, placement, and routing by the Quartus II software. Processing Time An important issue associated with the implementation is the processing time associated with Quartus II software. For the exponentiator, multiplicative inverse, and the random number receiver, the time taken for compilation, synthesis, and fitting is about 12 to 15 minutes. Software Implementation A software implementation of modular exponentiation algorithm, multiplicative inverse, modular multiplication, generation of random numbers, and multiplication of large integers was implemented in C and Java to verify the correctness of the results obtained. The Montgomery multiplication algorithm was also implemented to verify the correctness of the intermediate results during exponentiation. This was necessary since the intermediate results carry the additional factor of 2^{n+2} at each stage. Design Features The highlights of our design features that we implemented were: ■ Interboard communication between two Nios processors using interrupts. This entailed interrupt handling. ■ Use of peripherals around the Nios core. This facilitated quick prototyping at the design and trial stage.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved