Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Floating Point Multiplier - Hardware and Software Codesign - Final Project | ECE 587, Exams of Computer Fundamentals

Material Type: Exam; Class: Hardware/Software Codesign; Subject: Electrical and Computer Engr; University: Illinois Institute of Technology; Term: Unknown 1989;

Typology: Exams

Pre 2010

Uploaded on 08/19/2009

koofers-user-vk1
koofers-user-vk1 🇺🇸

10 documents

1 / 18

Toggle sidebar

Related documents


Partial preview of the text

Download Floating Point Multiplier - Hardware and Software Codesign - Final Project | ECE 587 and more Exams Computer Fundamentals in PDF only on Docsity! 1 Floating Point Multiplier Final Project ECE 587 Chris Babb Jeff Blank Ivan Castellanos John Moskal 2 PROJECT CONTRIBUTION Christopher R. Babb Jeffrey Blank Ivan Dario Castellanos John Moskal Design - Overall structure of Floating Point Unit - Detailed Design of Exponent Path - Overall structure of Floating Point Unit - Dadda 53x53 Multiplier - Debugging and modifying design as testing occurred to fix defects - Overall structure of Floating Point Unit - LEON/AMBA/FPU Interface. - C code FPU operation definition - Register file interface and specification - Overall structure of Floating Point Unit - Detailed Design of Input, Mantissa Path (except Dadda Multipler), and Output Implementation - Exponent Biased Addition Unit - Exponent Ulp Addition Unit - Exponent Trap Adjustment Unit - Upper Level FPU Shell Unit - Creation of Dadda Multiplier and modifying it for use in the mantissa datapath - Top of hierarchy FPU module, Custom logic Wrapper for AMBA bus interface. - Register File (CPU/FPU access, multi-port) and FSM control - LEON interface and operation triggering mechanism. - Single-To-Double Unit - Special Cases Unit - Sticky and Inexact Bit Generation Unit - Postnormalization Unit. - Round Bit Generation Unit - Rounding Unit (Ulp Adder) - Special Output Unit - Double-To-Single Unit - Muxes Testing - Testbenches for Implemented Units - Debugging and Fixing Problems in Final Testing of Floating Point Unit - Test vectors for Dadda multiplier - Selection of cases for general test and running them in Modelsim - Setup and Execution of Paranoia comparison to Golden File to multiplier actual results - Final test vectors to be included in the report from the LEON. - Paranoia floating point vectors cross checking with FPU results - C code program correct functionality with the FPU. - Testbenches for Implemented Units - Excel Model for Multiplication and Normalization Report Description and Diagrams of: - Abstract - Introduction - Overall FPU Design - Exponent Path - Summary - Abstract - Multiplier Info - Testing Section - AMBA/AHB Bus interface - Top hierarchy module, Register File and interface with the FPU - Introduction - Summary Description and Diagrams of: - Overall FPU Design - FPU Input - Mantissa Path (except Multiplier) - FPU Output - Final formatting Others - Keeping things sane for Jeff during testing phase - Driving around with Ivan late at night trying to find somewhere open to eat and having no success - Making the LEON, AMBA bus and register file interface transparent to the Floating Point computation Unit - Problem with the LEON access to the custom logic (SOCKS kit and C code compilation bugs) - Being a nice guy, etc… - Doing online support for mantissa path - Generated HTML Code Signature 64 Bit Register 64 Bit Register 82D 64 Bie MUX 64 Bit MUX Spocial Cases Pxpa + Raph - Mins MULTIPLIER All UTP Sticky Bit and Inexact Postnormalization Exponent Adwstment Fos Overflow und Undertow Rial Bil Generation Reunding, Special Value Tacesler Special Ourput Double ta Single Rearrangement Figure 3 — Structure of Floating Point Multiplication Unit 6 3.1 Input 3.1.1 Single to Double Conversion Project specification requires Floating Point Unit to perform operation both on single and double precision numbers. Internally FPU has double precision logic, so every single precision operand has to be converted to double logic before computations. This is performed by Single-To-Double Conversion Unit. Conversion operation consists of zero extending the significand of number, and adding difference between biases of double and single numbers to exponent (that is 896 in decimal, or 1110000000 in binary). Latter operation can be easily performed by combinational circuit shown on figure 3.1.1. Figure 3.1.1 – Exponent conversion logic In case of exponent being equal to zero or maximal value, it should be correspondingly converted to zero or maximal value in double precision. 3.1.2 Special Cases There are several cases in which the result of operation can be determined just by looking at input operands. In such case mantissa, exponent path can be bypassed and result can be generated by Special Output Unit. Special Cases Zero Denorm OK SNaN QNaN XNaN Infinity Zero Zero Zero Zero QNaN QNaN QNaN QNaN Invalid Operation Denorm Zero Zero Zero QNaN QNaN QNaN QNaN Invalid Operation OK Zero Zero A*B QNaN Invalid Operation QNaN QNaN Infinity SNaN QNaN Invalid Operation QNaN Invalid Operation QNaN Invalid Operation QNaN Invalid Operation QNaN Invalid Operation QNaN Invalid Operation QNaN Invalid Operation QNaN QNaN QNaN QNaN QNaN Invalid Operation QNaN QNaN QNaN XNaN QNaN QNaN QNaN QNaN Invalid Operation QNaN QNaN QNaN Infinity QNaN Invalid Operation QNaN Invalid Operation Infinity QNaN Invalid Operation QNaN QNaN Infinity Table 3.1 – Special Input Cases SE(7) SE(6) SE(5) SE(4) SE(3) SE(2) SE(1) SE(0) DE(7) DE(6) DE(5) DE(4) DE(3) DE(2) DE(1) DE(0) DE(8) DE(9) DE(10) 7 3.2 Mantissa Path 3.2.1 Dadda Multiplier The multiplier we chose to use was a parallel 53x53 tree-based Dadda multiplier. The Dadda multiplier was chosen for its speed as compared to array and serial multipliers. The lower 53 bits of each operand are the mantissa bits of the input operands, and the most significant bit is tied to 1 to move the range of the result to [1,2). This type of multiplier has three distinct stages. First, the partial products must be formed into a partial product array (PPA). Next, using stages of carry-save adders, the partial product array is reduced to a height of two as seen in the following figure for a simplified 4x4 case. Figure 3.2.1 Dadda Multiplier Once the partial product array is reduced to a height of two, a carry-propagate adder is used to produce the final result from the final two rows of the reduced partial product array. All 106 bits of the result are passed onto the postnormalization unit for further processing. 3.2.2 Postnormalization Unit IEEE Significand is defined to be in the range [1, 2), which is equivalent to having the most significant bit of the significand equal to '1'. Output from Dadda multiplier is a result of multiplication of two 53 bit binary numbers, so it is a 106 bit binary number in the range [1, 4). First two bits of multiplier result represent integer part. When result of multiplication is greater or equal 2 (most significant bit equals '1') it is necessary to normalize the number by shifting the result one position to the right and incrementing the exponent by one. Postnormalization Unit is realized as simple multiplexer. It requires 54 bits - IEEE Double number requires 52 bits, one extra bit in case of normalization and another extra bit for Guard Bit (needed later in normalization unit). Select input of multiplexer is driven by the most significant bit of unit's input, that is if the first bit is '1' then result is shifted, and if the first bit is '0' then no shift is performed. Structure of the unit is shown of figure 3.2.2. 10 3.3 Exponent Path 3.3.1 Biased Exponent Adder This module takes in two, biased exponent and then creates a biased output. Since each true value is the exponent value minus the bias, it can be seen that the output bias can be made from simply adding the exponent and subtracting only one bias. By using concepts of two’s compliment simple modifications can be made to the exponent: X X X X X X X X X X X Y Y Y Y Y Y Y Y Y Y Y - 0 1 1 1 1 1 1 1 1 1 1 Z Z Z Z Z Z Z Z Z Z Z Then using logic we can simplify to: _ X X X X X X X X X X X Y Y Y Y Y Y Y Y Y Y Y + 0 0 0 0 0 0 0 0 0 0 1 Z Z Z Z Z Z Z Z Z Z Z This function can be executed using a simple 13-bit adder with carry in set high and the MSB of one of the inputs inverted. This stage also checks an initial overflow. In double precision we know that if the MSB of an input is high then the number represented is higher than the bias. An overflow can only occur if two numbers larger than the bias produce a number smaller than the bias. For single precision we know we cannot have a number larger than or equal to 1151 because this number is equal to or larger than the single precision bias added to the double precision bias and cannot be represented in single precision. Thus using this information we can check the upper four bits to see if it has created an overflow. 3.3.2 Ulp Addition On two occasions an ulp may be added to the exponent. One occasion would be when the mantissa is normalized directly after multiplication. The other rounding is if rounding results in a required normalization. The two bits are simply added to the exponent result. This section finalizes the overflow detection. If an overflow did not occur in the previous stage then overflow is checked in this stage. For overflow in double precision, if the upper most bit of the output of the biased exponent adder is over the bias and then after ulp addition becomes less than the bias or the number goes to infinity representation an overflow occurs. In single precision an overflow occurs if after ulp addition the number is greater than or equal to 1151. Underflow is detected completely in this section. Underflow is detected in double precision if both inputs to the biased exponent adder are underneath the bias, but the output is zero or greater than bias. In single precision there is an underflow if the resulting number is less than or equal to 896. This number is less than or equal to the double precision bias minus the single precision bias, thus cannot be represented in single precision. 11 3.3.3 Exponent adjustment If traps are not enabled or neither underflow nor underflow are not detected this stage does nothing. Otherwise we have to adjust the exponent according to IEEE floating point standard. When an overflow occurs 192 is subtracted from single precision numbers and 1536 is subtracted from double precision numbers. When an underflow occurs 192 is added to single precision numbers, and 1536 is added to double precision numbers. These values are chosen from a mux and added to the final exponent. 3.4 Sign Path 3.5 Output 3.5.1 Special Output Special Output Unit takes care of situations where output is a result of a special case. Special cases: - Zero - Quiet NaN - Overflow (with Trap Disabled and depending on Rounding Mode) - Underflow (with Trap Disabled) - Infinity Overflow with Trap Enabled is taken care of in preceding units. 3.5.2 Double to Single Bit Rearrangement This module task is to rearrange bits from double precision format to single precision format. In case if the output format should be single precision, the result from FPU comes in double precision format, but in such way that if three most significant bits are removed from exponent, and 29 bits truncated from mantissa the result is valid single precision number. For double precision values this unit is not part of output path. 3.6 AMBA / LEON interface The AMBA Bus, created by ARM corp. is a standard that specifies a high performance bus for microprocessor systems [4]. It supports a modular system design style and Intellectual Property modules reuse. It also allows their design independent of the target system. The Floating Point Unit is interfaced with the LEON through the AHB AMBA interface (Advanced High- Performance Bus). This bus is intended to provide a high-speed communication with the processor. 12 Figure 3.6. AMBA Bus, AHB interface. MASTER 1, LEON MASTER n HADDR HWDATA HRDATA Arbiter Address and Control Mux Decoder Read Data Mux SLAVE 1, Floating Point Unit HRDATA HWDATA HADDR SLAVE n HADDR HADDR HWDATA HWDATA HRDATA HRDATA . . . Address = 0xA016_0000 15 Figure 3.7a AHB/FPU interface. Top FPU module and Register File. The two-state FSM control unit receives as inputs the we, match_fpscr and reset signals, all coming from the CPU (AHB Interface). Figure x3 shows the FSM implemented. The signal match_fpscr is asserted whenever the incoming address from the AHB interface matches the address of the FPSCR register. This action indicates either a write, in which case a Floating Point operation is triggered, or a read, when the CPU tries to fetch the FPSCR contents to get the Flag status of a completed operation. The single output of the FSM is called fpu_cpu. This output specifies when a register in the Register File is to be written with data coming from the CPU (fpu_cpu=0) or from the FP computation unit (fpu_cpu=1), when the RF is to store the result of a FP operation. Register File 32 x 32-bit Address Input (0-31) Data Input (CPU Write) Operand A Address Data Input extended 32-bits (double precision) Operand B Address Operand A Data Operand B Data Data Output (CPU Read) fpu_CPU FSM Control 0 1 0 1 0 1 FPSCR Address match (fpscr_match) FPSCR Register [31:27] [26:16] [15] [14:10] [9:5] [4:0] F P S C R O u t FPSCR Out FLAGS SP/DP Single/Double Precision Bit [15] FPSCR_OUT, [19:16] Traps, SP/DP, Rounding, etc Floating Point Unit fpu_CPU Result [63:32] [31:0] FLAGS (Invalid, UDF, OVF, Inexact, DivZero) Result OpB OpA Traps, Round Mode, SP/DP From FPSCR output [19:16] match_FPSCR address & Write Enable (write to FPSCR?) Data input from CPU [26:0] Data input from CPU Address input from CPU Data output to CPU 32 64 32 5 32 5 1 5 27 1 1 32 32 64 32 From FSM Control 1 Write Enable AMBA Module address match & CPU Write Enable 1 4 16 The FSM signal (we & match_fpscr & reset) indicates that a write to the FPSCR took place, and therefore a FP operation is to take place. The Control asserts the fpu_cpu signal and hence the FPU is allowed to store the value in the register file. After the result is written, the control disables any further FPU access and allows only write permissions to the CPU. Figure 3.7b Two state FSM to control FPU or CPU access to the Register file. 3.8 Testing Our testing approach had three major parts. First selected cases were run through a testbench on just the floating point unit, minus the register file. There were 9 cases selected and they are presented in the table below. Traps Operand A Operand B RM Result S/D Inexact Underflow Overflow DivideZero Invalid 0 CD1DFFFE00000000 3F80047E00000000 00 CD1E058A00000000 0 1 0 0 0 0 0 DEF77FFE00000000 EC00037F00000000 10 7F800000000000000 0 1 0 1 0 0 0 C0100010003FFFFF C1EFEFFFFFBFFFFF 00 420FF01FF03FBFBD 1 1 0 0 0 0 0 BFFED3F255B447EA BE20007F7FFFFFFE 01 3E2ED4E7FEA762DA 1 1 0 0 0 0 0 757FFFFC0000007F 6AC01000400000000 11 7FEFFFFFFFFFFFFFF 1 1 0 1 0 0 0 228FFFC00FFFFFFF 802FFFFE0000000001 10 800000000000000000 1 1 1 0 0 0 0 7FD0041FFFFFFFFF 001000000000000000 00 3FF0041FFFFFFFFFF 1 1 0 0 0 0 0 78050000000000000 00F600000000000000 00 390CE0000000000000 1 0 0 0 0 0 0 0FA80000000000000 7E2800000000000000 00 4E5C80000000000000 0 0 0 0 0 0 1 7F280000000000000 7F2800000000000000 00 5EDC80000000000000 0 1 0 1 0 0 Table 3.8 Selected Testing Results These selected cases were then run in C on the LEON and the results matched what we found with the floating point VHDL testbench. So the next logical step was to try using Paranoia to create a larger set of test vectors and give us a “golden file” to check our results more quickly. We configured Modelsim to output the results in a text format in which we could then use an Emacs version of diff to check that the files were identical. We only ran the testbench for the single precision vectors because the set of double precision vectors was too large to complete. In running the vectors and comparing them to the “golden file” we found only two differences. First there was a discrepancy between our results and the “golden file” when there was an underflow and traps were disabled. By default we set the result to zero, while the “golden file” had another standard it followed. Another very rare case (3 cases every ~2000 vectors) which we were not able to determine is the following case. Operand A = BAFF800000000000, Operand B = BF20010000000000, Rounding Mode = 00 (Nearest Even), Result = 3A9FB10000000000 for a single precision case. The problem is that the Inexact bit is not being set when it should be. This happens for a number of apparently random cases, but never when we experience a underflow or overflow. As of now we have not determined the solution to this. S0 we=1 & match_fpscr = 1 & reset=1 Always S1 FPU_CPU = 1 FPU_CPU = 0 Reset = 0 17 4 Summary Because the unit is built outside of the LEON processor it could not take any real benefits of a pipelined structure. In future implementations this design could be easily pipelined due to its modular design. The process used a mapped memory approach that would be difficult to pipeline. Bringing the FPU into the processor would also reduce the instructions needed to execute a floating-point operation. As seen in the C code for the LEON, the process for multiplying two double precision numbers included 4 writes to the registers, 1 write to the control register, and 2 reads from the registers. At certain points in the exponent path simplification can occur. When adding either ulp to the exponent value an entire 11-bit adder was used, To reduce area this adder could have been stripped down to simpler logic. Also in the exponent adjustment instead of an adder, specialized hardware could have been used to add and subtract the constants rather than just choosing a preset value to be added to the exponent to reduce speed and area. Using the Dadda multiplier also allowed higher speed at a trade-off of area. While we cannot show this without simulation out speed during this stage of execution is much faster than the speed of any other multiplier without the use of tables. This module however will have the largest delay of any module in the processor and thus will have the most impact on overall delay. Exponent modules and mantissa ulp addition have smaller delays due to the exclusive use of a carry lookahead adder. These make their delay extremely small in comparison to Dadda multiplier. On the other hand these modules become very large. In future implementations it may be possible to exchange high speed adders in the exponent path in order to reduce area. This will reduce speed but in this path speed is not an important issue, the limiting path is the mantissa path. The AMBA proved to be a very effective way to introduce new modules to an existing system like the LEON. The AHB module included in the SOCKS kit, interfacing the FPU with the processor, also reduced significantly the data transfer protocol between the FPU and the processor. Nevertheless part of the difficulty in the register file and the top_fpu module design was the fact that no direct control signals exist between the processor and the FPU creating the necessity of the FPSCR which indicated an FP operation and provided the flags to the processor when read. Furthermore, the testing of the complete module was easier by using C code from the LEON as opposed to verilog testbenches. One error was found using paranoia. This error would have been very difficult to find otherwise as it was during very rare occasions. This showed an issue with out inexact that was yet to be discovered. Results during this case are still correct, and the only error is the inexact flag itself. Nevertheless the Floating Point Multiplier responded correctly to all other test vectors applied, reflecting a good IEEE754 compliance.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved