Download Floating point Arithmetic and more Summaries Computer Architecture and Organization in PDF only on Docsity! Floating point arithmetic Floating Point Number Representation • ± mantissa (or) significant × Base±Exponent • (Ex.) 0. 00000000005=0.5 × 10-10 • 50000000000=5 × 1010 • Rules: 1. The integer part should be zero 2. 0.d1d2…. dn × Base±Exponent then d1>0 and all d2…. dn ≥0 • Two representation techniques • Single precision (32 bit) • Double precision (64 bit) Problems in Floating Point Arithmetic
Mantissa overflow : The addition of two mantissas of the same sign may result in a
carryout of the most significant bit. If so, the mantissa is shifted right and the exponent
is incremented.
Mantissa underflow : In the process of aligning mantissas, digits may flow off the
right end of the mantissa. In such case truncation methods such as chopping, rounding
are used.
Exponent overflow : Exponent overflow occurs when a positive exponent exceeds the
maximum possible exponent value. In some systems this may be designated as + or
— 0,
Exponent underflow : Exponent underflow occurs when a negative exponent exceeds
the maximum possible exponent value. In such cases, the number is designated as zero.
Floating-point addition-subtraction unit
A: S,. ELM,
324bit operands
B:iS, ER. My, |
EA En
AE Af p
ll ll Mf of number
. ith smaller
Sobit ————== SWAr ~
subtractor
I Jf off munniner
with Larper =”
sign SHIFTER
s. 5s = bits
Floating point arithmetic: Multiplication • Multiplication of a pair of Floating Point numbers 𝑋 = 𝑚𝑥 × 2𝑎 and Y = 𝑚𝑦 × 2𝑏 is represented as XY = 𝑚𝑥 ×𝑚𝑦 × 2𝑎+𝑏 • Algorithm 1. Compute the exponent of the product by adding the exponent together. 2. Multiply two mantissas 3. Normalize and round the final product • Example: Multiply X=1.000 × 2-2 and Y=-1.010 × 2-1 • Add exponents: (-2)+(-1)=(-3) • Multiply Mantissa: 1.000 × -1.010=-1.010000 • After normalizing product is: -1.0100× 2-3 Chopping
This is the simplest method of truncation. Here, the guard bits are
removed without making any changes in the retained bits.
For example, if we want to truncate a fraction from six to three bits by this method
we have to remove 3 least significant bits (right most).
Original Number | * >-1 >-2 5-3 b-4 b-5 b-6
Truncated Number" %-1 %-2 6-3
Chopping Method of truncation
In our example, the error in the 3-bit result ranges from 0 to .000111. The value
.000111 is almost equal to .001. Because of this in general terms we can say that, the
error in chopping ranges from 0 to almost 1 in the least significant position of the
retained bits. In our example, least significant position of the retained bit is b_ 3. In
chopping method, the error range is not symmetrical about 0 and hence the result of
chopping is a biased approximation.
Von Neumann Rounding
This is the next simplest method of truncation. Here,
e If the bits to be removed are all 0s, they are simply removed, with no changes to
the retained bits.
e However, if any of the bits to be removed are 1, the least significant bit of the
retained bits is set to 1,
In our 6-bit to 3-bit truncation we get results as shown in the Fig. 2.10.6.
Original 0. bly b_» b_3 b_y b)_¢ b_¢ 0, Deer b_5 b_3 b_y b_s b_¢
number | with b_y b_, b_¢ = 000 with b_y b_s b_,# 000
Truncated 9 5_; b_2 b_3 0 iB.y bug 12
number
The result is the unbiased approximation
Rounding
Rounding : It is the best method of truncation; however it is the most difficult to
implement. In this method, a 1 is added to the LSB position of the bits to be retained if
there is a 1 in the MSB position of the bits being removed. In our 6-bit to 3-bit example
we will get results as shown in Fig. 2.108.
with b_y=l
0, b-1 b-2 by + 0.001