Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Floating point Arithmetic, Summaries of Computer Architecture and Organization

IEEE 754 format - Floating Point

Typology: Summaries

2019/2020

Uploaded on 07/07/2023

santi-santosh
santi-santosh 🇮🇳

1 document

1 / 14

Toggle sidebar

Related documents


Partial preview of the text

Download Floating point Arithmetic and more Summaries Computer Architecture and Organization in PDF only on Docsity! Floating point arithmetic Floating Point Number Representation • ± mantissa (or) significant × Base±Exponent • (Ex.) 0. 00000000005=0.5 × 10-10 • 50000000000=5 × 1010 • Rules: 1. The integer part should be zero 2. 0.d1d2…. dn × Base±Exponent then d1>0 and all d2…. dn ≥0 • Two representation techniques • Single precision (32 bit) • Double precision (64 bit) Problems in Floating Point Arithmetic Mantissa overflow : The addition of two mantissas of the same sign may result in a carryout of the most significant bit. If so, the mantissa is shifted right and the exponent is incremented. Mantissa underflow : In the process of aligning mantissas, digits may flow off the right end of the mantissa. In such case truncation methods such as chopping, rounding are used. Exponent overflow : Exponent overflow occurs when a positive exponent exceeds the maximum possible exponent value. In some systems this may be designated as + or — 0, Exponent underflow : Exponent underflow occurs when a negative exponent exceeds the maximum possible exponent value. In such cases, the number is designated as zero. Floating-point addition-subtraction unit A: S,. ELM, 324bit operands B:iS, ER. My, | EA En AE Af p ll ll Mf of number . ith smaller Sobit ————== SWAr ~ subtractor I Jf off munniner with Larper =” sign SHIFTER s. 5s = bits Floating point arithmetic: Multiplication • Multiplication of a pair of Floating Point numbers 𝑋 = 𝑚𝑥 × 2𝑎 and Y = 𝑚𝑦 × 2𝑏 is represented as XY = 𝑚𝑥 ×𝑚𝑦 × 2𝑎+𝑏 • Algorithm 1. Compute the exponent of the product by adding the exponent together. 2. Multiply two mantissas 3. Normalize and round the final product • Example: Multiply X=1.000 × 2-2 and Y=-1.010 × 2-1 • Add exponents: (-2)+(-1)=(-3) • Multiply Mantissa: 1.000 × -1.010=-1.010000 • After normalizing product is: -1.0100× 2-3 Chopping This is the simplest method of truncation. Here, the guard bits are removed without making any changes in the retained bits. For example, if we want to truncate a fraction from six to three bits by this method we have to remove 3 least significant bits (right most). Original Number | * >-1 >-2 5-3 b-4 b-5 b-6 Truncated Number" %-1 %-2 6-3 Chopping Method of truncation In our example, the error in the 3-bit result ranges from 0 to .000111. The value .000111 is almost equal to .001. Because of this in general terms we can say that, the error in chopping ranges from 0 to almost 1 in the least significant position of the retained bits. In our example, least significant position of the retained bit is b_ 3. In chopping method, the error range is not symmetrical about 0 and hence the result of chopping is a biased approximation. Von Neumann Rounding This is the next simplest method of truncation. Here, e If the bits to be removed are all 0s, they are simply removed, with no changes to the retained bits. e However, if any of the bits to be removed are 1, the least significant bit of the retained bits is set to 1, In our 6-bit to 3-bit truncation we get results as shown in the Fig. 2.10.6. Original 0. bly b_» b_3 b_y b)_¢ b_¢ 0, Deer b_5 b_3 b_y b_s b_¢ number | with b_y b_, b_¢ = 000 with b_y b_s b_,# 000 Truncated 9 5_; b_2 b_3 0 iB.y bug 12 number The result is the unbiased approximation Rounding Rounding : It is the best method of truncation; however it is the most difficult to implement. In this method, a 1 is added to the LSB position of the bits to be retained if there is a 1 in the MSB position of the bits being removed. In our 6-bit to 3-bit example we will get results as shown in Fig. 2.108. with b_y=l 0, b-1 b-2 by + 0.001
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved