Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Floating point Arithmetic, Summaries of Computer Architecture and Organization

Vellore Institute of Technology Computer Architecture and Organization

IEEE 754 format - Floating Point

Typology: Summaries

2019/2020

Uploaded on 07/07/2023

santi-santosh 🇮🇳

1 document

1 / 14

Partial preview of the text

Download Floating point Arithmetic and more Summaries Computer Architecture and Organization in PDF only on Docsity! Floating point arithmetic Floating Point Number Representation • ± mantissa (or) significant × Base±Exponent • (Ex.) 0. 00000000005=0.5 × 10-10 • 50000000000=5 × 1010 • Rules: 1. The integer part should be zero 2. 0.d1d2…. dn × Base±Exponent then d1>0 and all d2…. dn ≥0 • Two representation techniques • Single precision (32 bit) • Double precision (64 bit) Problems in Floating Point Arithmetic Mantissa overflow : The addition of two mantissas of the same sign may result in a carryout of the most significant bit. If so, the mantissa is shifted right and the exponent is incremented. Mantissa underflow : In the process of aligning mantissas, digits may flow off the right end of the mantissa. In such case truncation methods such as chopping, rounding are used. Exponent overflow : Exponent overflow occurs when a positive exponent exceeds the maximum possible exponent value. In some systems this may be designated as + or — 0, Exponent underflow : Exponent underflow occurs when a negative exponent exceeds the maximum possible exponent value. In such cases, the number is designated as zero. Floating-point addition-subtraction unit A: S,. ELM, 324bit operands B:iS, ER. My, | EA En AE Af p ll ll Mf of number . ith smaller Sobit ————== SWAr ~ subtractor I Jf off munniner with Larper =” sign SHIFTER s. 5s = bits Floating point arithmetic: Multiplication • Multiplication of a pair of Floating Point numbers 𝑋 = 𝑚𝑥 × 2𝑎 and Y = 𝑚𝑦 × 2𝑏 is represented as XY = 𝑚𝑥 ×𝑚𝑦 × 2𝑎+𝑏 • Algorithm 1. Compute the exponent of the product by adding the exponent together. 2. Multiply two mantissas 3. Normalize and round the final product • Example: Multiply X=1.000 × 2-2 and Y=-1.010 × 2-1 • Add exponents: (-2)+(-1)=(-3) • Multiply Mantissa: 1.000 × -1.010=-1.010000 • After normalizing product is: -1.0100× 2-3 Chopping This is the simplest method of truncation. Here, the guard bits are removed without making any changes in the retained bits. For example, if we want to truncate a fraction from six to three bits by this method we have to remove 3 least significant bits (right most). Original Number | * >-1 >-2 5-3 b-4 b-5 b-6 Truncated Number" %-1 %-2 6-3 Chopping Method of truncation In our example, the error in the 3-bit result ranges from 0 to .000111. The value .000111 is almost equal to .001. Because of this in general terms we can say that, the error in chopping ranges from 0 to almost 1 in the least significant position of the retained bits. In our example, least significant position of the retained bit is b_ 3. In chopping method, the error range is not symmetrical about 0 and hence the result of chopping is a biased approximation. Von Neumann Rounding This is the next simplest method of truncation. Here, e If the bits to be removed are all 0s, they are simply removed, with no changes to the retained bits. e However, if any of the bits to be removed are 1, the least significant bit of the retained bits is set to 1, In our 6-bit to 3-bit truncation we get results as shown in the Fig. 2.10.6. Original 0. bly b_» b_3 b_y b)_¢ b_¢ 0, Deer b_5 b_3 b_y b_s b_¢ number | with b_y b_, b_¢ = 000 with b_y b_s b_,# 000 Truncated 9 5_; b_2 b_3 0 iB.y bug 12 number The result is the unbiased approximation Rounding Rounding : It is the best method of truncation; however it is the most difficult to implement. In this method, a 1 is added to the LSB position of the bits to be retained if there is a 1 in the MSB position of the bits being removed. In our 6-bit to 3-bit example we will get results as shown in Fig. 2.108. with b_y=l 0, b-1 b-2 by + 0.001

Documents

questions

Floating point Arithmetic, Summaries of Computer Architecture and Organization

Related documents

Partial preview of the text