As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Chapter 1: Numbers and
Precisions
Significant figures
Determining the number of significant figures in measured quantities is
essential when reporting the precision of measured values and the
precision that can be reported when measured values are used in
calculations. The rules for determining the number of significant figures
are as follows:
1. All nonzero digits are significant.
o For example, the value 211.8 has four significant figures.
2. All zeros that are found between nonzero digits are significant.
o Thus, the number 20,007, with three 0s between the 2 and 7, has a
total of five significant figures.
3. Leading zeros (to the left of the first nonzero digit) are not significant.
o A value such as 0.0085, for example, has two significant figures
because the 0s before the 8 are placeholders and are not significant.
4. Trailing zeros for a whole number that ends with a decimal point are
significant.
o For example, a value written as 320.0 shows the decimal point,
which indicates that the 0 to the right of the 2 was measured;
therefore, the value has a total of three significant figures. If the
decimal point was not written, then 320 would have only two
significant figures. In general, any confusion this may cause can be
avoided by writing values such as these in scientific notation.
5. Trailing zeros to the right of the decimal place are significant.
Numbers and Precision | 1
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
o This means a value such as 12.000 has a total of five significant
figures, since the 0s after the decimal place have been measured to
be zeros, indicating they are as significant as any other nonzero
digit.
6. Exact numbers, and irrationally defined numbers like Euler’s
number (e) and pi (π), have an infinite number of significant figures.
o In a defining expression like 1 meter = 100 centimeters, these
values are considered exact and thus have an infinite number of
significant figures. While π is usually written as 3.14 for ease of
calculation, the π button on the calculator would be used in any
calculations, and thus it is considered to be a value with infinite
significant figures.
For any value written in scientific notation as A ×10x, the number of
significant figures is determined by applying the above rules only to the
value of A; the x is considered an exact number and thus has an infinite
number of significant figures.
o For example, the value 4,500 can be written in scientific notation to
reflect two, three, and four significant digits:
o 4.5 × 103 has two significant figures
o 4.50 × 103 has three significant figures
o 4.500 × 103 has four significant figures
Calculations with significant figures
For calculations involving measured quantities, the first step in
determining the precision of the answer is to determine the number of
significant figures in each of the measured quantities. Once done, the
number of significant figures in a calculated value involving
measurements is determined based on the mathematical operation being
performed.
When two or more measured quantities are added or subtracted, the
resulting value will have the same number of decimal places as the value
with the fewest number of decimal places (the limiting value). So if the
measured values of 22.35 and 47.773 are added, the limiting value of
Numbers and Precision | 2
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
22.35 has two decimal places, which means that the result of the
addition will have only two decimal places.
When two or more measured quantities are being multiplied or divided,
the answer will have the same number of total significant figures as the
value with the fewest number of significant figures. So if the measured
values of 2.445 and 31.7 are being multiplied, the resulting value will
have three significant figures, since 2.445 has four significant figures,
but 31.7 has only three significant figures.
When a value is to be rounded off, the rules for rounding are:
1. When the digit to the right of the one being rounded to is less than a 5,
the remaining digit remains the same as the value rounds down.
o For example, 33.742 is to be rounded to one decimal place. Here,
the 7 in the first decimal place is followed by a 4, which is less than
5, which means that 33.742 rounded to one decimal place is 33.7.
Note that only the 4 that is to the right of the 7 is looked at here;
the 2 in the third decimal place is insignificant when rounding to
one decimal place.
2. When the digit to the right of the one being rounded to is greater than 5,
the value rounds up.
o For example, 2.8763 is to be rounded to two decimal places. In this
case, the 6 in the third decimal place is greater than 5, so the 7 in
the second decimal place is rounded up to 8. This means that when
rounded to two decimal places, 2.8763 rounds to 2.88. Again, the 3
in the fourth decimal place is insignificant when rounding to two
decimal places.
3. When the digit to the right of the one being rounded to is exactly a 5
(which means no nonzero digit follows it), the value is rounded so that
the final digit is an even number. This rule is designed to avoid always
rounding up or always rounding down; it creates more balance when
rounding.
o Thus, 21.45 rounds to one decimal place to 21.4, while 36.75 would
round to 36.8.
o However, if a value such as 38.25003 is to be rounded to one
decimal place, it rounds to 38.3. This is the only type of rounding
where a digit farther than immediately to the right of the one being
rounded to is ever considered. In this example, the digit looked at
Numbers and Precision | 3
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
when rounding off to one decimal place is a 5. However, farther
along the decimal portion of the value there is a nonzero digit. The
number being rounded is therefore rounded up, as the 0.00003
indicates that the value of 0.05003 is larger than just the 0.05. For
this reason, the value rounded to one decimal place is 38.3, not
38.2
Errors in Numerical Analysis
Numerical analysis involves developing algorithms to solve
mathematical problems approximately rather than exactly. When
dealing with real-world problems, exact solutions are often impossible
due to the complexity of the equations involved, the limitations of
computational resources, or inherent approximations in the model.
These approximations lead to errors.
Absolute Error
Absolute Error is used to measure the accuracy of a measurement by
comparing it to the true or exact value. It shows how far off a
measurement is from the actual value, without considering whether the
measured value is greater or less than the true value. It is always non-
negative. The absolute error has the same units as the measured and
true values. Absolute error does not tell us just how much significant
the error is relative to the true value.
Definition: Absolute error is the absolute difference between the
measured value and the true value.
Formula: Ea =| Xtrue - Xapprox |
The formula to calculate absolute error is:
Xtrue is the true or exact value.
Xapprox is the approximate or measured value.
The vertical bars “| |” denote the absolute value, ensuring error is
always non-negative.
Numbers and Precision | 4
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Calculation of Absolute Error
1. Identify the True Value: Determine the exact value of the quantity.
This might be a known constant, a value from a theoretical model, or
the most accurate measurement available.
2. Identify the Approximate Value: Determine the approximate or
measured value. This could be a value obtained through
experimentation, estimation, or numerical approximation.
3. Subtract the Approximate Value from the True Value: Find the
difference between the true value and the approximate value.
4. Take the Absolute Value: Ensure the error is expressed as a non-
negative quantity by taking the absolute value of the difference.
Relative Error
Relative error is a measure of the accuracy of an approximation in
relation to the true value. It expresses the absolute error as a fraction of
the true value, providing the error’s significance compared to the
magnitude of the quantity being measured. Relative error is particularly
useful when comparing errors across different units because it is a
dimensionless quantity.
Definition: Relative Error is the ratio of the Absolute Error to the true or
exact value.
Formula: Er= (| Xtrue - Xapprox | / | X true |)
Calculation of Relative Error
1. Determine the Absolute Error: Calculate the absolute error using
the formula.
2. Divide by the True Value: Divide the absolute error by the true
value to obtain the relative error.
3. Express as a Fraction or Percentage: The result can be left as a
fraction or multiplied by 100 to express it as a percentage.
Numbers and Precision | 5
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Percentage Error
Percentage error quantifies the accuracy of a measured or estimated
value by expressing the error as a percentage of the true or exact value.
It provides a way to compare the error relative to the magnitude of the
true value, making it easier to understand the significance of the error
in context. A small percentage error means the measurement is close to
the true value while a large percentage error indicates that the
measurement is far from the true value.
Definition: Percentage Error is the ratio of the Absolute Error to the true
value multiplied by 100, it can also be defined as Relative Error
multiplied by 100.
Formula: Ep = Er x 100% = (| Xtrue - Xapprox | / | Xtrue |) x 100%
Calculation of Percentage Error
1. Determine the Absolute Error: Calculate the absolute error using
the formula.
2. Divide by the True Value: Divide the absolute error by the true
value to obtain the relative error.
3. Multiply by 100: Convert the relative error to a percentage by
multiplying the result by 100.
Binary Number Representation
Binary is a base-2 number system that uses two mutually exclusive
states to represent information. A binary number is made up of elements
called bits where each bit can be in one of the two possible states.
Generally, we represent them with the numerals 1 and 0. We also talk
about them being true and false. Electrically, the two states might be
represented by high and low voltages or some form of switch turned on or
off.
We build binary numbers the same way we build numbers in our
traditional base 10 system. However, instead of a one's column, a 10's
column, a 100's column (and so on) we have a one's column, a two's
columns, a four's column, an eight's column, and so on, as illustrated
below.
Numbers and Precision | 6
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Binary 2... 26 25 24 23 22 21 20
... 64 32 16 8 4 2 1
For example, to represent the number 203 in base 10, we know we place
a 3 in the 1's column, a 0 in the 10's column and a 2 in
the 100's column. This is expressed with exponents in the table below.
203 in base 10
102 101 100
2 0 3
Or, in other words, 2 × 102 + 3 × 100 = 200 + 3 = 203. To represent the
same thing in binary, we would have the following table.
203 in base 2
27 26 2 5 24 23 2 2 21 20
1 1 0 0 1 0 1 1
That equates to 27 + 26 + 23+21 + 20 = 128 + 64 + 8 + 2 + 1 = 203.
Base 2 and 10 factors related to bytes
Name Base 2 Bytes Close Base 10 bytes
Factor Base
10
Factor
1 Kilobyte 210 1,024 103 1,000
1 Megabyte 220 1,048,576 106 1,000,000
1 Gigabyte 230 1,073,741,824 109 1,000,000,000
1 Terabyte 240 1,099,511,627,77 1012 1,000,000,000,000
6
1 Petabyte 250 1,125,899,906,84 1015 1,000,000,000,000,000
2,624
1 Exabyte 260 1,152,921,504,60 1018 1,000,000,000,000,000,000
6,846,976
Numbers and Precision | 7
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Conversion
The easiest way to convert between bases is to use a computer, after all,
that's what they're good at! However, it is often useful to know how to do
conversions by hand.
The easiest method to convert between bases is repeated division. To
convert, repeatedly divide the quotient by the base, until the quotient is
zero, making note of the remainders at each step. Then, write the
remainders in reverse, starting at the bottom and appending to the right
each time. An example should illustrate; since we are converting to
binary we use a base of 2.
Convert 203 to binary
Quotient Remainder
203 ÷ 2 101 1
101 ÷ 2 50 1
50 ÷ 2 25 0
25 ÷ 2 12 1
12 ÷ 2 6 0
6÷2 3 0
3÷2 1 1
1÷2 0 1
Reading from the bottom and appending to the right each time gives 11001011
Convert 193.379 to binary
First of all split the number into Integer (193) and fractional part (.379), then convert
them to binary form separately and finally adding them.
Quotient Remainder
193/2 96 1
96/2 48 0
48/2 24 0
24/2 12 0
12/2 6 0
6/2 3 0
3/2 1 1
1/2 0 1
Numbers and Precision | 8
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
So, in binary form 193 becomes 11000001
Multiplied Integer digit
form
.379 x2 0.758 0
.758 x2 1.516 1
.516 x2 1.032 1
.032 x2 0.064 0
.064 x2 0.128 0
0.128 x2 0.256 0
0.256 x2 0.512 0
0.512 x2 1.024 1
So, in binary form 0.379 becomes .01100001
Thus, in binary form the number 193.379 in decimal form (193.379)10 can be written
as (11000001.0110001)2
Binary Addition
Binary addition technique is similar to the normal addition of decimal
numbers excluding that as an alternative value of 10 digits, it carries on
a 2 value.
For example, as we compute 7+9 manually, then the answer is 16. So we
know that the result has to write like two digits 1 and 6. The main
reason to write down the result like 1 6 is, the addition of 7 + 9 is greater
than the single digit. So the result cannot be denoted through a single
digit because the largest single digit is ‘9’.
Similarly, whenever we would like to sum two binary numbers, only we
will have a carry if the product is bigger than 1 because, in binary
numbers, 1 is the highest number. The binary addition rules are given in
the following truth table of subtraction.
Numbers and Precision | 9
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
B A+B Carry
A
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1
In the above tabular form, the initial three equations are the same for the
binary digit number. The addition of binary numbers step by step is
explained in detail. For binary addition take an example of 11011 &
10101.
1 1 1 1 (Carry)
1 1 0 1 1 (27)
(+) 1 0 1 0 1 (21)
____________
1 1 0 0 0 0 (48)
Here the step by step binary addition rules is explained below
1 + 1 => 1 0 = 0 with a carry 1
1 + 1 + 0 => 1 0 = 0 with carry 1
1 + 0 + 1 => 1 0 => 0 = 0 with carry 1
1 + 1 + 0 => 1 0 => 0 = 0 with carry 1
1 + 1 + 1 => 1 0 +1 => 1 1
Carefully note that 10 + 1 => 11 and this is equal to 2 + 1= 3. Therefore
the necessary outcome is 111000.
Examples
The binary addition examples are shown in the following figure.
Numbers and Precision | 10
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Binary Subtraction
In subtraction, this is the primary technique. In this method, ensure that
the subtracting number must be from a larger number to smaller, or else
this technique won’t work appropriately.
If the minuend is smaller than the subtrahend, then this method is used
by just switch their positions and memorize that the effect will be a
negative number. The binary subtraction rules are given in the following
table of subtraction.
A B A-B Borrow
0 0 0 0
1 0 1 0
1 1 0 0
0 1 1 1
Numbers and Precision | 11
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Example, in the binary subtraction, subtract the subtrahend from
minuend. Take an example of subtrahend (11011) and minuend
(1101101). For subtraction, arrange these two like the subtrahend
should be below the minuend. The example of this is given below.
1101101
– 11011
To get the same number of digits in subtrahend, add zeros where it
requires.
1101101
– 0011011
________
1010010
In the above binary subtraction example, the subtraction was achieved
from the right side to the left side with the help of tabular form which is
shown in the above. Here the step by step binary subtract on rules is
explained below.
Starting from left side, we see:
1 – 1 = 0 => 0 – 1 = 1 (borrow 1) => 1(0) – 0 = 0 => 1 – 1 = 0 => 0 – 1 = 1
(borrow 1) => 1(0) – 0 = 0 => 1 – 0 = 1
So the final result will be 1010010
Binary Multiplication
Let us consider 2 binary numbers: 101101 and 101; to multiply them we
must write as follows:
101101 Check once you are done!!
×101 101101= 45
101101 101= 5
000000× 11100001= 225
+101101×× 45 × 5 = 225
11100001
Numbers and Precision | 12
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Let us consider binary number with a radix point in it: 101011.01 and
0101.10; to multiply them we must write as follows:
1 0 1 0 1 1. 0 1 for multiplying, stop considering radix point in
× 0 1 0 1. 1 0 the numbers, remove unwanted 0s, re-write
them accordingly and conduct multiplication.
10101101
×1011
10101101
10101101×
00000000××
10101101×××
1 1 1 0 1 1 0 1.1 1 1 Adjust the radix point in the answer after
adding the bits after radix point, i.e. 2 bits + 1
bits = 3 bits
Check once you are done!!
101011.01 = 43.25
101.1 = 5.5
11101101.111 = 237.875
And 43.25 × 5.5 = 237.825
Binary Division
Numbers and Precision | 13
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Classification of binary representation
In general, the binary number can be represented in two ways.
1. Unsigned Binary Numbers
2. Signed Binary Numbers
Unsigned Binary Numbers
Using unsigned binary number representation, only positive binary
numbers can be represented. For n-bit unsigned binary numbers, all n-
bits are used to represent the magnitude of the number.
For example, if we represent decimal 12 in 5- bit unsigned number form
then (12)10 = (01100)2. Here all 5 bit are used to represent the magnitude
of the number.
In unsigned binary number representation, using n-bits, we can
represent the numbers from 0 to 2n – 1. For example, using 4 -bits we
can represent the number from 0 to 15 in unsigned binary number
representation.
Signed Binary Numbers
Using signed binary number representation both positive and negative
numbers can be represented.
In signed binary number representation the most significant bit (MSB) of
the number is a sign bit. For positive numbers, the sign bit is 0 and for
negative number, the sign bit is 1.
There are three different ways the signed binary numbers can be
represented.
1. Signed Magnitude Form: range from -(2(k-1)-1) to (2(k-1)-1), for k bits.
2. 1’s complement representation: range from -(2(k-1)-1) to (2(k-1)-1), for
k bits.
3. 2’s complementation representation: range from -(2(k-1)) to (2(k-1)-1),
for k bits
Numbers and Precision | 14
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Sign Magnitude Representation
In sign-magnitude representation, the Most Significant bit of the number
is a sign bit and the remaining bit represents the magnitude of the
number in a true binary form. For example, if some signed number is
represented in the 8-bit sign-magnitude form then MSB is a sign bit and
the remaining 7 bits represent the magnitude of the number in a true
binary form.
Here is the representation of + 34 and -34 in a 8-bit sign-magnitude
form.
Since the magnitude of both numbers is the same, the first 7 bits in the
representation are the same for both numbers. For +34, the MSB is 0,
and for -34, the MSB or sign bit is 1.
In sign magnitude representations, there are two different
representations for 0.
Numbers and Precision | 15
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Using n-bits, the range of numbers that can be represented in Sign
Magnitude Representation is from – (2n-1 – 1) to (2n -1 – 1).
1’s Complement Representation
In 1’s complement representation, the representation of the positive
number is same as the negative number. But the representation of the
negative number is different.
For example, if we want to represent -34 in 8-bit 1’s complement form,
then first write the positive number (+34). And invert all 1s in that
number by 0s and 0s by 1s in that number. The corresponding inverted
number represents the -34 in 1’s complement form. It is also called 1s
complement of the number +34.
Here is another example which shows how to represent -60 in 8-bit 1’s
complement form.
Using n-bits, the range of numbers that can be represented in 1’s
complement form is from – (2n-1 – 1) to (2n -1 – 1). For example, using 4-
Numbers and Precision | 16
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
bits, it is possible to represent integer’s numbers from -7 to +7 in a 1’s
complement form representation.
Similar to sign-magnitude form, there are two different representations of
0 in 1’s complement form representation.
2’s Complement Representation
In 2’s complement representation also, the representation of the positive
number is same as1’s complement and sign-magnitude form.
But the representation of the negative number is different. For example,
if we want to represent -34 in 2’s complement form then
1. Write the number corresponding to +34.
2. Starting from Least Significant Bit (LSB), just copy all the bits until
the first 1 is encountered in the number.
3. After the first ‘1’ is encountered, invert all the 1s in the number
with 0s and 0s in the number with 1s (including the sign bit)
4. The resultant number is 2’s complement representation of the
number -34.
Numbers and Precision | 17
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
The second way of representing -34 in 2’s complement form is
1. Write the number corresponding to +34.
2. Find 1’s complement of +34
3. Add ‘1’ to the 1’s complement number
4. The resultant is 2’s complement representation of -34
For n-bit number N, its 2’s complement is (2n – N). For example, the 2’s
complement of +34 in 8-bit form is (28 – 34). In binary, it is 100000000 –
Numbers and Precision | 18
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
00100010 = 11011110. That is a third way of finding the 2’s
complement.
Here is the representation of -60 in sign-magnitude form, 1’s
complement, and 2’s complement form.
Using n-bits, the range of number which can be represented in 2’s
complement form is from – (2n-1) to 2n-1 – 1. For example, using 4-bits, it
is possible to represent numbers from -8 to +7. Unlike 1’s complement
and sign magnitude form, there is a unique way of representing 0 in this
2’s complement form.
Fixed and Floating point Representation
Digital Computers use Binary number system to represent all types of
information inside the computers. Alphanumeric characters are
represented using binary bits (i.e., 0 and 1). Digital representations are
easier to design, storage is easy, and accuracy and precision are greater.
There are various types of number representation techniques for digital
number representation, for example: Binary number system, octal
number system, decimal number system, and hexadecimal number
system etc. But Binary number system is most relevant and popular for
representing numbers in digital computer system.
Storing Real Number
These are structures as following below −
Numbers and Precision | 19
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
There are two major approaches to store real numbers (i.e., numbers
with fractional component) in modern computing. These are (i) Fixed
Point Notation and (ii) Floating Point Notation. In fixed point notation,
there are a fixed number of digits after the decimal point, whereas
floating point number allows for a varying number of digits after the
decimal point.
Fixed-Point Representation −
This representation has fixed number of bits for integer part and for
fractional part. For example, if given fixed-point representation is
IIII.FFFF, then you can store minimum value is 0000.0001 and
maximum value is 9999.9999. There are three parts of a fixed-point
number representation: the sign field, integer field, and fractional field.
We can represent these numbers using:
Signed representation: range from -(2(k-1)-1) to (2(k-1)-1), for k bits.
Numbers and Precision | 20
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
1’s complement representation: range from -(2(k-1)-1) to (2(k-1)-1), for
k bits.
2’s complementation representation: range from -(2(k-1)) to (2(k-1)-1),
for k bits.
2’s complementation representation is preferred in computer system
because of unambiguous property and easier for arithmetic operations.
Example −Assume number is using 32-bit format which reserve 1 bit for
the sign, 15 bits for the integer part and 16 bits for the fractional part.
Then, -43.625 is represented as following:
Where, 0 is used to represent + and 1 is used to represent -.
000000000101011 is 15 bit binary value for decimal 43 and
1010000000000000 is 16 bit binary value for fractional 0.625.
The advantage of using a fixed-point representation is performance and
disadvantage is relatively limited range of values that they can represent.
So, it is usually inadequate for numerical analysis as it does not allow
enough numbers and accuracy. A number whose representation exceeds
32 bits would have to be stored inexactly.
These are above smallest positive number and largest positive number
which can be store in 32-bit representation as given above format.
Numbers and Precision | 21
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Therefore, the smallest positive number is 2-16 ≈ 0.000015 approximate
and the largest positive number is (215-1) + (1-2-16) =215(1-2-16) =32768,
and gap between these numbers is 2-16.
We can move the radix point either left or right with the help of only
integer field is 1.
Floating-Point Representation −
This representation does not reserve a specific number of bits for the
integer part or the fractional part. Instead it reserves a certain number of
bits for the number (called the mantissa or significand) and a certain
number of bits to say where within that number the decimal place sits
(called the exponent).
The floating number representation of a number has two part: the first
part represents a signed fixed point number called mantissa. The second
part of designates the position of the decimal (or binary) point and is
called the exponent. The fixed point mantissa may be fraction or an
integer. Floating-point is always interpreted to represent a number in the
following form: M x re.
Only the mantissa m and the exponent e are physically represented in
the register (including their sign). A floating-point binary number is
represented in a similar manner except that is uses base 2 for the
exponent. A floating-point number is said to be normalized if the most
significant digit of the mantissa is 1.
So, actual number is (-1) s (1+m) x2(e-Bias), where s is the sign bit, m is the
mantissa, e is the exponent value, and Bias is the bias number.
Numbers and Precision | 22
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
Note that signed integers and exponent are not represented by either sign
representation, or one’s complement representation, or two’s complement
representation, it is rather done by bias representation. This is done so
because in case of signed value, 1’s and 2’s representation the sign bit
plays a vital role and make it difficult to control moreover there is a
discontinuity in this systems while in bias system we find no
discontinuity, hence it is preferred to use.
The floating point representation is more flexible. Any non-zero number
can be represented in the normalized form of ± (1.b 1b2b3 ...) 2x2n this is
normalized form of a number x.
Example −Suppose number is using 32-bit format: the 1 bit sign bit, 8
bits for signed exponent, and 23 bits for the fractional part. The leading
bit 1 is not stored (as it is always 1 for a normalized number) and is
referred to as a “hidden bit”.
Then −53.5 is normalized as -53.5= (-110101.1)2= (-1.101011) x25, which
is represented as following below,
Where 00000101 is the 8-bit binary value of exponent value +5, mantissa
is only 101011 other 17 bits are adjusted by putting 0s, and we omit the
integer part of the binary number.
Note that 8-bit exponent field is used to store integer exponents -126 ≤ n
≤ 127 (bias system).
The smallest normalized positive number that fits into 32 bits is
(1.00000000000000000000000)2x2-126=2-126≈1.18x10-38, and largest
normalized positive number that fits into 32 bits is
(1.11111111111111111111111)2x2127= (224-1) x2104 ≈ 3.40x1038. These
numbers are represented as following below,
Numbers and Precision | 23
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
The precision of a floating-point format is the number of positions
reserved for binary digits plus one (for the hidden bit). In the examples
considered here the precision is 23+1=24.
The gap between 1 and the next normalized floating-point number is
known as machine epsilon. the gap is (1+2-23)-1=2-23for above example,
but this is same as the smallest positive floating-point number because
of non-uniform spacing unlike in the fixed-point scenario.
Note that non-terminating binary numbers can be represented in floating
point representation, e.g., 1/3 = (0.010101 ...)2 cannot be a floating-point
number as its binary representation is non-terminating.
IEEE Floating point Number Representation −
IEEE (Institute of Electrical and Electronics Engineers) has standardized
Floating-Point Representation as following diagram.
So, actual number is (-1)s(1+m)x2(e-Bias), where s is the sign bit, m is the
mantissa, e is the exponent value, and Bias is the bias number. The sign
bit is 0 for positive number and 1 for negative number. Exponents are
represented by or two’s complement representation.
Numbers and Precision | 24
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
According to IEEE 754 standard, the floating-point number is
represented in following ways:
Half Precision (16 bit): 1 sign bit, 5 bit exponent, and 10 bit
mantissa
Single Precision (32 bit): 1 sign bit, 8 bit exponent, and 23 bit
mantissa
Double Precision (64 bit): 1 sign bit, 11 bit exponent, and 52 bit
mantissa
Quadruple Precision (128 bit): 1 sign bit, 15 bit exponent, and 112
bit mantissa
Special Value Representation −
There are some special values depended upon different values of the
exponent and mantissa in the IEEE 754 standard.
All the exponent bits 0 with all mantissa bits 0 represents 0. If sign
bit is 0, then +0, else -0.
All the exponent bits 1 with all mantissa bits 0 represents infinity. If
sign bit is 0, then +∞, else -∞.
All the exponent bits 0 and mantissa bits non-zero represents de-
normalized number.
All the exponent bits 1 and mantissa bits non-zero represents error.
Floating Point Arithmetic
Let 2 decimal numbers x and y be chosen for arithmetic operations, let z
be the result of the arithmetic operation, if fx, fy and fz are the respective
fractional part and Ex , Ey and Ez are the respective exponential part of
the decimal numbers x,y and z such that the normalized form is: x = f x Ex
1. For Addition and Subtraction :
a. Set Ez = Ex or Ey, which one is higher (if Ex >= Ey, Ez = Ex)
b. Adjust the decimal point of fx/fy in order to maintain Ez.
c. Perform fx ± fy to get fz
Numbers and Precision | 25
As per new syllabus of University of Calcutta (CCF-2022) CHEM-H-SEC3-3-TH
d. Normalized value of fz cannot exceed 1, if it happen shift the
decimal point to bring 0 before the decimal point. Eg: 1.987 =>
0.198
For Example:
i) Add 0.7642E4 and 0.4253E6.
Solution: The exponent of a number with the smallest exponent
is increased by 2 so that 0.7642E4 becomes 0.0076 E6. Then
0.7642E4 + 0.4253E6 = 0.0076E6 + 0.4253E6 = 0.4329E6
ii) Subtract 0.4673E-4 from 0.8542E-5.
Solution: The smallest exponent is E-5 so we increase the
exponent of 0.8542E-5 by 1 and it becomes 0.0854E-4, therefore
0.4673E-4 – 0.0854E-4 = 0.3819E-4.
2. For Multiplication :
a. Multiply the fractional part fz = fx.fy
b. Add the exponents Ez = Ex + Ey
c. Then z = fz.10Ez
d. Normalized value of fz cannot exceed 1, if it happen shift the
decimal point to bring 0 before the decimal point. Eg: 1.987 =>
0.198
For Example:
Multiply 0.5634E11 × 0.1532E-14.
Solution: 0.5634 × 0.1532 = 0.08631288 and E11 + E-14 = 11 + (-14)
=E-3. Therefore, 0.5634E11 × 0.1532E-14 = 0.08631288 E-3. Now the
leading digit of mantissa should be non-zero, therefore 0.08631288E-3
becomes 0.8631288E-4 = 0.8631 E-4
3. For Division :
a. divide the fractional part fz = fx/fy
b. Add the exponents Ez = Ex - Ey
c. Then z = fz.10Ez
d. Normalized value of fz cannot exceed 1, if it happen shift
the decimal point to bring 0 before the decimal point. Eg:
1.987 => 0.198
For Example:
Divide 0.2000E5 by 0.8883E3.
Solution: 0.2000/0.8883 = 0.2251 and E5 – E3 = 5 – 3 = E2. Therefore,
0.2000E5 / 0.8883E3 = 0.2251E2.
Numbers and Precision | 26