Skip to main content

Floating Point Numbers

Whats really floating here?

It's called floating point because the point in the number really floats to different spots. This is just scientific notation. There, you can float the point anywhere with exponents.

Decimals vs Floating Points

Decimals is how we write numbers in base 10. It can be both whole and fractional numbers.

Floating point is just the computer's version of scientific notation.

I always assumed floating point numbers were just two integers split by a dot. That's wrong. A floating point value is stored as one value in a floating point register. It uses different binary handling and standards.

Converting floating numbers to binaryโ€‹

The integer and fractional parts are handled differently. They aren't just two integers split by a decimal point.

floating-to-binary

Floating points aren't accurateโ€‹

floating-point-not-accurate
Why not accurate?

The diagram above is only a mental model for why it isn't accurate.

In reality, to convert a decimal to a floating form, the computer does this:

  1. if the number is 585.22, it will convert it into regular number as - 58522102ย orย 58522100\frac{58522}{10^2} \text{ or } \frac{58522}{100}
  2. Then it will convert both numerator and denominator to binary.
  3. Perform division on these binary numbers until the quotient is 53 bits.
  4. Then converts the answer to the format mentioned below.

Standards for binary representation of floating numbersโ€‹

All CPU architectures follow one standard, the IEE 754, for floating numbers. It uses scientific notation and normalization. The integer part of the binary is always just 1. The exponent is base 2, since the value is binary.

Ensuring 1 in integer part.

When we convert a decimal to binary, there will be 1 at some location for sure. The normalization will keep moving the decimal point to left until it reaches the first 1.

Finally what's stored is - sign bit + exponent + mantissa (binary value after the decimal point) only. Here the main assumptions are -

  • The integer part is understood that it's always 1.
  • The size of sign, exponent and mantissa bits are fixed.
  • The bias added to the exponent is known based on the register size.
  • Exponent is for base 2 since it's binary.
exponent can be positive or negative

The exponent itself can be positive or negative depending on how decimal is moved to get just 1 before the decimal point.

Adding Bias to Exponentโ€‹

This standard uses scientific notation for the mantissa. The exponent can be positive or negative. Still, the goal is to keep exponents positive. That makes comparison easier. The exponent size alone shows if a number is larger or smaller.

Meaning of bias in this context

Bias in english means, having an opinion different to truth. That's exactly what's done in IEE755. The actual value of exponent is biased with a fixed value.

For example, a 32 bit float has a bias of 127. The 32 bit float register reserves 8 bits for the exponent. The exponent runs from 2โˆ’1272^{-127} to 21282^{128}. The exponent bits must hold โˆ’127-127 to 128128. We add 127 to each, so only positive numbers are stored.

Floating point in programming languagesโ€‹

When you create floating point numbers in Java, it converts them to IEE 758 format before storing them. This is a hardware need that all programming languages meet.

Programming language specific implementation

In JavaScript, all numbers are represented in IEE754 format. Meaning even for whole numbers, it has only 53 bits available.

FPU in CPUโ€‹

The FPU is a CPU component that handles the IEE 754 standard. The ALU is only for integers. The FPU does all floating point math.