What Is Floating Point Representation in Computing?

Floating point representation is the standard way computers store decimal numbers, fractions, and very large or very small values. It works like scientific notation: instead of writing 300,000,000, you write 3 × 10⁸. Computers do the same thing, but in binary, splitting every number into three pieces that together can represent an enormous range of values using a fixed amount of memory.

The Three Components of a Floating Point Number

Every floating point number is stored as three fields packed into a string of bits:

Sign bit: A single bit that marks the number as positive (0) or negative (1).
Exponent: A set of bits that determines how far the decimal point shifts, controlling the scale of the number.
Fraction (mantissa): The remaining bits that store the actual digits of the number, determining its precision.

The final value is calculated by combining these three pieces. The sign bit sets positive or negative, the fraction provides the significant digits, and the exponent scales the result up or down by a power of 2. This is why the format is called “floating point”: the decimal point can float to different positions depending on the exponent, rather than being fixed in one place.

Single Precision vs. Double Precision

The IEEE 754 standard, which virtually all modern hardware follows, defines two common formats. Single precision uses 32 bits total: 1 for the sign, 8 for the exponent, and 23 for the fraction. Double precision uses 64 bits: 1 for the sign, 11 for the exponent, and 52 for the fraction.

More fraction bits means more significant digits, so double precision can represent numbers with roughly twice the decimal accuracy of single precision. More exponent bits means a wider range of magnitudes. Double precision can handle numbers as tiny as 10⁻³⁰⁸ or as large as 10³⁰⁸, while single precision tops out around 10³⁸.

Most general-purpose programming defaults to double precision (64-bit) for everyday math. Single precision (32-bit) is common in graphics, gaming, and applications where speed and memory savings matter more than extreme accuracy.

How the Exponent Bias Works

The exponent field needs to represent both positive exponents (for large numbers) and negative exponents (for tiny fractions). Rather than reserving a separate sign bit for the exponent, the system uses a trick called biasing. A fixed number, the bias, is subtracted from the stored exponent to get the true exponent.

For single precision, the bias is 127. If the exponent field stores the value 130, the true exponent is 130 − 127 = 3, meaning the number is scaled by 2³. If it stores 120, the true exponent is 120 − 127 = −7, scaling the number by 2⁻⁷ (making it very small). For double precision, the bias is 1023. This approach gives a roughly equal distribution of representable values above and below zero, without needing a separate sign for the exponent.

Why 0.1 + 0.2 Doesn’t Equal 0.3

One of the most common surprises with floating point is that simple-looking numbers often can’t be stored exactly. If you type 0.1 + 0.2 into most programming languages, you get something like 0.30000000000000004 instead of 0.3.

The reason is that floating point works in binary, and binary can only represent fractions whose denominator is a power of 2. Numbers like 0.5 (1/2) or 0.25 (1/4) translate perfectly. But 0.1 is 1/10, and 10 is not a power of 2. In binary, 0.1 becomes a repeating fraction that goes on forever, much like 1/3 becomes 0.333… in decimal. Since the computer only has a finite number of fraction bits, it rounds to the closest value it can store. That tiny rounding error is baked in from the moment you write the number, before any calculation even happens.

For most purposes this error is negligible. But it accumulates over many operations, which is why financial software and scientific simulations need special strategies to manage precision. If you’re comparing floating point numbers in code, checking for exact equality (like if result == 0.3) is unreliable. Instead, you check whether the difference is smaller than some tiny threshold.

Special Values: Infinity and NaN

IEEE 754 reserves certain bit patterns for values that aren’t ordinary numbers. When a calculation overflows (produces a result too large to store), the system returns positive or negative infinity rather than crashing. These are represented by setting all exponent bits to their maximum value and all fraction bits to zero.

When a calculation has no meaningful result, like dividing zero by zero or subtracting infinity from infinity, the system returns NaN, short for “Not a Number.” NaN is stored with all exponent bits at maximum and at least one fraction bit set to 1. NaN has a unique property: it is not equal to anything, including itself. This makes it useful as a signal that something has gone wrong in a computation.

At the other extreme, the standard also defines subnormal numbers (sometimes called denormalized numbers). These are values extremely close to zero that sacrifice some precision to represent magnitudes smaller than the normal range would allow. They prevent a gap between zero and the smallest normal number, giving calculations a more gradual “underflow” to zero.

Rounding in Floating Point

Since most real numbers can’t be represented exactly, the system has to round constantly. IEEE 754 defines several rounding options: round toward zero, round toward positive infinity, round toward negative infinity, and round to the nearest representable value. The default on most systems is round-to-nearest, which picks whichever stored value is closest to the true result. When the true result falls exactly halfway between two representable values, the standard breaks the tie by choosing the one whose last bit is zero (an even number), which prevents rounding errors from accumulating in one direction over many operations.

Newer Formats for AI and Machine Learning

The 32-bit and 64-bit formats dominate general computing, but machine learning has driven demand for smaller, faster formats. Training and running neural networks involves billions of multiplications where a little lost precision is acceptable if it means the work finishes faster and uses less memory.

One recent development is FP8, an 8-bit floating point format designed specifically for deep learning. It comes in two variants: E4M3, with 4 exponent bits and 3 fraction bits, and E5M2, with 5 exponent bits and 2 fraction bits. E5M2 follows the standard IEEE 754 conventions for special values. E4M3 trades away the ability to represent infinity entirely, using those bit patterns to extend its usable range instead. With only 8 bits total, these formats are far less precise than standard floats, but for tasks like training large language models, the speed and efficiency gains outweigh the accuracy tradeoff.

Between these extremes sit formats like 16-bit half precision (1 sign bit, 5 exponent bits, 10 fraction bits) and Google’s bfloat16 (1 sign bit, 8 exponent bits, 7 fraction bits), both widely used as middle-ground options in GPU-heavy workloads. The choice of format always comes down to the same tradeoff: more bits give you more precision and range, fewer bits give you more speed and lower memory cost.