What Is Information Theory? A Simple Explanation

Information theory is the mathematical study of how information is quantified, stored, and transmitted. Developed by Claude Shannon in his landmark 1948 paper “A Mathematical Theory of Communication,” it provides a universal framework for measuring the amount of information in a message and determining the limits of how efficiently that message can be sent through any communication channel. Shannon’s work is so foundational that he’s widely considered the father of the digital age.

The Core Idea: Measuring Surprise

Before Shannon, there was no rigorous way to measure information. His key insight was that information is fundamentally about uncertainty. The more uncertain you are about what comes next, the more information you gain when you find out. A coin flip carries more information than being told the sun will rise tomorrow, because the coin flip could go either way.

Shannon formalized this with a quantity called entropy, borrowing the term from physics. Entropy measures the average amount of surprise in a message source. If you’re watching a weather report in a desert where it’s sunny 99% of the time, hearing “sunny” tells you almost nothing. Hearing “rain” would be genuinely surprising and therefore carries much more information. Shannon’s entropy calculation accounts for these differences by weighting each possible outcome by how likely it is.

The unit of information is the bit, short for “binary digit,” a term suggested by mathematician J. W. Tukey. One bit is the amount of information you get from a single yes-or-no answer when both options are equally likely. Shannon showed that any message source has a fundamental entropy rate, a number that captures how much genuine information each symbol carries on average.

English Has Built-In Redundancy

Shannon himself applied these ideas to the English language. If every letter of the alphabet appeared with equal frequency and no patterns existed between letters, each character would carry about 4.14 bits of information. But English is full of patterns. The letter “q” is almost always followed by “u.” After “th,” an “e” is far more likely than a “z.” These patterns make English partially predictable.

When Shannon accounted for statistical patterns extending across sequences of about eight letters, the entropy dropped to roughly 2.3 bits per letter. When he considered long-range patterns spanning up to 100 letters (the kind of predictability that lets you guess the end of a sentence), the entropy fell to about one bit per letter. That means English is roughly 75% redundant. Three-quarters of what we write is, in a mathematical sense, predictable from context. This redundancy is why you can read a text message full of typos and still understand it perfectly.

The Communication Model

Shannon didn’t just measure information in the abstract. He built a model for how communication actually works. His system breaks any communication process into a chain of components: a sender who originates a message, a channel (the medium carrying it, whether a copper wire, radio wave, or optical fiber), noise that interferes with the signal along the way, and a receiver who decodes what arrives.

This model applies whether you’re talking about a phone call, a satellite link, or a conversation across a noisy room. The critical question Shannon asked was: given a channel with a certain amount of noise, what is the maximum rate at which you can send information and still have it arrive correctly?

Channel Capacity: The Speed Limit

Shannon proved that every communication channel has a maximum information rate, called its channel capacity. You can think of it as a speed limit. No matter how clever your encoding scheme, you cannot reliably send data faster than this limit. But here’s the remarkable part: Shannon also proved you can get arbitrarily close to that limit with virtually zero errors, as long as you use the right coding strategy.

The formula for channel capacity depends on two things: the bandwidth of the channel (how many signals per second it can carry) and the signal-to-noise ratio (how strong your signal is compared to the background interference). A wider channel with less noise can carry more information. This relationship, known as the Shannon-Hartley theorem, sets the theoretical ceiling for every communication system ever built, from dial-up modems to 5G networks. Engineers designing these systems are essentially trying to get as close to Shannon’s limit as physics and math will allow.

Data Compression

One of the most practical consequences of information theory is data compression. Shannon proved that the entropy of a source is the absolute minimum number of bits per symbol needed to represent it without losing any information. You simply cannot compress data below its entropy rate and still recover the original perfectly.

This result gave rise to lossless compression techniques like Huffman coding, which assigns shorter codes to more common symbols and longer codes to rare ones. When you zip a file on your computer, algorithms descended from these ideas are at work. They exploit the redundancy in your data (repeated patterns, predictable sequences) to represent the same information in fewer bits. The entropy rate is the floor: the theoretical best any lossless compression can achieve.

Lossy compression, used in formats like JPEG and MP3, goes further by deliberately discarding information your senses are unlikely to notice. This crosses below the entropy threshold, which is why you can never perfectly reconstruct the original from a lossy file. The tradeoff between file size and perceptible quality is, at its root, an information-theoretic problem.

Error Correction

The flip side of compression is error correction. While compression removes redundancy to save space, error correction adds redundancy back in to protect against noise. If a bit gets flipped during transmission, the extra redundancy lets the receiver detect and fix the mistake.

The simplest example: send every bit three times. If the receiver gets “1, 1, 0,” it takes a majority vote and concludes the original bit was 1. This works because the chance of two or more independent errors is much smaller than the chance of one. Real-world error correction codes are far more sophisticated, but they all rely on the same principle. By carefully structuring the redundancy, engineers can detect multiple errors and correct them without asking the sender to retransmit. The strength of a code depends on the Hamming distance between its valid codewords: the minimum number of bit positions in which any two codewords differ. A code with a minimum Hamming distance of 3 can correct any single-bit error.

Shannon’s channel coding theorem guarantees that codes exist which can achieve near-perfect reliability at any transmission rate below the channel capacity. Finding practical codes that approach this limit took decades, but modern codes used in cell phones, Wi-Fi, and deep-space communication now come remarkably close.

Applications Beyond Engineering

Information theory started as a communications engineering framework, but its reach has expanded dramatically. In neuroscience, researchers use Shannon’s tools to measure how much information nerve cells transmit and to map the direction of information flow between brain regions. Early work in the 1950s proposed theoretical limits on how much data a single nerve cell could carry. More recent studies have used information-theoretic measures to decode neural firing patterns in the auditory systems of songbirds and to trace how different cortical regions coordinate during motor tasks, with functional brain imaging data analyzed using the same entropy and mutual information concepts Shannon developed for telephone lines.

In genetics, information theory helps quantify the structure and redundancy in DNA sequences. In machine learning, entropy-based measures guide decision trees, and a concept called cross-entropy serves as the loss function that many modern AI systems optimize during training. The compression perspective also connects to machine learning: a model that compresses its training data well has, in an information-theoretic sense, learned the underlying patterns rather than memorizing noise.

Even fields like ecology, linguistics, and economics have adopted Shannon’s entropy as a standard measure of diversity or uncertainty. Wherever there is data with statistical structure, information theory offers a principled way to quantify what’s there, what’s redundant, and what’s been lost.