How to Measure Speech Intelligibility: Key Methods

Speech intelligibility is measured either by having real listeners identify spoken words and scoring their accuracy, or by using mathematical models that predict how much speech a person could understand in a given environment. The listener-based approach gives you a direct percentage of words or sentences correctly identified. The model-based approach produces an index or score derived from acoustic properties like background noise, reverberation, and signal strength. Both approaches are widely used, but in different contexts: clinical audiologists typically rely on listener tests, while engineers designing classrooms, airports, or phone systems use predictive models.

Subjective vs. Objective Measurement

The core distinction in speech intelligibility measurement is between subjective methods (a human listener tries to understand speech) and objective methods (an algorithm calculates a predicted score from acoustic data). For people with normal hearing, these two approaches tend to produce similar results. The gap widens for people with hearing loss, who often score better on subjective tests than objective models predict, likely because they use context clues and lip-reading cues that mathematical models don’t account for.

This matters because it determines which method is appropriate for a given situation. If you need to evaluate how well a specific person understands speech, a listener-based test is more accurate. If you need to evaluate whether a room, a speaker system, or a phone codec supports clear communication for a general population, a predictive model is faster and more practical.

Listener-Based Word and Sentence Tests

The most straightforward way to measure intelligibility is to play recorded words or sentences and ask listeners to repeat or select what they heard. The score is simply the percentage of items identified correctly.

The Modified Rhyme Test (MRT) is one of the most commonly used formats, particularly in telecommunications and aviation. It consists of lists of 50 items, each containing six rhyming one-syllable words that differ only in their opening or closing sound. For example, a listener hears one word from a set like “hold, cold, told, fold, sold, gold” and must pick the correct one from six choices displayed on a screen. Because the words are so similar, the test is highly sensitive to small differences in audio quality. The Federal Aviation Administration has used the MRT to evaluate whether voice compression technologies are clear enough for air traffic control.

Sentence-level tests add a layer of complexity by introducing context. Some tests use “high-predictability” sentences where the meaning helps you guess the final word, and “low-predictability” sentences where it doesn’t. Comparing performance on these two types reveals how much a listener depends on context to fill in gaps, which is especially informative for people with hearing difficulties.

Testing Speech in Background Noise

Measuring intelligibility in quiet conditions only tells part of the story. Most real conversations happen against background noise, so a category of tests specifically measures how well someone understands speech when competing sounds are present.

These tests use three basic protocols. A fixed protocol plays speech at a constant signal-to-noise ratio (SNR), say with speech 5 decibels louder than the noise, and records the percent correct. An adaptive protocol adjusts the noise level based on performance, zeroing in on the exact SNR where the listener gets about 50% of items right. A progressive protocol changes the noise in one direction only, typically getting harder, regardless of how the listener performs.

The result of an adaptive test is usually an SNR50: the signal-to-noise ratio at which you understand half the material. Some tests go further and convert this into an “SNR loss” value by comparing your result to the average for people with normal hearing. SNR loss is useful because it’s more comparable across different tests than raw SNR50 scores, which vary depending on the specific words or sentences used.

The Speech Intelligibility Index

The Speech Intelligibility Index (SII), defined by the ANSI S3.5 standard, is the primary mathematical model for predicting how much speech a person can understand. It produces a value between 0 and 1, where 0 means no speech information is audible and 1 means all of it is.

Calculating the SII requires three inputs across a range of frequency bands: the listener’s hearing thresholds, the levels of both the speech signal and any background noise, and a weighting function that reflects how important each frequency band is for understanding a particular type of speech. The calculation works by first determining how audible the speech signal is in each band (how far above the listener’s hearing threshold or the noise floor), then multiplying that audibility by the band’s importance weight. Bands where speech is both clearly audible and important to understanding contribute the most to the final score.

A key feature of the SII is that different frequency bands contribute unequally. Mid-frequency bands, where consonant sounds carry most of their information, are weighted more heavily than very low or very high bands. The standard allows calculations using different bandwidth divisions (one-third octave, one octave, or 20 equally contributing bands) depending on the application.

The Speech Transmission Index

The Speech Transmission Index (STI) is the go-to metric for evaluating rooms and public address systems. Like the SII, it runs from 0 to 1, but it’s calculated from the acoustic properties of a space rather than from a listener’s hearing profile. The STI captures how much a room’s acoustics degrade the patterns in speech over time, factoring in both background noise and reverberation.

The scale breaks down into practical categories: scores above 0.75 are considered excellent, 0.60 to 0.75 is good, and anything below about 0.45 is poor. An STI of 0.41, for instance, indicates that a space is barely functional for spoken communication. Architects and sound engineers use STI measurements to decide whether a lecture hall, courtroom, or transit station needs acoustic treatment.

How Room Acoustics Affect Scores

Two physical properties dominate how intelligible speech is in any room: reverberation time and background noise level.

Reverberation time (often measured as T20 or RT60) describes how long sound lingers after the source stops. A small clinical room with a reverberation time of 0.18 seconds provides outstanding clarity. A mid-size room at 0.22 seconds is still very good. But a reverberant room with a reverberation time of 1.38 seconds distorts and masks speech cues so severely that extended conversations become difficult. Current U.S. standards for classrooms recommend reverberation times no longer than 0.6 seconds in smaller rooms and 0.7 seconds in larger ones.

A related metric called C50 (the “clarity factor for speech”) measures the ratio of sound energy arriving within the first 50 milliseconds to everything arriving after. Reflections that reach your ears within that 50-millisecond window actually reinforce the original sound and improve clarity. Later reflections blur it. Values of +3 dB or greater are desirable. In testing, a small clinical room scored 19.5 dB on C50, while a reverberant room scored -1.0 dB, confirming what anyone who’s tried to have a conversation in a gymnasium already knows.

Background noise is the other critical factor. Classroom standards call for noise levels no higher than 35 dBA when the room is unoccupied. This threshold was chosen because typical speech levels in a classroom will exceed it, keeping the signal-to-noise ratio positive and giving listeners a reasonable chance of catching every word.

Word Error Rate for Digital Systems

When the “listener” is a machine rather than a person, intelligibility is measured using Word Error Rate (WER). This metric evaluates automatic speech recognition systems by comparing their transcript to a known-correct reference. WER counts three types of mistakes: words the system missed entirely (deletions), words it got wrong (substitutions), and extra words it inserted that weren’t spoken (insertions). The total number of errors is divided by the total number of words in the reference transcript.

A WER of 0% means perfect transcription. In practice, modern systems achieve single-digit WER on clean studio recordings but can climb to 20% or higher in noisy, real-world conditions. WER has become the standard benchmark for comparing voice assistants, transcription services, and any technology that converts speech to text.

Choosing the Right Method

The best measurement approach depends entirely on what you’re trying to evaluate. For an individual’s hearing ability, listener-based tests with standardized word lists in noise give the most clinically meaningful results. For a room or building, STI measurements or SII calculations tell you whether the space supports clear communication before anyone walks in. For audio technology like codecs, hearing aids, or voice-over-IP systems, the Modified Rhyme Test or WER testing reveals whether the hardware preserves enough speech detail for reliable understanding.

In many professional settings, combining methods gives the fullest picture. An audiologist might use both a speech-in-noise test and an SII calculation to understand not just how a patient performs, but why. A classroom designer might measure STI to verify acoustic treatment, then run listener tests to confirm real-world performance matches the prediction.