What Is MOS Score? Mean Opinion Score Explained

A MOS score, or Mean Opinion Score, is a numerical rating from 1 to 5 that measures how good audio, video, or a phone call sounds and looks to real people. It’s the standard way industries like telecommunications, VoIP, and streaming evaluate quality of experience. A score of 5 means excellent, and 1 means completely unusable.

The 1 to 5 Scale

The MOS scale uses five simple categories:

5 (Excellent): No noticeable problems. The audio or video feels natural and clear.
4 (Good): Perceptible but not annoying imperfections. Most users are satisfied.
3 (Fair): Noticeable issues that are slightly annoying. Quality is acceptable but not great.
2 (Poor): Annoying degradation. Conversations become difficult or video is hard to watch.
1 (Bad): Severe distortion or near-total breakdown. Essentially unusable.

The scale was originally developed for rating telephone call quality and is standardized by the International Telecommunication Union (ITU) under Recommendation P.800. That standard lays out formal methods for running listening tests in controlled lab environments. Over time, the same 1-to-5 framework has been adopted for video streaming, video conferencing, and any situation where perceived quality matters.

How MOS Is Calculated

The math behind MOS is straightforward. A group of people listens to audio clips or watches video samples, and each person rates the quality on the 1-to-5 scale. The final MOS for a given sample is simply the average of all those individual ratings. If 20 listeners rate a phone call and their scores add up to 78, the MOS is 3.9.

Alongside the average, analysts also track the number of votes and the standard deviation, which reveals how much listeners agreed with each other. A MOS of 3.5 where everyone rated it a 3 or 4 tells a different story than a MOS of 3.5 where half the group said 5 and half said 2. The average alone doesn’t capture that spread, so the standard deviation matters when interpreting results.

Subjective vs. Objective MOS

There are two fundamentally different ways to arrive at a MOS value: asking humans or asking an algorithm.

Subjective MOS comes from real people in a controlled test. For audio, this is labeled MOS-LQS (listening quality, subjective), and it follows ITU-T P.800 procedures where listeners rate clips on the five-point scale. For video, the equivalent is MOS-VQS, and for combined audio and video, MOS-AVQS. These human-derived scores are the gold standard because they directly capture what the score was designed to measure: the experience of a real person.

Objective MOS is generated by software algorithms that attempt to predict what humans would rate. These are labeled with an “O” instead of an “S,” so you’ll see MOS-LQO for listening quality or MOS-VQO for video quality. The algorithms analyze technical characteristics of the signal, like distortion, delay, and packet loss, then output a predicted score on the same 1-to-5 scale. Objective models are far cheaper and faster than gathering a room full of human testers, which is why they’re used heavily in real-time network monitoring.

Common Algorithmic Models

Two algorithms dominate objective speech quality measurement. PESQ (Perceptual Evaluation of Speech Quality), defined in ITU-T standard P.862, was the workhorse for years. It works by comparing a clean reference signal to the version that came through the network, then calculating how much degradation occurred. PESQ handles a range of real-world problems including different speech volumes, compression artifacts, delays, packet loss, and background noise. It was originally designed for narrowband audio, the frequency range of traditional phone calls (roughly 300 to 3,400 Hz).

POLQA (Perceptual Objective Listening Quality Assessment) is the newer successor, standardized as ITU-T P.863. It addresses several limitations of PESQ and works in two modes: narrowband and super-wideband, covering frequencies from 50 to 14,000 Hz. That wider range matters because HD voice calls and modern VoIP services transmit richer audio than old telephone lines did. POLQA also shows better overall correlation with subjective MOS scores, meaning its predictions more closely match what real listeners would say.

Both algorithms are “full reference” models, meaning they need access to the original clean signal to compare against. This makes them ideal for lab testing and quality assurance but less practical for live monitoring where you don’t have the original signal handy. Other approaches exist for “no reference” estimation, where the algorithm works only with the received signal.

What Counts as a Good Score

In VoIP and telecommunications, a MOS of 4.0 or higher is generally considered good quality. Most people won’t notice issues at that level. Scores between 3.5 and 4.0 are acceptable for business calls, though some users will perceive minor artifacts. Below 3.5, complaints start rising noticeably. Below 3.0, the experience degrades to the point where it interferes with communication.

Perfect 5.0 scores are essentially impossible in practice because no transmission system is completely transparent. Even high-quality codecs introduce tiny compression artifacts. The theoretical maximum for a given codec becomes its practical ceiling. For context, traditional landline phone calls typically scored around 4.0 to 4.2 on MOS, a benchmark sometimes called “toll quality.” Modern HD voice calls can score higher because they carry a wider frequency range that makes speech sound more natural.

For video, the same 1-to-5 scale applies, but the factors influencing the score shift. Resolution, frame rate, buffering interruptions, and compression artifacts all play a role. A 1080p stream running smoothly will score higher than the same content at 480p with buffering pauses, which aligns with what you’d expect as a viewer.

Why Human Perception Varies

One reason MOS relies on averaging across multiple people is that quality perception isn’t identical from person to person. Differences in attention, hearing ability, and expectations all influence how someone rates a clip. A listener focused intently on speech clarity may penalize background hiss more harshly than someone who barely noticed it. These individual differences are well documented in auditory perception research, where selective attention plays a significant role in how people process sound.

This is also why formal MOS testing follows strict protocols. The ITU standards specify details like how many listeners to include, what kind of room to test in, and how to present the samples. Without those controls, you’d get wildly inconsistent results that reflect the testing conditions more than the actual audio or video quality.

Where MOS Is Used Today

MOS shows up anywhere that audio or video quality affects the user experience. VoIP providers use it to monitor call quality across their networks in real time, flagging routes or connections that drop below acceptable thresholds. Video conferencing platforms rely on objective MOS models to adjust streaming parameters on the fly, lowering resolution or bitrate before quality drops to a point users would notice. Streaming services use similar metrics during encoding to balance file size against visual quality.

Codec developers use MOS extensively during development, running subjective listening tests to validate that a new compression method sounds as good as or better than the previous generation. Mobile carriers measure MOS across their networks to identify coverage areas where call quality suffers. Even in fields outside traditional telecom, like medical imaging and security camera systems, MOS principles have been adapted to rate image quality using the same 1-to-5 framework with expert observers.