What Is MFCC in Audio and Speech Processing?

MFCC stands for Mel-Frequency Cepstral Coefficients, a set of features extracted from audio that represent how sound energy is distributed across frequencies. They’re the most widely used audio features in speech recognition, speaker identification, and music classification. MFCCs work by transforming raw audio into a compact numerical summary that reflects how the human ear actually perceives sound, making them far more useful than raw waveform data for machine learning and signal processing tasks.

Why MFCCs Exist

Raw audio is a stream of amplitude values over time. That’s a lot of data, and most of it isn’t useful for distinguishing one sound from another. What matters is the shape of the sound’s frequency content at any given moment: which frequencies are loud, which are quiet, and how that pattern changes over time. MFCCs capture exactly this information in a small, efficient package, typically just 13 to 25 numbers per audio frame.

The key insight behind MFCCs is that human hearing doesn’t treat all frequencies equally. You can easily tell the difference between a 200 Hz tone and a 400 Hz tone, but a 10,000 Hz tone and a 10,200 Hz tone sound nearly identical. MFCCs bake this perceptual quirk directly into the math by using something called the Mel scale, which spaces frequencies the way your ears do: fine resolution at low frequencies, coarser resolution at high ones.

How MFCCs Are Calculated

The calculation follows a specific pipeline, and understanding each step helps you see what MFCCs actually represent.

Framing and Windowing

Audio is continuous, but MFCCs describe short snapshots. The signal gets chopped into overlapping frames, usually 20 to 30 milliseconds long. Each frame is multiplied by a window function (typically a Hamming window) to smooth out the edges and prevent artifacts when the math treats each frame as if it were a standalone signal.

Fourier Transform

Each frame is converted from the time domain into the frequency domain using a Fast Fourier Transform. This tells you how much energy sits at each frequency within that short window of audio. The result is a frequency spectrum for every frame.

Mel Filter Bank

Here’s where human hearing enters the picture. A set of triangular, overlapping filters is applied to the frequency spectrum. These filters are spaced equally along the Mel scale, not the linear frequency scale. At low frequencies, the filters are narrow and closely packed, capturing fine distinctions. At high frequencies, the filters are wide and spread apart, lumping similar frequencies together. The Mel scale converts hertz to “mels” using a logarithmic formula: frequencies below about 1,000 Hz map roughly one-to-one, while higher frequencies get increasingly compressed.

This filterbank approach mimics how the cochlea in your inner ear resolves frequencies. Empirical evidence shows that designing audio processing this way consistently improves recognition performance compared to treating all frequencies equally.

Logarithm

The energy output from each filter is log-transformed. This mirrors another property of human hearing: perceived loudness scales logarithmically with actual sound intensity. Doubling the acoustic power doesn’t sound twice as loud. The log step also makes the features more robust to variations in recording volume.

Discrete Cosine Transform

The final step applies a Discrete Cosine Transform (DCT) to the log filter bank energies. This is what puts the “cepstral” in the name (cepstral being an anagram-inspired rearrangement of “spectral”). The DCT serves two purposes. First, it decorrelates the filter bank outputs, since neighboring filters overlap and produce correlated values. Decorrelation is important because many machine learning algorithms assume input features are independent. The DCT approximates principal components analysis, which is the mathematically ideal way to decorrelate data. Second, it compresses the information: you can keep just the first 13 or so coefficients and discard the rest, because the lower-order coefficients capture the broad spectral shape while higher-order ones represent fine details that are often noise.

The result is a vector of numbers, one per coefficient, for each audio frame. MFCC1 captures overall energy. MFCC2 captures the balance between low and high frequencies. Higher coefficients represent increasingly fine spectral details.

Delta and Delta-Delta Features

Static MFCCs describe the spectral shape at a single moment, but speech and music are inherently dynamic. To capture how sound changes over time, most systems also compute delta coefficients (the rate of change between consecutive frames) and delta-delta coefficients (the acceleration, or rate of change of the rate of change).

Delta features are calculated by taking the difference between MFCC values a few frames apart, typically two or three frames in each direction. Delta-delta features apply the same differencing operation to the delta values. Adding delta features to a standard 13-dimensional MFCC set strongly improves speech recognition accuracy, and delta-delta features provide a further, smaller improvement on top of that. Together, a system using 13 base MFCCs plus their deltas and delta-deltas ends up with 39 features per frame.

How Many Coefficients to Use

The classic choice is 13 MFCCs, and that number appears in the vast majority of published speech recognition systems. But it’s not a law of physics. A study on Bengali phoneme recognition found that all performance metrics peaked at 24 or 25 MFCCs, suggesting the optimal number depends on the language and task. For most general-purpose applications, 13 base coefficients remain the default starting point, with tuning from there based on your specific use case.

Where MFCCs Are Used

Speech recognition was the original application, and MFCCs remain central to it. Any time a system needs to convert spoken words to text, identify who is speaking, or detect emotion in voice, MFCCs are a standard feature set. They’re also used in music information retrieval for tasks like genre classification and instrument identification. One well-known study achieved 80% accuracy classifying fifteen orchestral instruments using MFCCs alone, and 94% accuracy when grouping instruments by family. Environmental sound classification, language identification, and even medical speech biomarker analysis all rely on MFCCs as well.

MFCCs vs. Mel Spectrograms in Deep Learning

With the rise of deep neural networks, a natural question is whether MFCCs are still the best choice. The answer: it depends on your model. MFCCs apply the DCT step, which compresses and decorrelates the data. That’s helpful for traditional machine learning algorithms that struggle with correlated inputs, but deep learning models can learn to handle correlation on their own. Skipping the DCT and feeding the model log-scaled Mel filter bank energies (a Mel spectrogram) preserves more of the original time-frequency structure.

Comparative studies have found that raw spectrograms and Mel spectrograms can outperform MFCCs when paired with powerful sequence models like bidirectional LSTMs. MFCCs sometimes show lower accuracy in these setups because the DCT discards contextual information that neural networks could have exploited. Still, MFCCs remain a strong baseline and are preferred when computational resources are limited, training data is small, or the downstream model expects decorrelated input features.

In practice, many modern systems use Mel spectrograms as input to convolutional neural networks, treating audio essentially like an image. MFCCs haven’t disappeared, though. They’re lighter, faster to compute, and still competitive for many tasks, especially outside the deep learning paradigm.