What Is the Source Filter Theory of Speech?

The source-filter theory is a model of how humans produce speech sounds. It breaks voice production into two independent components: a sound source (the vibrating vocal folds in your larynx) and a filter (the shape of your vocal tract). Swedish acoustician Gunnar Fant formally proposed the theory in 1960, and it remains the foundational framework for understanding speech acoustics.

The core idea is simple. Your vocal folds create a raw buzzing sound, and then the air-filled tube above them, your throat, mouth, and nasal cavity, shapes that buzz into recognizable vowels and consonants. These two stages work largely independently of each other, which is why you can say the same vowel at different pitches or sing the same pitch on different vowels.

How the Source Works

The “source” in the model is the vibration of the vocal folds. When you speak, air pressure from your lungs pushes through the vocal folds, causing them to open and close rapidly. This produces a series of air pulses, a buzzing signal rich in harmonics. The rate at which your vocal folds vibrate determines the fundamental frequency of your voice, which you perceive as pitch. A typical adult male voice vibrates around 100 to 150 times per second; a typical adult female voice around 180 to 220 times per second.

This raw signal contains energy at many frequencies, but it isn’t yet speech. It sounds roughly the same regardless of what word you’re trying to say. Think of it like the vibration of a guitar string before the body of the guitar shapes the tone. The source provides the raw material; everything that makes it sound like language happens next.

How the Filter Shapes Sound

The “filter” is your vocal tract, the entire airway from the top of the vocal folds to your lips (and, when relevant, your nasal cavity). This tube of air has natural resonance frequencies determined almost entirely by its shape. When the raw buzzing from your vocal folds passes through, certain frequencies get amplified and others get dampened, depending on how the tract is configured at that moment.

These resonance peaks are called formants, and they’re what distinguish one vowel from another. When you shift your tongue forward and raise it to say “ee,” the shape of the tube changes, boosting a different set of frequencies than when you open wide and drop your tongue for “ah.” Research from Macquarie University’s phonetics program confirms that the cross-sectional area at each point along the vocal tract is the main predictor of these resonances. Interestingly, the actual geometric shape of the cross-section (whether it’s oval, round, or irregular) has almost no effect. What matters is how wide or narrow the tube is at each point.

The model captures this filtering effect as a “transfer function,” a mathematical description of which frequencies the tract amplifies and which it suppresses for a given tongue, jaw, and lip position.

The Role of Lip Radiation

Fant’s full model actually has three components, not just two. After the source and filter, there’s a radiation characteristic that accounts for how sound exits your lips and radiates into open air. The lips act as a small acoustic opening, and the transition from a confined tube to the open environment boosts higher frequencies relative to lower ones. Research published in Acta Acustica found that the lips are the single biggest factor in this radiation effect, more important than the overall shape of the head or the presence of the torso. The complete equation multiplies the source spectrum by the filter’s transfer function by this radiation effect to predict the final sound that reaches a listener’s ear.

Why Independence Matters

The theory’s most powerful feature is its assumption that the source and filter operate independently. This means a change in pitch (controlled by vocal fold tension) doesn’t alter the resonances of your vocal tract, and a change in vowel (controlled by tongue and jaw position) doesn’t alter the vibration rate of your vocal folds. In practical terms, this is why you can whisper, speak, or shout the word “boot” and it still sounds like “boot.” The pitch and loudness change, but the vowel identity stays the same because the filter hasn’t changed.

This independence also explains why the model is language-independent. Every human language uses the same basic apparatus: a vibrating source and a resonant filter. The differences between languages come from how speakers configure the filter, not from any fundamental change in the mechanism. The theory applies equally well to disordered speech, making it useful for clinicians who need to figure out whether a voice problem originates at the vocal folds or somewhere in the vocal tract above them.

Where the Linear Model Breaks Down

Fant’s original model treats the source and filter as completely separate, a “linear” simplification. In reality, the vocal tract can influence the vocal folds, and vice versa. Research by Ingo Titze published in The Journal of the Acoustical Society of America demonstrated that this source-filter coupling is nonlinear, meaning the vocal tract can actually create new frequencies that weren’t present in the original vocal fold vibration.

This coupling happens at different levels of intensity. At the mildest level, air pressure reflected back from the vocal tract slightly reshapes the airflow pulse through the vocal folds, skewing it in a way that generates additional harmonics. At a more intense level, the vocal tract’s acoustic load can actually change how the vocal folds vibrate, producing sudden pitch jumps, subharmonic frequencies (pitches below the intended note), or even chaotic vibration. Singers encounter this coupling more than speakers do, because singing often places the fundamental frequency or its harmonics close to a vocal tract resonance, which strengthens the interaction.

For ordinary speech, the fundamental frequency is usually far enough from any vocal tract resonance that the coupling effect is minimal. This is precisely why Fant’s linear model works so well in most contexts: the conditions of normal speech keep the source and filter functionally independent, even though the physics allows them to interact.

How the Theory Is Used Today

The source-filter theory underpins nearly every modern application involving the human voice. Speech synthesis systems, including text-to-speech engines, use source-filter models to generate artificial speech by independently controlling a simulated glottal source and a set of formant frequencies. Voice recognition software relies on the same decomposition to extract the formant patterns that identify phonemes, separating them from pitch information that varies by speaker.

In clinical settings, the framework helps speech-language pathologists pinpoint voice problems. If a patient sounds breathy or strained, the issue likely involves the source: the vocal folds aren’t closing properly or are vibrating irregularly. If the patient’s pitch and voice quality sound normal but specific speech sounds are distorted, the problem is more likely in the filter, perhaps a structural issue in the oral or nasal cavity, or a motor control problem affecting tongue and jaw positioning. Being able to separate these two domains guides both diagnosis and treatment.

Voice training for singers and transgender individuals also draws on source-filter principles. Adjusting the resonance characteristics of the vocal tract (the filter) can make a voice sound brighter, darker, more masculine, or more feminine without changing the fundamental pitch. This is a direct application of the theory’s central insight: source and filter are independent controls over the final sound.