How Does Your Voice Work? From Folds to Speech

Your voice starts as a puff of air from your lungs, gets turned into a buzzing vibration by two small folds of tissue in your throat, and then gets shaped into recognizable speech by your throat, mouth, and tongue. The whole process happens so fast and so automatically that most people never think about the dozens of moving parts involved. But the system is remarkably intricate, and understanding it helps explain everything from why your voice cracked during puberty to why you sound different on a recording.

The Vocal Folds: Your Sound Source

The sound of your voice originates in the larynx, a small structure in your throat sometimes called the voice box. Inside it sit two bands of layered tissue called the vocal folds (often called vocal cords, though they’re more like shelves than strings). In adults, these folds are only about 1.75 to 2.5 centimeters long. At birth, they’re even smaller, roughly 6 to 8 millimeters.

Each vocal fold is built from five distinct layers. The deepest layer is a muscle called the thyroarytenoid. Over that sit three layers of connective tissue (the lamina propria), ranging from stiff at the bottom to gel-like at the surface. The outermost layer is a thin lining of skin-like tissue. This layered design is essential: the softer outer layers can ripple and wave independently of the stiffer inner layers, which is what lets the folds vibrate so efficiently.

The folds are anchored to a framework of cartilages. The thyroid cartilage is the one you can feel at the front of your throat (the “Adam’s apple”). Behind it, two smaller pyramid-shaped cartilages called the arytenoids act like pivots, pulling the vocal folds open for breathing and pressing them together for speaking. A ring of cartilage called the cricoid sits below, forming the only complete cartilage ring in the larynx and serving as a stable base for the whole structure.

How Vibration Creates Sound

When you decide to speak, muscles pull the vocal folds together until they nearly touch. You then push air up from your lungs. With the folds pressed close, air pressure builds beneath them. Once that pressure is high enough, it forces the lower edges of the folds apart, and a burst of air escapes upward.

Here’s where the physics gets interesting. As air rushes through the narrow gap between the folds, it speeds up, and fast-moving air creates a drop in pressure (a principle from fluid dynamics sometimes called the Bernoulli effect). That pressure drop, combined with the natural elastic recoil of the tissue, snaps the folds back together. Pressure builds again beneath them, they’re forced apart again, and the cycle repeats. This open-close-open-close cycle happens extraordinarily fast: around 110 times per second in a typical male voice, 180 to 220 times per second in a typical female voice, and about 300 times per second in children.

Each cycle releases a tiny puff of air, and those rapid pulses create a sound wave. The sound at this stage is a raw buzz, not yet anything you’d recognize as a voice. It contains energy at many different frequencies, with the loudest component at the rate of vibration itself (the fundamental frequency) and progressively quieter overtones above it.

What Controls Pitch

Pitch is primarily determined by how stiff and stretched the vocal folds are. Two sets of muscles do most of the work. One set, the cricothyroid muscles, tilt the thyroid cartilage forward, which stretches the vocal folds longer and makes them stiffer. Stiffer folds vibrate faster, producing a higher pitch, much like tightening a guitar string raises its note. The opposing set, the thyroarytenoid muscles (which form the body of the folds themselves), can shorten and thicken the folds, lowering the pitch.

These two muscle groups interact in complex ways. When the vocal folds are already stretched by strong cricothyroid activation, engaging the thyroarytenoid reduces that stretch and lowers the pitch. But when the folds are relaxed, contracting the thyroarytenoid can actually stiffen and shorten them enough to raise the pitch slightly. The cricothyroid is the more powerful pitch regulator overall, which is why it drives most of the pitch range you use in everyday speech.

Men’s voices typically range from about 78 to 182 Hz in fundamental frequency, while women’s range from about 126 to 307 Hz. That difference comes down to vocal fold size: testosterone during puberty thickens and lengthens the folds, which is why boys’ voices drop. The voice cracking that happens during this transition is partly the result of the brain learning to coordinate muscles around rapidly changing anatomy.

What Controls Volume

Volume is a separate dial from pitch, though the two interact. To get louder, you increase the air pressure below the vocal folds by pushing harder with your breathing muscles. During normal conversation, the pressure beneath the folds runs between about 200 and 800 Pascals. Classical singers routinely hit 1,500 to 2,000 Pascals. Higher pressure forces the folds apart more forcefully and keeps them open slightly longer during each cycle, which means each puff of air carries more energy, and the resulting sound wave has a larger amplitude.

You also get louder by pressing the vocal folds together more firmly before and during phonation, so they resist the airflow longer before popping open. This combination of greater air pressure and firmer closure is why shouting can tire your voice: you’re slamming the folds together harder and faster than they’re designed to sustain for long periods.

How Your Throat and Mouth Shape the Sound

The buzzing sound generated by the vocal folds isn’t what other people hear. It gets dramatically filtered and reshaped as it travels through the vocal tract: the open space of your throat (pharynx), mouth, and nasal passages. This is the same principle as blowing across the top of a bottle. The air column inside the bottle amplifies certain frequencies and dampens others, depending on its shape. Your vocal tract does the same thing, but with a shape you can change in real time.

The specific frequencies that get amplified are called formants, and their pattern is what distinguishes one vowel sound from another. When you say “ee,” your tongue is high and forward, creating a small space in the back of your mouth and a large one in the front. When you say “ah,” your tongue drops low and back, reversing those proportions. Each configuration amplifies a different set of frequencies from the original buzz, and your brain interprets those frequency patterns as different vowel sounds.

Trained singers and actors learn to manipulate their vocal tracts to create extra resonance. Lowering the larynx lengthens the throat and can cluster several formants together, creating a strong peak of acoustic energy around 3,000 Hz. This is sometimes called the “singer’s formant,” and it’s what allows an operatic voice to cut through an orchestra without a microphone. Actors develop a similar effect by narrowing the mouth at the front while widening it at the rear, which helps their voice carry in a theater.

Turning Sound Into Speech

Vowels are only half of speech. Consonants require you to interrupt or redirect the airflow in specific ways using what linguists call articulators: your tongue, lips, teeth, the roof of your mouth, and the soft palate at the back.

Your tongue is the most versatile articulator. It can touch the ridge behind your upper teeth to make a “t” or “d,” curl back to make an “r,” or press against the roof of your mouth to make a “k” or “g.” Research on speech movement shows that tongue position contributes more to distinguishing between sounds than lip position does, which makes sense given how many distinct positions the tongue can reach inside the mouth.

Your lips handle sounds like “p,” “b,” and “m,” where the upper and lower lips press together to briefly stop airflow. The vertical distance between your lips also shapes vowels like “oo,” where you round and narrow the opening. The soft palate, a flap of tissue at the back of the roof of your mouth, controls whether air flows out through your nose. For most sounds, it seals off the nasal passage. For “m,” “n,” and “ng,” it drops open, letting sound resonate through your nasal cavity and giving those sounds their distinctive humming quality.

All of these movements happen in tight coordination, with your brain sequencing dozens of muscles across your chest, throat, and face within milliseconds. Speaking a single syllable involves adjusting airflow, vocal fold tension, tongue position, and lip shape almost simultaneously. It’s one of the most neurologically complex motor tasks humans perform, and you do it without thinking roughly 16,000 words a day.