A speech sound is the smallest unit of sound that humans produce when speaking. It’s the actual acoustic event that comes out of your mouth, as opposed to a letter on a page. English has roughly 44 distinct speech sounds but only 26 letters, which is why spelling and pronunciation so often disagree. Linguists use the term “phoneme” for a speech sound that changes the meaning of a word, like the difference between the “b” in “bat” and the “p” in “pat.”
How Your Body Produces Speech Sounds
Every speech sound starts with air from your lungs. That air travels up through your windpipe to the larynx (your voice box), where two small folds of tissue called the vocal folds can either vibrate or stay open. When they vibrate, you get a voiced sound like “z” or “v.” When they stay open and let air pass freely, you get a voiceless sound like “s” or “f.” You can feel the difference by placing your fingers on your throat and alternating between a long “zzzzz” and a long “sssss.”
After passing through the larynx, the airflow gets shaped by your tongue, lips, teeth, and the roof of your mouth. These are called articulators, and their precise positions determine which sound comes out. The whole process breaks down into three stages: respiration (air from the lungs), phonation (vibration at the larynx), and articulation (shaping the sound in the mouth and nose).
Consonants vs. Vowels
Speech sounds fall into two broad categories. Consonants are produced by partially or fully blocking the airflow somewhere in the mouth. Vowels are produced with an open vocal tract, allowing air to flow relatively freely. The distinction matters because consonants and vowels are described using completely different sets of characteristics.
How Consonants Are Classified
Linguists describe any consonant using three features. First is voicing: whether the vocal folds vibrate (voiced) or not (voiceless). Second is place of articulation, meaning where in the mouth the blockage happens. A “p” blocks air at the lips, a “t” blocks it behind the upper teeth, and a “k” blocks it at the back of the mouth near the soft palate. Third is manner of articulation, which describes how the airflow is blocked or modified.
English consonants include several manner categories. Stops (like “p,” “b,” “t,” “d,” “k,” “g”) completely block the airflow for a moment and then release it. Fricatives (like “f,” “v,” “s,” “z,” “sh”) force air through a narrow gap, creating a hissing or buzzing quality. Nasals (like “m,” “n,” and the “ng” at the end of “sing”) redirect air through the nose. Liquids (“l” and “r”) partially obstruct the airflow but keep it moving smoothly. Glides (“w” and “y”) are brief, vowel-like sounds that transition quickly into the next sound. American English has about 24 consonant phonemes in total.
How Vowels Are Classified
Vowels don’t involve any real blockage, so they’re classified differently. Linguists look at four things: how high or low the tongue sits in the mouth, how far forward or back the tongue is positioned, whether the lips are rounded or spread, and whether the tongue muscles are tense or relaxed.
The vowel in “see” is a high, front vowel with spread lips. The vowel in “father” is a low, back vowel. The vowel in “boot” is a high, back vowel with rounded lips. These physical positions directly affect the sound’s acoustic properties. Specifically, tongue height controls the frequency of the first resonance in the sound wave, while tongue position (front to back) controls the second resonance. That’s why “see” and “Sue” sound so different even though they’re both high vowels: one is front, the other is back.
Speech Sounds Are Not the Same as Letters
One of the most important things to understand is that speech sounds and letters are separate systems. Letters (graphemes) are visual symbols. Speech sounds (phonemes) are acoustic events. In a perfectly designed writing system, each letter would represent exactly one sound. English is far from that ideal.
The letter “c” can represent a “k” sound (as in “cat”) or an “s” sound (as in “city”). The combination “sh” uses two letters for a single sound. The combination “ough” represents different sounds in “though,” “through,” “rough,” and “cough.” Going the other direction, the single sound “ee” can be spelled as “ea” (team), “ee” (feet), “ie” (field), or “ei” (receive). Vowels get even messier when followed by “r,” which warps them into distinct sounds: the “ar” in “barn,” the “or” in “corn,” and the “er” in “fern” are all vowels altered by the trailing “r.”
This mismatch between letters and sounds is a major reason English spelling is so difficult to learn. It’s also why linguists developed a standardized system for writing down speech sounds.
The International Phonetic Alphabet
The International Phonetic Alphabet (IPA) is a universal notation system where each symbol represents exactly one speech sound, and each speech sound maps to exactly one symbol. Unlike English spelling, there’s no ambiguity. The IPA works across all human languages, so a linguist in Tokyo and a linguist in São Paulo can look at the same transcription and know precisely which sounds are being described.
You’ve probably encountered IPA symbols in dictionary pronunciation guides. The word “ship,” for example, is transcribed as /ʃɪp/, where /ʃ/ represents the “sh” sound, /ɪ/ is the short “i” vowel, and /p/ is the final consonant. Each symbol captures what the mouth is doing, not what the spelling looks like.
How Sounds Blend Together in Real Speech
When you talk at a normal speed, you don’t produce one clean sound at a time. Your mouth is constantly preparing for the next sound while still finishing the current one. This overlap is called coarticulation, and it means that every speech sound is slightly different depending on what comes before and after it.
A classic example: the vowel in “mish” is physically different from the same vowel in “miss,” because your lips start rounding to prepare for the “sh” sound in “mish” while you’re still producing the vowel. Your brain compensates for this automatically when listening. In fact, these subtle overlapping cues help you predict upcoming sounds, making it easier to understand speech in noisy environments or at fast speeds.
This is part of why speech recognition is so complicated for computers. A speech sound isn’t a fixed, unchanging thing. It’s a flexible target that shifts depending on context, speaker, accent, and speaking rate. Your brain handles all of this effortlessly, parsing a continuous stream of overlapping acoustic information into distinct, meaningful units dozens of times per second.
Why Speech Sounds Matter
Understanding speech sounds has practical applications well beyond linguistics classrooms. Speech-language pathologists use this knowledge to diagnose and treat speech disorders in children and adults. Literacy educators rely on the relationship between sounds and letters (phonics) to teach reading. Accent coaches break down the specific sounds that differ between a speaker’s native language and their target language. And engineers building voice assistants need detailed models of how speech sounds behave acoustically.
For anyone learning to read, learning a second language, or trying to understand why English spelling seems so illogical, the core insight is the same: spoken language is built from a finite set of distinct sounds, and those sounds follow physical rules based on what your tongue, lips, and vocal folds are doing. Letters are just one imperfect attempt to write those sounds down.

