Speech perception is one of the most impressive feats of human cognition. We effortlessly decode a continuous stream of sound into discrete words and meanings, despite enormous variability in the acoustic signal due to differences in speakers, speaking rates, accents, and background noise. The problem is so computationally difficult that automatic speech recognition systems, despite dramatic improvements, still struggle in conditions that humans handle effortlessly.
Key Structures
- Superior temporal gyrus — The upper temporal lobe gyrus containing auditory cortex and regions critical for speech perception and social cognition.
- Wernicke's area — The left posterior superior temporal region involved in speech comprehension and the mapping of sound to meaning.
- Broca's area — The left inferior frontal region critical for speech production, syntactic processing, and verbal working memory.
- Motor cortex — The precentral cortical region that plans, initiates, and executes voluntary movements through corticospinal projections.
- Mirror Neurons — Neurons that fire both when performing an action and when observing the same action performed by another, potentially supporting action understanding and imitation.
- Recognition — A form of memory retrieval in which a previously encountered item is identified as familiar when presented again, typically easier than recall because the target item itself serves as a retrieval cue.
- Perceptual Constancy — The ability of the perceptual system to perceive stable properties of objects (size, shape, color, brightness) despite continuous variation in the sensory input they produce.
- McGurk Effect — An illusion demonstrating that speech perception is fundamentally multimodal — when auditory and visual speech information conflict, perceivers hear a sound that neither source actually produced.
- Categorical Perception — The phenomenon whereby a continuous range of physical stimuli is perceived as falling into discrete categories, with better discrimination across category boundaries than within categories, first demo.
- Phoneme — The smallest unit of sound in a language that can distinguish one word from another — an abstract mental category rather than a specific physical sound.
Key Functions
Decode continuous acoustic signals into discrete phonemes, syllables, and words for language comprehension.
The Acoustic Signal
Speech sounds are produced by coordinated movements of the lungs, vocal folds, tongue, lips, and jaw. These articulatory gestures create complex acoustic patterns characterized by formant frequencies (resonances of the vocal tract), voice onset time (the delay between release of a consonant closure and the onset of voicing), and spectral transitions. The speech signal is continuous — there are no reliable acoustic boundaries between words, and often not between phonemes.
Categorical Perception
One of the earliest and most influential findings in speech perception research was categorical perception. When a physical continuum (such as voice onset time, which distinguishes /b/ from /p/) is varied in equal steps, listeners do not perceive a gradual change but instead perceive a sharp boundary between categories. Discrimination is far better across the category boundary than within a category, even for equally-spaced physical differences. This suggests that the speech perception system imposes discrete categories on continuous acoustic input.
The perceptual boundary between voiced and voiceless stops is remarkably sharp.
The Lack of Invariance Problem
The same phoneme can have radically different acoustic realizations depending on the surrounding phonemes (coarticulation), the speaker's vocal tract, speaking rate, and emotional state. The /d/ in "deep" and "doom" differ substantially in their acoustic properties because the tongue and lips are already moving toward the following vowel. Yet listeners perceive both as /d/. How the brain achieves this perceptual constancy despite acoustic variability — the lack of invariance problem — has been called the central challenge of speech perception.
Motor Theory and Direct Realism
Alvin Liberman's motor theory of speech perception proposed that listeners perceive not acoustic patterns but the intended articulatory gestures of the speaker. This elegantly solves the invariance problem: different acoustic signals map to the same perceived phoneme because they correspond to the same intended gesture. The discovery of mirror neurons — neurons that fire both when performing and observing an action — has provided some neurological plausibility for this view, and the McGurk effect (where visual information about lip movements alters what listeners hear) demonstrates audiovisual integration in speech perception.
When an audio recording of "ba" is paired with video of a face saying "ga," most listeners perceive "da" — a fusion of the auditory and visual information. This illusion, discovered by Harry McGurk and John MacDonald (1976), powerfully demonstrates that speech perception is fundamentally multimodal: the visual speech signal (lip movements, facial gestures) is automatically integrated with the auditory signal.
Lexical and Contextual Effects
Top-down knowledge strongly influences speech perception. The phoneme restoration effect (Richard Warren, 1970) shows that when a phoneme is replaced by noise, listeners "hear" the missing phoneme based on lexical and sentence context. The Ganong effect demonstrates that an ambiguous sound between two phonemes is perceived as whichever interpretation forms a real word. These findings show that speech perception is an active inferential process combining acoustic evidence with linguistic knowledge.
Disorders
- Wernicke's aphasia — Fluent but meaningless speech with severely impaired comprehension; paraphasias; neologisms; poor self-monitoring.
- Word deafness — A form of auditory agnosia characterized by inability to comprehend spoken words despite intact hearing acuity.
- Auditory verbal agnosia — Inability to recognize spoken words despite intact hearing, resulting from bilateral temporal lobe damage.