Materials Part I

M1: Prevailing Theory

Vowels - Speech production: source and filter - Formants - Vowel-specific formants - Age- and gender-specific formants

Vowels

“Vowel […]. 1. (also vocoid) In phonetics, a segment whose articulation involves no significant obstruction of the airstream, such as [a], [ i ] or [u]. Strictly speaking, a glide such as [ j ] of [w] may also be regarded as a (brief) vowel in this sense. 2. In phonology, a segment which forms the nucleus of a syllable. 3. Any letter of the alphabet which, generally or in a particular case, represents a vowel in sense 2.” (Trask, 1996, p. 382)

“Vocoid […]. 1. A synonym for vowel in the phonetic sense of that term (sense 1), introduced in an effort to remove the ambiguity between the phonetic and the phonological sense of ‘vowel’. While possibly useful, the term has never become established. Pike (1943). 2. More narrowly, a vocoid in sense 1 which is also syllabic: a true vowel, as opposed to a glide or approximant. Sense 2: Laver (1994).” (Trask, 1996, p. 378)

“Vowels and Consonants. Phonetics has traditionally classified the segments of speech into two basic varieties which are called vowels and consonants. Once again, there has never been a straightforward definition of these terms. Early linguists in India also grappled with the concepts of vowel, consonant, and syllable around 800 BC, and they recognized that the three notions are hopelessly intertwined […]. The definitions used here will be similar to those of the ancient Sanskrit scholars, and in fact, the development of modern phonetics in the West owes much to the transmission of knowledge in translation from the Sanskrit sources.
A vowel is defined as a ‘vowel-like segment’ (what Pike […] termed a vocoid) that occupies the nucleus of a syllable. A segment is considered to be a vocoid when its articulation permits the relatively free passage of air through the center of the mouth. This definition is also rather loose, but in roughly familiar terms, most segments that are at least as open as an English w or y-sound (the latter is transcribed [ j ] in IPA) are vocoids, all others being non-vocoids. A consonant is then defined simply as a non-vocoid, no matter what syllable position it occupies. This imperfect dichotomy leaves room for a middle category, that of the semivowel, which is defined as a vocoid located outside the nucleus of a syllable. Semivowels, in spite of being vocoids, are usually regarded as a special sort of consonant (often called a ‘glide’) in the interests of preserving the consonant-vowel dichotomy. The interplayslightly different (more acoustic) view by Orlikoff and Kahane: ‘Consonants differ from vowels primarily by the amount of vocal tract constriction employed in their production […] Speech can be considered to be an overlay of consonants on the vocal signal. The dispersion of consonants results in an amplitude modulation of the acoustic energy that, for the most part, gives rise to our perception of syllables.’” (Fulop, 2011, pp. 8–9)

Speech production: source and filter

“The speech wave is the response of the vocal tract filter systems to one or more sound sources. This simple rule, expressed in the terminology of acoustic and electrical engineering, implies that the speech wave may be uniquely specified in terms of source and filter characteristics. In spite of the technical phrasing it is apparent that this statement also covers essentials of the phonetician’s concept of speech production.” (Fant, 1960, p. 15) See also Chapter M4.

Formants

“The spectral peaks of the sound spectrum | P( f ) | are called formants. Referring to Fig. 1.1-2, it may be seen that one such resonance has its counterpart in a frequency region of relatively effective transmission through the vocal tract. This selective property of | T( f ) | is independent of the source. The frequency location of a maximum in | T( f ) |, i.e. the resonance frequency, is very close to the corresponding maximum in spectrum P( f ) of the complete sound. Conceptually these should be held apart, but in most instances resonance frequency and formant frequency may be used synonymously. Thus, for technical applications dealing with voiced sounds it is profitable to define formant frequency as a property of T( f ).
The basic principle of the theory of voiced sounds is that, to a first order of approximation, the filter function is independent of the source. The formant peak will thus only accidentally coincide with the frequency of a harmonic. The formant frequencies can change only as a result of an articulatory change affecting the dimensions of the various parts of the vocal tract cavity system and thus the filter function. Conversely, but with the limitations implied by the concept of compensatory forms of articulation, the formant frequencies provide information about the position of the speaker’s articulatory organs. If these formant frequencies are held constant and the fundamental frequency is raised one octave, the result is ideally that twice as many pulses per second are emitted from the voice organs. The distance between adjacent harmonics in the spectrum will be doubled, and the number of harmonics up to a certain fixed frequency limit will thus be halved. If a specific formant, for instance the first, comes close to the 6th harmonic at the lower pitch, it will be the 3rd harmonic that comes closest to the same formant in the case of the higher pitch. The concepts of formant frequency and harmonic number should not be confused.” (Fant, 1960, p. 20)

Vowel-specific formants

“Usually vowels can be quite well characterized in terms of the frequencies of just the first and second formants, but the third formant should also be measured for high front vowels and for r-colored vowels.” (Ladefoged, 2003, p. 105)

Age- and gender-specific formants

“The length of the pharyngeal-oral tract depends on the physical size of the speaker. The length affects the frequency locations of all of the vowel formants; this fact helps us to predict where the formant peaks in the spectrum will appear for men, women, and children. A very simple rule relates the frequencies of the formants to the overall length of the tract from glottis through lips. The rule for this relation is:
Length Rule. The average frequencies of the vowel formants are inversely proportional to the length of the pharyngeal-oral tract. In other words, the longer the tract, the lower are its average formant frequencies.
The neutral vowel formants for the average man, with an oral tract 17.5 cm in length, are at 500, 1500, 2500 Hz, and so on, with the lowest formant at 500 Hz and frequency spacing of 1000 Hz between all formants.
An easy way to remember the neutral formant frequencies is to think of the odd numbers 1, 3, 5, 7, 9, and so on, because the formant frequencies of a uniform tube that is closed at one end and open at the other, like the pharyngeal-oral tract, are always odd multiples of the frequency of the lowest formant. For example, begin with the basic formant frequency, 500 Hz, as the unit or 1; then the formant frequencies above that are 500 × 3 = 1500 Hz, 500 × 5 = 2500 Hz, and so on. This method, calculating the formants above F1 as multiples of F1, applies only as a model of a neutral tract shape.
The pharyngeal-oral tract length of an infant is approximately half the length of that of a man. Therefore, following our Length Rule about formant frequency locations, the formants of a neutral-shaped infant tract in relation to a man’s would be at frequency locations that are a factor of the reciprocal of ½, or twice those of the man. On this basis the infant formant locations for a neutral vowel would be as follows: F1 is 500 × 2 = 1000 Hz, F2 is 1500 × 2 = 3000 Hz, F3 is 2500 × 2 = 5000 Hz, and so on.
Following the same procedure, a woman’s vocal tract, on the average, is about 15% shorter than that of a man. The ratio corresponding to this amount of shortening is approximately 5/6. The reciprocal of 5/6 is 6/5, which is equal to a factor of 1.20, which, when multiplied by the man’s neutral formant frequencies, gives the woman’s values of 20% higher: F1 is 500 × 1.2 = 600 Hz, F2 is 1500 × 1.2 = 1800 Hz, F3 is 2500 × 1.2 = 3000 Hz, and so on. […]
The Length Rule tells us approximately where we may find the formants for the very young as well as for older, larger persons. However, the neutral locations of F1 and F2 for an individual are also affected by the length proportions of the vocal tract between the oral and pharyngeal cavities (Fant, 1973, Chapter 4). In general, the location and spacing of formants F3 and above are more closely correlated with length of vocal tract than for F1 and F2. The average locations of F1 and F2 for an individual are also affected somewhat by language environment and training.” (Pickett, 1999, pp. 38–40)