openopen
previous chapterprevious chapterprevious chapterprevious chapter
next chapternext chapternext chapternext chapter
closeclose
Materials Part II

M6: Terms of Reference, Methods of Formant Estimation

Terms of reference

“Formant […]. A concentration of acoustic energy within a particular frequency band, especially in speech. Any given configuration of the vocal tract produces resonance, and hence formants, in certain frequency ranges. During the articulation of a vowel, these formants show up prominently in a sound spectrogram as thick dark bars; the three lowest of these, known as first, second and third formants (F1, F2 and F3) are highly diagnostic, and vowels are distinguished acoustically by the positions of these formants.” (Trask, 1996, p. 148)

“Some refer to a formant as a peak in the acoustic spectrum. In this usage, a formant is an acoustic feature that may or may not be evidence of a vocal tract resonance. Others use the term formant to designate a resonance, whether or not actual empirical evidence is found for it.” (Kent & Read, 2002, p. 24)

“Resonances, formants and spectral peaks: Unfortunately, the meaning of the word ‘formant’ has expanded to describe two or three different things. Fant (1960) gives this definition: ‘The spectral peaks of the sound spectrum | P( f ) | are called formants.’ Resonance frequencies are then defined in terms of the gain function T( f ) of the tract by ‘The frequency location of a maximum in | T( f ) |, i.e. the resonance frequency, is very close to the corresponding maximum in spectrum | P( f ) | of the complete sound.’ Fant then writes: ‘Conceptually these should be held apart but in most instances resonance frequency and formant frequency may be used synonymously.’ Benade (1976) uses a similar definition of formant: ‘The peaks that are observed in the spectrum envelope are called formants.’ More recently, the acoustical properties of the vocal tract are often modelled using an all-pole autoregressive filter (Atal and Hanauer, 1971). For many voice researchers, formants now refer to the poles of this filter model. To others, formant means the resonance frequency of the tract. Finally, many researchers, particularly in the broader field of acoustics, retain the original meaning: a broad peak in the spectral envelope of a sound (of a voice, musical instrument, room etc.). The original meaning of formant is also retained, almost universally, when discussing the singers formant and actors formant: these terms refer to a peak in the spectral envelope around 3 kHz (discussed below). As Fant observes, while these uses are often closely related, they are conceptually quite distinct. Further, the resonant frequency, the pole of the fitted filter function and the peak spectral maximum need not coincide. Moreover, it is now possible to measure resonances of the vocal tract quite independently of the voice. Consequently, it is sometimes essential to make a clear distinction among a resonance frequency (a physical property of the tract), a filter pole (a value derived from data processing) and a spectral peak (a property of the sound).” (Wolfe, Garnier, & Smith, 2009)

“Formant is used by James Jeans (1938) to mean the collection of harmonics of a note that are augmented by a resonance.
Formant was defined by Gunnar Fant (1960): ‘The spectral peaks of the sound spectrum | P( f ) | are called formants’.
Benade (1976) writes: ‘The peaks that are observed in the spectrum envelope are called formants’.
In its standards for acoustical terminology, the Acoustical Society of America (1994) defines formant thus: “Of a complex sound, a range of frequencies in which there is an absolute or relative maximum in the sound spectrum. Unit, hertz (HZ). NOTE-The frequency at the maximum is the formant frequency.” (Wolfe, n.d.)

“Does it matter? For the voice, a resonance at a frequency R( i ) gives rise to a spectral maximum at frequency F( i ) which may produce in a filter model a pole at frequency P( i ). Usually, the three frequencies have similar values. However, as Fant observed, they are conceptually distinct. Let’s take some examples:

In our laboratory, the distinction is important. We routinely measure the resonances independently of the voice (Epps et al, 1997; Dowd et al, 1997; Joliveau et al, 2004a, b). We are often interested in comparing formants and resonances.
What to do? Our preference would be to retain the original meaning for the word formant. We prefer to say ‘A resonance at frequency Ri gives rise to a formant at frequency Fi. This may be modelled by a filter with a pole at frequency Pi’. While acousticians will broadly agree with this use, some members of the speech research and modelling community may not. We therefore suggest that, when discussing the voice, the word formant should be defined, to make it clear which meaning is intended. In principle, one could consider abandoning the word. However ‘broad peak in the spectral envelope’ is a long phrase, so it is useful to retain formant for that reason. […]
Whatever your choice of definition, you should make it clear. And, in literature and in discussions, prepare for some confusion. For instance, some researchers who use formant to mean resonance will also talk about ‘formant level’. When such people then talk of ‘formant level’, or say that the second formant is 10 dB lower than the first, I suspect that they refer to the amplitude of a peak in the sound spectrum. In a scientific talk, I have heard the sentence: ‘Trained sopranos tune the first formant near the note sung, but they usually don’t have a strong singer’s formant’. When that speaker said ‘first formant’ he presumably meant ‘first resonance’ and when he said ‘singer’s formant’ he meant a spectral peak probably due to two or more resonances. So we have the same person using the word in two of its three different meanings in the one sentence.” (Wolfe, n.d.)

“With regard to airway resonances, historical precedence and current usage of terminology are also slightly at odds. Joe Wolfe and colleagues suggest that the symbol R be used to stand separate from the symbol F for formant (Wolfe, 2014). The distinction is being made because a formant was originally defined as a peak in the output spectrum envelope radiated from the mouth (Hermann, 1894, 1895; Russell, 1929; Fant, 1960, p. 20). A similar definition appears in the current ASA standard of acoustic terminology (Acoustical Society of America, 2004), namely, that a formant is ‘a range of frequencies in which there is absolute or relative maximum in the sound spectrum. The frequency at the maximum is the formant frequency.’ As such, a formant involves both the source and the filter. However, as speech analysis and synthesis have progressed in a half century, the definition has not been universally maintained. Fant (1960, pp. 20, 53) defined formants as the poles of the transfer function of the supraglottal vocal tract, and labeled the pole frequencies F1, …, Fn and their bandwidths B1, …, Bn. He was followed in this path by many authors, such as Titze (1994, p. 156) or Stevens (1998, p.131). It is noteworthy that Flanagan (1965, p. 57) was aware of the dual definition (and possible evolution) by using the term ‘formant resonance.’ While Benade (1976) maintained the definition of ‘peaks in the spectral envelope of the radiated sound,’ Badin and Fant (1984) computed formant frequencies and bandwidths on the basis of x-ray area function resonances of the supraglottal vocal tract, not peaks in the output spectrum envelope. Story et al. (1996) did similar calculations based on magnetic resonance imaging (MRI). Differentiation between the formant frequencies and resonance frequencies of the vocal tract can be found in some papers comparing measurements from phonation (formants) to those derived from vocal tract impedance measurements or from calculations based on MRI or computer tomography (CT) data (resonance frequencies) (e.g., Stoffers et al., 2006; Vampola et al., 2013).
What is relevant here for nomenclature and symbolic notation is that the letter R is easily distinguishable from the letter F or f, both in speaking and writing. Hence, it is useful as a subscript to separate source and filter symbols. Discussion can continue on whether or not a formant is a meaningful representation of any particular resonance. Some authors describe resonances pertaining to the supraglottal airway only (assuming no coupling to the glottal or subglottal system), while others describe the net effect of complex interactions of multiple resonators above, below, and within the larynx. […] Unfortunately, the common definition between a formant and a resonance is yet to be established.” (Titze et al., 2015)

Note that Titze et al. (2015) propose a new and consistent terminology for the frequencies, magnitudes and bandwidths of harmonics, resonances and formants.

“Spectrum Envelope: The term spectrum envelope refers to an imaginary smooth line drawn to enclose an amplitude spectrum. Figure 3-17 shows several examples. This is a rather simple concept that will play a very important role in understanding certain aspects of auditory perception. For example, we will see that our perception of a perceptual attribute called timbre (also called sound quality) is controlled primarily by the shape of the spectrum envelope, and not by the fine details of the amplitude spectrum. The examples in Figure 3-17 show how differences in spectrum envelope play a role in signaling differences in one specific example of timbre called vowel quality (i.e., whether a vowel sounds like / i / vs. /a / vs. /u /, etc.). For example, panels a and b in Figure 3-17 show the vowel /ɑ / produced at two different fundamental frequencies. (We know that the fundamental frequencies are different because one spectrum shows wide harmonic spacing and the other shows narrow harmonic spacing.) The fact that the two vowels are heard as /a / despite the difference in fundamental frequency can be a ttributed to the fact that these two signals have similar spectrum envelopes. Panels c and d in Figure 3-17 show the spectra of two signals with different spectrum envelopes but the same fundamental frequency (i.e., with the same harmonic spacing). As we will see in the chapter on auditory perception, differences in fundamental frequency are perceived as differences in pitch. So, for signals (a) and (b) in Figure 3-17, the listener will hear the same vowel produced at two different pitches. Conversely, for signals (c) and (d) in Figure 3-17, the listener will hear two different vowels produced at the same pitch.” (Hillenbrand, n.d., pp. 16–17)

Methods of formant estimation I: general aspects

“The difficulties involved in measuring formant frequencies have been well known since the early days of the spectrograph, and involve errors related to ( i ) the ambiguous definition of the object to be measured, ( ii ) spectral features of the speech wave, ( iii ) intermodulation distortion, (iv) the spectrographic record, and (v) the measuring procedure:

Lindblom’s advice is thus still valid today. It is still necessary to apply one’s knowledge and experience of speech production and expected envelope shapes to the problem of how to select samples to measure and where to look for spectral peaks.” (Wood, 1989, referring to Lindblom, 1962)

“[…] At this point we should remember that an LPC filter lumps together several aspects of speech production […]. An LPC spectrum represents not only the formant frequencies due to the resonances of the vocal tract but also the effects of the lip radiation and the spectrum of the pulse from the vocal folds. Nevertheless, the peaks in the LPC spectrum are usually good indicators of the formant frequencies. Problems may arise when two formants are close together, in which case the spectrum may appear to have only a single peak corresponding to both of them, or when one formant has a lower amplitude, so that it appears as only a kink in the curve representing another formant. These problems lead us to another way of considering LPC analysis.
It is also possible to analyze an LPC expression so as to determine the exact frequencies corresponding to the poles (which, however, may not be exactly those of the formants in the vocal tract transfer function). For every pair of LPC terms we get a pair of numbers corresponding to the frequency and the bandwidth of a pole in the filter. We know […] that there will be a formant at 500Hz, 1,500 Hz, 2,500 Hz, and so on in a neutral vowel for a speaker with a vocal tract of 17.5 cm. In general, for such a speaker there will be one formant for every 1,000 Hz interval. So with a 10,000 Hz sample rate and an upper frequency limit of 5,000 Hz, we can expect to find five formants. This will require ten LPC terms. If we want to allow two further terms to account for higher formants that may be influencing the spectrum or a pole due to the glottal pulse shape, then we should make a twelve-point LPC analysis. If the speaker might have a shorter vocal tract so that we could only expect four formants below 10,000 Hz, then we could use a ten point LPC.
Choosing the right number of coefficients for an LPC analysis is somewhat of an art. If one chooses too many, the analysis will produce poles corresponding to spurious formants; if one chooses too few, formants may be lumped together because the higher formants or the glottal pulse may require more complex specification. The problem is compounded by the fact that an LPC analysis is equivalent to trying to model the spectrum using only poles, and there may be zeros (antiresonances) in the vocal tract transfer function. There certainly will be antiresonances in any vocal tract shape that contains the equivalent of a side tube, such as the oral cavity in the case of a nasal sound. LPC analysis is not reliable for nasalized vowels. A general rule of thumb for the number of coefficients is the sample rate in kHz plus 2, e.g. 10,000 Hz = 10 kHz plus 2 equals 12. But a better rule is to use several different analyses with different numbers of coefficients and see which gives the most interpretable results.” (Ladefoged, 1996, pp. 210–212)

“Good spectrograms are a great help in determining where the formants are. This is often not as easy one might imagine. You have to know where to look for formants before you can find them. The best practical technique is to look for one formant for every 1,000 Hz. The vowel ə, for example, has formants at about 500, 1,500 and 2,500 Hz for a male speaker (all slightly higher for a female speaker). Other vowels will have formants up or down from this mid range. But there are exceptions to this general rule of one formant per 1,000 Hz. It would be more true to say that there is, on average, one formant for every 1,000 Hz. Low back vowels may have two formants below 1,000 Hz, but nothing between 1,000 and 2,000 Hz, and then the third formant somewhere between 2,000 and 3,000 Hz.” (Ladefoged, 2003, pp. 113–114)

Methods of formant estimation II: methodological limits related to F0

“[…] in the case of female speech, formant analysis is extremely difficult. The fundamental frequency is so high that formants are often poorly defined. […] We had difficulties in determining the position of a formant in about 40% of the 300 vowel segments, if no a priori knowledge was used.” (Van Nierop et al., 1973)

“[…] because formant frequencies are hard to determine when fundamental frequency is higher than about half of the frequency of the first formant.” (Sundberg, 1987, pp. 124–125)

“Accurate measurement of formant frequencies is important in many studies of speech perception and production. Errors in formant frequency estimation by eye, using a spectrogram, or automatically, using linear prediction, have been reported to be as high as 60 Hz at F0 < 300 Hz. This exceeds the typical auditory difference limens (DLs) for formant frequencies and is also greater than some of the variation that one would like to study, e.g. the acoustic effects of varying vocal effort. The problem becomes substantially worse when F0 is as high as 500 to 600 Hz, which is not uncommon in the speech of women and children at high vocal efforts.” (Traunmüller & Eriksson, 1997)

“Measurements of the frequency position of the formants, considered as the resonances of the vocal tract, are affected by substantial errors when F0 is as high as it is when people communicate over large distances. This holds for LPC-based methods as well as when using visual inspection of spectrograms.” (Traunmüller & Erikkson, 2000)

“The problem is that it is difficult to determine reliably the resonance frequencies of the tract from the sound alone, using either spectral analysis or linear prediction, once F0 exceeds 350 Hz (Monson and Engebretson, 1983), and essentially impossible once F0 exceeds 500 Hz.” (Joliveau et al., 2004)

“[…] it is difficult to determine unambiguously the frequencies of the resonances with a resolution much finer than f0/2.” (Swerdlin, Smith, & Wolfe, 2010)

Methods of formant estimation III: “One wonders, for example, if the source-filter theory of speech production would have taken the same course of development if female voices had been the primary model early on.”

“To a large extent, the early work in acoustic phonetics focused on the adult male speaker. There were a number of reasons for this focus, including social and technical factors. Only rather recently has the study of acoustic phonetics been broadened to encompass significant research on populations other than men. This is not to say that children and women were neglected altogether in the early history of acoustic speech research. Peterson and Barney’s (1952) classic study included acoustic data on vowels for men, women and children, making it clear that acoustic values vary markedly with age and gender characteristics of speakers […].
The problem is that the research effort given to the speech of women and children has been on a smaller scale than that given to the speech of men. Consequently, there is a continuing need to gather acoustic data for diverse populations. The concentration on male speakers had several consequences, not all of which facilitated research on the speech of women and children. One consequence was the choice of an analyzing bandwidth (300 Hz for the ‘wide-band’ analysis) on early spectrographs that worked well enough for most adult male voices but was deficient for many women and children. The unsuitability of the analyzing bandwidth probably discouraged acoustic analyses of women’s and children’s speech.
The implications of the male emphasis may have reached even to theory; Titze (1989, p. 1699) commented, ‘One wonders, for example, if the source-filter theory of speech production would have taken the same course of development if female voices had been the primary model early on.’ Klatt and Klatt (1990, p. 820) remarked on the same point: ‘informal observations hint at the possibility that vowel spectra obtained from women’s voices do not conform as well to an all-pole [i.e. all formant] model, due perhaps to tracheal coupling and source/ tract interactions.’ The acoustic theory for vowels […] assumed that the vocal tract transfer function is satisfactorily represented by formants (poles) and that antiformants (zeros) are required only for modifications such as nasalization. It is advisable to bear in mind that this theory is predicated largely on the characteristics of adult male speech and that it may have to be altered to account for the characteristics of both children and women.” (Kent & Read, 2002, pp. 189–190)