ANAE normalization == log-mean normalization

normalization

urbanlegends

Published

March 13, 2024

Introduction

When researchers seek to compare formant patterns across speakers, they often normalize the formant values in some way to facilitate the comparison. To do this they use vowel normalization algorithms, mathematical procedures designed to remove some of the variation in formant patterns.

Log mean normalization was proposed by Terry Nearey in his 1978 PhD thesis Phonetic Feature Systems. It is just a way of centering formant values around the average formant frequency produced by each speaker. Log-mean normalization is carried out by calculating the average log-transformed formant frequency produced by each speaker, and then subtracting this from the log-transformed formant frequencies produced by that speaker.

There is a ‘different’ normalization procedure proposed in the Atlas of North American English (ANAE) by Labov, Ash, and Boberg (2005). This method will be referred to as ANAE normalization. The only difference between ANAE and ‘traditional’ log-mean normalization is that ANAE normalization multiplies normalized formants by an arbitrary constant related to the average production of some set of speakers.

I often hear or read people say variants of the following statement, directly or indirectly:

False

The vowel normalization method proposed in the Atlas of North American English (ANAE) is ‘new’ and qualitatively different from log-mean normalization. You should use ANAE not log-mean normalization since it is better.

I don’t know how or why this idea has become so widespread, but a cursory review of the description of these algorithms (presented below) shows that it is plainly false. The purpose of this post is to show:

True

ANAE normalization is mathematically identical to log-mean normalization, save for the addition of an arbitrary proportional constant. Actually, they are slightly different implementations of uniform scaling normalization, something I will discuss in a future post.

If you need to cite something to make this argument, it follows obviously from the explanation of log-mean normalization in Barreda and Nearey (2018).

Log-mean normalization

Log-mean normaliation requires the calculation of the speaker log-mean formant frequency, called \(S\) here. This value is the logarithm of the geometric mean formant frequency produced by this speaker, a measure of their average formant. In (1) we see that the \(S\) parameter is the sum of the log-transformed formant frequencies (\(F\)) produced by speaker \(s\) for formant number \(k\) for token \(t\), divided by the product of the number of formants and tokens (i.e., the total number of observations).

\[ \begin{equation} \label{eq:a} \, \, S_s= \frac{1}{(K)} \sum_{m=1}^K \sum_{t=1}^T \log⁡(F_{kts}) \end{equation} \tag{1}\]
To provide a concrete example, we can load the Hillenbrand et al. (1995) data from the phonTools R package.

library (phonTools)
data(h95)

We can select speaker 50, an adult male. So, in this case \(s = \mathrm{m50}\).

m50 = h95[589:600,c("f1","f2","f3")]
m50
##      f1   f2   f3
## 589 627 1910 2488
## 590 832 1222 2624
## 591 694  970 2746
## 592 698 1580 2492
## 593 493 2102 2839
## 594 460 1473 1764
## 595 441 2045 2694
## 596 353 2385 3400
## 597 441  862 2455
## 598 522 1123 2538
## 599 650 1036 2697
## 600 422 1299 2347

To carry out (1) we can log transform our formant frequencies and divide by total number of observations (36). This is the number of tokens categories (M=12) times the number of formants (K=3).

log_formants = log (unlist (m50))

S_s = sum (log_formants) / 36
S_s
## [1] 7.131983

Log-mean normalization consists of subtracting the speaker parameter (\(S_s\)) from the logarithm of the observed formant pattern (\([F_{ts} \: F_{ts} \: F_{ts}]\)) for token t produced by speaker s. The result of this is a vector or log-transformed formant frequencies, \(\log([N_{ts} \: N_{ts} \: N_{ts}])\).

\[ \begin{equation} \, \, \log([N_{ts} \: N_{ts} \: N_{ts}]) = [F_{ts} \: F_{ts} \: F_{ts}] - S_s \end{equation} \tag{2}\]

We can do this in R as follows:

log_normalized_formants = log(m50) - S_s
log_normalized_formants
##             f1          f2        f3
## 589 -0.6910361  0.42287587 0.6872518
## 590 -0.4081502 -0.02373851 0.7404725
## 591 -0.5895107 -0.25468658 0.7859179
## 592 -0.5837635  0.23319747 0.6888582
## 593 -0.9314735  0.51866190 0.8192245
## 594 -1.0007562  0.16307376 0.3433566
## 595 -1.0429378  0.49117042 0.7667997
## 596 -1.2655146  0.64497175 0.9995481
## 597 -1.0429378 -0.37272738 0.6738994
## 598 -0.8743151 -0.10822370 0.7071490
## 599 -0.6550103 -0.18886023 0.7679127
## 600 -1.0869773  0.03736736 0.6289105

Mathematically, this is equivalent to dividing the observed formant pattern by the exponent of the speaker parameter, with the only difference being whether the normalized pattern is log-transformed or not.

\[ \begin{equation} \, \, [N_{ts} \: N_{ts} \: N_{ts}] = [F_{ts} \: F_{ts} \: F_{ts}] \times \frac{1}{\exp(S_s)} \end{equation} \tag{3}\]

We can see this below where we first normalize the formants by dividing by the exponent of the speaker parameter, and then compare the logarithm of this to the log_normalized_formants calculated above.

normalized_formants = m50 / exp(S_s)
max ( log_normalized_formants - log(normalized_formants) )
## [1] 4.440892e-16

ANAE normalization

The following summary of ANAE normalization is taken from the pages included at the bottom of this paper (Passage from ANAE). ANAE normalization requires the calculation of an \(S\) and a \(G\) parameter. The \(S\) parameter is the log-mean formant frequency produced by each speaker. This is calculated using equation (1) in the same way as it is for log-mean normalization. The \(G\) parameter is the average of all the S parameters for the n speakers composing the data, or it would equal this in the balanced case. Alternitavely, we could just use the value of G provided in the ANAE, 6.897.

\[ \begin{equation} (4) \, \, G= \frac{1}{N} \sum_{n=1}^N S_s \end{equation} \tag{4}\]
Using these two parameters we calculate \(F\) for speaker \(s\) as below:

\[ \begin{equation} (5) \, \, F = \exp(G - S_s) \end{equation} \tag{5}\]

We can do this for our data below using the value of \(G\) provided in the ANAE:

G = 6.897
F = exp(G - S_s)
F
## [1] 0.7905846

ANAE normalization consists of multiplying the observed formant pattern (\([F_{ts} F_{ts} F_{ts}]\)) for token \(t\) produced by speaker \(s\) times the \(F\) parameter for that speaker (\(F_s\)). I will refer to these formant values as [A_{ts} : A_{ts} : A_{ts}] to indicate that these are the ANAE normalized values.

\[ \begin{equation} \, \, [A_{ts} \: A_{ts} \: A_{ts}] = [F_{ts} \: F_{ts} \: F_{ts}] \times F_s \end{equation} \tag{6}\]

The result of this process is a normalized formant pattern for the token \([F_{ts} F_{ts} F_{ts}]\). Notice that these values seem like ‘normal’ formant values because they have not been log-transformed. In addition, rather than centering around zero (more on this later), ANAE normalization centers formant values around an arbitrary value, e.g. the average formant frequency produced by the speakers in the data.

ANAE_normalized = m50 * F
ANAE_normalized
##           f1        f2       f3
## 589 495.6965 1510.0165 1966.974
## 590 657.7664  966.0943 2074.494
## 591 548.6657  766.8670 2170.945
## 592 551.8280 1249.1236 1970.137
## 593 389.7582 1661.8088 2244.470
## 594 363.6689 1164.5311 1394.591
## 595 348.6478 1616.7454 2129.835
## 596 279.0764 1885.5442 2687.988
## 597 348.6478  681.4839 1940.885
## 598 412.6851  887.8265 2006.504
## 599 513.8800  819.0456 2132.207
## 600 333.6267 1026.9693 1855.502

ANAE normalization == log-mean normalization

Equation (6) defines the ANAE normalization method. Let’s take the logarithm of both sides in (6) below in (7):

\[ \begin{equation} \, \, \log([A_{ts} \: A_{ts} \: A_{ts}]) = \log([F_{ts} \: F_{ts} \: F_{ts}]) + \log(F_s) \end{equation} \tag{7}\]

Recall that in equation (5), we established that \(F_s = \exp(G - S_s)\). Let’s replace the parameter \(F_s\) with it’s equivalent from equation (5):

\[ \begin{equation} \, \, \log([A_{ts} \: A_{ts} \: A_{ts}]) = \log([F_{ts} \: F_{ts} \: F_{ts}]) + \log(\exp(G-S_s)) \end{equation} \tag{8}\]

The log and exp functions on the right hand side of (8) cancel out as in (9). At this point we can see that the log transformed ANAE normalized formants are equal to the the log transformed formants minus the speaker parameter \(S_s\), plus the arbitrary constant \(G\).

\[ \begin{split} \, \, \log([A_{ts} \: A_{ts} \: A_{ts}]) = \log([F_{ts} \: F_{ts} \: F_{ts}]) + G - S_s \end{split} \tag{9}\]
We can re-arrange the terms \(G\) and \(S\) to make the relationship to equation (1) more explicit.

\[ \begin{align} \, \, \log([A_{ts} \: A_{ts} \: A_{ts}]) = \log([F_{ts} \: F_{ts} \: F_{ts}]) - S_s + G \end{align} \tag{10}\]
Or in less ‘mathy’ terms, if:

\[ \log(ANAE) = \log([F_{ts} \: F_{ts} \: F_{ts}]) - S_s + G \tag{11}\]
and:

\[ \mathrm{Log \, Mean} = \log([F_{ts} \: F_{ts} \: F_{ts}]) - S_s \tag{12}\]
then:

\[ \begin{split} \mathrm{Log \, Mean} + G = \log(ANAE) \\ \mathrm{Log \, Mean} + G - \log(ANAE) = 0 \end{split} \tag{13}\]
and:

\[ \begin{split} \exp (\mathrm{Log \, Mean} + G) = ANAE \\ \exp (\mathrm{Log \, Mean} + G) - ANAE = 0 \end{split} \tag{14}\]
Ok it’s easy enough to move symbols around, so let’s see that it’s true. Below we see that equations X and X are legitimate, and that the difference between the two approaches amounts to the addition of the arbitrary proportional constant \(G\) (and numerical error).

eq.13 = (log_normalized_formants + G) - log(ANAE_normalized)
max (abs (eq.13))
## [1] 8.881784e-16

eq.14 = (exp(log_normalized_formants) * exp(G)) - ANAE_normalized
max (abs (eq.14))
## [1] 9.094947e-13

Conclusion

Let me conclude with an analogy. Imagine you ‘normalize’ day and night temperature variation around the mean for a city. You do this to compare cities. If you leave the normalized temperature values centered around 0, that is log-mean normalization. Suppose that the average of all the cities you compare is 65 degrees Fahrenheit. Imagine that you don’t like the to be centered around zero, so instead you add 65 back to all values so that they are instead centered around 65. This is ANAE normalization. Do these methods seem substantially different to you? The only function of the \(G\) parameter in ANAE normalization is to scale normalized values up so that they are no longer centered around zero. This makes values more similar to the productions in Hertz but yields no other benefit.

I don’t understand how this could be seen as ‘new’ or ‘different’ with respect to log-mean normalization, and why I am so often asked to use ANAE rather than log-mean since it is ‘new’ and ‘better’. Given my difficulties changing the minds of other researchers, I have decided to go with the flow and propose a new normalization method I call Santiago normalization. It is actually exactly just like ANAE normalization but you increase all values by a constant \(C\) equal to 1.1 (1%). I think this is better because values are a bit larger, which I like. I hope those researchers who consider that ANAE is substantially different from log-mean normalization will give serious consideration to my exciting new method.

Passage from ANAE

Below is the relevant sections from the Atlas of North American English, pages 39-40.

References

Barreda, S., & Nearey, T. M. (2018). A regression approach to vowel normalization for missing and unbalanced data. The Journal of the Acoustical Society of America, 144(1). https://doi.org/10.1121/1.5047742

Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. The Journal of the Acoustical Society of America, 97(5), 3099-3111.

Labov, W., Ash, S., & Boberg, C. (2005). The atlas of North American English: Phonetics, phonology and sound change. Walter de Gruyter.

Nearey, T. M. (1978). Phonetic feature systems for vowels. Phonetic feature systems for vowels.