Durational differences of homophonous suffixes emerge from the lexicon: Evidence from pseudowords’ semantic vectors
This script documents the analysis of ENGS production data and MALD corpus data by means and measures of Linear Discriminative Learning (LDL; e.g. Baayen et al., 2018). This is a work in progress. Please check this script for an analysis of the results of the current script, i.e. PCAs and GAMMs modelling.
The idea of this LDL implementation is the following: Networks trained on real word data and their transformation matrices are used to create semantic matrices for pseudowords, following the implementation by Chuang et al. (2020). The underlying assumption is that speakers who face pseudowords do so with their knowledge on previously encountered words, i.e. their lexicon. Thus, pseudowords are not semantically empty shells but contain some sort of meaning.
Combining real and nonce word data, creating cue and semantic matrices for both, and extracting LDL measures for comprehension and production proves to be a little bit more complicated than more straightforward implementations limiting themselves to either real or nonce word data. Thus, the following figure is given to illustrate what is done in more detail in the remainder of this document.
1 Data
1.1 MALD data
Data on real words is taken from the Massive Auditory Lexical Decision (MALD) Database (Tucker et al., 2018). It is a subset of the complete MALD database, containing 8328 words of which 2093 words contain 36 different affixes.
[1] 8328 4
1.2 ENGS data
Pseudoword data is taken from the ENGS production experiment (Schmitz et al., 2020). In total, 48 pseudowords are contained within this data set, with half of them representing monomorphemic words (e.g. singular bloufs) and half of them representing plural wordforms (e.g. plural bloufs).
[1] 48 4
1.3 COMB data
The combined data set contains both data on real and on nonce words as introduced in 1.1 and 1.2. In total, 8376 words and 36 different affixes are part of this data set.
## create a combined data set
comb.data <- rbind(mald.data, engs.data)
## sort combined data set alphabetically
comb.data <- comb.data[order(comb.data$Word),]
dim(comb.data)
[1] 8376 4
2 A matrix
We can create an A matrix for real words using the semantic vectors constructed of the TASA corpus. The A matrix consists of semantic vectors (columns) for all words in orthographic form (rows).
X | meat | mechanic | mechanical | mechanism | medal |
---|---|---|---|---|---|
eat | 0.0118417 | -0.0001872 | 0.0002885 | -0.0002331 | 1.98e-05 |
echo | 0.0005983 | -0.0000384 | -0.0000922 | -0.0000899 | -2.05e-05 |
eclipse | 0.0000350 | -0.0003962 | -0.0000451 | -0.0000278 | -9.70e-06 |
## load TASA vectors
load("data_simon/wL2Ld.rda")
## find all unique MALD (= real word) lexomes
mald.subset.allLexomes = unique(c(mald.data$Word, mald.data$Affix))
mald.subset.allLexomes = mald.subset.allLexomes[!is.na(mald.subset.allLexomes)]
## find vectors for all unique lexomes > this then is the A matrix
mald.Amat = wL2Ld[mald.subset.allLexomes, ]
3 C matrices
C matrices contain all triphones of a data set (columns) which are coded binary for all words in phonological form (rows).
X | X..b | X.b. | b.n | X.nd | nd. | d.n | X.nm | nm. | m.n | X.nt | nt. | b.S | X.S. | X..b.1 | X.bI |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@b{nd@nm@nt | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
@b{S | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
{bI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
{b@t | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
{bdIk1t | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3.1 real words
The C matrix for real words is computed based on the MALD data set and \(Word+Affix\), i.e. word forms and affixes are considered.
mald.cues = make_cue_matrix(data = mald.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")
dim(mald.cues$matrices$C)
# 8328 7587
4 S matrices
S matrices contain all semantic vectors of a data set (columns) and their pertinent values for all words in phonological form (rows).
X | meat | mechanic | mechanical | mechanism | medal |
---|---|---|---|---|---|
it | 0.0118417 | -0.0001872 | 0.0002885 | -0.0002331 | 1.98e-05 |
Ek5 | 0.0005983 | -0.0000384 | -0.0000922 | -0.0000899 | -2.05e-05 |
IklIps | 0.0000350 | -0.0003962 | -0.0000451 | -0.0000278 | -9.70e-06 |
4.1 real words
The S matrix for real words is computed based on the MALD data set, the MALD A matrix, and \(Word+Affix\), i.e. word forms and affixes are considered.
mald.Smat = make_S_matrix(data = mald.data, formula = ~ Word + Affix, inputA = mald.Amat, grams = 3, wordform = "DISC")
dim(mald.Smat)
# 8328 5487
4.2 pseudowords
We can create a semantic matrix for pseudoword based on the semantics of real words solving the following equation
\[engs.Smat = engs.C * Hp\] with engs.Smat
being the semantic matrix for pseudowords, engs.C
being the C matrix contained within the cue matrix of pseudowords, and Hp
being the transformation matrix for mapping real word cues onto real word semantics.
To obtain this real word transformation matrix Hp
(Chuang et al., 2020) we first need to learn comprehension of real words.
Then, we can extract F
(the name of Hp
which is used in this package) from mald.comp
. We will use Hp
as variable name.
For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix (e.g. Nykamp, 2020).
Let’s check the dimensions of our matrices: • engs.C: 48 x 63 • Hp: 7587 x 5487
Thus, engs.C
must be expanded so its number of columns \(63\) matches the number of rows in Hp
\(7587\).
To achieve this, we first create an empty matrix, i.e. cells contain zeros, with rows=7539
and columns=63
.
This matrix is then attached to the original pseudoword C matrix.
Next, we create an identity matrix, i.e. all cells are zero but the diagonal contains ones. Adding an identity matrix is the common way to extend a matrix’s number of columns. The number of rows and columns for an identity matrix is identical; we take \(7587\) rows/columns as this is number of columns we want to obtain.
Then, we create a new matrix based on the identity matrix containing all columns but the first \(63\) (as this is the number of columns contained in the pseudoword C matrix).
This matrix is then combined with C.eng.big1
(the cue matrix with additional rows).
This expanded pseudoword C matrix must be reduced to the necessary size, i.e. in its current form there are 7500+ rows without useful data. We only keep rows which contain info on the ENGS pseudowords, i.e. \(48\) rows.
In a last step, we make sure that R considers the newly created C matrix as matrix (this sometimes is not the case and will lead to errors in further steps).
Now, we can solve the aforementioned equation \(engs.Smat = engs.C * Hp\).
In this S matrix, 48 pseudowords and their semantic vectors are contained. However, all of our pseudowords are homophones, thus, they show identical semantic vectors even though half of them contain a suffix, e.g. the semantic vectors for bloufs (non-morphemic S) and bloufs (plural S) are identical.
X | fierce | fifteen | fifteenth | fifth | fifty |
---|---|---|---|---|---|
bl6fs | -0.0001453 | 0.0002496 | 0.0002782 | 0.0006957 | -0.0001303 |
bl6fs | -0.0001453 | 0.0002496 | 0.0002782 | 0.0006957 | -0.0001303 |
bl6ks | 0.0000021 | -0.0006015 | 0.0000090 | -0.0012144 | 0.0003275 |
bl6ks | 0.0000021 | -0.0006015 | 0.0000090 | -0.0012144 | 0.0003275 |
We “fix” this by adding the “PL” semantic vector values to the semantic vectors of plural pseudowords (idea by Yu-Ying).
Thus, we first need to extract the “PL” semantic vector contained in the general A matrix.
Taking a closer look at the pseudoword S matrix, we find that every second row corresponds to a plural pseudoword, i.e. this is where we add the PL vector.
Taking this plural subset, we add the “PL” semantic vector to all rows.
We then recombine the original monomorphemic S matrix and the modified plural S matrix, and recreate the original row order.
engs.nm = engs.Smat[seq(1, nrow(engs.Smat), 2), ]
engs.Smat <- rbind(engs.nm, engs.pl.added)
sort.rows <- row.names(engs.Smat.old)
engs.Smat <- engs.Smat[order(match(rownames(engs.Smat), sort.rows)), , drop = FALSE]
Now, homophonous pseudowords have differing semantic vectors.
X | fierce | fifteen | fifteenth | fifth | fifty |
---|---|---|---|---|---|
bl6fs | -0.0001453 | 0.0002496 | 0.0002782 | 0.0006957 | -0.0001303 |
bl6fs | -0.0000796 | 0.0001372 | 0.0003686 | 0.0003187 | -0.0002868 |
bl6ks | 0.0000021 | -0.0006015 | 0.0000090 | -0.0012144 | 0.0003275 |
bl6ks | 0.0000678 | -0.0007139 | 0.0000994 | -0.0015913 | 0.0001709 |
4.3 combined
Finally, we can create one combined S matrix for real and nonce words.
However, LDL commands in R are sensitive to row order. Thus, we need to recreate the original row order as found in the combined C matrix.
all(rownames(comb.cues$matrices$C) == rownames(comb.Smat))
# FALSE
comb.Smat2 <- comb.Smat[rownames(comb.cues$matrices$C),,drop=FALSE]
all(rownames(comb.cues$matrices$C) == rownames(comb.Smat2))
# TRUE
5 Comprehension
Using the combined C and S matrices, we can now learn and evaluate comprehension of real and nonce words in one go, i.e. as if they are encountered by a person with their pertinent lexicon.
First, comprehension is learned.
Second, the accuracy of this comprehension is evaluated. That is, how good can form (C matrix) be mapped onto meaning (S matrix).
comb.comp_acc = accuracy_comprehension(m = comb.comp, data = comb.data, wordform = "DISC", show_rank = TRUE, neighborhood_density = TRUE)
comb.comp_acc$acc # 0.6703677
We find that comprehension accuracy is about \(67\%\).
6 Production
Using the combined C and S matrices as well as the comprehension results, we now are able to learn production and evaluate its accuracy.
First, production is learned.
Second, the accuracy of this production is evaluated. That is, how good can meaning (S matrix) be mapped onto form (C matrix).
comb.prod_acc = accuracy_production(m = comb.prod, data = comb.data, wordform = "DISC", full_results = TRUE, return_triphone_supports = TRUE)
comb.prod_acc$acc # 0.9641834
We find that production accuracy is about \(96\%\).
7 Measures
7.1 Comprehension
Extracting and saving comprehension measures is straightforward.
comb.comp_measures = comprehension_measures(comp_acc = comb.comp_acc, Shat = comb.comp$Shat, S = comb.Smat2, A = mald.Amat, affixFunctions = comb.data$Affix)
- recognition: whether the correct form was recognized
- cor_max: the correlation between the pertinent vector in Shat and the semantic vector of its predicted form in S
- cor_target: the correlation between the pertinent vector in Shat and the semantic vector of its targeted form in S
- l1norm: city block distance / Manhattan distance of Shat
- l2norm: euclidian (direct) distance / length of Shat
- rank: the rank of correlation between s-hat and the semantic vector of its targeted form
- density: the mean correlation of s-hat with the semantic vectors of its top 8 neighbors
- cor_affix: the correlation of s-hat and the semantic vector of its affix function; more or less a measure of transparency
7.2 Production
Extracting and saving production measures runs into problems. It appears that there are 23 words for which comb.prod_acc
does not contain all necessary information. Why? I have no clue.
I worked around this problem by removing the 23 problematic words from full
, i.e. the list in which all necessary info is contained within comb.prod_acc
.
Using the summary()
function, all cases with missing info are identified. I call this newly created data frame nineortwelve
as complete entries consist of 12 entries per word, while incomplete entries consist of only 9 entries per word.
Checking this data frame, the following entries can be found with only 9 individual entries instead of 12. Thus, these are removed from full2
.
full2 <- full
full2[[8149]] <- NULL
full2[[7741]] <- NULL
full2[[7233]] <- NULL
full2[[7143]] <- NULL
full2[[7133]] <- NULL
full2[[7027]] <- NULL
full2[[6920]] <- NULL
full2[[6314]] <- NULL
full2[[5274]] <- NULL
full2[[4878]] <- NULL
full2[[4385]] <- NULL
full2[[3873]] <- NULL
full2[[3399]] <- NULL
full2[[3111]] <- NULL
full2[[2842]] <- NULL
full2[[2206]] <- NULL
full2[[2067]] <- NULL
full2[[1821]] <- NULL
full2[[1772]] <- NULL
full2[[1771]] <- NULL
full2[[1354]] <- NULL
full2[[971]] <- NULL
full2[[330]] <- NULL
This reduced full2
then replaces the original full
in a copy of comb.prod_acc
.
Only then production measures can be obtained.
comb.prod_measures = production_measures(prod_acc = comb.prod_acc2, S = comb.Smat2, prod = comb.prod)
- correlations: the correlation of the predicted path with the targeted semantic vector
- path_counts: the number of paths detected
- path_sum: the summed support for the predicted path, could be read as the uncertainty about how to pronounce the word due to competing activation of articulatory motor plans
- path_entropies: Shannon entropy calculated over the path supports
- lwlr: Simon: “The length-weakest link ratios of the predicted path (also sometimes equal to word length). When the length of a word’s path, as measured in number of nodes, becomes higher, while the weakest link between nodes stays the same, uncertainty becomes greater. If triphone support is high, i.e. when all the strengths of the links are almost equal to one, the value will be almost equal to the number of segments in the word. This may not be the case in other models, where the triphone support may be lower.”
8 Analysis I
8.1 Data
Load comprehension and production measures as well as ENGS production data on non-morphemic and plural /s/.
# comprehension measures
comb.comp_measures <- read.csv("data/comb.comp_measures_ALLREALSEMANTICS_2.csv")[2:9]
# production measures
comb.prod_measures <- read.csv("data/comb.prod_measures_ALLREALSEMANTICS.csv")
# ENGS production data
data <- read.csv("data/engs_durations.csv")
# fix a naming issue
names(data)[names(data) == "ï..number"] <- "number"
Add information on words, transcriptions, bases, and affixes to one of the measure data sets (so we actually know which words we are dealing with).
comb.comp_measures$Word <- comb.data$Word
comb.comp_measures$DISC <- comb.data$DISC
comb.comp_measures$Affix <- comb.data$Affix
comb.comp_measures$Base <- comb.data$Base
Next, we have to exclude the 23 words we excluded previously when calculating production measures from the comprehension measures data set.
comb.comp_measures2 <- subset(comb.comp_measures, DISC!="{p@r1t@s" & DISC!="bUl@k" & DISC!= "kl2@nt" & DISC!= "krimI" &
DISC!= "kri1t" & DISC!= "krVmbP" & DISC!= "divj@nt" & DISC!= "dIsEmIn1SH" & DISC!= "fIS@R" &
DISC!= "g & DISC!=t@R" & DISC!= "h{mb3g@R" & DISC!= "InstrUm@nt" & DISC!= "m@=Qr@tI" &
DISC!= "n5bIl@tI" & DISC!= "pEndjUl@m" & DISC!= "rVpJ@R" & DISC!= "s6@R" & DISC!= "skwQnd@R" &
DISC!= "stQk" & DISC!= "st5nkVt@R" & DISC!= "s@bmISH" & DISC!= "tr{nsEnd" & DISC!= "wQJfUl")
Finally, create the final data set by combining both measure data sets and the ENGS production data.
# combine comprehension and production measures
ldl.measures <- cbind(comb.comp_measures2, comb.prod_measures)
# combine LDL measures and ENGS production data
data <- merge(data, ldl.measures, by="Base")
# exclude some duplicated columns
data1 <- data[1:40]
data2 <- data[44:49]
data <- cbind(data1, data2)
# fix some names
names(data)[names(data) == "Word.x"] <- "Word"
names(data)[names(data) == "Affix.x"] <- "Affix"
names(data)[names(data) == "DISC.x"] <- "DISC"
# fix 'Affix' levels
data$Affix[is.na(data$Affix)] <- "NM"
data$Affix <- as.factor(data$Affix)
8.2 Variables
Let’s have a closer look at all variables we are interested in:
- dependent variables: sDur, baseDur, wordDur
- traditional variables: speakingRate, googleFreq, pauseBin, biphoneProbSumBin, folType, monoMultilingual, speaker
- LDL variables: comprehension: cor_max, cor_target, l1norm, l2nrom, rank, density, cor_affix production: correlations, path_counts, path_sum, path_entropies, lwlr
8.2.1 dependent variables
sDur
sDur
is the measured duration of non-morphemic and plural /s/.
Shapiro-Wilk normality test
data: data$sDur
W = 0.94176, p-value = 9.988e-16
sDur
is not normally distributed. Thus, sDur
is log-transformed:
Shapiro-Wilk normality test
data: data$sDurLog
W = 0.99268, p-value = 0.001955
baseDur
baseDur
is the measured duration of strings preceding the word-final /s/.
Shapiro-Wilk normality test
data: data$baseDur
W = 0.93546, p-value < 2.2e-16
baseDur
is not normally distributed. Thus, baseDur
is log-transformed:
Shapiro-Wilk normality test
data: data$baseDurLog
W = 0.99236, p-value = 0.001384
wordDur
wordDur
is the sum of baseDur
and sDur
.
Shapiro-Wilk normality test
data: data$wordDur
W = 0.96604, p-value = 1.717e-11
wordDur
is not normally distributed. Thus, wordDur
is log-transformed:
Shapiro-Wilk normality test
data: data$wordDurLog
W = 0.99469, p-value = 0.01763
8.2.2 traditional variables
speakingRate
speakingRate
is the number of syllables in an utterance divided by the duration of the utterance.
Shapiro-Wilk normality test
data: data$speakingRate
W = 0.99142, p-value = 0.0005315
speakingRate
is not normally distributed. Thus, speakingRate
is log-transformed:
Shapiro-Wilk normality test
data: data$speakingRateLog
W = 0.98808, p-value = 2.291e-05
However, log-transforming does not improve normality. Thus, we keep speakingRate
as variable.
googleFreq
googleFreq
is the number of syllables in an utterance divided by the duration of the utterance.
Shapiro-Wilk normality test
data: data$googleFreq
W = 0.36678, p-value < 2.2e-16
googleFreq
is not normally distributed. Thus, googleFreq
is log-transformed:
Shapiro-Wilk normality test
data: data$googleFreqLog
W = 0.9794, p-value = 3.172e-08
categorical variables
no_pause pause
435 249
high low
168 516
APP F N P V
197 12 114 171 190
bilingual monolingual
129 555
1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 19 20 21 23 24 25 26 27 28 29
19 26 14 18 14 16 18 14 14 20 21 20 13 21 14 7 18 17 14 5 12 20 22 17 21 19
30 31 32 33 35 36 41 42 43 45 46 47 48 49
21 17 20 17 16 11 20 21 18 15 18 17 20 19
8.2.3 LDL variables
8.2.3.1 comprehension measures
cor_max
cor_max
: the correlation between the pertinent vector in Shat and the semantic vector of its predicted form in S.
Shapiro-Wilk normality test
data: data$cor_max
W = 0.74745, p-value < 2.2e-16
cor_max
is not normally distributed. Thus, cor_max
is log-transformed:
Shapiro-Wilk normality test
data: data$cor_maxLog
W = 0.71499, p-value < 2.2e-16
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
The bestNormalize package contains a suite of transformation-estimating functions that can be used to normalize data. The function of the same name attempts to find and execute the best of all of these potential normalizing transformations. In this package, we define “normalize” as in “to render data Gaussian”, rather than transform it to the 0-1 scale.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 17.0242
- Box-Cox: 17.0954
- Exp(x): 16.9817
- Log_b(x+a): 16.9687
- No transform: 16.887
- orderNorm (ORQ): 13.5756
- sqrt(x + a): 16.9604
- Yeo-Johnson: 16.6287
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 15 unique values
- Original quantiles:
0% 25% 50% 75% 100%
0.607 0.888 0.938 1.000 1.000
cor_max_best <- bestNormalize(data$cor_max)
orderNorm_obj <- orderNorm(data$cor_max)
data$cor_maxNorm <- predict(orderNorm_obj, newdata = data$cor_max)
Shapiro-Wilk normality test
data: data$cor_maxNorm
W = 0.85932, p-value < 2.2e-16
Still, cor_max
is not even nearly normally distributed. We will keep using cor_max
as variable for now.
cor_target
cor_target
: the correlation between the pertinent vector in Shat and the semantic vector of its targeted form in S.
Shapiro-Wilk normality test
data: data$cor_target
W = 0.89454, p-value < 2.2e-16
cor_target
is not normally distributed. Thus, cor_target
is log-transformed:
Shapiro-Wilk normality test
data: data$cor_targetLog
W = 0.8771, p-value < 2.2e-16
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 6.687
- Box-Cox: 6.3041
- Exp(x): 6.4214
- Log_b(x+a): 6.2881
- No transform: 6.3034
- orderNorm (ORQ): 1.1091
- sqrt(x + a): 5.9398
- Yeo-Johnson: 6.3725
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 48 unique values
- Original quantiles:
0% 25% 50% 75% 100%
0.068 0.279 0.637 0.888 0.998
cor_target_best <- bestNormalize(data$cor_target)
orderNorm_obj <- orderNorm(data$cor_target)
data$cor_targetNorm <- predict(orderNorm_obj, newdata = data$cor_target)
Shapiro-Wilk normality test
data: data$cor_targetNorm
W = 0.99374, p-value = 0.006111
l1norm
l1norm
: city block distance / Manhattan distance of Shat.
Shapiro-Wilk normality test
data: data$l1norm
W = 0.93901, p-value = 4.012e-16
l1norm
is not normally distributed. Thus, l1norm
is log-transformed:
Shapiro-Wilk normality test
data: data$l1normLog
W = 0.87911, p-value < 2.2e-16
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 6.1997
- Box-Cox: 6.1849
- Exp(x): 7.6258
- Log_b(x+a): 6.254
- No transform: 6.1678
- orderNorm (ORQ): 1.3636
- sqrt(x + a): 6.1385
- Yeo-Johnson: 6.1727
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 24 unique values
- Original quantiles:
0% 25% 50% 75% 100%
0.102 0.572 0.971 1.439 1.937
l1norm_best <- bestNormalize(data$l1norm)
orderNorm_obj <- orderNorm(data$l1norm)
data$l1normNorm <- predict(orderNorm_obj, newdata = data$l1norm)
Shapiro-Wilk normality test
data: data$l1normNorm
W = 0.98538, p-value = 2.408e-06
l2norm
l2norm
: euclidian (direct) distance / length of Shat.
Shapiro-Wilk normality test
data: data$l2norm
W = 0.84956, p-value < 2.2e-16
l2norm
is not normally distributed. Thus, l2norm
is log-transformed:
Shapiro-Wilk normality test
data: data$l2normLog
W = 0.94739, p-value = 7.141e-15
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 10.3382
- Box-Cox: 5.7599
- Exp(x): 10.7656
- Log_b(x+a): 5.5838
- No transform: 10.3382
- orderNorm (ORQ): 1.3053
- sqrt(x + a): 7.0144
- Yeo-Johnson: 8.738
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 24 unique values
- Original quantiles:
0% 25% 50% 75% 100%
0.004 0.014 0.033 0.042 0.096
l2norm_best <- bestNormalize(data$l2norm)
orderNorm_obj <- orderNorm(data$l2norm)
data$l2normNorm <- predict(orderNorm_obj, newdata = data$l2norm)
Shapiro-Wilk normality test
data: data$l2normNorm
W = 0.98576, p-value = 3.256e-06
density
density
: the mean correlation of s-hat with the semantic vectors of its top 8 neighbors.
Shapiro-Wilk normality test
data: data$density
W = 0.93529, p-value < 2.2e-16
density
is not normally distributed. Thus, density
is log-transformed:
Shapiro-Wilk normality test
data: data$densityLog
W = 0.95078, p-value = 2.511e-14
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 3.9804
- Box-Cox: 3.1975
- Exp(x): 4.401
- Log_b(x+a): 3.6193
- No transform: 4.1459
- orderNorm (ORQ): 1.2895
- sqrt(x + a): 3.9806
- Yeo-Johnson: 3.1907
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 24 unique values
- Original quantiles:
0% 25% 50% 75% 100%
0.518 0.601 0.694 0.785 0.929
density_best <- bestNormalize(data$density)
orderNorm_obj <- orderNorm(data$density)
data$densityNorm <- predict(orderNorm_obj, newdata = data$density)
Shapiro-Wilk normality test
data: data$densityNorm
W = 0.98627, p-value = 4.935e-06
cor_affix
cor_affix
: the correlation between the pertinent vector in Shat and the semantic vector of its predicted form in Sthe correlation of s-hat and the semantic vector of its affix function; more or less a measure of transparency.
Shapiro-Wilk normality test
data: data$cor_affix
W = 0.75756, p-value < 2.2e-16
cor_affix
is not normally distributed. Thus, cor_affix
is log-transformed:
Shapiro-Wilk normality test
data: data$cor_affixLog
W = NaN, p-value = NA
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 19.7154
- Exp(x): 19.9425
- Log_b(x+a): 19.7236
- No transform: 19.7235
- orderNorm (ORQ): 15.5695
- sqrt(x + a): 18.8939
- Yeo-Johnson: 19.2443
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 25 unique values
- Original quantiles:
0% 25% 50% 75% 100%
-0.050 -0.004 0.000 0.002 0.099
cor_affix_best <- bestNormalize(data$cor_affix)
orderNorm_obj <- orderNorm(data$cor_affix)
data$cor_affixNorm <- predict(orderNorm_obj, newdata = data$cor_affix)
Shapiro-Wilk normality test
data: data$cor_affixNorm
W = 0.89007, p-value < 2.2e-16
8.2.3.2 production measures
correlations
correlations
: the correlation of the predicted path with the targeted semantic vector.
Shapiro-Wilk normality test
data: data$correlations
W = 0.87493, p-value < 2.2e-16
correlations
is not normally distributed. Thus, correlations
is log-transformed:
Shapiro-Wilk normality test
data: data$correlationsLog
W = 0.86514, p-value < 2.2e-16
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 8.274
- Box-Cox: 8.1077
- Exp(x): 8.7276
- Log_b(x+a): 8.0811
- No transform: 8.225
- orderNorm (ORQ): 1.2942
- sqrt(x + a): 7.849
- Yeo-Johnson: 8.222
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 46 unique values
- Original quantiles:
0% 25% 50% 75% 100%
0.068 0.279 0.651 0.958 1.000
correlations_best <- bestNormalize(data$correlations)
orderNorm_obj <- orderNorm(data$correlations)
data$correlationsNorm <- predict(orderNorm_obj, newdata = data$correlations)
Shapiro-Wilk normality test
data: data$correlationsNorm
W = 0.99319, p-value = 0.003364
path_counts
path_counts
: the correlation of the predicted path with the targeted semantic vector.
Shapiro-Wilk normality test
data: data$path_counts
W = 0.80891, p-value < 2.2e-16
path_counts
is not normally distributed. Thus, path_counts
is log-transformed:
Shapiro-Wilk normality test
data: data$path_countsLog
W = 0.85867, p-value < 2.2e-16
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 23.2514
- Box-Cox: 23.2177
- Exp(x): 45.0524
- Log_b(x+a): 23.2177
- No transform: 23.2708
- orderNorm (ORQ): 23.3514
- sqrt(x + a): 23.2924
- Yeo-Johnson: 23.2177
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
Standardized Box Cox Transformation with 684 nonmissing obs.:
Estimated statistics:
- lambda = -0.0112817
- mean (before standardization) = 0.6606645
- sd (before standardization) = 0.4586133
correlations_best <- bestNormalize(data$path_counts)
bc_obj <- bestNormalize::boxcox(data$path_counts)
data$path_countsNorm <- predict(bc_obj, newdata = data$path_counts)
Shapiro-Wilk normality test
data: data$path_countsNorm
W = 0.85827, p-value < 2.2e-16
path_sum
path_sum
: the correlation of the predicted path with the targeted semantic vector.
Shapiro-Wilk normality test
data: data$path_sum
W = 0.91361, p-value < 2.2e-16
path_sum
is not normally distributed. Thus, path_sum
is log-transformed:
Shapiro-Wilk normality test
data: data$path_sumLog
W = 0.88998, p-value < 2.2e-16
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 6.6561
- Box-Cox: 5.521
- Exp(x): 7.9058
- Log_b(x+a): 6.5748
- No transform: 5.538
- orderNorm (ORQ): 1.2506
- sqrt(x + a): 6.4564
- Yeo-Johnson: 5.3907
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 48 unique values
- Original quantiles:
0% 25% 50% 75% 100%
1.010 2.003 3.328 4.206 4.709
bestNormalize(data$path_sum)
path_sum_best <- bestNormalize(data$path_sum)
orderNorm_obj <- orderNorm(data$path_sum)
data$path_sumNorm <- predict(orderNorm_obj, newdata = data$path_sum)
Shapiro-Wilk normality test
data: data$path_sumNorm
W = 0.99349, p-value = 0.004673
path_entropies
path_entropies
: the correlation of the predicted path with the targeted semantic vector.
Shapiro-Wilk normality test
data: data$path_entropies
W = 0.86413, p-value < 2.2e-16
path_entropies
is not normally distributed. Thus, path_entropies
is log-transformed:
Shapiro-Wilk normality test
data: data$path_entropiesLog
W = NaN, p-value = NA
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 18.9133
- Exp(x): 17.007
- Log_b(x+a): 30.5447
- No transform: 16.8756
- orderNorm (ORQ): 4.795
- sqrt(x + a): 19.103
- Yeo-Johnson: 16.909
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 36 unique values
- Original quantiles:
0% 25% 50% 75% 100%
0.000 0.000 0.952 1.204 2.346
bestNormalize(data$path_entropies)
path_entropies_best <- bestNormalize(data$path_entropies)
orderNorm_obj <- orderNorm(data$path_entropies)
data$path_entropiesNorm <- predict(orderNorm_obj, newdata = data$path_entropies)
Shapiro-Wilk normality test
data: data$path_entropiesNorm
W = 0.93314, p-value < 2.2e-16
lwlr
lwlr
: the correlation of the predicted path with the targeted semantic vector.
Shapiro-Wilk normality test
data: data$lwlr
W = 0.84565, p-value < 2.2e-16
lwlr
is not normally distributed. Thus, lwlr
is log-transformed:
Shapiro-Wilk normality test
data: data$lwlrLog
W = 0.87731, p-value < 2.2e-16
However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize
package.
Best Normalizing transformation with 684 Observations
Estimated Normality Statistics (Pearson P / df, lower => more normal):
- arcsinh(x): 7.4059
- Box-Cox: 7.1704
- Exp(x): 61.1705
- Log_b(x+a): 7.3937
- No transform: 8.8156
- orderNorm (ORQ): 1.6566
- sqrt(x + a): 7.8238
- Yeo-Johnson: 7.4122
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
- 34 unique values
- Original quantiles:
0% 25% 50% 75% 100%
5.102 5.882 8.475 16.667 23.810
bestNormalize(data$lwlr)
lwlr_best <- bestNormalize(data$lwlr)
orderNorm_obj <- orderNorm(data$lwlr)
data$lwlrNorm <- predict(orderNorm_obj, newdata = data$lwlr)
Shapiro-Wilk normality test
data: data$lwlrNorm
W = 0.9905, p-value = 0.0002134
8.3 Correlations 1
Correlations of sDurLog, Affix, and all (normalized) LDL measures.
pairscor.fnc(data[,c("sDurLog", "Affix", "speakingRateLog", "baseDurLog", "pauseBin", "biphoneProbSumBin", "folType", "monoMultilingual",
"cor_max", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm",
"correlationsNorm", "path_countsNorm", "path_sumNorm", "path_entropiesNorm", "lwlrNorm" )])
However, this is a rather crowded picture. We will return to more specific correlations in the next section.
8.4 Random Forests
We take a look at two different random forests: one forest containing all variables, and one forest containing only LDL variables.
8.4.1 Forest 1: all variables
preds = c("Affix", "speakingRateLog", "pauseBin", "biphoneProbSumBin", "folType", "monoMultilingual",
"cor_maxNorm", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm",
"correlationsNorm", "path_counts", "path_sumNorm", "path_entropiesNorm", "lwlrNorm")
varnames = c("sDurLog", "Affix", "speakingRateLog", "pauseBin", "biphoneProbSumBin", "folType", "monoMultilingual",
"cor_maxNorm", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm",
"correlationsNorm", "path_counts", "path_sumNorm", "path_entropiesNorm", "lwlrNorm")
pseudo.cforest = cforest(sDurLog ~ ., data = data[, varnames])
pseudo.varimp = varimp(pseudo.cforest)
dotplot(sort(pseudo.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration"))
8.4.1 Forest 2: LDL variables
preds.ldl = c(
"cor_maxNorm", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm",
"correlationsNorm", "path_counts", "path_sumNorm", "path_entropiesNorm", "lwlrNorm")
varnames.ldl = c("sDurLog",
"cor_maxNorm", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm",
"correlationsNorm", "path_counts", "path_sumNorm", "path_entropiesNorm", "lwlrNorm")
pseudo.cforest.ldl = cforest(sDurLog ~ ., data = data[, varnames.ldl])
pseudo.varimp.ldl = varimp(pseudo.cforest.ldl)
dotplot(sort(pseudo.varimp.ldl), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration"))
8.5 Correlations 2
Taking a closer look at the LDL variables, we find that some of them are highly correlated:
pairscor.fnc(data[,c("sDurLog", "Affix",
"cor_targetNorm", "lwlrNorm", "path_sumNorm", "correlationsNorm", "cor_maxNorm", "densityNorm", "cor_affixNorm",
"l1normNorm", "rank", "path_counts", "path_entropiesNorm", "l2normNorm" )])
We find the following highly correlated variable combinations:
- correlationsNorm vs. cor_targetNorm, \(rho=0.92\)
- path_sumNorm vs. lwlrNorm, \(rho=-0.87\)
- l1normNorm vs. l2normNorm, \(rho=0.93\)
- path_counts vs. path_entropiesNorm, \(rho=0.91\)
Thus, we need to exclude some of them so we do not run into problems of colinearity later on.
correlationsNorm vs. cor_targetNorm
Create two simple models containing either correlationsNorm
or cor_targetNorm.
Then, compare them using ANOVA()
.
mdl.test.correlationsNorm <- lmer(sDurLog ~ correlationsNorm + (1 | speaker), data, REML=F)
mdl.test.cor_targetNorm <- lmer(sDurLog ~ cor_targetNorm + (1 | speaker), data, REML=F)
anova(mdl.test.correlationsNorm, mdl.test.cor_targetNorm)
Data: data
Models:
mdl.test.correlationsNorm: sDurLog ~ correlationsNorm + (1 | speaker)
mdl.test.cor_targetNorm: sDurLog ~ cor_targetNorm + (1 | speaker)
npar AIC BIC logLik deviance Chisq Df
mdl.test.correlationsNorm 4 401.79 419.90 -196.89 393.79
mdl.test.cor_targetNorm 4 402.12 420.24 -197.06 394.12 0 0
Pr(>Chisq)
mdl.test.correlationsNorm
mdl.test.cor_targetNorm 1
No difference; keep cor_targetNorm
because it is less correlated to other variables.
path_sumNorm vs. lwlrNorm
Create two simple models containing either path_sumNorm
or lwlrNorm
Then, compare them using ANOVA()
.
mdl.test.path_sumNorm <- lmer(sDurLog ~ path_sumNorm + (1 | speaker), data, REML=F)
mdl.test.lwlrNorm <- lmer(sDurLog ~ lwlrNorm + (1 | speaker), data, REML=F)
anova(mdl.test.path_sumNorm, mdl.test.lwlrNorm)
Data: data
Models:
mdl.test.path_sumNorm: sDurLog ~ path_sumNorm + (1 | speaker)
mdl.test.lwlrNorm: sDurLog ~ lwlrNorm + (1 | speaker)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
mdl.test.path_sumNorm 4 402.64 420.75 -197.32 394.64
mdl.test.lwlrNorm 4 402.91 421.02 -197.46 394.91 0 0 1
No difference; keep lwlrNorm
because it is less correlated to other variables.
l1normNorm vs. l2normNorm
Create two simple models containing either l1normNorm
or l2normNorm
Then, compare them using ANOVA()
.
mdl.test.l1normNorm <- lmer(sDurLog ~ l1normNorm + (1 | speaker), data, REML=F)
mdl.test.l2normNorm <- lmer(sDurLog ~ l2normNorm + (1 | speaker), data, REML=F)
anova(mdl.test.l1normNorm, mdl.test.l2normNorm)
Data: data
Models:
mdl.test.l1normNorm: sDurLog ~ l1normNorm + (1 | speaker)
mdl.test.l2normNorm: sDurLog ~ l2normNorm + (1 | speaker)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
mdl.test.l1normNorm 4 402.98 421.09 -197.49 394.98
mdl.test.l2normNorm 4 403.18 421.29 -197.59 395.18 0 0 1
No difference; keep l1normNorm
because it is less correlated to other variables.
path_counts vs. path_entropiesNorm
Create two simple models containing either path_counts
or path_entropiesNorm
Then, compare them using ANOVA()
.
mdl.test.path_counts <- lmer(sDurLog ~ path_counts + (1 | speaker), data, REML=F)
mdl.test.path_entropiesNorm <- lmer(sDurLog ~ path_entropiesNorm + (1 | speaker), data, REML=F)
anova(mdl.test.path_counts, mdl.test.path_entropiesNorm)
Data: data
Models:
mdl.test.path_counts: sDurLog ~ path_counts + (1 | speaker)
mdl.test.path_entropiesNorm: sDurLog ~ path_entropiesNorm + (1 | speaker)
npar AIC BIC logLik deviance Chisq Df
mdl.test.path_counts 4 401.57 419.69 -196.79 393.57
mdl.test.path_entropiesNorm 4 402.66 420.77 -197.33 394.66 0 0
Pr(>Chisq)
mdl.test.path_counts
mdl.test.path_entropiesNorm 1
No difference; keep path_entropiesNorm
because it is less correlated to other variables.
Create two simple models containing either correlationsNorm
or cor_targetNorm.
Then, compare them using ANOVA()
.
mdl.test.correlationsNorm <- lmer(sDurLog ~ correlationsNorm + (1 | speaker), data, REML=F)
mdl.test.cor_targetNorm <- lmer(sDurLog ~ cor_targetNorm + (1 | speaker), data, REML=F)
anova(mdl.test.correlationsNorm, mdl.test.cor_targetNorm)
Data: data
Models:
mdl.test.correlationsNorm: sDurLog ~ correlationsNorm + (1 | speaker)
mdl.test.cor_targetNorm: sDurLog ~ cor_targetNorm + (1 | speaker)
npar AIC BIC logLik deviance Chisq Df
mdl.test.correlationsNorm 4 401.79 419.90 -196.89 393.79
mdl.test.cor_targetNorm 4 402.12 420.24 -197.06 394.12 0 0
Pr(>Chisq)
mdl.test.correlationsNorm
mdl.test.cor_targetNorm 1
No difference; keep ‘cor_targetNorm’ because it is less correlated to other variables.
8.6 Mixed Effects Models
Let’s create a full model containing all traditional control variables as well as all non-correlated LDL variables.
mdl.before.step <- lmer(sDurLog ~
#Affix +
speakingRate +
pauseBin +
folType +
# correlationsNorm +
# path_sumNorm +
# l2normNorm +
# path_countsNorm +
lwlrNorm +
cor_targetNorm +
cor_affixNorm +
cor_maxNorm +
densityNorm +
l1normNorm +
rank +
path_entropiesNorm +
(1 | speaker)+
(1 | transcription),
data=data,
REML=FALSE)
Checking this full model for significance of fixed effects proves to result in a rather sad picture:
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
speakingRate 1.3005 1.3005 1 662.48 19.4765 1.189e-05 ***
pauseBin 10.0826 10.0826 1 660.77 150.9964 < 2.2e-16 ***
folType 3.4696 0.8674 4 631.25 12.9902 3.591e-10 ***
lwlrNorm 0.0223 0.0223 1 87.00 0.3339 0.5648
cor_targetNorm 0.0016 0.0016 1 394.81 0.0240 0.8770
cor_affixNorm 0.0115 0.0115 1 163.03 0.1717 0.6792
cor_maxNorm 0.1510 0.1510 1 24.07 2.2608 0.1457
densityNorm 0.1489 0.1489 1 22.81 2.2293 0.1491
l1normNorm 0.0054 0.0054 1 31.48 0.0814 0.7773
rank 0.0821 0.0821 1 25.23 1.2295 0.2780
[ reached getOption("max.print") -- omitted 1 row ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s see whether reducing the number of fixed effects results in any significant LDL measures:
Backward reduced random-effect table:
Eliminated npar logLik AIC LRT Df Pr(>Chisq)
<none> 18 -97.931 231.86
(1 | transcription) 1 17 -98.468 230.94 1.073 1 0.3003
(1 | speaker) 0 16 -193.416 418.83 189.897 1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Backward reduced fixed-effect table:
Degrees of freedom method: Satterthwaite
Eliminated Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
cor_targetNorm 1 0.0058 0.0058 1 645.07 0.0850 0.7708
l1normNorm 2 0.0128 0.0128 1 645.04 0.1892 0.6637
lwlrNorm 3 0.0115 0.0115 1 648.19 0.1701 0.6802
path_entropiesNorm 4 0.0473 0.0473 1 644.40 0.6968 0.4042
cor_affixNorm 5 0.0511 0.0511 1 647.07 0.7513 0.3864
rank 6 0.0735 0.0735 1 645.57 1.0804 0.2990
cor_maxNorm 7 0.1825 0.1825 1 645.08 2.6760 0.1024
densityNorm 8 0.0980 0.0980 1 646.03 1.4306 0.2321
speakingRate 0 1.3916 1.3916 1 683.99 20.2607 7.942e-06
cor_targetNorm
l1normNorm
lwlrNorm
path_entropiesNorm
cor_affixNorm
rank
cor_maxNorm
densityNorm
speakingRate ***
[ reached getOption("max.print") -- omitted 2 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model found:
sDurLog ~ speakingRate + pauseBin + folType + (1 | speaker)
No - none of the LDL measures appears to be significantly predictive for /s/ duration.
Taking a closer look at the distribution of all LDL measures, we find why they do not show any effects:
cor_targetNorm
cor_affixNorm
cor_maxNorm
densityNorm
l1normNorm
rank
path_entropiesNorm
9 More Measures
It appears that the measures obtained by learning comprehension and production of real words and pseudowords are not predictive of /s/ duration in pseudowords.
However, there is one ray of hope left: Chuang et al. (2020) use a number of measures which are not contained in our comprehension and production measures, namely:
- ALDC - Average Levenshtein Distance of Candidates
- SCPP - Semantic Correlation of Predicted Production
- EDNN - Euclidian Distance from Nearest Neighbour
- NNC - Nearest Neighbour Correlation
- ALC - Average Lexical Correlation
- DRC - Dual Route Consistency
Let’s see whether these measures are predictive of /s/ duration in our pseudowords.
9.1 ALDC
The Average Levenshtein Distance of all Candidate productions from the true pronunciation of a given pseudoword.
Higher ALDC values indicate that the candidate forms are very different from the intended pronunciation, indicating a sparse form neighborhood.
# load stringdist package to calculate Levensthein Distances
library(stringdist)
# creating a function to obtain the ALDC
ALDC <- function (prod_acc, data) {
if (is.null(prod_acc)) {
stop(call=F, geterrmessage = "prod_acc not found\n")
}
if (is.null(data)) {
stop(call=F, geterrmessage = "data not found\n")
}
result <- vector("list",length(prod_acc[["full"]]))
for (i in 1:length(prod_acc[["full"]]))
{
result[[i]] <- stringdist::stringdist(
prod_acc[["full"]][[i]][["word"]],
prod_acc[["full"]][[i]][["candidates"]],
method = c("lv"),
useBytes = FALSE,
weight = c(d = 1, i = 1, s = 1, t = 1),
q = 1,
p = 0,
bt = 0,
nthread = getOption("sd_num_thread"))
}
Word = data$Word
Base = data$Base
ALDC <- data.frame(matrix(vector(), 0, 1, dimnames=list(c(), c("ALDC"))), stringsAsFactors=F)
for (j in 1:length(prod_acc[["full"]]))
{
ALDC[j,1] <- mean(result[[j]])
}
ALDCframe <- as.data.frame(cbind(Word, Base, ALDC))
}
# using the function
ALDCframe <- ALDC(prod_acc = comb.prod_acc, data = comb.data)
9.2 SCPP
The maximum correlation between a word’s predicted semantic vector and any of the semantic vectors of its candidate forms, i.e. the Semantic Correlation of Predicted Production.
Higher SCPP values indicate that the semantics of the estimated form better approximate the estimated meaning.
# creating a function to obtain the SCPP
SCPP <- function (comp_measures, data) {
if (is.null(comp_measures)) {
stop(call=F, geterrmessage = "comp_measures not found\n")
}
if (is.null(data)) {
stop(call=F, geterrmessage = "data not found\n")
}
SCPP = comp_measures$cor_max
Word = data$Word
Base = data$Base
SCPPframe <- as.data.frame(cbind(Word, Base, SCPP))
}
# using the function
SCPPframe <- SCPP(comp_measures = comb.comp_measures, data = comb.data)
9.3 EDNN
The Euclidian Distance between a word’s estimated semantic vector and its Nearest Neighbour.
Higher EDNN values indicate that a word is more distant from the nearest word neighbour.
# creating a function to obtain the SCPP
EDNN <- function (comprehension, data) {
if (is.null(comprehension)) {
stop(call=F, geterrmessage = "comprehension not found\n")
}
if (is.null(data)) {
stop(call=F, geterrmessage = "data not found\n")
}
euclidian <- get.knnx(comprehension$S, comprehension$Shat, k=1)
EDNN <- as.data.frame(euclidian[["nn.dist"]])
colnames(EDNN) <- c("EDNN")
Word = data$Word
Base = data$Base
EDNNframe <- as.data.frame(cbind(Word, Base, EDNN))
}
# using the function
EDNNframe <- EDNN(comprehension = comb.comp, data = comb.data)
9.4 NNC
The maximum of the correlations between a pseudoword’s estimated semantic vector and words’ semantic vectors, i.e. a pseudoword’s Nearest Neighbour Correlation.
Higher NNC values indicate that a pseudoword’s meaning is similar to that of a real word.
# creating a function to obtain the SCPP
NNC <- function (pseudo_S_matrix, real_S_matrix, pseudo_word_data, real_word_data)
{
if (is.null(pseudo_S_matrix)) {
stop(call=F, geterrmessage = "pseudo_S_matrix not found\n")
}
if (is.null(real_S_matrix)) {
stop(call=F, geterrmessage = "real_S_matrix not found\n")
}
if (is.null(pseudo_word_data)) {
stop(call=F, geterrmessage = "pseudo_word_data not found\n")
}
if (is.null(real_word_data)) {
stop(call=F, geterrmessage = "real_word_data not found\n")
}
results <- data.frame(matrix(ncol = 1, nrow = 0))
x <- colnames(results[1])
names(results)[names(results) == x] <- "V1"
for (i in 1:nrow(pseudo_S_matrix))
{
for (j in 1:nrow(real_S_matrix))
{
results[i,j] <- cor(pseudo_S_matrix[i,], real_S_matrix[j,], method = "pearson")
}
}
NNC <- apply(results, 1, max)
vectors_wV <- colnames(results)[apply(results,1,which.max)]
vectors <- as.numeric(gsub('V', '', vectors_wV))
Base <- pseudo_word_data$Base
Pseudoword <- pseudo_word_data$Word
Word <- real_word_data[vectors,1]
NNCframe <- as.data.frame(cbind(NNC, vectors, Base, Pseudoword, Word))
}
# using the function
NNCframe <- NNC(pseudo_S_matrix = engs.Smat, real_S_matrix = comb.Smat, pseudo_word_data = engs.data, real_word_data = comb.data)
9.5 ALC
The mean correlation of a pseudoword’s estimated semantic vector with each of the word’s semantic vectors, i.e. its Average Lexical Correlation.
Higher ALC values indicate that a pseudoword vector is part of a denser semantic neighbourhood.
# creating a function to obtain the SCPP
ALC <- function (pseudo_S_matrix, real_S_matrix, pseudo_word_data)
{
if (is.null(pseudo_S_matrix)) {
stop(call=F, geterrmessage = "pseudo_S_matrix not found\n")
}
if (is.null(real_S_matrix)) {
stop(call=F, geterrmessage = "real_S_matrix not found\n")
}
if (is.null(pseudo_word_data)) {
stop(call=F, geterrmessage = "pseudo_word_data not found\n")
}
results <- data.frame(matrix(ncol = 1, nrow = 0))
for (i in 1:length(row.names(pseudo_S_matrix)))
{
for (j in 1:length(row.names(real_S_matrix)))
{
results[i,j] <- cor(pseudo_S_matrix[i,], real_S_matrix[j,], method = "pearson")
}
}
ALC <- rowMeans(results)
Pseudoword <- pseudo_word_data$Word
Base <- pseudo_word_data$Base
ALCframe <- as.data.frame(cbind(ALC, Pseudoword, Base))
}
# using the function
ALCframe <- ALC(pseudo_S_matrix = engs.Smat, real_S_matrix = mald.Smat, pseudo_word_data = engs.data)
9.6 DRC
The correlation between the semantic vector estimated from the direct route and that from the indirect route, i.e. the Dual Route Consistency of a word’s semantic vectors.
Higher DRC values indicates that the semantic vectors produced by the two routes are more similar to each other.
As we need both, estimated semantic vectors for the direct and indirect route, we first have to compute the estimated semantic vector for the indirect route. The estimated semantic vectors for the direct route were computed earlier (see 4.1).
## get cues for INDIRECT route
comb.cues.indirect = make_cue_matrix(data = comb.data, formula = ~ Word + Affix, grams = 3, cueForm = "DISC", indirectRoute = TRUE)
## get mapping for comprehension with INDIRECT route cues
comb.comp.indirect = learn_comprehension(cue_obj = comb.cues.indirect, S = comb.Smat3)
beep()
## solve equations to obtain the S matrix based on INDIRECT cues
Tmat = comb.cues.indirect$matrices$T
X = MASS::ginv(t(comb.cues.indirect$matrices$C) %*% comb.cues.indirect$matrices$C) %*% t(comb.cues.indirect$matrices$C)
K = X %*% comb.cues.indirect$matrices$T
H = MASS::ginv(t(Tmat) %*% Tmat) %*% t(Tmat) %*% comb.Smat2
That = comb.cues.indirect$matrices$C %*% K
Shat = That %*% H
Now, Shat
is the matrix containing estimated semantic vectors for words and pseudowords based on the indirect route. Using Shat
and the Shat matrix contained in the object obtained originally by learn_comprehension
, we are able to compute DRC
:
# creating a function to obtain the SCPP
DRC <- function (comprehension, S_matrix_indirect, data)
{
if (is.null(comprehension)) {
stop(call=F, geterrmessage = "comprehension not found\n")
}
if (is.null(S_matrix_indirect)) {
stop(call=F, geterrmessage = "S_matrix_indirect not found\n")
}
if (is.null(data)) {
stop(call=F, geterrmessage = "data not found\n")
}
results <- data.frame(matrix(ncol = 1, nrow = 0))
for (i in 1:length(row.names(S_matrix_indirect)))
{
results[i,1] <- cor(S_matrix_indirect[i,], comprehension$Shat[i,], method = "pearson")
}
DRC <- rowMeans(results)
Word <- data$Word
Base <- data$Base
DRCframe <- as.data.frame(cbind(DRC, Word, Base))
}
# using the function
DRCframe <- DRC(comprehension = comb.comp, S_matrix_indirect = Shat, data = comb.data)
9.7 Package
The functions to compute and extract ALDC, SCPP, EDNN, NNC, ALC, and DRC are now available as LDLConvFunctions
. This package can be found here: https://github.com/dosc91/LDLConvFunctions
Install this package through devtools:
10 Analysis II
10.1 Data
Combine previously used data (see 8.1) with newly computed and extracted measures (see 9).
# ALDC
ALDCframe <- read.csv("chuang_measures/ALDCframe_new.csv")[2:4]
# SCPP
SCPPframe <- read.csv("chuang_measures/SCPPframe_new.csv")[2:4]
# EDNN
EDNNframe <- read.csv("chuang_measures/EDNNframe_new.csv")[2:4]
# NNC
NNCframe <- read.csv("chuang_measures/NNCframe_new.csv")[2:6]
# ALC
ALCframe <- read.csv("chuang_measures/ALCframe_new.csv")[2:4]
# DRC
DRCframe <- read.csv("chuang_measures/DRCframe_new.csv")[2:4]
## merge frames which contain info on all words
comb1 <- merge(ALDCframe, SCPPframe, by=c("Base", "Word"))
comb2 <- merge(comb1, EDNNframe, by=c("Base", "Word"))
ALDC_SCPP_EDNN_DCC_frame <- merge(comb2, DRCframe, by=c("Base", "Word"))
## filter for pseudowords
ALDC_SCPP_EDNN_DCC_frame <- subset(ALDC_SCPP_EDNN_DCC_frame, Word=="bloufs" | Word=="blouks" | Word=="bloups" | Word=="blouts" |
Word=="cloofs" | Word=="clooks" | Word=="cloops" | Word=="cloots" |
Word=="glaifs" | Word=="glaiks" | Word=="glaips" | Word=="glaits" |
Word=="glifs" | Word=="gliks" | Word=="glips" | Word=="glits" |
Word=="pleefs" | Word=="pleeks" | Word=="pleeps" | Word=="pleets" |
Word=="prufs" | Word=="pruks" | Word=="prups" | Word=="pruts")
# merge NNC and ALC
NNC_ALC_frame <- merge(NNCframe, ALCframe, by=c("Base", "Pseudoword"))
# rename columns
names(NNC_ALC_frame)[names(NNC_ALC_frame) == "Word"] <- "RealWord"
names(NNC_ALC_frame)[names(NNC_ALC_frame) == "Pseudoword"] <- "Word"
# merge all frames
all_frames <- merge(ALDC_SCPP_EDNN_DCC_frame, NNC_ALC_frame, by=c("Base", "Word"))
# merge rest of data with new measures
data <- merge(data, all_frames, by=c("Base", "Word"))
10.2 Correlations I
Correlations of sDurLog, Affix, and all new measures.
pairscor.fnc(data[,c("sDurLog", "Affix", "NeighbourhoodDensity", "ALDC", "DRC", "NNC", "ALC", "SCPP", "EDNN")])
We find the following highly correlated variable combinations:
- DRC vs. EDNN, \(rho=0.43\)
- NNC vs. ALC, \(rho=0.76\)
- ALC vs. SCPP, \(rho=0.35\)
- SCPP vs. EDNN, \(rho=-0.5\)
Thus, we need to take some measures before we run into problems of colinearity later on.
10.3 PCA
Let’s attempt a principal component analysis. First, we extract the correlated variables and merge them to form a new data frame.
pca.data1 <- data[colnames(data)=="DRC"]
pca.data2 <- data[colnames(data)=="EDNN"]
pca.data3 <- data[colnames(data)=="NNC"]
pca.data4 <- data[colnames(data)=="SCPP"]
pca.data5 <- data[colnames(data)=="ALC"]
pca.data <- cbind(pca.data1,
pca.data2,
pca.data3,
pca.data4,
pca.data5)
Compute a principal component analysis using the princomp
function:
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 1.4542451 1.2254515 0.8792584 0.64182608 0.44542541
Proportion of Variance 0.4229658 0.3003463 0.1546191 0.08238814 0.03968076
Cumulative Proportion 0.4229658 0.7233120 0.8779311 0.96031924 1.00000000
Following one of many ways, we take components 1, 2, and 3 for further analysis.
The loadings of our PCA components are the following:
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
DRC 0.318 0.443 0.762 0.293 0.191
EDNN 0.381 0.555 -0.172 -0.702 -0.158
NNC -0.482 0.491 -0.269 0.673
SCPP -0.492 -0.241 0.561 -0.615
ALC -0.528 0.444 0.205 -0.692
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings 1.0 1.0 1.0 1.0 1.0
Proportion Var 0.2 0.2 0.2 0.2 0.2
Cumulative Var 0.2 0.4 0.6 0.8 1.0
What do these loadings conceptually mean?
Variables | Comp.1 | Comp.1.interpretation | Comp.2 | Comp.2.interpretation | Comp.3 | Comp.3.interpretation |
---|---|---|---|---|---|---|
ALC | -0.528 | sparse semantic neighbourhood | 0.444 | dense semantic neighbourhood | 0.000 | 0 |
DRC | 0.318 | direct + indirect route are consistent | 0.443 | direct + indirect route are consistent | 0.762 | direct + indirect route are consistent |
EDNN | 0.381 | semantically distant from nearest neighbour | 0.555 | semantically distant from nearest neighbour | -0.172 | semantically close to nearest neighbour |
NNC | -0.482 | pseudoword meaning not similar to real word | 0.491 | pseudoword meaning similar to real word | -0.269 | pseudoword meaning not similar to real word |
SCPP | -0.492 | estimated form and meaning do not match well | -0.241 | estimated form and meaning do not match well | 0.561 | estimated form and meaning match well |
Taking the results of the PCA, we combine our data set and the newly calculated component scores:
10.4 Correlations II
Checking the newly calculated PCA components, Affix, sDurLog and ALDC for correlations, we find none:
pairscor.fnc(data[,c("sDurLog", "Affix", "NeighbourhoodDensity", "ALDC", "Comp.1", "Comp.2", "Comp.3")])
10.5 Models
10.5.1 LMER: Simple Model
mdl.simple <- lmer(sDurLog ~
ALDC+
Comp.1 +
Comp.2 +
Comp.3 +
(1|speaker),
data=data, REML=F)
anova(mdl.simple)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
ALDC 0.065581 0.065581 1 644.18 0.7358 0.3913
Comp.1 0.007051 0.007051 1 645.22 0.0791 0.7786
Comp.2 0.000277 0.000277 1 646.22 0.0031 0.9556
Comp.3 0.025287 0.025287 1 644.81 0.2837 0.5945
It appears none of the variables is significant for /s/ duration prediction.
10.5.2 LMER: Control Variables
mdl.control <- lmer(sDurLog ~
Affix +
ALDC +
Comp.1 +
Comp.2 +
Comp.3 +
SylPerMin +
pauseBin +
folType +
(1|speaker),
data=data, REML=F)
anova(mdl.control)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
Affix 1.0285 1.0285 1 670.25 32.9533 1.431e-08 ***
ALDC 0.0010 0.0010 1 648.61 0.0330 0.8559558
Comp.1 0.0037 0.0037 1 652.44 0.1192 0.7299856
Comp.2 0.0168 0.0168 1 652.29 0.5390 0.4630936
Comp.3 0.1507 0.1507 1 649.49 4.8270 0.0283696 *
SylPerMin 27.3285 27.3285 1 581.36 875.5698 < 2.2e-16 ***
pauseBin 1.5566 1.5566 1 676.71 49.8717 4.083e-12 ***
folType 0.6112 0.1528 4 652.61 4.8956 0.0006788 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It appears that Comp.3
is significant for /s/ duration prediction.
10.5.3 GAM: Control Variables
gam.control <- gam(sDurLog ~
Affix +
s(ALDC, k=9)+
s(Comp.1) +
s(Comp.2) +
s(Comp.3) +
SylPerMin+
pauseBin+
folType+
speaker,
data=data)
anova(gam.control)
Family: gaussian
Link function: identity
Formula:
sDurLog ~ Affix + s(ALDC, k = 9) + s(Comp.1) + s(Comp.2) + s(Comp.3) +
SylPerMin + pauseBin + folType + speaker
Parametric Terms:
df F p-value
Affix 1 28.438 1.36e-07
SylPerMin 1 764.534 < 2e-16
pauseBin 1 36.096 3.20e-09
folType 4 4.622 0.0011
speaker 39 6.342 < 2e-16
Approximate significance of smooth terms:
edf Ref.df F p-value
s(ALDC) 4.207 4.914 2.142 0.054426
s(Comp.1) 8.209 8.808 3.613 0.000416
s(Comp.2) 1.040 1.073 1.504 0.211859
s(Comp.3) 1.000 1.000 5.224 0.022601
It appears that Comp.1
and Comp.3
are significant for /s/ duration prediction.
11 Next Steps
Where do we go from here? Do we stick with the PCA components - do we keep 3, only 2 or even 4? What type of modelling do we wish to use? LMER or GAMs or both - and why? Affix
is still significant, thus, LDL measures cannot completely capture the effect of it. Do we have an explanation for this?
My next steps:
- Find out how to use CELEX.
- Redo the entire LDL modelling (i.e. comprehension and production) with more real word data (see 1.).
- Combine “old” and “new” LDL measures (hint: correlations are all over the place, resulting in 7+ PCA components…).
- Go completely mad.
References
Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270. doi.org/10.1075/ml.18010.baa
Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39. doi.org/10.1155/2019/4895891
Chuang, Y-Y., Vollmer, M-l., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., & Baayen, R. H. (2020). The processing of pseudoword form and meaning in production and comprehension: A computational modeling approach using Linear Discriminative Learning. Behavior Research Methods, 1-51. doi.org/10.3758/s13428-020-01356-w
Nykamp, D. (2020, November 10). Multiplying matrices and vectors. Math Insight. mathinsight.org/matrix_vector_multiplication
Schmitz, D., Baer-Henney, D., & Plag, I. (submitted). The duration of word-final /s/ differs across morphological categories in English: Evidence from pseudowords.
Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadic, F., & Sims, M. (2018). The massive auditory lexical decision (MALD) database. Behavior Research Methods, 1-18. doi.org/10.3758/s13428-018-1056-1