Durational differences of homophonous suffixes emerge from the lexicon: PCAs and GAMMs
This script documents the analysis of ENGS production data and MALD corpus data with measures of Linear Discriminative Learning (LDL; e.g. Baayen et al., 2018). This is a work in progress.
The idea of this LDL implementation is the following: Networks trained on real word data and their transformation matrices are used to create semantic matrices for pseudowords, following the implementation by Chuang et al. (2020). The underlying assumption is that speakers who face pseudowords do so with their knowledge on previously encountered words, i.e. their lexicon. Thus, pseudowords are not semantically empty shells but contain some sort of meaning.
The following script includes multiple Principle Component Analyses (PCAs) and the creation of a number of Generalized Additive Mixed Models (GAMMs). Please check this script for the LDL modelling itself as well as measure extractions.
1 Data
The data set used in this script contains the following parts:
- ENGS production data (Schmitz et al., 2020)
- LDL measures as extracted by the
WpmWithLdl
package - LDL measures as extracted by the
LDLConvFunctions
package
For details on data set creation, please check this link.
The data set contains 55 columns and 683 rows, i.e. 683 observations of 54 variables:
2 WpmWithLdl measures
WpmWithLdl
includes the following measures via the comprehension_measures
and production_measures
functions:
Comprehension:
- recognition: whether the correct form was recognized
- cor_max: the correlation between the pertinent vector in Shat and the semantic vector of its predicted form in S
- cor_target: the correlation between the pertinent vector in Shat and the semantic vector of its targeted form in S
- l1norm: city block distance / Manhattan distance of Shat; a greater l1norm implies more strong links to many other lexomes
- l2norm: euclidian (direct) distance / length of Shat; a greater l2norm implies more strong links to many other lexomes
- rank: the rank of correlation between s-hat and the semantic vector of its targeted form
- density: the mean correlation of s-hat with the semantic vectors of its top 8 neighbors
- cor_affix: the correlation of s-hat and the semantic vector of its affix function; more or less a measure of transparency
Production:
- correlations: the correlation of the predicted path with the targeted semantic vector
- path_counts: the number of paths detected
- path_sum: the summed support for the predicted path, could be read as the uncertainty about how to pronounce the word due to competing activation of articulatory motor plans
- path_entropies: Shannon entropy calculated over the path supports
- lwlr: Simon: “The length-weakest link ratios of the predicted path (also sometimes equal to word length). When the length of a word’s path, as measured in number of nodes, becomes higher, while the weakest link between nodes stays the same, uncertainty becomes greater. If triphone support is high, i.e. when all the strengths of the links are almost equal to one, the value will be almost equal to the number of segments in the word. This may not be the case in other models, where the triphone support may be lower.”
Without discussing the underlying implications of the individual measures, let’s see whether some of them are correlated. If so, we need to take some measures before including them in any models.
2.1 Correlations
Correlations of sDurLog
, Affix
, and all WpmWithLdl
measures:
pairscor.fnc(data[,c("sDurLog", "Affix",
"cor_max", "l1norm", "l2norm", "rank", "density", "cor_affix",
"correlations", "path_counts", "path_sum", "path_entropies", "lwlr")])
We find the following highly correlated (\(\rho\ge0.5\)) variable combinations:
- cor_max vs. l2norm, \(\rho=-0.5\)
- cor_max vs. rank, \(\rho=-0.52\)
- l1norm vs. l2norm, \(\rho=0.86\)
- path_counts vs. path_entropies, \(\rho=0.92\)
- path_sum vs. lwlr, \(\rho=-0.9\)
Thus, we need to take some measures before we run into problems of colinearity later on.
2.2 PCA
To avoid colinearity issues when modelling, we combine all highly correlated variables in a principal component analysis.
PCA Computation
First, we extract the highly correlated variables and merge them to form a new data frame.
pca.data1 <- data[colnames(data)=="cor_max"]
pca.data2 <- data[colnames(data)=="l2norm"]
pca.data3 <- data[colnames(data)=="rank"]
pca.data4 <- data[colnames(data)=="l1norm"]
pca.data5 <- data[colnames(data)=="path_counts"]
pca.data6 <- data[colnames(data)=="path_entropies"]
pca.data7 <- data[colnames(data)=="path_sum"]
pca.data8 <- data[colnames(data)=="lwlr"]
pca.data <- cbind(pca.data1,
pca.data2,
pca.data3,
pca.data4,
pca.data5,
pca.data6,
pca.data7,
pca.data8)
Then, we compute a principal component analysis using the princomp
function:
In order to decide which components to keep, we follow three criteria:
- The Eigenvalue-criterion: Any component that displays an eigenvalue greater than 1.00 is accounting for a greater amount of variance than had been contributed by one variable. Such a component is therefore accounting for a meaningful amount of variance, and is worthy of being retained.
- The scree-criterion: Plot the eigenvalues associated with each component and look for a “break” between the components with relatively large eigenvalues and those with small eigenvalues. The components that appear before the break are assumed to be meaningful and are retained (Cattell, 1966).
- The proportion-of-variance-criterion: Retain enough components so that the cumulative percent of variance accounted for is equal to some minimal value, i.e. 85 percent.
Thus, let’s first check the PCA components’ Eigenvalues (criterion 1):
# load package to extract Eigenvalues
library("factoextra")
# extract Eigenvalues
get_eigenvalue(pc)
eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.67973298 33.4966622 33.49666
Dim.2 2.55993527 31.9991909 65.49585
Dim.3 1.52302408 19.0378010 84.53365
Dim.4 0.60312766 7.5390957 92.07275
Dim.5 0.36117187 4.5146484 96.58740
Dim.6 0.13384219 1.6730274 98.26043
Dim.7 0.09054334 1.1317918 99.39222
Dim.8 0.04862261 0.6077826 100.00000
Components 1, 2, and 3 (called Dim.x here) show an Eigenvalue of \(v\ge1\), i.e. keep components 1, 2, and 3.
Next, take another look at the Eigenvalues and check for a “break” between the components (criterion 2) with relatively large eigenvalues and those with small eigenvalues:
eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.67973298 33.4966622 33.49666
Dim.2 2.55993527 31.9991909 65.49585
Dim.3 1.52302408 19.0378010 84.53365
Dim.4 0.60312766 7.5390957 92.07275
Dim.5 0.36117187 4.5146484 96.58740
Dim.6 0.13384219 1.6730274 98.26043
Dim.7 0.09054334 1.1317918 99.39222
Dim.8 0.04862261 0.6077826 100.00000
A break in Eigenvalues can be found between component 3 with \(v=1.52\) and component 4 with \(v=0.60\). Thus, keep components 1, 2, and 3.
Lastly, check the cumulative percent of variance accounted for by all components (criterion 3). Our target level of explained variance is \(σ^2=85\%\) percent:
eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.67973298 33.4966622 33.49666
Dim.2 2.55993527 31.9991909 65.49585
Dim.3 1.52302408 19.0378010 84.53365
Dim.4 0.60312766 7.5390957 92.07275
Dim.5 0.36117187 4.5146484 96.58740
Dim.6 0.13384219 1.6730274 98.26043
Dim.7 0.09054334 1.1317918 99.39222
Dim.8 0.04862261 0.6077826 100.00000
Again, components 1, 2, and 3 are retained, i.e. their cumulative percent of variance accounted for is \(σ^2=84.53\%\), i.e. almost \(σ^2\ge85\%\)
Thus, we add components 1, 2, and 3 to our data frame.
pcframe <- as.data.frame(pc$scores)
pcframe <- pcframe[1:4]
library(dplyr)
pcframe <- pcframe %>% rename(
WWL_Comp.1 = Comp.1,
WWL_Comp.2 = Comp.2,
WWL_Comp.3 = Comp.3
)
data <- cbind(data, pcframe)
PCA Loadings
Now we can take a closer look at the individual components and their loadings.
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
cor_max 0.475 0.174 0.597 0.606 0.114
l2norm -0.476 -0.307 0.365 -0.135 0.660 0.226 -0.202
rank -0.346 -0.166 -0.494 -0.303 0.709 0.119
l1norm -0.437 -0.325 0.114 0.485 -0.616 -0.148 0.229
path_counts -0.534 0.334 -0.243 0.147 -0.377 -0.606
path_entropies 0.129 -0.485 0.412 -0.325 0.406 0.547
path_sum -0.342 0.366 0.422 -0.121 0.236 -0.248 0.555 -0.361
lwlr 0.307 -0.343 -0.503 -0.183 -0.312 0.551 -0.309
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
Proportion Var 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
Cumulative Var 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000
But what do these loadings conceptually mean?
Component 1
Component 1 is rather highly correlated with cor_max
and l1norm
, l2norm
. For l1norm
and l2norm
, higher values indicate more strong links to many other lexomes, i.e. a higher activation diversity. cor_max
describes the correlation between the pertinent vector in \(\hat{S}\) and the semantic vector of its predicted form in \(S\), i.e. higher values indicate that the semantics of the estimated form better approximate the estimated meaning.
Component 1
indicates a word’s activation diversity (i.e. l1norm
and l2norm
) and semantic certainty (i.e. cor_max
), with higher values representing a lower activation diversity and higher certainty.
Component 2
Component 2 is rather highly correlated with path_counts
and path_entropies
. Both measures describe the certainty of a predicted production. That is, path_counts
reflects the number of paths detected for a pertinent production, while path_entropies
measures the Shannon entropy calculated over the path supports.
Component 2
reflects the support of a predicted production, i.e. its certainty.
Component 3
Component 3 is rather highly correlated with lwlr
and rank
. For lwlr
, higher values indicate a lower support of a predicted production, i.e. the length-weakest link ratio of a predicted path is higher than the number of its triphones. For rank
, higher values indicate a mismatch between estimated and predicted semantic vectors. Thus, both measures indicate a phonological uncertainty. Additionally, Component 3 is also correlated with path_counts
, path_sum
, and path_entropies
, however, these correlations are pointing in the opposite direction. That is, these measures indicate a phonological certainty rather than uncertainty.
Component 3
reflects the support of a predicted production, i.e. its (un-)certainty.
2.3 GAMMs
Let’s see whether the remaining LDL measures and PCA components are significant predictors for word-final /s/ duration in singular and plural pseudowords. We will use Generalized Additive Mixed Models (GAMMs) to model /s/ durations.
Modelling
First, create a model with all relevant variables, including LDL and PCA variables as smooths and speaker
as random effect:
gam.WpmWithLdl.1 <- gam(sDurLog ~
Affix +
s(WWL_Comp.1) +
s(WWL_Comp.2) +
s(WWL_Comp.3) +
s(cor_target)+
s(correlations)+
s(density)+
s(cor_affix)+
SylPerMin+
pauseBin+
folType+
s(speaker, bs="re"),
data=data)
summary(gam.WpmWithLdl.1)
Family: gaussian
Link function: identity
Formula:
sDurLog ~ Affix + s(WWL_Comp.1) + s(WWL_Comp.2) + s(WWL_Comp.3) +
s(cor_target) + s(correlations) + s(density) + s(cor_affix) +
SylPerMin + pauseBin + folType + s(speaker, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.841762 0.113426 -7.421 9.79e-13 ***
AffixPL -0.102729 0.098167 -1.046 0.2961
SylPerMin -0.513179 0.023101 -22.214 < 2e-16 ***
pauseBinpause 0.116584 0.023538 4.953 1.17e-06 ***
folTypeF 0.060360 0.086724 0.696 0.4869
folTypeN 0.006494 0.030648 0.212 0.8323
folTypeP 0.042363 0.025878 1.637 0.1026
folTypeV -0.057121 0.025054 -2.280 0.0232 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(WWL_Comp.1) 2.252 2.789 3.932 0.0312 *
s(WWL_Comp.2) 1.000 1.000 0.446 0.5046
s(WWL_Comp.3) 1.000 1.000 4.736 0.0302 *
s(cor_target) 1.000 1.000 0.073 0.7875
s(correlations) 1.272 1.466 0.258 0.5581
s(density) 1.069 1.125 0.228 0.6181
s(cor_affix) 2.978 3.616 2.591 0.0485 *
s(speaker) 27.954 38.000 3.539 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.797 Deviance explained = 82.1%
GCV = 0.035681 Scale est. = 0.0313 n = 379
Checking the anova()
function’s output, we now exclude non-significant variables. The result of this procedure is the following model:
gam.WpmWithLdl.2 <- gam(sDurLog ~
#Affix +
s(WWL_Comp.1) +
#s(WWL_Comp.2) +
s(WWL_Comp.3) +
#s(cor_target)+
#s(correlations)+
#s(density)+
s(cor_affix)+
SylPerMin+
pauseBin+
folType+
s(speaker, bs="re"),
data=data)
anova(gam.WpmWithLdl.2)
Family: gaussian
Link function: identity
Formula:
sDurLog ~ s(WWL_Comp.1) + s(WWL_Comp.3) + s(cor_affix) + SylPerMin +
pauseBin + folType + s(speaker, bs = "re")
Parametric Terms:
df F p-value
SylPerMin 1 493.430 < 2e-16
pauseBin 1 23.395 2.03e-06
folType 4 4.154 0.00269
Approximate significance of smooth terms:
edf Ref.df F p-value
s(WWL_Comp.1) 3.193 3.873 3.931 0.00747
s(WWL_Comp.3) 8.146 8.648 2.506 0.00997
s(cor_affix) 2.534 2.993 3.887 0.00862
s(speaker) 28.593 38.000 3.634 < 2e-16
Analysis
This is the final model we found:
gam.WpmWithLdl.3 <- gam(sDurLog ~
s(WWL_Comp.1) + # PCA component
s(WWL_Comp.3) + # PCA component
s(cor_affix)+ # LDL measure
SylPerMin+
pauseBin+
folType+
s(speaker, bs="re"),
data=data)
summary(gam.WpmWithLdl.3)
Family: gaussian
Link function: identity
Formula:
sDurLog ~ s(WWL_Comp.1) + s(WWL_Comp.3) + s(cor_affix) + SylPerMin +
pauseBin + folType + s(speaker, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.933545 0.063483 -14.705 < 2e-16 ***
SylPerMin -0.515971 0.023228 -22.213 < 2e-16 ***
pauseBinpause 0.113619 0.023490 4.837 2.03e-06 ***
folTypeF 0.093015 0.084380 1.102 0.2711
folTypeN 0.001699 0.030740 0.055 0.9560
folTypeP 0.039247 0.025646 1.530 0.1269
folTypeV -0.058390 0.024995 -2.336 0.0201 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(WWL_Comp.1) 3.193 3.873 3.931 0.00747 **
s(WWL_Comp.3) 8.146 8.648 2.506 0.00997 **
s(cor_affix) 2.534 2.993 3.887 0.00862 **
s(speaker) 28.593 38.000 3.634 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.8 Deviance explained = 82.6%
GCV = 0.035443 Scale est. = 0.030817 n = 379
In total, 1 LDL measures and 2 PCA components are included as significant predictors for /s/ duration in pseudowords.
WWL_Comp.1
For WWL_Comp.1
, lower values indicate a higher activation diversity and lower certainty, while higher values indicate a higher semantic certainty and lower activation diversity. That is, higher activation diversity and lower certainty lead to shorter /s/ durations, while higher semantic certainty and lower activation diversity lead to longer /s/ durations.
Summary:
* SylPerMin : numeric predictor; set to the value(s): 2.335248237.
* pauseBin : factor; set to the value(s): no_pause.
* folType : factor; set to the value(s): APP.
* WWL_Comp.1 : numeric predictor; with 30 values ranging from -5.142392 to 2.903543.
* WWL_Comp.3 : numeric predictor; set to the value(s): 0.171377825002154.
* cor_affix : numeric predictor; set to the value(s): -0.001114876.
* speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(speaker)
WWL_Comp.3
For WWL_Comp.3
, higher values indicate a higher certainty in production. A higher certainty appears to lead to longer /s/ durations.
Summary:
* SylPerMin : numeric predictor; set to the value(s): 2.335248237.
* pauseBin : factor; set to the value(s): no_pause.
* folType : factor; set to the value(s): APP.
* WWL_Comp.1 : numeric predictor; set to the value(s): 0.0671835516485308.
* WWL_Comp.3 : numeric predictor; with 30 values ranging from -3.548324 to 2.239769.
* cor_affix : numeric predictor; set to the value(s): -0.001114876.
* speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(speaker)
cor_affix
cor_affix
describes the correlation of an estimated semantic vector \(\hat{s}\) and the semantic vector \(s\) of its affix, i.e. it may be interpreted as a measure of transparency. Thus, higher values indicate a higher correlation of pseudoword and affix, leading to longer /s/ durations. Note: the effect of cor_affix
is only applicable to plural pseudowords, as there is no cor_affix
calculated for singular pseudowords (as they do not contain any affixes).
Summary:
* SylPerMin : numeric predictor; set to the value(s): 2.335248237.
* pauseBin : factor; set to the value(s): no_pause.
* folType : factor; set to the value(s): APP.
* WWL_Comp.1 : numeric predictor; set to the value(s): 0.0671835516485308.
* WWL_Comp.3 : numeric predictor; set to the value(s): 0.171377825002154.
* cor_affix : numeric predictor; with 30 values ranging from -0.049959 to 0.098958.
* speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(speaker)
3 LDLConvFunctions measures
LDLConvFunctions
includes the following measures as implemented in Chuang et al. (2020):
- ALDC - Average Levenshtein Distance of Candidates: Higher ALDC values indicate that the candidate forms are very different from the intended pronunciation, indicating a sparse form neighborhood.
- SCPP - Semantic Correlation of Predicted Production: Higher SCPP values indicate that the semantics of the estimated form better approximate the estimated meaning (see WpmWithLdl::cor_max).
- EDNN - Euclidian Distance from Nearest Neighbour: Higher EDNN values indicate that a word is more distant from the nearest word neighbour.
- NNC - Nearest Neighbour Correlation: Higher NNC values indicate that a pseudoword’s meaning is similar to that of a real word.
- ALC - Average Lexical Correlation: Higher ALC values indicate that a pseudoword vector is part of a denser semantic neighbourhood.
- DRC - Dual Route Consistency: Higher DRC values indicates that the semantic vectors produced by the two routes are more similar to each other.
Without discussing the underlying implications of the individual measures, let’s see whether some of them are correlated. If so, we need to take some measures before including them in any models.
3.1 Correlations
Correlations of sDurLog
, Affix
, and all WpmWithLdl
measures:
We find the following highly correlated (\(rho\ge0.5\)) variable combinations:
- NNC vs. ALC, \(\rho=0.76\)
Thus, we need to take some measures before we run into problems of colinearity later on.
3.2 PCA
To avoid colinearity issues when modelling, we combine the two highly correlated variables in a principal component analysis.
PCA Computation
First, we extract the highly correlated variables and merge them to form a new data frame.
pca.data1 <- data[colnames(data)=="NNC"]
pca.data2 <- data[colnames(data)=="ALC"]
pca.data <- cbind(pca.data1,
pca.data2)
Then, we compute a principal component analysis using the princomp
function:
In order to decide which components to keep, we follow three criteria:
- The Eigenvalue-criterion: Any component that displays an eigenvalue greater than 1.00 is accounting for a greater amount of variance than had been contributed by one variable. Such a component is therefore accounting for a meaningful amount of variance, and is worthy of being retained.
- The scree-criterion: Plot the eigenvalues associated with each component and look for a “break” between the components with relatively large eigenvalues and those with small eigenvalues. The components that appear before the break are assumed to be meaningful and are retained (Cattell, 1966).
- The proportion-of-variance-criterion: Retain enough components so that the cumulative percent of variance accounted for is equal to some minimal value, i.e. 85 percent.
Thus, let’s first check the PCA components’ Eigenvalues (criterion 1):
# load package to extract Eigenvalues
library("factoextra")
# extract Eigenvalues
get_eigenvalue(pc)
eigenvalue variance.percent cumulative.variance.percent
Dim.1 1.7646569 88.23285 88.23285
Dim.2 0.2353431 11.76715 100.00000
Component 1 (called Dim.1 here) shows an Eigenvalue of \(v\ge1\), i.e. keep component 1.
Next, take another look at the Eigenvalues and check for a “break” between the components (criterion 2) with relatively large eigenvalues and those with small eigenvalues:
eigenvalue variance.percent cumulative.variance.percent
Dim.1 1.7646569 88.23285 88.23285
Dim.2 0.2353431 11.76715 100.00000
There is a huge break in Eigenvalues between component 1 with \(v=1.76\) and component 2 with \(v=0.24\). Thus, we might keep component 1.
Lastly, check the cumulative percent of variance accounted for by all components (criterion 3). Our target level of explained variance is \(σ^2=85\%\) percent:
eigenvalue variance.percent cumulative.variance.percent
Dim.1 1.7646569 88.23285 88.23285
Dim.2 0.2353431 11.76715 100.00000
Components 1 shows a “cumulative” percent of variance accounted for of \(σ^2=88.2\%\). Thus, we retain components 1.
Thus, we add component 1 to our data frame.
pcframe <- as.data.frame(pc$scores)
colnames(pcframe) <- c("LCF_Comp.1", "LCF_Comp.2")
data <- cbind(data, pcframe[1])
Biplot
Taking a look at the PCA’s biplot, we find that
the two variables point into the same direction: ALC
and NNC
. ALC
runs almost orthogonal to NNC
.
Next, we will take a closer look at the PCA component and its loadings.
PCA Loadings
Now we can take a closer look at the individual components and their loadings.
Loadings:
Comp.1 Comp.2
NNC 0.707 0.707
ALC 0.707 -0.707
Comp.1 Comp.2
SS loadings 1.0 1.0
Proportion Var 0.5 0.5
Cumulative Var 0.5 1.0
But what do these loadings conceptually mean?
Component 1 is highly correlated with ALC
and NNC
. NNC
indicates that a pseudoword’s meaning is similar to that of a real word, while ALC
describes the density of a pseudoword’s semantic neighbourhood. Taking a closer look at NNC
and ALC
, we find they are highly correlated with \(\rho=0.76\), with higher values indicating close to real words’ semantics and a dense semantic neighbourhood.
[1] 0.7646569
We may conclude that Component 1
reflects close-to-real-word semantics and a dense semantic neighbourhood for pseudowords.
3.3 GAMMs
Let’s see whether the remaining LDL measures and PCA components are significant predictors for word-final /s/ duration in singular and plural pseudowords. We will use Generalized Additive Mixed Models (GAMMs) to model /s/ durations.
Modelling
First, create a model with all relevant variables, including LCF and PCA variables as smooths and speaker
as random effect:
gam.LDLConvFun.1 <- gam(sDurLog ~
Affix+
s(LCF_Comp.1)+
s(ALDC, k=8)+
s(SCPP)+
s(EDNN)+
s(DRC, k=9)+
SylPerMin+
pauseBin+
folType+
s(speaker, bs="re"),
data=data)
summary(gam.LDLConvFun.1)
Family: gaussian
Link function: identity
Formula:
sDurLog ~ Affix + s(LCF_Comp.1) + s(ALDC, k = 8) + s(SCPP) +
s(EDNN) + s(DRC, k = 9) + SylPerMin + pauseBin + folType +
s(speaker, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.844352 0.047238 -17.874 < 2e-16 ***
AffixPL -0.082864 0.015870 -5.221 2.42e-07 ***
SylPerMin -0.516381 0.017247 -29.941 < 2e-16 ***
pauseBinpause 0.113543 0.017572 6.462 2.09e-10 ***
folTypeF 0.003673 0.053609 0.069 0.94539
folTypeN 0.008256 0.020736 0.398 0.69066
folTypeP 0.015385 0.018392 0.836 0.40320
folTypeV -0.059368 0.018572 -3.197 0.00146 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(LCF_Comp.1) 1.741 2.148 1.041 0.34223
s(ALDC) 1.986 2.486 0.549 0.55331
s(SCPP) 5.779 6.260 2.217 0.03833 *
s(EDNN) 7.000 7.000 3.228 0.00226 **
s(DRC) 2.884 3.217 2.117 0.10002
s(speaker) 32.049 39.000 5.494 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Rank: 88/90
R-sq.(adj) = 0.809 Deviance explained = 82.5%
GCV = 0.032417 Scale est. = 0.029596 n = 683
Using the anova()
function, we check whether there are any non-significant variables to excluded. The result of this procedure is the following model:
gam.LDLConvFun.2 <- gam(sDurLog ~
Affix+
#s(LCF_Comp.1)+
#s(ALDC, k=8)+
s(SCPP)+
s(EDNN)+
#s(DRC, k=9)+
SylPerMin+
pauseBin+
folType+
s(speaker, bs="re"),
data=data)
anova(gam.LDLConvFun.2)
Family: gaussian
Link function: identity
Formula:
sDurLog ~ Affix + s(SCPP) + s(EDNN) + SylPerMin + pauseBin +
folType + s(speaker, bs = "re")
Parametric Terms:
df F p-value
Affix 1 33.677 1.03e-08
SylPerMin 1 910.741 < 2e-16
pauseBin 1 42.721 1.30e-10
folType 4 4.856 0.000731
Approximate significance of smooth terms:
edf Ref.df F p-value
s(SCPP) 6.136 6.679 2.699 0.026491
s(EDNN) 6.464 6.704 3.795 0.000549
s(speaker) 32.365 39.000 5.520 < 2e-16
Analysis
This is the final model we found:
gam.LDLConvFun.2 <- gam(sDurLog ~
Affix+
s(SCPP)+
s(EDNN)+
SylPerMin+
pauseBin+
folType+
s(speaker, bs="re"),
data=data)
summary(gam.LDLConvFun.2)
Family: gaussian
Link function: identity
Formula:
sDurLog ~ Affix + s(SCPP) + s(EDNN) + SylPerMin + pauseBin +
folType + s(speaker, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.842491 0.047005 -17.923 < 2e-16 ***
AffixPL -0.084095 0.014491 -5.803 1.03e-08 ***
SylPerMin -0.516839 0.017126 -30.178 < 2e-16 ***
pauseBinpause 0.114480 0.017515 6.536 1.30e-10 ***
folTypeF 0.003104 0.053407 0.058 0.95367
folTypeN 0.006503 0.020760 0.313 0.75422
folTypeP 0.015336 0.018420 0.833 0.40542
folTypeV -0.059794 0.018615 -3.212 0.00138 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(SCPP) 6.136 6.679 2.699 0.026491 *
s(EDNN) 6.464 6.704 3.795 0.000549 ***
s(speaker) 32.365 39.000 5.520 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Rank: 64/66
R-sq.(adj) = 0.807 Deviance explained = 82.1%
GCV = 0.032441 Scale est. = 0.029925 n = 683
In total, 2 LCF measures are included as significant predictors for /s/ duration in pseudowords.
SCPP
The Semantic Correlation of Predicted Production describes the maximum correlation between a word’s predicted semantic vector and any of the semantic vectors of its candidate forms. Higher SCPP values indicate that the semantics of the estimated form better approximate the estimated meaning.
But what about its effect on /s/ durations?
Summary:
* Affix : factor; set to the value(s): PL.
* SylPerMin : numeric predictor; set to the value(s): 2.335248237.
* pauseBin : factor; set to the value(s): no_pause.
* folType : factor; set to the value(s): APP.
* SCPP : numeric predictor; with 30 values ranging from 0.607063 to 1.000000.
* EDNN : numeric predictor; set to the value(s): 1.71e-15.
* speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(speaker)
EDNN
The Euclidian Distance from Nearest Neighbour describes the Euclidean distance from the semantic vector \(\hat{s}\) produced by the direct route to its closest semantic word neighbour. Higher EDNN values indicate that a word is more distant from the nearest word neighbour.
But what about its effect on /s/ durations?
Summary:
* Affix : factor; set to the value(s): PL.
* SylPerMin : numeric predictor; set to the value(s): 2.335248237.
* pauseBin : factor; set to the value(s): no_pause.
* folType : factor; set to the value(s): APP.
* SCPP : numeric predictor; set to the value(s): 0.937532423.
* EDNN : numeric predictor; with 30 values ranging from 0.000000 to 0.032502.
* speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(speaker)
References
Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270. doi.org/10.1075/ml.18010.baa
Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39. doi.org/10.1155/2019/4895891
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-276.
Chuang, Y-Y., Vollmer, M-l., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., & Baayen, R. H. (2020). The processing of pseudoword form and meaning in production and comprehension: A computational modeling approach using Linear Discriminative Learning. Behavior Research Methods, 1-51. doi.org/10.3758/s13428-020-01356-w
Nykamp, D. (2020, November 10). Multiplying matrices and vectors. Math Insight. mathinsight.org/matrix_vector_multiplication
Schmitz, D., Baer-Henney, D., & Plag, I. (submitted). The duration of word-final /s/ differs across morphological categories in English: Evidence from pseudowords.
Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadic, F., & Sims, M. (2018). The massive auditory lexical decision (MALD) database. Behavior Research Methods, 1-18. doi.org/10.3758/s13428-018-1056-1