This script documents the analysis of ENGS production data and MALD corpus data with measures of Linear Discriminative Learning (LDL; e.g. Baayen et al., 2018). This is a work in progress.

The idea of this LDL implementation is the following:
Networks trained on real word data and their transformation matrices are used to create semantic matrices for pseudowords, following the implementation by Chuang et al. (2020). The underlying assumption is that speakers who face pseudowords do so with their knowledge on previously encountered words, i.e. their lexicon. Thus, pseudowords are not semantically empty shells but contain some sort of meaning.

The following script includes multiple Principle Component Analyses (PCAs) and the creation of a number of Generalized Additive Mixed Models (GAMMs). Please check this script for the LDL modelling itself as well as measure extractions.

1 Data

The data set used in this script contains the following parts:

  • ENGS production data (Schmitz et al., 2020)
  • LDL measures as extracted by the WpmWithLdl package
  • LDL measures as extracted by the LDLConvFunctions package

For details on data set creation, please check this link.

The data set contains 55 columns and 683 rows, i.e. 683 observations of 54 variables:

2 WpmWithLdl measures

WpmWithLdl includes the following measures via the comprehension_measures and production_measures functions:

Comprehension:

  • recognition:
    whether the correct form was recognized
  • cor_max:
    the correlation between the pertinent vector in Shat and the semantic vector of its predicted form in S
  • cor_target:
    the correlation between the pertinent vector in Shat and the semantic vector of its targeted form in S
  • l1norm:
    city block distance / Manhattan distance of Shat; a greater l1norm implies more strong links to many other lexomes
  • l2norm:
    euclidian (direct) distance / length of Shat; a greater l2norm implies more strong links to many other lexomes
  • rank:
    the rank of correlation between s-hat and the semantic vector of its targeted form
  • density:
    the mean correlation of s-hat with the semantic vectors of its top 8 neighbors
  • cor_affix:
    the correlation of s-hat and the semantic vector of its affix function; more or less a measure of transparency

Production:

  • correlations:
    the correlation of the predicted path with the targeted semantic vector
  • path_counts:
    the number of paths detected
  • path_sum:
    the summed support for the predicted path, could be read as the uncertainty about how to pronounce the word due to competing activation of articulatory motor plans
  • path_entropies:
    Shannon entropy calculated over the path supports
  • lwlr:
    Simon: “The length-weakest link ratios of the predicted path (also sometimes equal to word length). When the length of a word’s path, as measured in number of nodes, becomes higher, while the weakest link between nodes stays the same, uncertainty becomes greater. If triphone support is high, i.e. when all the strengths of the links are almost equal to one, the value will be almost equal to the number of segments in the word. This may not be the case in other models, where the triphone support may be lower.”

Without discussing the underlying implications of the individual measures, let’s see whether some of them are correlated. If so, we need to take some measures before including them in any models.

2.1 Correlations

Correlations of sDurLog, Affix, and all WpmWithLdl measures:

pairscor.fnc(data[,c("sDurLog", "Affix",
                     "cor_max", "l1norm", "l2norm", "rank", "density", "cor_affix",
                     "correlations", "path_counts", "path_sum", "path_entropies", "lwlr")])

We find the following highly correlated (\(\rho\ge0.5\)) variable combinations:

  • cor_max vs. l2norm, \(\rho=-0.5\)
  • cor_max vs. rank, \(\rho=-0.52\)
  • l1norm vs. l2norm, \(\rho=0.86\)
  • path_counts vs. path_entropies, \(\rho=0.92\)
  • path_sum vs. lwlr, \(\rho=-0.9\)

Thus, we need to take some measures before we run into problems of colinearity later on.

2.2 PCA

To avoid colinearity issues when modelling, we combine all highly correlated variables in a principal component analysis.

PCA Computation

First, we extract the highly correlated variables and merge them to form a new data frame.

pca.data1 <- data[colnames(data)=="cor_max"]
pca.data2 <- data[colnames(data)=="l2norm"]
pca.data3 <- data[colnames(data)=="rank"]
pca.data4 <- data[colnames(data)=="l1norm"]
pca.data5 <- data[colnames(data)=="path_counts"]
pca.data6 <- data[colnames(data)=="path_entropies"]
pca.data7 <- data[colnames(data)=="path_sum"]
pca.data8 <- data[colnames(data)=="lwlr"]

pca.data  <- cbind(pca.data1,
                   pca.data2,
                   pca.data3,
                   pca.data4,
                   pca.data5,
                   pca.data6,
                   pca.data7,
                   pca.data8)

Then, we compute a principal component analysis using the princomp function:

pc <- princomp(pca.data, cor=TRUE, score=TRUE)

In order to decide which components to keep, we follow three criteria:

  1. The Eigenvalue-criterion: Any component that displays an eigenvalue greater than 1.00 is accounting for a greater amount of variance than had been contributed by one variable. Such a component is therefore accounting for a meaningful amount of variance, and is worthy of being retained.
  2. The scree-criterion: Plot the eigenvalues associated with each component and look for a “break” between the components with relatively large eigenvalues and those with small eigenvalues. The components that appear before the break are assumed to be meaningful and are retained (Cattell, 1966).
  3. The proportion-of-variance-criterion: Retain enough components so that the cumulative percent of variance accounted for is equal to some minimal value, i.e. 85 percent.

Thus, let’s first check the PCA components’ Eigenvalues (criterion 1):

# load package to extract Eigenvalues
library("factoextra")

# extract Eigenvalues
get_eigenvalue(pc)
      eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.67973298       33.4966622                    33.49666
Dim.2 2.55993527       31.9991909                    65.49585
Dim.3 1.52302408       19.0378010                    84.53365
Dim.4 0.60312766        7.5390957                    92.07275
Dim.5 0.36117187        4.5146484                    96.58740
Dim.6 0.13384219        1.6730274                    98.26043
Dim.7 0.09054334        1.1317918                    99.39222
Dim.8 0.04862261        0.6077826                   100.00000

Components 1, 2, and 3 (called Dim.x here) show an Eigenvalue of \(v\ge1\), i.e. keep components 1, 2, and 3.

Next, take another look at the Eigenvalues and check for a “break” between the components (criterion 2) with relatively large eigenvalues and those with small eigenvalues:

# extract Eigenvalues
get_eigenvalue(pc)
      eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.67973298       33.4966622                    33.49666
Dim.2 2.55993527       31.9991909                    65.49585
Dim.3 1.52302408       19.0378010                    84.53365
Dim.4 0.60312766        7.5390957                    92.07275
Dim.5 0.36117187        4.5146484                    96.58740
Dim.6 0.13384219        1.6730274                    98.26043
Dim.7 0.09054334        1.1317918                    99.39222
Dim.8 0.04862261        0.6077826                   100.00000

A break in Eigenvalues can be found between component 3 with \(v=1.52\) and component 4 with \(v=0.60\). Thus, keep components 1, 2, and 3.

Lastly, check the cumulative percent of variance accounted for by all components (criterion 3). Our target level of explained variance is \(σ^2=85\%\) percent:

# extract Eigenvalues
get_eigenvalue(pc)
      eigenvalue variance.percent cumulative.variance.percent
Dim.1 2.67973298       33.4966622                    33.49666
Dim.2 2.55993527       31.9991909                    65.49585
Dim.3 1.52302408       19.0378010                    84.53365
Dim.4 0.60312766        7.5390957                    92.07275
Dim.5 0.36117187        4.5146484                    96.58740
Dim.6 0.13384219        1.6730274                    98.26043
Dim.7 0.09054334        1.1317918                    99.39222
Dim.8 0.04862261        0.6077826                   100.00000

Again, components 1, 2, and 3 are retained, i.e. their cumulative percent of variance accounted for is \(σ^2=84.53\%\), i.e. almost \(σ^2\ge85\%\)

Thus, we add components 1, 2, and 3 to our data frame.

pcframe <- as.data.frame(pc$scores)
pcframe <- pcframe[1:4]

library(dplyr)
pcframe <- pcframe %>% rename(
  WWL_Comp.1 = Comp.1,
  WWL_Comp.2 = Comp.2,
  WWL_Comp.3 = Comp.3
)

data <- cbind(data, pcframe)

PCA Loadings

Now we can take a closer look at the individual components and their loadings.

pc$loadings

Loadings:
               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
cor_max         0.475         0.174  0.597  0.606  0.114              
l2norm         -0.476 -0.307         0.365 -0.135  0.660  0.226 -0.202
rank           -0.346 -0.166 -0.494 -0.303  0.709                0.119
l1norm         -0.437 -0.325  0.114  0.485        -0.616 -0.148  0.229
path_counts           -0.534  0.334 -0.243  0.147        -0.377 -0.606
path_entropies  0.129 -0.485  0.412 -0.325                0.406  0.547
path_sum       -0.342  0.366  0.422 -0.121  0.236 -0.248  0.555 -0.361
lwlr            0.307 -0.343 -0.503        -0.183 -0.312  0.551 -0.309

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125  0.125
Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875  1.000

But what do these loadings conceptually mean?

Component 1

Component 1 is rather highly correlated with cor_max and l1norm, l2norm. For l1norm and l2norm, higher values indicate more strong links to many other lexomes, i.e. a higher activation diversity. cor_max describes the correlation between the pertinent vector in \(\hat{S}\) and the semantic vector of its predicted form in \(S\), i.e. higher values indicate that the semantics of the estimated form better approximate the estimated meaning.

Component 1 indicates a word’s activation diversity (i.e. l1norm and l2norm) and semantic certainty (i.e. cor_max), with higher values representing a lower activation diversity and higher certainty.

Component 2

Component 2 is rather highly correlated with path_counts and path_entropies. Both measures describe the certainty of a predicted production. That is, path_counts reflects the number of paths detected for a pertinent production, while path_entropies measures the Shannon entropy calculated over the path supports.

Component 2 reflects the support of a predicted production, i.e. its certainty.

Component 3

Component 3 is rather highly correlated with lwlr and rank. For lwlr, higher values indicate a lower support of a predicted production, i.e. the length-weakest link ratio of a predicted path is higher than the number of its triphones. For rank, higher values indicate a mismatch between estimated and predicted semantic vectors. Thus, both measures indicate a phonological uncertainty.
Additionally, Component 3 is also correlated with path_counts, path_sum, and path_entropies, however, these correlations are pointing in the opposite direction. That is, these measures indicate a phonological certainty rather than uncertainty.

Component 3 reflects the support of a predicted production, i.e. its (un-)certainty.

2.3 GAMMs

Let’s see whether the remaining LDL measures and PCA components are significant predictors for word-final /s/ duration in singular and plural pseudowords. We will use Generalized Additive Mixed Models (GAMMs) to model /s/ durations.

Modelling

First, create a model with all relevant variables, including LDL and PCA variables as smooths and speaker as random effect:

gam.WpmWithLdl.1 <- gam(sDurLog ~
                          Affix +
                          s(WWL_Comp.1) +
                          s(WWL_Comp.2) +
                          s(WWL_Comp.3) +
                          s(cor_target)+
                          s(correlations)+
                          s(density)+
                          s(cor_affix)+
                          SylPerMin+
                          pauseBin+
                          folType+
                          s(speaker, bs="re"),
                        data=data)

summary(gam.WpmWithLdl.1)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ Affix + s(WWL_Comp.1) + s(WWL_Comp.2) + s(WWL_Comp.3) + 
    s(cor_target) + s(correlations) + s(density) + s(cor_affix) + 
    SylPerMin + pauseBin + folType + s(speaker, bs = "re")

Parametric coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.841762   0.113426  -7.421 9.79e-13 ***
AffixPL       -0.102729   0.098167  -1.046   0.2961    
SylPerMin     -0.513179   0.023101 -22.214  < 2e-16 ***
pauseBinpause  0.116584   0.023538   4.953 1.17e-06 ***
folTypeF       0.060360   0.086724   0.696   0.4869    
folTypeN       0.006494   0.030648   0.212   0.8323    
folTypeP       0.042363   0.025878   1.637   0.1026    
folTypeV      -0.057121   0.025054  -2.280   0.0232 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                   edf Ref.df     F p-value    
s(WWL_Comp.1)    2.252  2.789 3.932  0.0312 *  
s(WWL_Comp.2)    1.000  1.000 0.446  0.5046    
s(WWL_Comp.3)    1.000  1.000 4.736  0.0302 *  
s(cor_target)    1.000  1.000 0.073  0.7875    
s(correlations)  1.272  1.466 0.258  0.5581    
s(density)       1.069  1.125 0.228  0.6181    
s(cor_affix)     2.978  3.616 2.591  0.0485 *  
s(speaker)      27.954 38.000 3.539  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.797   Deviance explained = 82.1%
GCV = 0.035681  Scale est. = 0.0313    n = 379

Checking the anova() function’s output, we now exclude non-significant variables. The result of this procedure is the following model:

gam.WpmWithLdl.2 <- gam(sDurLog ~
                          #Affix +
                          s(WWL_Comp.1) +
                          #s(WWL_Comp.2) +
                          s(WWL_Comp.3) +
                          #s(cor_target)+
                          #s(correlations)+
                          #s(density)+
                          s(cor_affix)+
                          SylPerMin+
                          pauseBin+
                          folType+
                          s(speaker, bs="re"),
                        data=data)
anova(gam.WpmWithLdl.2)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ s(WWL_Comp.1) + s(WWL_Comp.3) + s(cor_affix) + SylPerMin + 
    pauseBin + folType + s(speaker, bs = "re")

Parametric Terms:
          df       F  p-value
SylPerMin  1 493.430  < 2e-16
pauseBin   1  23.395 2.03e-06
folType    4   4.154  0.00269

Approximate significance of smooth terms:
                 edf Ref.df     F p-value
s(WWL_Comp.1)  3.193  3.873 3.931 0.00747
s(WWL_Comp.3)  8.146  8.648 2.506 0.00997
s(cor_affix)   2.534  2.993 3.887 0.00862
s(speaker)    28.593 38.000 3.634 < 2e-16

Analysis

This is the final model we found:

gam.WpmWithLdl.3 <- gam(sDurLog ~
                          s(WWL_Comp.1) + # PCA component
                          s(WWL_Comp.3) + # PCA component
                          s(cor_affix)+   # LDL measure
                          SylPerMin+      
                          pauseBin+
                          folType+
                          s(speaker, bs="re"),
                        data=data)

summary(gam.WpmWithLdl.3)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ s(WWL_Comp.1) + s(WWL_Comp.3) + s(cor_affix) + SylPerMin + 
    pauseBin + folType + s(speaker, bs = "re")

Parametric coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.933545   0.063483 -14.705  < 2e-16 ***
SylPerMin     -0.515971   0.023228 -22.213  < 2e-16 ***
pauseBinpause  0.113619   0.023490   4.837 2.03e-06 ***
folTypeF       0.093015   0.084380   1.102   0.2711    
folTypeN       0.001699   0.030740   0.055   0.9560    
folTypeP       0.039247   0.025646   1.530   0.1269    
folTypeV      -0.058390   0.024995  -2.336   0.0201 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                 edf Ref.df     F p-value    
s(WWL_Comp.1)  3.193  3.873 3.931 0.00747 ** 
s(WWL_Comp.3)  8.146  8.648 2.506 0.00997 ** 
s(cor_affix)   2.534  2.993 3.887 0.00862 ** 
s(speaker)    28.593 38.000 3.634 < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =    0.8   Deviance explained = 82.6%
GCV = 0.035443  Scale est. = 0.030817  n = 379

In total, 1 LDL measures and 2 PCA components are included as significant predictors for /s/ duration in pseudowords.

WWL_Comp.1

For WWL_Comp.1, lower values indicate a higher activation diversity and lower certainty, while higher values indicate a higher semantic certainty and lower activation diversity. That is, higher activation diversity and lower certainty lead to shorter /s/ durations, while higher semantic certainty and lower activation diversity lead to longer /s/ durations.

Summary:
    * SylPerMin : numeric predictor; set to the value(s): 2.335248237. 
    * pauseBin : factor; set to the value(s): no_pause. 
    * folType : factor; set to the value(s): APP. 
    * WWL_Comp.1 : numeric predictor; with 30 values ranging from -5.142392 to 2.903543. 
    * WWL_Comp.3 : numeric predictor; set to the value(s): 0.171377825002154. 
    * cor_affix : numeric predictor; set to the value(s): -0.001114876. 
    * speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(speaker)
 

WWL_Comp.3

For WWL_Comp.3, higher values indicate a higher certainty in production. A higher certainty appears to lead to longer /s/ durations.

Summary:
    * SylPerMin : numeric predictor; set to the value(s): 2.335248237. 
    * pauseBin : factor; set to the value(s): no_pause. 
    * folType : factor; set to the value(s): APP. 
    * WWL_Comp.1 : numeric predictor; set to the value(s): 0.0671835516485308. 
    * WWL_Comp.3 : numeric predictor; with 30 values ranging from -3.548324 to 2.239769. 
    * cor_affix : numeric predictor; set to the value(s): -0.001114876. 
    * speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(speaker)
 

cor_affix

cor_affix describes the correlation of an estimated semantic vector \(\hat{s}\) and the semantic vector \(s\) of its affix, i.e. it may be interpreted as a measure of transparency. Thus, higher values indicate a higher correlation of pseudoword and affix, leading to longer /s/ durations. Note: the effect of cor_affix is only applicable to plural pseudowords, as there is no cor_affix calculated for singular pseudowords (as they do not contain any affixes).

Summary:
    * SylPerMin : numeric predictor; set to the value(s): 2.335248237. 
    * pauseBin : factor; set to the value(s): no_pause. 
    * folType : factor; set to the value(s): APP. 
    * WWL_Comp.1 : numeric predictor; set to the value(s): 0.0671835516485308. 
    * WWL_Comp.3 : numeric predictor; set to the value(s): 0.171377825002154. 
    * cor_affix : numeric predictor; with 30 values ranging from -0.049959 to 0.098958. 
    * speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(speaker)
 

3 LDLConvFunctions measures

LDLConvFunctions includes the following measures as implemented in Chuang et al. (2020):

  • ALDC - Average Levenshtein Distance of Candidates:
    Higher ALDC values indicate that the candidate forms are very different from the intended pronunciation, indicating a sparse form neighborhood.
  • SCPP - Semantic Correlation of Predicted Production:
    Higher SCPP values indicate that the semantics of the estimated form better approximate the estimated meaning (see WpmWithLdl::cor_max).
  • EDNN - Euclidian Distance from Nearest Neighbour:
    Higher EDNN values indicate that a word is more distant from the nearest word neighbour.
  • NNC - Nearest Neighbour Correlation:
    Higher NNC values indicate that a pseudoword’s meaning is similar to that of a real word.
  • ALC - Average Lexical Correlation:
    Higher ALC values indicate that a pseudoword vector is part of a denser semantic neighbourhood.
  • DRC - Dual Route Consistency:
    Higher DRC values indicates that the semantic vectors produced by the two routes are more similar to each other.

Without discussing the underlying implications of the individual measures, let’s see whether some of them are correlated. If so, we need to take some measures before including them in any models.

3.1 Correlations

Correlations of sDurLog, Affix, and all WpmWithLdl measures:

pairscor.fnc(data[,c("sDurLog", "Affix",
                     "ALDC", "SCPP", "EDNN", "NNC", "ALC", "DRC")])

We find the following highly correlated (\(rho\ge0.5\)) variable combinations:

  • NNC vs. ALC, \(\rho=0.76\)

Thus, we need to take some measures before we run into problems of colinearity later on.

3.2 PCA

To avoid colinearity issues when modelling, we combine the two highly correlated variables in a principal component analysis.

PCA Computation

First, we extract the highly correlated variables and merge them to form a new data frame.

pca.data1 <- data[colnames(data)=="NNC"]
pca.data2 <- data[colnames(data)=="ALC"]

pca.data  <- cbind(pca.data1,
                   pca.data2)

Then, we compute a principal component analysis using the princomp function:

pc <- princomp(pca.data, cor=TRUE, score=TRUE)

In order to decide which components to keep, we follow three criteria:

  1. The Eigenvalue-criterion: Any component that displays an eigenvalue greater than 1.00 is accounting for a greater amount of variance than had been contributed by one variable. Such a component is therefore accounting for a meaningful amount of variance, and is worthy of being retained.
  2. The scree-criterion: Plot the eigenvalues associated with each component and look for a “break” between the components with relatively large eigenvalues and those with small eigenvalues. The components that appear before the break are assumed to be meaningful and are retained (Cattell, 1966).
  3. The proportion-of-variance-criterion: Retain enough components so that the cumulative percent of variance accounted for is equal to some minimal value, i.e. 85 percent.

Thus, let’s first check the PCA components’ Eigenvalues (criterion 1):

# load package to extract Eigenvalues
library("factoextra")

# extract Eigenvalues
get_eigenvalue(pc)
      eigenvalue variance.percent cumulative.variance.percent
Dim.1  1.7646569         88.23285                    88.23285
Dim.2  0.2353431         11.76715                   100.00000

Component 1 (called Dim.1 here) shows an Eigenvalue of \(v\ge1\), i.e. keep component 1.

Next, take another look at the Eigenvalues and check for a “break” between the components (criterion 2) with relatively large eigenvalues and those with small eigenvalues:

# extract Eigenvalues
get_eigenvalue(pc)
      eigenvalue variance.percent cumulative.variance.percent
Dim.1  1.7646569         88.23285                    88.23285
Dim.2  0.2353431         11.76715                   100.00000

There is a huge break in Eigenvalues between component 1 with \(v=1.76\) and component 2 with \(v=0.24\). Thus, we might keep component 1.

Lastly, check the cumulative percent of variance accounted for by all components (criterion 3). Our target level of explained variance is \(σ^2=85\%\) percent:

# extract Eigenvalues
get_eigenvalue(pc)
      eigenvalue variance.percent cumulative.variance.percent
Dim.1  1.7646569         88.23285                    88.23285
Dim.2  0.2353431         11.76715                   100.00000

Components 1 shows a “cumulative” percent of variance accounted for of \(σ^2=88.2\%\). Thus, we retain components 1.

Thus, we add component 1 to our data frame.

pcframe <- as.data.frame(pc$scores)
colnames(pcframe) <- c("LCF_Comp.1", "LCF_Comp.2")

data <- cbind(data, pcframe[1])

Biplot

Taking a look at the PCA’s biplot, we find that

the two variables point into the same direction: ALC and NNC. ALC runs almost orthogonal to NNC.

Next, we will take a closer look at the PCA component and its loadings.

PCA Loadings

Now we can take a closer look at the individual components and their loadings.

pc$loadings

Loadings:
    Comp.1 Comp.2
NNC  0.707  0.707
ALC  0.707 -0.707

               Comp.1 Comp.2
SS loadings       1.0    1.0
Proportion Var    0.5    0.5
Cumulative Var    0.5    1.0

But what do these loadings conceptually mean?

Component 1 is highly correlated with ALC and NNC. NNC indicates that a pseudoword’s meaning is similar to that of a real word, while ALC describes the density of a pseudoword’s semantic neighbourhood. Taking a closer look at NNC and ALC, we find they are highly correlated with \(\rho=0.76\), with higher values indicating close to real words’ semantics and a dense semantic neighbourhood.

cor(data$NNC, data$ALC)
[1] 0.7646569

We may conclude that Component 1 reflects close-to-real-word semantics and a dense semantic neighbourhood for pseudowords.

3.3 GAMMs

Let’s see whether the remaining LDL measures and PCA components are significant predictors for word-final /s/ duration in singular and plural pseudowords. We will use Generalized Additive Mixed Models (GAMMs) to model /s/ durations.

Modelling

First, create a model with all relevant variables, including LCF and PCA variables as smooths and speaker as random effect:

gam.LDLConvFun.1 <- gam(sDurLog ~
                          Affix+
                          s(LCF_Comp.1)+
                          s(ALDC, k=8)+
                          s(SCPP)+
                          s(EDNN)+
                          s(DRC, k=9)+
                          SylPerMin+
                          pauseBin+
                          folType+
                          s(speaker, bs="re"),
                        data=data)

summary(gam.LDLConvFun.1)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ Affix + s(LCF_Comp.1) + s(ALDC, k = 8) + s(SCPP) + 
    s(EDNN) + s(DRC, k = 9) + SylPerMin + pauseBin + folType + 
    s(speaker, bs = "re")

Parametric coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.844352   0.047238 -17.874  < 2e-16 ***
AffixPL       -0.082864   0.015870  -5.221 2.42e-07 ***
SylPerMin     -0.516381   0.017247 -29.941  < 2e-16 ***
pauseBinpause  0.113543   0.017572   6.462 2.09e-10 ***
folTypeF       0.003673   0.053609   0.069  0.94539    
folTypeN       0.008256   0.020736   0.398  0.69066    
folTypeP       0.015385   0.018392   0.836  0.40320    
folTypeV      -0.059368   0.018572  -3.197  0.00146 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                 edf Ref.df     F p-value    
s(LCF_Comp.1)  1.741  2.148 1.041 0.34223    
s(ALDC)        1.986  2.486 0.549 0.55331    
s(SCPP)        5.779  6.260 2.217 0.03833 *  
s(EDNN)        7.000  7.000 3.228 0.00226 ** 
s(DRC)         2.884  3.217 2.117 0.10002    
s(speaker)    32.049 39.000 5.494 < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Rank: 88/90
R-sq.(adj) =  0.809   Deviance explained = 82.5%
GCV = 0.032417  Scale est. = 0.029596  n = 683

Using the anova() function, we check whether there are any non-significant variables to excluded. The result of this procedure is the following model:

gam.LDLConvFun.2 <- gam(sDurLog ~
                          Affix+
                          #s(LCF_Comp.1)+
                          #s(ALDC, k=8)+
                          s(SCPP)+
                          s(EDNN)+
                          #s(DRC, k=9)+
                          SylPerMin+
                          pauseBin+
                          folType+
                          s(speaker, bs="re"),
                        data=data)

anova(gam.LDLConvFun.2)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ Affix + s(SCPP) + s(EDNN) + SylPerMin + pauseBin + 
    folType + s(speaker, bs = "re")

Parametric Terms:
          df       F  p-value
Affix      1  33.677 1.03e-08
SylPerMin  1 910.741  < 2e-16
pauseBin   1  42.721 1.30e-10
folType    4   4.856 0.000731

Approximate significance of smooth terms:
              edf Ref.df     F  p-value
s(SCPP)     6.136  6.679 2.699 0.026491
s(EDNN)     6.464  6.704 3.795 0.000549
s(speaker) 32.365 39.000 5.520  < 2e-16

Analysis

This is the final model we found:

gam.LDLConvFun.2 <- gam(sDurLog ~
                          Affix+
                          s(SCPP)+
                          s(EDNN)+
                          SylPerMin+
                          pauseBin+
                          folType+
                          s(speaker, bs="re"),
                        data=data)

summary(gam.LDLConvFun.2)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ Affix + s(SCPP) + s(EDNN) + SylPerMin + pauseBin + 
    folType + s(speaker, bs = "re")

Parametric coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.842491   0.047005 -17.923  < 2e-16 ***
AffixPL       -0.084095   0.014491  -5.803 1.03e-08 ***
SylPerMin     -0.516839   0.017126 -30.178  < 2e-16 ***
pauseBinpause  0.114480   0.017515   6.536 1.30e-10 ***
folTypeF       0.003104   0.053407   0.058  0.95367    
folTypeN       0.006503   0.020760   0.313  0.75422    
folTypeP       0.015336   0.018420   0.833  0.40542    
folTypeV      -0.059794   0.018615  -3.212  0.00138 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df     F  p-value    
s(SCPP)     6.136  6.679 2.699 0.026491 *  
s(EDNN)     6.464  6.704 3.795 0.000549 ***
s(speaker) 32.365 39.000 5.520  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Rank: 64/66
R-sq.(adj) =  0.807   Deviance explained = 82.1%
GCV = 0.032441  Scale est. = 0.029925  n = 683

In total, 2 LCF measures are included as significant predictors for /s/ duration in pseudowords.

SCPP

The Semantic Correlation of Predicted Production describes the maximum correlation between a word’s predicted semantic vector and any of the semantic vectors of its candidate forms. Higher SCPP values indicate that the semantics of the estimated form better approximate the estimated meaning.

But what about its effect on /s/ durations?

Summary:
    * Affix : factor; set to the value(s): PL. 
    * SylPerMin : numeric predictor; set to the value(s): 2.335248237. 
    * pauseBin : factor; set to the value(s): no_pause. 
    * folType : factor; set to the value(s): APP. 
    * SCPP : numeric predictor; with 30 values ranging from 0.607063 to 1.000000. 
    * EDNN : numeric predictor; set to the value(s): 1.71e-15. 
    * speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(speaker)
 

EDNN

The Euclidian Distance from Nearest Neighbour describes the Euclidean distance from the semantic vector \(\hat{s}\) produced by the direct route to its closest semantic word neighbour. Higher EDNN values indicate that a word is more distant from the nearest word neighbour.

But what about its effect on /s/ durations?

Summary:
    * Affix : factor; set to the value(s): PL. 
    * SylPerMin : numeric predictor; set to the value(s): 2.335248237. 
    * pauseBin : factor; set to the value(s): no_pause. 
    * folType : factor; set to the value(s): APP. 
    * SCPP : numeric predictor; set to the value(s): 0.937532423. 
    * EDNN : numeric predictor; with 30 values ranging from 0.000000 to 0.032502. 
    * speaker : factor; set to the value(s): 2. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(speaker)
 

References

Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270. doi.org/10.1075/ml.18010.baa

Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39. doi.org/10.1155/2019/4895891

Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-276.

Chuang, Y-Y., Vollmer, M-l., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., & Baayen, R. H. (2020). The processing of pseudoword form and meaning in production and comprehension: A computational modeling approach using Linear Discriminative Learning. Behavior Research Methods, 1-51. doi.org/10.3758/s13428-020-01356-w

Nykamp, D. (2020, November 10). Multiplying matrices and vectors. Math Insight. mathinsight.org/matrix_vector_multiplication

Schmitz, D., Baer-Henney, D., & Plag, I. (submitted). The duration of word-final /s/ differs across morphological categories in English: Evidence from pseudowords.

Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadic, F., & Sims, M. (2018). The massive auditory lexical decision (MALD) database. Behavior Research Methods, 1-18. doi.org/10.3758/s13428-018-1056-1