This script documents the analysis of ENGS production data and MALD corpus data by means and measures of Linear Discriminative Learning (LDL; e.g. Baayen et al., 2018). This is a work in progress. Please check this script for an analysis of the results of the current script, i.e. PCAs and GAMMs modelling.

The idea of this LDL implementation is the following:
Networks trained on real word data and their transformation matrices are used to create semantic matrices for pseudowords, following the implementation by Chuang et al. (2020). The underlying assumption is that speakers who face pseudowords do so with their knowledge on previously encountered words, i.e. their lexicon. Thus, pseudowords are not semantically empty shells but contain some sort of meaning.

Combining real and nonce word data, creating cue and semantic matrices for both, and extracting LDL measures for comprehension and production proves to be a little bit more complicated than more straightforward implementations limiting themselves to either real or nonce word data. Thus, the following figure is given to illustrate what is done in more detail in the remainder of this document.

1 Data

1.1 MALD data

Data on real words is taken from the Massive Auditory Lexical Decision (MALD) Database (Tucker et al., 2018). It is a subset of the complete MALD database, containing 8328 words of which 2093 words contain 36 different affixes.

## load real words (MALD)
mald.data <- read.csv("data/mald.subset.dfr.csv")[2:5]

dim(mald.data)
[1] 8328    4

1.2 ENGS data

Pseudoword data is taken from the ENGS production experiment (Schmitz et al., 2020). In total, 48 pseudowords are contained within this data set, with half of them representing monomorphemic words (e.g. singular bloufs) and half of them representing plural wordforms (e.g. plural bloufs).

## load pseudowords (ENGS)
engs.data <- read.csv("data/final_S_data_LDL_types.csv")

dim(engs.data)
[1] 48  4

1.3 COMB data

The combined data set contains both data on real and on nonce words as introduced in 1.1 and 1.2. In total, 8376 words and 36 different affixes are part of this data set.

## create a combined data set
comb.data <- rbind(mald.data, engs.data)

## sort combined data set alphabetically
comb.data <- comb.data[order(comb.data$Word),]

dim(comb.data)
[1] 8376    4

2 A matrix

We can create an A matrix for real words using the semantic vectors constructed of the TASA corpus. The A matrix consists of semantic vectors (columns) for all words in orthographic form (rows).

X meat mechanic mechanical mechanism medal
eat 0.0118417 -0.0001872 0.0002885 -0.0002331 1.98e-05
echo 0.0005983 -0.0000384 -0.0000922 -0.0000899 -2.05e-05
eclipse 0.0000350 -0.0003962 -0.0000451 -0.0000278 -9.70e-06
## load TASA vectors
load("data_simon/wL2Ld.rda")

## find all unique MALD (= real word) lexomes
mald.subset.allLexomes = unique(c(mald.data$Word, mald.data$Affix))
mald.subset.allLexomes = mald.subset.allLexomes[!is.na(mald.subset.allLexomes)]

## find vectors for all unique lexomes > this then is the A matrix
mald.Amat = wL2Ld[mald.subset.allLexomes, ]

3 C matrices

C matrices contain all triphones of a data set (columns) which are coded binary for all words in phonological form (rows).

X X..b X.b. b.n X.nd nd. d.n X.nm nm. m.n X.nt nt. b.S X.S. X..b.1 X.bI
@b{ 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
@b{S 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0
{bI 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
{ 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
{bdIk1t 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

3.1 real words

The C matrix for real words is computed based on the MALD data set and \(Word+Affix\), i.e. word forms and affixes are considered.

mald.cues = make_cue_matrix(data = mald.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

dim(mald.cues$matrices$C)
# 8328 7587

3.2 pseudowords

The C matrix for nonce words is computed based on the ENGS data set and \(Word+Affix\), i.e. word forms and affixes are considered.

engs.cues = make_cue_matrix(data = engs.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

dim(engs.cues$matrices$C)
# 48 63

3.3 combined

The C matrix for real and nonce words is computed based on the combined data set and \(Word+Affix\), i.e. word forms and affixes are considered.

comb.cues = make_cue_matrix(data = comb.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

dim(comb.cues$matrices$C)
# 8376 7609

4 S matrices

S matrices contain all semantic vectors of a data set (columns) and their pertinent values for all words in phonological form (rows).

X meat mechanic mechanical mechanism medal
it 0.0118417 -0.0001872 0.0002885 -0.0002331 1.98e-05
Ek5 0.0005983 -0.0000384 -0.0000922 -0.0000899 -2.05e-05
IklIps 0.0000350 -0.0003962 -0.0000451 -0.0000278 -9.70e-06

4.1 real words

The S matrix for real words is computed based on the MALD data set, the MALD A matrix, and \(Word+Affix\), i.e. word forms and affixes are considered.

mald.Smat = make_S_matrix(data = mald.data, formula = ~ Word + Affix, inputA = mald.Amat, grams = 3, wordform = "DISC")

dim(mald.Smat)
# 8328 5487

4.2 pseudowords

We can create a semantic matrix for pseudoword based on the semantics of real words solving the following equation

\[engs.Smat = engs.C * Hp\] with engs.Smat being the semantic matrix for pseudowords, engs.C being the C matrix contained within the cue matrix of pseudowords, and Hp being the transformation matrix for mapping real word cues onto real word semantics.

To obtain this real word transformation matrix Hp (Chuang et al., 2020) we first need to learn comprehension of real words.

mald.comp = learn_comprehension(cue_obj = mald.cues, S = mald.Smat)

Then, we can extract F (the name of Hp which is used in this package) from mald.comp. We will use Hp as variable name.

Hp = mald.comp$F

dim(Hp) 
# 7587 5487

For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix (e.g. Nykamp, 2020).

Let’s check the dimensions of our matrices:
• engs.C: 48 x 63
• Hp: 7587 x 5487

Thus, engs.C must be expanded so its number of columns \(63\) matches the number of rows in Hp \(7587\).

To achieve this, we first create an empty matrix, i.e. cells contain zeros, with rows=7539 and columns=63.

add.rows <- matrix(0, 7539, 63)

This matrix is then attached to the original pseudoword C matrix.

C.eng.big1 <- rbind(engs.cues$matrices$C, add.rows)

dim(C.eng.big1)
# 7587   63

Next, we create an identity matrix, i.e. all cells are zero but the diagonal contains ones. Adding an identity matrix is the common way to extend a matrix’s number of columns. The number of rows and columns for an identity matrix is identical; we take \(7587\) rows/columns as this is number of columns we want to obtain.

diag <- diag(nrow = 7587)

Then, we create a new matrix based on the identity matrix containing all columns but the first \(63\) (as this is the number of columns contained in the pseudoword C matrix).

add.columns <- as.data.frame(diag[,64:7587])

dim(add.columns)
# 7587 7524

This matrix is then combined with C.eng.big1 (the cue matrix with additional rows).

C.eng.big2 <- cbind(C.eng.big1, add.columns)

dim(C.eng.big2)
# 7587 7587

This expanded pseudoword C matrix must be reduced to the necessary size, i.e. in its current form there are 7500+ rows without useful data. We only keep rows which contain info on the ENGS pseudowords, i.e. \(48\) rows.

C.eng.big3 <- C.eng.big2[1:48,]

In a last step, we make sure that R considers the newly created C matrix as matrix (this sometimes is not the case and will lead to errors in further steps).

C.eng.big4 <- as.matrix(C.eng.big3)

Now, we can solve the aforementioned equation \(engs.Smat = engs.C * Hp\).

engs.Smat = C.eng.big4 %*% Hp

dim(engs.Smat)
# 48 5487

In this S matrix, 48 pseudowords and their semantic vectors are contained. However, all of our pseudowords are homophones, thus, they show identical semantic vectors even though half of them contain a suffix, e.g. the semantic vectors for bloufs (non-morphemic S) and bloufs (plural S) are identical.

X fierce fifteen fifteenth fifth fifty
bl6fs -0.0001453 0.0002496 0.0002782 0.0006957 -0.0001303
bl6fs -0.0001453 0.0002496 0.0002782 0.0006957 -0.0001303
bl6ks 0.0000021 -0.0006015 0.0000090 -0.0012144 0.0003275
bl6ks 0.0000021 -0.0006015 0.0000090 -0.0012144 0.0003275

We “fix” this by adding the “PL” semantic vector values to the semantic vectors of plural pseudowords (idea by Yu-Ying).

Thus, we first need to extract the “PL” semantic vector contained in the general A matrix.

PL.vec <- mald.Amat[rownames(mald.Amat)=="PL"]

length(PL.vec)
# 5487

Taking a closer look at the pseudoword S matrix, we find that every second row corresponds to a plural pseudoword, i.e. this is where we add the PL vector.

engs.pl = engs.Smat[seq(2, nrow(engs.Smat), 2), ]

Taking this plural subset, we add the “PL” semantic vector to all rows.

engs.pl.added <- engs.pl + rep(PL.vec, each = nrow(engs.pl))

We then recombine the original monomorphemic S matrix and the modified plural S matrix, and recreate the original row order.

engs.nm = engs.Smat[seq(1, nrow(engs.Smat), 2), ]

engs.Smat <- rbind(engs.nm, engs.pl.added)

sort.rows <- row.names(engs.Smat.old)
engs.Smat <- engs.Smat[order(match(rownames(engs.Smat), sort.rows)), , drop = FALSE]

Now, homophonous pseudowords have differing semantic vectors.

X fierce fifteen fifteenth fifth fifty
bl6fs -0.0001453 0.0002496 0.0002782 0.0006957 -0.0001303
bl6fs -0.0000796 0.0001372 0.0003686 0.0003187 -0.0002868
bl6ks 0.0000021 -0.0006015 0.0000090 -0.0012144 0.0003275
bl6ks 0.0000678 -0.0007139 0.0000994 -0.0015913 0.0001709

4.3 combined

Finally, we can create one combined S matrix for real and nonce words.

comb.Smat <- rbind(mald.Smat, engs.Smat)

However, LDL commands in R are sensitive to row order. Thus, we need to recreate the original row order as found in the combined C matrix.

all(rownames(comb.cues$matrices$C) == rownames(comb.Smat))
# FALSE

comb.Smat2 <- comb.Smat[rownames(comb.cues$matrices$C),,drop=FALSE]

all(rownames(comb.cues$matrices$C) == rownames(comb.Smat2))
# TRUE

5 Comprehension

Using the combined C and S matrices, we can now learn and evaluate comprehension of real and nonce words in one go, i.e. as if they are encountered by a person with their pertinent lexicon.

First, comprehension is learned.

comb.comp = learn_comprehension(cue_obj = comb.cues, S = comb.Smat2)

Second, the accuracy of this comprehension is evaluated. That is, how good can form (C matrix) be mapped onto meaning (S matrix).

comb.comp_acc = accuracy_comprehension(m = comb.comp, data = comb.data, wordform = "DISC", show_rank = TRUE, neighborhood_density = TRUE)

comb.comp_acc$acc # 0.6703677

We find that comprehension accuracy is about \(67\%\).

6 Production

Using the combined C and S matrices as well as the comprehension results, we now are able to learn production and evaluate its accuracy.

First, production is learned.

comb.prod = learn_production(cue_obj = comb.cues, S = comb.Smat2, comp = comb.comp)

Second, the accuracy of this production is evaluated. That is, how good can meaning (S matrix) be mapped onto form (C matrix).

comb.prod_acc = accuracy_production(m = comb.prod, data = comb.data, wordform = "DISC", full_results = TRUE, return_triphone_supports = TRUE)

comb.prod_acc$acc # 0.9641834

We find that production accuracy is about \(96\%\).

7 Measures

7.1 Comprehension

Extracting and saving comprehension measures is straightforward.

comb.comp_measures = comprehension_measures(comp_acc = comb.comp_acc, Shat = comb.comp$Shat, S = comb.Smat2, A = mald.Amat, affixFunctions = comb.data$Affix)
write.csv(comb.comp_measures, "comb.comp_measures_ALLREALSEMANTICS.csv")
  • recognition:
    whether the correct form was recognized
  • cor_max:
    the correlation between the pertinent vector in Shat and the semantic vector of its predicted form in S
  • cor_target:
    the correlation between the pertinent vector in Shat and the semantic vector of its targeted form in S
  • l1norm:
    city block distance / Manhattan distance of Shat
  • l2norm:
    euclidian (direct) distance / length of Shat
  • rank:
    the rank of correlation between s-hat and the semantic vector of its targeted form
  • density:
    the mean correlation of s-hat with the semantic vectors of its top 8 neighbors
  • cor_affix:
    the correlation of s-hat and the semantic vector of its affix function; more or less a measure of transparency

7.2 Production

Extracting and saving production measures runs into problems. It appears that there are 23 words for which comb.prod_acc does not contain all necessary information. Why? I have no clue.

I worked around this problem by removing the 23 problematic words from full, i.e. the list in which all necessary info is contained within comb.prod_acc.

full <- comb.prod_acc[["full"]]

Using the summary() function, all cases with missing info are identified. I call this newly created data frame nineortwelve as complete entries consist of 12 entries per word, while incomplete entries consist of only 9 entries per word.

nineortwelve <- as.data.frame(summary(full))

Checking this data frame, the following entries can be found with only 9 individual entries instead of 12. Thus, these are removed from full2.

full2 <- full

full2[[8149]] <- NULL
full2[[7741]] <- NULL
full2[[7233]] <- NULL
full2[[7143]] <- NULL
full2[[7133]] <- NULL
full2[[7027]] <- NULL
full2[[6920]] <- NULL
full2[[6314]] <- NULL
full2[[5274]] <- NULL
full2[[4878]] <- NULL
full2[[4385]] <- NULL
full2[[3873]] <- NULL
full2[[3399]] <- NULL
full2[[3111]] <- NULL
full2[[2842]] <- NULL
full2[[2206]] <- NULL
full2[[2067]] <- NULL
full2[[1821]] <- NULL
full2[[1772]] <- NULL
full2[[1771]] <- NULL
full2[[1354]] <- NULL
full2[[971]] <- NULL
full2[[330]] <- NULL

This reduced full2 then replaces the original full in a copy of comb.prod_acc.

comb.prod_acc2 <- comb.prod_acc

comb.prod_acc2$full <- full2

Only then production measures can be obtained.

comb.prod_measures = production_measures(prod_acc = comb.prod_acc2, S = comb.Smat2, prod = comb.prod)
write.csv(comb.prod_measures, "comb.prod_measures_ALLREALSEMANTICS.csv")
  • correlations:
    the correlation of the predicted path with the targeted semantic vector
  • path_counts:
    the number of paths detected
  • path_sum:
    the summed support for the predicted path, could be read as the uncertainty about how to pronounce the word due to competing activation of articulatory motor plans
  • path_entropies:
    Shannon entropy calculated over the path supports
  • lwlr:
    Simon: “The length-weakest link ratios of the predicted path (also sometimes equal to word length). When the length of a word’s path, as measured in number of nodes, becomes higher, while the weakest link between nodes stays the same, uncertainty becomes greater. If triphone support is high, i.e. when all the strengths of the links are almost equal to one, the value will be almost equal to the number of segments in the word. This may not be the case in other models, where the triphone support may be lower.”

8 Analysis I

Attention:
The following section (i.e. chapter 8) is no longer up to date. Please check this link for a more recent analysis.

8.1 Data

Load comprehension and production measures as well as ENGS production data on non-morphemic and plural /s/.

# comprehension measures
comb.comp_measures <- read.csv("data/comb.comp_measures_ALLREALSEMANTICS_2.csv")[2:9]

# production measures
comb.prod_measures <- read.csv("data/comb.prod_measures_ALLREALSEMANTICS.csv")

# ENGS production data
data <- read.csv("data/engs_durations.csv")

# fix a naming issue
names(data)[names(data) == "ï..number"] <- "number"

Add information on words, transcriptions, bases, and affixes to one of the measure data sets (so we actually know which words we are dealing with).

comb.comp_measures$Word  <- comb.data$Word
comb.comp_measures$DISC  <- comb.data$DISC
comb.comp_measures$Affix <- comb.data$Affix
comb.comp_measures$Base  <- comb.data$Base

Next, we have to exclude the 23 words we excluded previously when calculating production measures from the comprehension measures data set.

comb.comp_measures2 <- subset(comb.comp_measures, DISC!="{p@r1t@s" & DISC!="bUl@k" & DISC!= "kl2@nt" & DISC!= "krimI" & 
                               DISC!= "kri1t" & DISC!= "krVmbP" & DISC!= "divj@nt" & DISC!= "dIsEmIn1SH" & DISC!= "fIS@R" & 
                               DISC!= "g & DISC!=t@R" & DISC!= "h{mb3g@R" & DISC!= "InstrUm@nt" & DISC!= "m@=Qr@tI" & 
                               DISC!= "n5bIl@tI" & DISC!= "pEndjUl@m" & DISC!= "rVpJ@R" & DISC!= "s6@R" & DISC!= "skwQnd@R" & 
                               DISC!= "stQk" & DISC!= "st5nkVt@R" & DISC!= "s@bmISH" & DISC!= "tr{nsEnd" & DISC!= "wQJfUl")

Finally, create the final data set by combining both measure data sets and the ENGS production data.

# combine comprehension and production measures
ldl.measures <- cbind(comb.comp_measures2, comb.prod_measures)

# combine LDL measures and ENGS production data
data <- merge(data, ldl.measures, by="Base")

# exclude some duplicated columns
data1 <- data[1:40]
data2 <- data[44:49]
data <- cbind(data1, data2)

# fix some names
names(data)[names(data) == "Word.x"] <- "Word"
names(data)[names(data) == "Affix.x"] <- "Affix"
names(data)[names(data) == "DISC.x"] <- "DISC"

# fix 'Affix' levels
data$Affix[is.na(data$Affix)] <- "NM"
data$Affix <- as.factor(data$Affix)

8.2 Variables

Let’s have a closer look at all variables we are interested in:

  • dependent variables:
    sDur, baseDur, wordDur
  • traditional variables:
    speakingRate, googleFreq, pauseBin, biphoneProbSumBin, folType, monoMultilingual, speaker
  • LDL variables:
    comprehension: cor_max, cor_target, l1norm, l2nrom, rank, density, cor_affix
    production: correlations, path_counts, path_sum, path_entropies, lwlr

8.2.1 dependent variables

sDur

sDur is the measured duration of non-morphemic and plural /s/.

shapiro.test(data$sDur)

    Shapiro-Wilk normality test

data:  data$sDur
W = 0.94176, p-value = 9.988e-16

sDur is not normally distributed. Thus, sDur is log-transformed:

data$sDurLog <- log(data$sDur)

shapiro.test(data$sDurLog)

    Shapiro-Wilk normality test

data:  data$sDurLog
W = 0.99268, p-value = 0.001955

baseDur

baseDur is the measured duration of strings preceding the word-final /s/.

shapiro.test(data$baseDur)

    Shapiro-Wilk normality test

data:  data$baseDur
W = 0.93546, p-value < 2.2e-16

baseDur is not normally distributed. Thus, baseDur is log-transformed:

data$baseDurLog <- log(data$baseDur)

shapiro.test(data$baseDurLog)

    Shapiro-Wilk normality test

data:  data$baseDurLog
W = 0.99236, p-value = 0.001384

wordDur

wordDur is the sum of baseDur and sDur.

shapiro.test(data$wordDur)

    Shapiro-Wilk normality test

data:  data$wordDur
W = 0.96604, p-value = 1.717e-11

wordDur is not normally distributed. Thus, wordDur is log-transformed:

data$wordDurLog <- log(data$wordDur)

shapiro.test(data$wordDurLog)

    Shapiro-Wilk normality test

data:  data$wordDurLog
W = 0.99469, p-value = 0.01763

8.2.2 traditional variables

speakingRate

speakingRate is the number of syllables in an utterance divided by the duration of the utterance.

shapiro.test(data$speakingRate)

    Shapiro-Wilk normality test

data:  data$speakingRate
W = 0.99142, p-value = 0.0005315

speakingRate is not normally distributed. Thus, speakingRate is log-transformed:

data$speakingRateLog <- log(data$speakingRate)

shapiro.test(data$speakingRateLog)

    Shapiro-Wilk normality test

data:  data$speakingRateLog
W = 0.98808, p-value = 2.291e-05

However, log-transforming does not improve normality. Thus, we keep speakingRate as variable.

googleFreq

googleFreq is the number of syllables in an utterance divided by the duration of the utterance.

shapiro.test(data$googleFreq)

    Shapiro-Wilk normality test

data:  data$googleFreq
W = 0.36678, p-value < 2.2e-16

googleFreq is not normally distributed. Thus, googleFreq is log-transformed:

data$googleFreqLog <- log(data$googleFreq)

shapiro.test(data$googleFreqLog)

    Shapiro-Wilk normality test

data:  data$googleFreqLog
W = 0.9794, p-value = 3.172e-08

categorical variables

summary(data$pauseBin)
no_pause    pause 
     435      249 
summary(data$biphoneProbSumBin)
high  low 
 168  516 
summary(data$folType)
APP   F   N   P   V 
197  12 114 171 190 
summary(data$monoMultilingual)
  bilingual monolingual 
        129         555 
summary(data$speaker)
 1  2  3  4  5  6  7  9 10 11 12 13 14 15 16 17 19 20 21 23 24 25 26 27 28 29 
19 26 14 18 14 16 18 14 14 20 21 20 13 21 14  7 18 17 14  5 12 20 22 17 21 19 
30 31 32 33 35 36 41 42 43 45 46 47 48 49 
21 17 20 17 16 11 20 21 18 15 18 17 20 19 

8.2.3 LDL variables

8.2.3.1 comprehension measures

cor_max

cor_max: the correlation between the pertinent vector in Shat and the semantic vector of its predicted form in S.

shapiro.test(data$cor_max)

    Shapiro-Wilk normality test

data:  data$cor_max
W = 0.74745, p-value < 2.2e-16

cor_max is not normally distributed. Thus, cor_max is log-transformed:

data$cor_maxLog <- log(data$cor_max)

shapiro.test(data$cor_maxLog)

    Shapiro-Wilk normality test

data:  data$cor_maxLog
W = 0.71499, p-value < 2.2e-16

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

The bestNormalize package contains a suite of transformation-estimating functions that can be used to normalize data. The function of the same name attempts to find and execute the best of all of these potential normalizing transformations. In this package, we define “normalize” as in “to render data Gaussian”, rather than transform it to the 0-1 scale.

bestNormalize(data$cor_max)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 17.0242
 - Box-Cox: 17.0954
 - Exp(x): 16.9817
 - Log_b(x+a): 16.9687
 - No transform: 16.887
 - orderNorm (ORQ): 13.5756
 - sqrt(x + a): 16.9604
 - Yeo-Johnson: 16.6287
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 15 unique values 
 - Original quantiles:
   0%   25%   50%   75%  100% 
0.607 0.888 0.938 1.000 1.000 
cor_max_best <- bestNormalize(data$cor_max)
orderNorm_obj <- orderNorm(data$cor_max)
data$cor_maxNorm <- predict(orderNorm_obj, newdata = data$cor_max)

shapiro.test(data$cor_maxNorm)

    Shapiro-Wilk normality test

data:  data$cor_maxNorm
W = 0.85932, p-value < 2.2e-16

Still, cor_max is not even nearly normally distributed. We will keep using cor_max as variable for now.

cor_target

cor_target: the correlation between the pertinent vector in Shat and the semantic vector of its targeted form in S.

shapiro.test(data$cor_target)

    Shapiro-Wilk normality test

data:  data$cor_target
W = 0.89454, p-value < 2.2e-16

cor_target is not normally distributed. Thus, cor_target is log-transformed:

data$cor_targetLog <- log(data$cor_target)

shapiro.test(data$cor_targetLog)

    Shapiro-Wilk normality test

data:  data$cor_targetLog
W = 0.8771, p-value < 2.2e-16

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$cor_target)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 6.687
 - Box-Cox: 6.3041
 - Exp(x): 6.4214
 - Log_b(x+a): 6.2881
 - No transform: 6.3034
 - orderNorm (ORQ): 1.1091
 - sqrt(x + a): 5.9398
 - Yeo-Johnson: 6.3725
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 48 unique values 
 - Original quantiles:
   0%   25%   50%   75%  100% 
0.068 0.279 0.637 0.888 0.998 
cor_target_best <- bestNormalize(data$cor_target)
orderNorm_obj <- orderNorm(data$cor_target)
data$cor_targetNorm <- predict(orderNorm_obj, newdata = data$cor_target)

shapiro.test(data$cor_targetNorm)

    Shapiro-Wilk normality test

data:  data$cor_targetNorm
W = 0.99374, p-value = 0.006111
l1norm

l1norm: city block distance / Manhattan distance of Shat.

shapiro.test(data$l1norm)

    Shapiro-Wilk normality test

data:  data$l1norm
W = 0.93901, p-value = 4.012e-16

l1norm is not normally distributed. Thus, l1norm is log-transformed:

data$l1normLog <- log(data$l1norm)

shapiro.test(data$l1normLog)

    Shapiro-Wilk normality test

data:  data$l1normLog
W = 0.87911, p-value < 2.2e-16

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$l1norm)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 6.1997
 - Box-Cox: 6.1849
 - Exp(x): 7.6258
 - Log_b(x+a): 6.254
 - No transform: 6.1678
 - orderNorm (ORQ): 1.3636
 - sqrt(x + a): 6.1385
 - Yeo-Johnson: 6.1727
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 24 unique values 
 - Original quantiles:
   0%   25%   50%   75%  100% 
0.102 0.572 0.971 1.439 1.937 
l1norm_best <- bestNormalize(data$l1norm)
orderNorm_obj <- orderNorm(data$l1norm)
data$l1normNorm <- predict(orderNorm_obj, newdata = data$l1norm)

shapiro.test(data$l1normNorm)

    Shapiro-Wilk normality test

data:  data$l1normNorm
W = 0.98538, p-value = 2.408e-06
l2norm

l2norm: euclidian (direct) distance / length of Shat.

shapiro.test(data$l2norm)

    Shapiro-Wilk normality test

data:  data$l2norm
W = 0.84956, p-value < 2.2e-16

l2norm is not normally distributed. Thus, l2norm is log-transformed:

data$l2normLog <- log(data$l2norm)

shapiro.test(data$l2normLog)

    Shapiro-Wilk normality test

data:  data$l2normLog
W = 0.94739, p-value = 7.141e-15

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$l2norm)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 10.3382
 - Box-Cox: 5.7599
 - Exp(x): 10.7656
 - Log_b(x+a): 5.5838
 - No transform: 10.3382
 - orderNorm (ORQ): 1.3053
 - sqrt(x + a): 7.0144
 - Yeo-Johnson: 8.738
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 24 unique values 
 - Original quantiles:
   0%   25%   50%   75%  100% 
0.004 0.014 0.033 0.042 0.096 
l2norm_best <- bestNormalize(data$l2norm)
orderNorm_obj <- orderNorm(data$l2norm)
data$l2normNorm <- predict(orderNorm_obj, newdata = data$l2norm)

shapiro.test(data$l2normNorm)

    Shapiro-Wilk normality test

data:  data$l2normNorm
W = 0.98576, p-value = 3.256e-06
density

density: the mean correlation of s-hat with the semantic vectors of its top 8 neighbors.

shapiro.test(data$density)

    Shapiro-Wilk normality test

data:  data$density
W = 0.93529, p-value < 2.2e-16

density is not normally distributed. Thus, density is log-transformed:

data$densityLog <- log(data$density)

shapiro.test(data$densityLog)

    Shapiro-Wilk normality test

data:  data$densityLog
W = 0.95078, p-value = 2.511e-14

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$density)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 3.9804
 - Box-Cox: 3.1975
 - Exp(x): 4.401
 - Log_b(x+a): 3.6193
 - No transform: 4.1459
 - orderNorm (ORQ): 1.2895
 - sqrt(x + a): 3.9806
 - Yeo-Johnson: 3.1907
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 24 unique values 
 - Original quantiles:
   0%   25%   50%   75%  100% 
0.518 0.601 0.694 0.785 0.929 
density_best <- bestNormalize(data$density)
orderNorm_obj <- orderNorm(data$density)
data$densityNorm <- predict(orderNorm_obj, newdata = data$density)

shapiro.test(data$densityNorm)

    Shapiro-Wilk normality test

data:  data$densityNorm
W = 0.98627, p-value = 4.935e-06
cor_affix

cor_affix: the correlation between the pertinent vector in Shat and the semantic vector of its predicted form in Sthe correlation of s-hat and the semantic vector of its affix function; more or less a measure of transparency.

shapiro.test(data$cor_affix)

    Shapiro-Wilk normality test

data:  data$cor_affix
W = 0.75756, p-value < 2.2e-16

cor_affix is not normally distributed. Thus, cor_affix is log-transformed:

data$cor_affixLog <- log(data$cor_affix)

shapiro.test(data$cor_affixLog)

    Shapiro-Wilk normality test

data:  data$cor_affixLog
W = NaN, p-value = NA

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$cor_affix)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 19.7154
 - Exp(x): 19.9425
 - Log_b(x+a): 19.7236
 - No transform: 19.7235
 - orderNorm (ORQ): 15.5695
 - sqrt(x + a): 18.8939
 - Yeo-Johnson: 19.2443
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 25 unique values 
 - Original quantiles:
    0%    25%    50%    75%   100% 
-0.050 -0.004  0.000  0.002  0.099 
cor_affix_best <- bestNormalize(data$cor_affix)
orderNorm_obj <- orderNorm(data$cor_affix)
data$cor_affixNorm <- predict(orderNorm_obj, newdata = data$cor_affix)

shapiro.test(data$cor_affixNorm)

    Shapiro-Wilk normality test

data:  data$cor_affixNorm
W = 0.89007, p-value < 2.2e-16

8.2.3.2 production measures

correlations

correlations: the correlation of the predicted path with the targeted semantic vector.

shapiro.test(data$correlations)

    Shapiro-Wilk normality test

data:  data$correlations
W = 0.87493, p-value < 2.2e-16

correlations is not normally distributed. Thus, correlations is log-transformed:

data$correlationsLog <- log(data$correlations)

shapiro.test(data$correlationsLog)

    Shapiro-Wilk normality test

data:  data$correlationsLog
W = 0.86514, p-value < 2.2e-16

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$correlations)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 8.274
 - Box-Cox: 8.1077
 - Exp(x): 8.7276
 - Log_b(x+a): 8.0811
 - No transform: 8.225
 - orderNorm (ORQ): 1.2942
 - sqrt(x + a): 7.849
 - Yeo-Johnson: 8.222
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 46 unique values 
 - Original quantiles:
   0%   25%   50%   75%  100% 
0.068 0.279 0.651 0.958 1.000 
correlations_best <- bestNormalize(data$correlations)
orderNorm_obj <- orderNorm(data$correlations)
data$correlationsNorm <- predict(orderNorm_obj, newdata = data$correlations)

shapiro.test(data$correlationsNorm)

    Shapiro-Wilk normality test

data:  data$correlationsNorm
W = 0.99319, p-value = 0.003364
path_counts

path_counts: the correlation of the predicted path with the targeted semantic vector.

shapiro.test(data$path_counts)

    Shapiro-Wilk normality test

data:  data$path_counts
W = 0.80891, p-value < 2.2e-16

path_counts is not normally distributed. Thus, path_counts is log-transformed:

data$path_countsLog <- log(data$path_counts)

shapiro.test(data$path_countsLog)

    Shapiro-Wilk normality test

data:  data$path_countsLog
W = 0.85867, p-value < 2.2e-16

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$path_counts)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 23.2514
 - Box-Cox: 23.2177
 - Exp(x): 45.0524
 - Log_b(x+a): 23.2177
 - No transform: 23.2708
 - orderNorm (ORQ): 23.3514
 - sqrt(x + a): 23.2924
 - Yeo-Johnson: 23.2177
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
Standardized Box Cox Transformation with 684 nonmissing obs.:
 Estimated statistics:
 - lambda = -0.0112817 
 - mean (before standardization) = 0.6606645 
 - sd (before standardization) = 0.4586133 
correlations_best <- bestNormalize(data$path_counts)
bc_obj <- bestNormalize::boxcox(data$path_counts)
data$path_countsNorm <- predict(bc_obj, newdata = data$path_counts)

shapiro.test(data$path_countsNorm)

    Shapiro-Wilk normality test

data:  data$path_countsNorm
W = 0.85827, p-value < 2.2e-16
path_sum

path_sum: the correlation of the predicted path with the targeted semantic vector.

shapiro.test(data$path_sum)

    Shapiro-Wilk normality test

data:  data$path_sum
W = 0.91361, p-value < 2.2e-16

path_sum is not normally distributed. Thus, path_sum is log-transformed:

data$path_sumLog <- log(data$path_sum)

shapiro.test(data$path_sumLog)

    Shapiro-Wilk normality test

data:  data$path_sumLog
W = 0.88998, p-value < 2.2e-16

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$path_sum)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 6.6561
 - Box-Cox: 5.521
 - Exp(x): 7.9058
 - Log_b(x+a): 6.5748
 - No transform: 5.538
 - orderNorm (ORQ): 1.2506
 - sqrt(x + a): 6.4564
 - Yeo-Johnson: 5.3907
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 48 unique values 
 - Original quantiles:
   0%   25%   50%   75%  100% 
1.010 2.003 3.328 4.206 4.709 
bestNormalize(data$path_sum)
path_sum_best <- bestNormalize(data$path_sum)
orderNorm_obj <- orderNorm(data$path_sum)
data$path_sumNorm <- predict(orderNorm_obj, newdata = data$path_sum)

shapiro.test(data$path_sumNorm)

    Shapiro-Wilk normality test

data:  data$path_sumNorm
W = 0.99349, p-value = 0.004673
path_entropies

path_entropies: the correlation of the predicted path with the targeted semantic vector.

shapiro.test(data$path_entropies)

    Shapiro-Wilk normality test

data:  data$path_entropies
W = 0.86413, p-value < 2.2e-16

path_entropies is not normally distributed. Thus, path_entropies is log-transformed:

data$path_entropiesLog <- log(data$path_entropies)

shapiro.test(data$path_entropiesLog)

    Shapiro-Wilk normality test

data:  data$path_entropiesLog
W = NaN, p-value = NA

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$path_entropies)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 18.9133
 - Exp(x): 17.007
 - Log_b(x+a): 30.5447
 - No transform: 16.8756
 - orderNorm (ORQ): 4.795
 - sqrt(x + a): 19.103
 - Yeo-Johnson: 16.909
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 36 unique values 
 - Original quantiles:
   0%   25%   50%   75%  100% 
0.000 0.000 0.952 1.204 2.346 
bestNormalize(data$path_entropies)
path_entropies_best <- bestNormalize(data$path_entropies)
orderNorm_obj <- orderNorm(data$path_entropies)
data$path_entropiesNorm <- predict(orderNorm_obj, newdata = data$path_entropies)

shapiro.test(data$path_entropiesNorm)

    Shapiro-Wilk normality test

data:  data$path_entropiesNorm
W = 0.93314, p-value < 2.2e-16
lwlr

lwlr: the correlation of the predicted path with the targeted semantic vector.

shapiro.test(data$lwlr)

    Shapiro-Wilk normality test

data:  data$lwlr
W = 0.84565, p-value < 2.2e-16

lwlr is not normally distributed. Thus, lwlr is log-transformed:

data$lwlrLog <- log(data$lwlr)

shapiro.test(data$lwlrLog)

    Shapiro-Wilk normality test

data:  data$lwlrLog
W = 0.87731, p-value < 2.2e-16

However, log-transforming does not improve normality. Thus, we test another way to normalize data using the bestNormalize package.

bestNormalize(data$lwlr)
Best Normalizing transformation with 684 Observations
 Estimated Normality Statistics (Pearson P / df, lower => more normal):
 - arcsinh(x): 7.4059
 - Box-Cox: 7.1704
 - Exp(x): 61.1705
 - Log_b(x+a): 7.3937
 - No transform: 8.8156
 - orderNorm (ORQ): 1.6566
 - sqrt(x + a): 7.8238
 - Yeo-Johnson: 7.4122
Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
 
Based off these, bestNormalize chose:
orderNorm Transformation with 684 nonmissing obs and ties
 - 34 unique values 
 - Original quantiles:
    0%    25%    50%    75%   100% 
 5.102  5.882  8.475 16.667 23.810 
bestNormalize(data$lwlr)
lwlr_best <- bestNormalize(data$lwlr)
orderNorm_obj <- orderNorm(data$lwlr)
data$lwlrNorm <- predict(orderNorm_obj, newdata = data$lwlr)

shapiro.test(data$lwlrNorm)

    Shapiro-Wilk normality test

data:  data$lwlrNorm
W = 0.9905, p-value = 0.0002134

8.3 Correlations 1

Correlations of sDurLog, Affix, and all (normalized) LDL measures.

pairscor.fnc(data[,c("sDurLog", "Affix", "speakingRateLog", "baseDurLog", "pauseBin", "biphoneProbSumBin", "folType", "monoMultilingual",
                     "cor_max", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm",
                     "correlationsNorm", "path_countsNorm", "path_sumNorm", "path_entropiesNorm", "lwlrNorm" )])

However, this is a rather crowded picture. We will return to more specific correlations in the next section.

8.4 Random Forests

We take a look at two different random forests: one forest containing all variables, and one forest containing only LDL variables.

8.4.1 Forest 1: all variables

preds = c("Affix", "speakingRateLog", "pauseBin", "biphoneProbSumBin", "folType", "monoMultilingual",
          "cor_maxNorm", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm", 
          "correlationsNorm", "path_counts", "path_sumNorm", "path_entropiesNorm", "lwlrNorm")
varnames = c("sDurLog", "Affix", "speakingRateLog", "pauseBin", "biphoneProbSumBin", "folType", "monoMultilingual",
             "cor_maxNorm", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm",
             "correlationsNorm", "path_counts", "path_sumNorm", "path_entropiesNorm", "lwlrNorm")

pseudo.cforest = cforest(sDurLog ~ ., data = data[, varnames])
pseudo.varimp = varimp(pseudo.cforest)
dotplot(sort(pseudo.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration"))

8.4.1 Forest 2: LDL variables

preds.ldl = c(
          "cor_maxNorm", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm", 
          "correlationsNorm", "path_counts", "path_sumNorm", "path_entropiesNorm", "lwlrNorm")
varnames.ldl = c("sDurLog",  
             "cor_maxNorm", "cor_targetNorm", "l1normNorm", "l2normNorm", "rank", "densityNorm", "cor_affixNorm", 
             "correlationsNorm", "path_counts", "path_sumNorm", "path_entropiesNorm", "lwlrNorm")

pseudo.cforest.ldl = cforest(sDurLog ~ ., data = data[, varnames.ldl])
pseudo.varimp.ldl = varimp(pseudo.cforest.ldl)
dotplot(sort(pseudo.varimp.ldl), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration"))

8.5 Correlations 2

Taking a closer look at the LDL variables, we find that some of them are highly correlated:

pairscor.fnc(data[,c("sDurLog", "Affix",
                     "cor_targetNorm", "lwlrNorm", "path_sumNorm", "correlationsNorm", "cor_maxNorm", "densityNorm", "cor_affixNorm",  
                     "l1normNorm", "rank", "path_counts", "path_entropiesNorm", "l2normNorm" )])

We find the following highly correlated variable combinations:

  • correlationsNorm vs. cor_targetNorm, \(rho=0.92\)
  • path_sumNorm vs. lwlrNorm, \(rho=-0.87\)
  • l1normNorm vs. l2normNorm, \(rho=0.93\)
  • path_counts vs. path_entropiesNorm, \(rho=0.91\)

Thus, we need to exclude some of them so we do not run into problems of colinearity later on.

correlationsNorm vs. cor_targetNorm

Create two simple models containing either correlationsNorm or cor_targetNorm. Then, compare them using ANOVA().

mdl.test.correlationsNorm <- lmer(sDurLog ~ correlationsNorm + (1 | speaker), data, REML=F)
mdl.test.cor_targetNorm <- lmer(sDurLog ~ cor_targetNorm + (1 | speaker), data, REML=F)

anova(mdl.test.correlationsNorm, mdl.test.cor_targetNorm)
Data: data
Models:
mdl.test.correlationsNorm: sDurLog ~ correlationsNorm + (1 | speaker)
mdl.test.cor_targetNorm: sDurLog ~ cor_targetNorm + (1 | speaker)
                          npar    AIC    BIC  logLik deviance Chisq Df
mdl.test.correlationsNorm    4 401.79 419.90 -196.89   393.79         
mdl.test.cor_targetNorm      4 402.12 420.24 -197.06   394.12     0  0
                          Pr(>Chisq)
mdl.test.correlationsNorm           
mdl.test.cor_targetNorm            1

No difference; keep cor_targetNorm because it is less correlated to other variables.

path_sumNorm vs. lwlrNorm

Create two simple models containing either path_sumNorm or lwlrNorm Then, compare them using ANOVA().

mdl.test.path_sumNorm <- lmer(sDurLog ~ path_sumNorm + (1 | speaker), data, REML=F)
mdl.test.lwlrNorm <- lmer(sDurLog ~ lwlrNorm + (1 | speaker), data, REML=F)

anova(mdl.test.path_sumNorm, mdl.test.lwlrNorm)
Data: data
Models:
mdl.test.path_sumNorm: sDurLog ~ path_sumNorm + (1 | speaker)
mdl.test.lwlrNorm: sDurLog ~ lwlrNorm + (1 | speaker)
                      npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
mdl.test.path_sumNorm    4 402.64 420.75 -197.32   394.64                    
mdl.test.lwlrNorm        4 402.91 421.02 -197.46   394.91     0  0          1

No difference; keep lwlrNorm because it is less correlated to other variables.

l1normNorm vs. l2normNorm

Create two simple models containing either l1normNorm or l2normNorm Then, compare them using ANOVA().

mdl.test.l1normNorm <- lmer(sDurLog ~ l1normNorm + (1 | speaker), data, REML=F)
mdl.test.l2normNorm <- lmer(sDurLog ~ l2normNorm + (1 | speaker), data, REML=F)

anova(mdl.test.l1normNorm, mdl.test.l2normNorm)
Data: data
Models:
mdl.test.l1normNorm: sDurLog ~ l1normNorm + (1 | speaker)
mdl.test.l2normNorm: sDurLog ~ l2normNorm + (1 | speaker)
                    npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
mdl.test.l1normNorm    4 402.98 421.09 -197.49   394.98                    
mdl.test.l2normNorm    4 403.18 421.29 -197.59   395.18     0  0          1

No difference; keep l1normNorm because it is less correlated to other variables.

path_counts vs. path_entropiesNorm

Create two simple models containing either path_counts or path_entropiesNorm Then, compare them using ANOVA().

mdl.test.path_counts <- lmer(sDurLog ~ path_counts + (1 | speaker), data, REML=F)
mdl.test.path_entropiesNorm <- lmer(sDurLog ~ path_entropiesNorm + (1 | speaker), data, REML=F)

anova(mdl.test.path_counts, mdl.test.path_entropiesNorm)
Data: data
Models:
mdl.test.path_counts: sDurLog ~ path_counts + (1 | speaker)
mdl.test.path_entropiesNorm: sDurLog ~ path_entropiesNorm + (1 | speaker)
                            npar    AIC    BIC  logLik deviance Chisq Df
mdl.test.path_counts           4 401.57 419.69 -196.79   393.57         
mdl.test.path_entropiesNorm    4 402.66 420.77 -197.33   394.66     0  0
                            Pr(>Chisq)
mdl.test.path_counts                  
mdl.test.path_entropiesNorm          1

No difference; keep path_entropiesNorm because it is less correlated to other variables.

Create two simple models containing either correlationsNorm or cor_targetNorm. Then, compare them using ANOVA().

mdl.test.correlationsNorm <- lmer(sDurLog ~ correlationsNorm + (1 | speaker), data, REML=F)
mdl.test.cor_targetNorm <- lmer(sDurLog ~ cor_targetNorm + (1 | speaker), data, REML=F)

anova(mdl.test.correlationsNorm, mdl.test.cor_targetNorm)
Data: data
Models:
mdl.test.correlationsNorm: sDurLog ~ correlationsNorm + (1 | speaker)
mdl.test.cor_targetNorm: sDurLog ~ cor_targetNorm + (1 | speaker)
                          npar    AIC    BIC  logLik deviance Chisq Df
mdl.test.correlationsNorm    4 401.79 419.90 -196.89   393.79         
mdl.test.cor_targetNorm      4 402.12 420.24 -197.06   394.12     0  0
                          Pr(>Chisq)
mdl.test.correlationsNorm           
mdl.test.cor_targetNorm            1

No difference; keep ‘cor_targetNorm’ because it is less correlated to other variables.

8.6 Mixed Effects Models

Let’s create a full model containing all traditional control variables as well as all non-correlated LDL variables.

mdl.before.step <- lmer(sDurLog ~ 
                          #Affix +
                          speakingRate +
                          pauseBin +
                          folType +
                          # correlationsNorm +
                          # path_sumNorm +
                          # l2normNorm +
                          # path_countsNorm +
                          lwlrNorm +
                          cor_targetNorm +
                          cor_affixNorm +
                          cor_maxNorm +
                          densityNorm +
                          l1normNorm +
                          rank +
                          path_entropiesNorm +
                          (1 | speaker)+
                          (1 | transcription),
                        data=data,
                        REML=FALSE)

Checking this full model for significance of fixed effects proves to result in a rather sad picture:

anova(mdl.before.step)
Type III Analysis of Variance Table with Satterthwaite's method
                    Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
speakingRate        1.3005  1.3005     1 662.48  19.4765 1.189e-05 ***
pauseBin           10.0826 10.0826     1 660.77 150.9964 < 2.2e-16 ***
folType             3.4696  0.8674     4 631.25  12.9902 3.591e-10 ***
lwlrNorm            0.0223  0.0223     1  87.00   0.3339    0.5648    
cor_targetNorm      0.0016  0.0016     1 394.81   0.0240    0.8770    
cor_affixNorm       0.0115  0.0115     1 163.03   0.1717    0.6792    
cor_maxNorm         0.1510  0.1510     1  24.07   2.2608    0.1457    
densityNorm         0.1489  0.1489     1  22.81   2.2293    0.1491    
l1normNorm          0.0054  0.0054     1  31.48   0.0814    0.7773    
rank                0.0821  0.0821     1  25.23   1.2295    0.2780    
 [ reached getOption("max.print") -- omitted 1 row ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let’s see whether reducing the number of fixed effects results in any significant LDL measures:

step(mdl.before.step)
Backward reduced random-effect table:

                    Eliminated npar   logLik    AIC     LRT Df Pr(>Chisq)    
<none>                           18  -97.931 231.86                          
(1 | transcription)          1   17  -98.468 230.94   1.073  1     0.3003    
(1 | speaker)                0   16 -193.416 418.83 189.897  1     <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Backward reduced fixed-effect table:
Degrees of freedom method: Satterthwaite 

                   Eliminated  Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)
cor_targetNorm              1  0.0058  0.0058     1 645.07   0.0850    0.7708
l1normNorm                  2  0.0128  0.0128     1 645.04   0.1892    0.6637
lwlrNorm                    3  0.0115  0.0115     1 648.19   0.1701    0.6802
path_entropiesNorm          4  0.0473  0.0473     1 644.40   0.6968    0.4042
cor_affixNorm               5  0.0511  0.0511     1 647.07   0.7513    0.3864
rank                        6  0.0735  0.0735     1 645.57   1.0804    0.2990
cor_maxNorm                 7  0.1825  0.1825     1 645.08   2.6760    0.1024
densityNorm                 8  0.0980  0.0980     1 646.03   1.4306    0.2321
speakingRate                0  1.3916  1.3916     1 683.99  20.2607 7.942e-06
                      
cor_targetNorm        
l1normNorm            
lwlrNorm              
path_entropiesNorm    
cor_affixNorm         
rank                  
cor_maxNorm           
densityNorm           
speakingRate       ***
 [ reached getOption("max.print") -- omitted 2 rows ]
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model found:
sDurLog ~ speakingRate + pauseBin + folType + (1 | speaker)

No - none of the LDL measures appears to be significantly predictive for /s/ duration.

Taking a closer look at the distribution of all LDL measures, we find why they do not show any effects:

cor_targetNorm

cor_affixNorm

cor_maxNorm

densityNorm

l1normNorm

rank

path_entropiesNorm

9 More Measures

It appears that the measures obtained by learning comprehension and production of real words and pseudowords are not predictive of /s/ duration in pseudowords.

However, there is one ray of hope left: Chuang et al. (2020) use a number of measures which are not contained in our comprehension and production measures, namely:

  • ALDC - Average Levenshtein Distance of Candidates
  • SCPP - Semantic Correlation of Predicted Production
  • EDNN - Euclidian Distance from Nearest Neighbour
  • NNC - Nearest Neighbour Correlation
  • ALC - Average Lexical Correlation
  • DRC - Dual Route Consistency

Let’s see whether these measures are predictive of /s/ duration in our pseudowords.

9.1 ALDC

The Average Levenshtein Distance of all Candidate productions from the true pronunciation of a given pseudoword.

Higher ALDC values indicate that the candidate forms are very different from the intended pronunciation, indicating a sparse form neighborhood.

# load stringdist package to calculate Levensthein Distances
library(stringdist)

# creating a function to obtain the ALDC
ALDC <- function (prod_acc, data) {
  if (is.null(prod_acc)) {
    stop(call=F, geterrmessage = "prod_acc not found\n")
  }
  if (is.null(data)) {
    stop(call=F, geterrmessage = "data not found\n")
  }
  result <- vector("list",length(prod_acc[["full"]]))
  for (i in 1:length(prod_acc[["full"]]))
  {
    result[[i]] <- stringdist::stringdist(
      prod_acc[["full"]][[i]][["word"]],
      prod_acc[["full"]][[i]][["candidates"]],
      method = c("lv"),
      useBytes = FALSE,
      weight = c(d = 1, i = 1, s = 1, t = 1),
      q = 1,
      p = 0,
      bt = 0,
      nthread = getOption("sd_num_thread"))
  }
  Word = data$Word
  Base = data$Base
  ALDC <- data.frame(matrix(vector(), 0, 1, dimnames=list(c(), c("ALDC"))), stringsAsFactors=F)
  for (j in 1:length(prod_acc[["full"]]))
  {
    ALDC[j,1] <- mean(result[[j]])
  }
  ALDCframe <- as.data.frame(cbind(Word, Base, ALDC))
}

# using the function
ALDCframe <- ALDC(prod_acc = comb.prod_acc, data = comb.data)

9.2 SCPP

The maximum correlation between a word’s predicted semantic vector and any of the semantic vectors of its candidate forms, i.e. the Semantic Correlation of Predicted Production.

Higher SCPP values indicate that the semantics of the estimated form better approximate the estimated meaning.

# creating a function to obtain the SCPP
SCPP <- function (comp_measures, data) {
  if (is.null(comp_measures)) {
    stop(call=F, geterrmessage = "comp_measures not found\n")
  }
  if (is.null(data)) {
    stop(call=F, geterrmessage = "data not found\n")
  }
  SCPP = comp_measures$cor_max
  Word = data$Word
  Base = data$Base
  SCPPframe <- as.data.frame(cbind(Word, Base, SCPP))
}

# using the function
SCPPframe <- SCPP(comp_measures = comb.comp_measures, data = comb.data)

9.3 EDNN

The Euclidian Distance between a word’s estimated semantic vector and its Nearest Neighbour.

Higher EDNN values indicate that a word is more distant from the nearest word neighbour.

# creating a function to obtain the SCPP
EDNN <- function (comprehension, data) {
  if (is.null(comprehension)) {
    stop(call=F, geterrmessage = "comprehension not found\n")
  }
  if (is.null(data)) {
    stop(call=F, geterrmessage = "data not found\n")
  }
  euclidian <- get.knnx(comprehension$S, comprehension$Shat, k=1)
  EDNN <- as.data.frame(euclidian[["nn.dist"]])
  colnames(EDNN) <- c("EDNN")
  Word = data$Word
  Base = data$Base
  EDNNframe <- as.data.frame(cbind(Word, Base, EDNN))
}

# using the function
EDNNframe <- EDNN(comprehension = comb.comp, data = comb.data)

9.4 NNC

The maximum of the correlations between a pseudoword’s estimated semantic vector and words’ semantic vectors, i.e. a pseudoword’s Nearest Neighbour Correlation.

Higher NNC values indicate that a pseudoword’s meaning is similar to that of a real word.

# creating a function to obtain the SCPP
NNC <- function (pseudo_S_matrix, real_S_matrix, pseudo_word_data, real_word_data) 
{
  if (is.null(pseudo_S_matrix)) {
    stop(call=F, geterrmessage = "pseudo_S_matrix not found\n")
  }
  if (is.null(real_S_matrix)) {
    stop(call=F, geterrmessage = "real_S_matrix not found\n")
  }
  if (is.null(pseudo_word_data)) {
    stop(call=F, geterrmessage = "pseudo_word_data not found\n")
  }
  if (is.null(real_word_data)) {
    stop(call=F, geterrmessage = "real_word_data not found\n")
  }
  results <- data.frame(matrix(ncol = 1, nrow = 0))
  x <- colnames(results[1])
  names(results)[names(results) == x] <- "V1"
  for (i in 1:nrow(pseudo_S_matrix))
  {
    for (j in 1:nrow(real_S_matrix))
    {
      results[i,j] <- cor(pseudo_S_matrix[i,], real_S_matrix[j,], method = "pearson")
    }
  }
  NNC <- apply(results, 1, max)
  vectors_wV <- colnames(results)[apply(results,1,which.max)]
  vectors <- as.numeric(gsub('V', '', vectors_wV))
  Base <- pseudo_word_data$Base
  Pseudoword <- pseudo_word_data$Word
  Word <- real_word_data[vectors,1]
  NNCframe <- as.data.frame(cbind(NNC, vectors, Base, Pseudoword, Word))
}

# using the function
NNCframe <- NNC(pseudo_S_matrix = engs.Smat, real_S_matrix = comb.Smat, pseudo_word_data = engs.data, real_word_data = comb.data)

9.5 ALC

The mean correlation of a pseudoword’s estimated semantic vector with each of the word’s semantic vectors, i.e. its Average Lexical Correlation.

Higher ALC values indicate that a pseudoword vector is part of a denser semantic neighbourhood.

# creating a function to obtain the SCPP
ALC <- function (pseudo_S_matrix, real_S_matrix, pseudo_word_data) 
{
  if (is.null(pseudo_S_matrix)) {
    stop(call=F, geterrmessage = "pseudo_S_matrix not found\n")
  }
  if (is.null(real_S_matrix)) {
    stop(call=F, geterrmessage = "real_S_matrix not found\n")
  }
  if (is.null(pseudo_word_data)) {
    stop(call=F, geterrmessage = "pseudo_word_data not found\n")
  }
  results <- data.frame(matrix(ncol = 1, nrow = 0))
  for (i in 1:length(row.names(pseudo_S_matrix)))
  {
    for (j in 1:length(row.names(real_S_matrix)))
    {
      results[i,j] <- cor(pseudo_S_matrix[i,], real_S_matrix[j,], method = "pearson")
    }
  }
  ALC <- rowMeans(results)
  Pseudoword <- pseudo_word_data$Word
  Base <- pseudo_word_data$Base
  ALCframe <- as.data.frame(cbind(ALC, Pseudoword, Base))
}

# using the function
ALCframe <- ALC(pseudo_S_matrix = engs.Smat, real_S_matrix = mald.Smat, pseudo_word_data = engs.data)

9.6 DRC

The correlation between the semantic vector estimated from the direct route and that from the indirect route, i.e. the Dual Route Consistency of a word’s semantic vectors.

Higher DRC values indicates that the semantic vectors produced by the two routes are more similar to each other.

As we need both, estimated semantic vectors for the direct and indirect route, we first have to compute the estimated semantic vector for the indirect route. The estimated semantic vectors for the direct route were computed earlier (see 4.1).

## get cues for INDIRECT route
comb.cues.indirect = make_cue_matrix(data = comb.data, formula = ~ Word + Affix, grams = 3, cueForm = "DISC", indirectRoute = TRUE)

## get mapping for comprehension with INDIRECT route cues
comb.comp.indirect = learn_comprehension(cue_obj = comb.cues.indirect, S = comb.Smat3)
beep()

## solve equations to obtain the S matrix based on INDIRECT cues
Tmat = comb.cues.indirect$matrices$T
X = MASS::ginv(t(comb.cues.indirect$matrices$C) %*% comb.cues.indirect$matrices$C) %*% t(comb.cues.indirect$matrices$C)
K = X %*% comb.cues.indirect$matrices$T
H = MASS::ginv(t(Tmat) %*% Tmat) %*% t(Tmat) %*% comb.Smat2
That = comb.cues.indirect$matrices$C %*% K
Shat = That %*% H

Now, Shat is the matrix containing estimated semantic vectors for words and pseudowords based on the indirect route. Using Shat and the Shat matrix contained in the object obtained originally by learn_comprehension, we are able to compute DRC:

# creating a function to obtain the SCPP
DRC <- function (comprehension, S_matrix_indirect, data) 
{
  if (is.null(comprehension)) {
    stop(call=F, geterrmessage = "comprehension not found\n")
  }
  if (is.null(S_matrix_indirect)) {
    stop(call=F, geterrmessage = "S_matrix_indirect not found\n")
  }
  if (is.null(data)) {
    stop(call=F, geterrmessage = "data not found\n")
  }
  results <- data.frame(matrix(ncol = 1, nrow = 0))
  for (i in 1:length(row.names(S_matrix_indirect)))
  {
    results[i,1] <- cor(S_matrix_indirect[i,], comprehension$Shat[i,], method = "pearson")
  }
  DRC <- rowMeans(results)
  Word <- data$Word
  Base <- data$Base
  DRCframe <- as.data.frame(cbind(DRC, Word, Base))
}

# using the function
DRCframe <- DRC(comprehension = comb.comp, S_matrix_indirect = Shat, data = comb.data)

9.7 Package

The functions to compute and extract ALDC, SCPP, EDNN, NNC, ALC, and DRC are now available as LDLConvFunctions. This package can be found here: https://github.com/dosc91/LDLConvFunctions

Install this package through devtools:

devtools::install_github("dosc91/LDLConvFunctions", upgrade_dependencies = FALSE)

10 Analysis II

Attention:
The following section (i.e. chapter 10) is no longer up to date. Please check this link for a more recent analysis.

10.1 Data

Combine previously used data (see 8.1) with newly computed and extracted measures (see 9).

# ALDC
ALDCframe <- read.csv("chuang_measures/ALDCframe_new.csv")[2:4]

# SCPP
SCPPframe <- read.csv("chuang_measures/SCPPframe_new.csv")[2:4]

# EDNN
EDNNframe <- read.csv("chuang_measures/EDNNframe_new.csv")[2:4]

# NNC
NNCframe <- read.csv("chuang_measures/NNCframe_new.csv")[2:6]

# ALC
ALCframe <- read.csv("chuang_measures/ALCframe_new.csv")[2:4]

# DRC
DRCframe <- read.csv("chuang_measures/DRCframe_new.csv")[2:4]


## merge frames which contain info on all words
comb1 <- merge(ALDCframe, SCPPframe, by=c("Base", "Word"))
comb2 <- merge(comb1, EDNNframe, by=c("Base", "Word"))
ALDC_SCPP_EDNN_DCC_frame <- merge(comb2, DRCframe, by=c("Base", "Word"))

## filter for pseudowords
ALDC_SCPP_EDNN_DCC_frame <- subset(ALDC_SCPP_EDNN_DCC_frame, Word=="bloufs" | Word=="blouks" | Word=="bloups" | Word=="blouts" |
                                                             Word=="cloofs" | Word=="clooks" | Word=="cloops" | Word=="cloots" |
                                                             Word=="glaifs" | Word=="glaiks" | Word=="glaips" | Word=="glaits" |
                                                             Word=="glifs" | Word=="gliks" | Word=="glips" | Word=="glits" |
                                                             Word=="pleefs" | Word=="pleeks" | Word=="pleeps" | Word=="pleets" |
                                                             Word=="prufs" | Word=="pruks" | Word=="prups" | Word=="pruts")


# merge NNC and ALC
NNC_ALC_frame <- merge(NNCframe, ALCframe, by=c("Base", "Pseudoword"))

# rename columns
names(NNC_ALC_frame)[names(NNC_ALC_frame) == "Word"] <- "RealWord"
names(NNC_ALC_frame)[names(NNC_ALC_frame) == "Pseudoword"] <- "Word"


# merge all frames
all_frames <- merge(ALDC_SCPP_EDNN_DCC_frame, NNC_ALC_frame, by=c("Base", "Word"))


# merge rest of data with new measures
data <- merge(data, all_frames, by=c("Base", "Word"))

10.2 Correlations I

Correlations of sDurLog, Affix, and all new measures.

pairscor.fnc(data[,c("sDurLog", "Affix", "NeighbourhoodDensity", "ALDC", "DRC", "NNC", "ALC", "SCPP", "EDNN")])

We find the following highly correlated variable combinations:

  • DRC vs. EDNN, \(rho=0.43\)
  • NNC vs. ALC, \(rho=0.76\)
  • ALC vs. SCPP, \(rho=0.35\)
  • SCPP vs. EDNN, \(rho=-0.5\)

Thus, we need to take some measures before we run into problems of colinearity later on.

10.3 PCA

Let’s attempt a principal component analysis. First, we extract the correlated variables and merge them to form a new data frame.

pca.data1 <- data[colnames(data)=="DRC"]
pca.data2 <- data[colnames(data)=="EDNN"]
pca.data3 <- data[colnames(data)=="NNC"]
pca.data4 <- data[colnames(data)=="SCPP"]
pca.data5 <- data[colnames(data)=="ALC"]

pca.data <- cbind(pca.data1,
                   pca.data2,
                   pca.data3,
                   pca.data4,
                   pca.data5)

Compute a principal component analysis using the princomp function:

pc <- princomp(pca.data, cor=TRUE, score=TRUE)

summary(pc)
Importance of components:
                          Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
Standard deviation     1.4542451 1.2254515 0.8792584 0.64182608 0.44542541
Proportion of Variance 0.4229658 0.3003463 0.1546191 0.08238814 0.03968076
Cumulative Proportion  0.4229658 0.7233120 0.8779311 0.96031924 1.00000000

Following one of many ways, we take components 1, 2, and 3 for further analysis.

plot(pc)

plot(pc, type="l")

biplot(pc)

The loadings of our PCA components are the following:

pc$loadings

Loadings:
     Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
DRC   0.318  0.443  0.762  0.293  0.191
EDNN  0.381  0.555 -0.172 -0.702 -0.158
NNC  -0.482  0.491 -0.269         0.673
SCPP -0.492 -0.241  0.561 -0.615       
ALC  -0.528  0.444         0.205 -0.692

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings       1.0    1.0    1.0    1.0    1.0
Proportion Var    0.2    0.2    0.2    0.2    0.2
Cumulative Var    0.2    0.4    0.6    0.8    1.0

What do these loadings conceptually mean?

Variables Comp.1 Comp.1.interpretation Comp.2 Comp.2.interpretation Comp.3 Comp.3.interpretation
ALC -0.528 sparse semantic neighbourhood 0.444 dense semantic neighbourhood 0.000 0
DRC 0.318 direct + indirect route are consistent 0.443 direct + indirect route are consistent 0.762 direct + indirect route are consistent
EDNN 0.381 semantically distant from nearest neighbour 0.555 semantically distant from nearest neighbour -0.172 semantically close to nearest neighbour
NNC -0.482 pseudoword meaning not similar to real word 0.491 pseudoword meaning similar to real word -0.269 pseudoword meaning not similar to real word
SCPP -0.492 estimated form and meaning do not match well -0.241 estimated form and meaning do not match well 0.561 estimated form and meaning match well

Taking the results of the PCA, we combine our data set and the newly calculated component scores:

data <- cbind(data, pc$scores)

10.4 Correlations II

Checking the newly calculated PCA components, Affix, sDurLog and ALDC for correlations, we find none:

pairscor.fnc(data[,c("sDurLog", "Affix", "NeighbourhoodDensity", "ALDC", "Comp.1", "Comp.2", "Comp.3")])

10.5 Models

10.5.1 LMER: Simple Model

mdl.simple <- lmer(sDurLog ~ 
                   ALDC+
                   Comp.1 +
                   Comp.2 +
                   Comp.3 +
                   (1|speaker),
                 data=data, REML=F)

anova(mdl.simple)
Type III Analysis of Variance Table with Satterthwaite's method
         Sum Sq  Mean Sq NumDF  DenDF F value Pr(>F)
ALDC   0.065581 0.065581     1 644.18  0.7358 0.3913
Comp.1 0.007051 0.007051     1 645.22  0.0791 0.7786
Comp.2 0.000277 0.000277     1 646.22  0.0031 0.9556
Comp.3 0.025287 0.025287     1 644.81  0.2837 0.5945

It appears none of the variables is significant for /s/ duration prediction.

10.5.2 LMER: Control Variables

mdl.control <- lmer(sDurLog ~
                   Affix + 
                   ALDC +
                   Comp.1 +
                   Comp.2 +
                   Comp.3 +
                   SylPerMin +
                   pauseBin +
                   folType +
                   (1|speaker),
                 data=data, REML=F)

anova(mdl.control)
Type III Analysis of Variance Table with Satterthwaite's method
           Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
Affix      1.0285  1.0285     1 670.25  32.9533 1.431e-08 ***
ALDC       0.0010  0.0010     1 648.61   0.0330 0.8559558    
Comp.1     0.0037  0.0037     1 652.44   0.1192 0.7299856    
Comp.2     0.0168  0.0168     1 652.29   0.5390 0.4630936    
Comp.3     0.1507  0.1507     1 649.49   4.8270 0.0283696 *  
SylPerMin 27.3285 27.3285     1 581.36 875.5698 < 2.2e-16 ***
pauseBin   1.5566  1.5566     1 676.71  49.8717 4.083e-12 ***
folType    0.6112  0.1528     4 652.61   4.8956 0.0006788 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It appears that Comp.3 is significant for /s/ duration prediction.

10.5.3 GAM: Control Variables

gam.control <- gam(sDurLog ~ 
                      Affix +
                      s(ALDC, k=9)+
                      s(Comp.1) +
                      s(Comp.2) +
                      s(Comp.3) +
                      SylPerMin+
                      pauseBin+
                      folType+
                      speaker,
                    data=data)

anova(gam.control)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ Affix + s(ALDC, k = 9) + s(Comp.1) + s(Comp.2) + s(Comp.3) + 
    SylPerMin + pauseBin + folType + speaker

Parametric Terms:
          df       F  p-value
Affix      1  28.438 1.36e-07
SylPerMin  1 764.534  < 2e-16
pauseBin   1  36.096 3.20e-09
folType    4   4.622   0.0011
speaker   39   6.342  < 2e-16

Approximate significance of smooth terms:
            edf Ref.df     F  p-value
s(ALDC)   4.207  4.914 2.142 0.054426
s(Comp.1) 8.209  8.808 3.613 0.000416
s(Comp.2) 1.040  1.073 1.504 0.211859
s(Comp.3) 1.000  1.000 5.224 0.022601

It appears that Comp.1 and Comp.3 are significant for /s/ duration prediction.

11 Next Steps

Where do we go from here? Do we stick with the PCA components - do we keep 3, only 2 or even 4? What type of modelling do we wish to use? LMER or GAMs or both - and why? Affix is still significant, thus, LDL measures cannot completely capture the effect of it. Do we have an explanation for this?

My next steps:

  1. Find out how to use CELEX.
  2. Redo the entire LDL modelling (i.e. comprehension and production) with more real word data (see 1.).
  3. Combine “old” and “new” LDL measures (hint: correlations are all over the place, resulting in 7+ PCA components…).
  4. Go completely mad.

References

Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270. doi.org/10.1075/ml.18010.baa

Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39. doi.org/10.1155/2019/4895891

Chuang, Y-Y., Vollmer, M-l., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., & Baayen, R. H. (2020). The processing of pseudoword form and meaning in production and comprehension: A computational modeling approach using Linear Discriminative Learning. Behavior Research Methods, 1-51. doi.org/10.3758/s13428-020-01356-w

Nykamp, D. (2020, November 10). Multiplying matrices and vectors. Math Insight. mathinsight.org/matrix_vector_multiplication

Schmitz, D., Baer-Henney, D., & Plag, I. (submitted). The duration of word-final /s/ differs across morphological categories in English: Evidence from pseudowords.

Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadic, F., & Sims, M. (2018). The massive auditory lexical decision (MALD) database. Behavior Research Methods, 1-18. doi.org/10.3758/s13428-018-1056-1