This script documents the analysis of the ENGS production data and MALD corpus data with the help of Linear Discriminative Learning. This is a work in progress.

The secret hero of this quest:
beepr This package contains one function, beep(), with one purpose: To make it easy to play notification sounds on whatever platform you are on. It is intended to be useful, for example, if you are running a long analysis1 in the background and want to know when it is ready.

Analysis1 Mean SD
learn_comprehension() 78 11
accuracy_comprehension() 253 23
learn_production() 74 9
accuracy_production() 526 31
Processing times for comprehension and production mappings and accuracy.

1 Data

We will use a combined data set: a subset of the real word data of the MALD corpus as well as the pseudoword data of the ENGS production experiment. Both data files contain types, not tokens (see Simon’s LDL work).

mald.data <- read.csv("mald.subset.dfr.csv", header=T, strip.white=TRUE)

data.engs.types <- read.csv("engs/final_S_data_LDL_types.csv", header=T, strip.white=TRUE)

data.comb <- rbind(mald.data, data.engs.types)
data.comb <- data.comb[order(data.comb$Word),]

We include not only word-final /s/ words and suffixes, but also other affixes for more realistic (?) mappings:


       ABLE       AGAIN       AGENT COMPARATIVE  CONTINUOUS         DIS 
         46          48         232          11           6          37 
         EE        ENCE         FUL           I          IC  INSTRUMENT 
          6          23          47           1          32          59 
        ION         ISH         ISM         IST         ITY         IVE 
        305           9          26          18          88          72 
        IZE        LESS          LY        MENT         MIS        NESS 
         30          39         325          73           9          63 
        NOT     ORDINAL         OUS         OUT        OVER        PAST 
         31           8          43           7          14         120 
         PL         SUB SUPERLATIVE       UNDER        UNDO           Y 
         49           6           2           8           3         222 

2 A Matrix

The A matrix contains the semantic vectors of all lexomes in rows. But what about pseudowords? As we do not have any semantic information on the ENGS pseudowords, we assume that their semantic vectors are (almost) equal to zero - for now.
A possible next step is to create semantic vectors for pseudowords as in Chuang et al. 2020, however, it is still unclear how to do this exactly.

load("data_simon/wL2Ld.rda") # Load the trimmed semantic matrix from TASA

First, create an A matrix for MALD real words:

mald.subset.allLexomes = unique(c(mald.data$Word, mald.data$Affix))
mald.subset.allLexomes = mald.subset.allLexomes[!is.na(mald.subset.allLexomes)]

mald.Amat = wL2Ld[mald.subset.allLexomes, ]

Then, create a matrix containing almost equal to zero values for the pseudoword types:

MatrixA <- matrix(data = 0.0000000001, nrow = 48, ncol = 5487)

new_names <- c("bloufs",    "blouf",    "blouks",   "blouk",    "bloups",   "bloup",    "blouts",   "blout",    
               "cloofs",    "cloof",    "clooks",   "clook",    "cloops", "cloop",  "cloots",   "cloot",    
               "glaifs",    "glaif",    "glaiks",   "glaik",    "glaips",   "glaip",    "glaits",   "glait",    
               "glifs",   "glif"    , "gliks",  "glik",   "glips"   , "glip",     "glits",  "glit",
               "pleefs" , "pleef",  "pleeks",   "pleek",    "pleeps",   "pleep",    "pleets",   "pleet",
               "prufs",   "pruf",     "pruks"   , "pruk",   "prups" , "prup",     "pruts"   , "prut")

rownames(MatrixA) <- new_names

This creates a new matrix with \(48*5487\) dimensions: 48 as there are 48 pseudowords; 5487 as this is the number of columns in the MALD A matrix.

Next, combine the MALD A matrix with the fictional pseudoword matrix:

# split the MALD matrix into two parts
mald.Amat.upper <- mald.Amat[1:8328,]    # contains word lexome vectors
mald.Amat.lower <- mald.Amat[8329:8364,] # contains affix lexome vectors

# combine MALD word data with ENGS pseudoword data
mald.Amat.B.upper <- rbind(mald.Amat.upper, MatrixA)

# order alphabetically
mald.Amat.B.upper <- mald.Amat.B.upper[ order(row.names(mald.Amat.B.upper)), ]

# add affix vectors
mald.Amat.B <- rbind(mald.Amat.B.upper, mald.Amat.lower)

dim(mald.Amat)
# [1] 8412 5487

3 C Matrix

Next, we create the C matrix for our combined data set.

mald.cues = make_cue_matrix(data = mald.data, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

The C matrix contains cues coded in binary.

4 S Matrix

For the S matrix, semantic vectors will be constructed for the word forms from the lexomes in inputA as specified by the model formula.

comb.Smat = make_S_matrix(data = data.comb, formula = ~ Word + Affix, inputA = mald.Amat.B, grams = 3, wordform = "DISC")

5 Comprehension

comb.comp = learn_comprehension(cue_obj = comb.cues, S = comb.Smat)
beep()

comb.comp_acc = accuracy_comprehension(m = comb.comp, data = data.comb, wordform = "DISC", show_rank = TRUE, neighborhood_density = TRUE)
beep()

comb.comp_acc$acc
# 0.9837631

comb.comp_measures = comprehension_measures(comp_acc = comb.comp_acc, Shat = comb.comp$Shat, S = comb.Smat, A = mald.Amat.B, affixFunctions = data.comb$Affix)
beep()

6 Production

comb.prod = learn_production(cue_obj = comb.cues, S = comb.Smat, comp = comb.comp)
beep()

comb.prod_acc = accuracy_production(m = comb.prod, data = data.comb, wordform = "DISC", full_results = TRUE, return_triphone_supports = TRUE)
beep()

comb.prod_acc$acc
# 0.9711079

comb.prod_measures = production_measures(prod_acc = comb.prod_acc, S = comb.Smat, prod = comb.prod)
beep()

# Error in cor(comb.comp$Shat[positions[j], ], mald.Amat.B[af, ]) : incompatible dimensions

I am still not sure what caused this problem. However, I was able to fix(?) it by manually calculating the production measures:

h = function(v) {
  v = v[v > 0]
  p = v/sum(v)
  return(-sum(p * log2(p)))
}


values = sapply(comb.prod_acc$full, FUN = function(li) li$values[which(li$values == 
                                                                    max(li$values))[1]])

correlations = unname(values)

lwlr = sapply(comb.prod_acc$full, FUN = function(li) li$length_weakest_link_ratio[which(li$values == 
                                                                                     max(li$values))[1]])

path_supports = lapply(comb.prod_acc$full, FUN = function(li) li$li)

path_counts = sapply(path_supports, FUN = function(li) length(li))

max_paths = unlist(sapply(comb.prod_acc$full, FUN = function(li) which(li$values == 
                                                                    max(li$values))[1]))

for (i in 1:length(path_supports)) {
  for (j in 1:length(path_supports[[i]])) {
    triphone = names(path_supports[[i]][[j]][1])
    path_supports[[i]][[j]][1] = comb.prod$production_matrices$Chat[i, 
                                                               triphone]
  }
}

path_summed_values = lapply(path_supports, FUN = function(li) sapply(li, 
                                                                     sum))

path_entropies = sapply(path_summed_values, h)

path_sum = path_entropies

for (i in 1:length(path_sum)) {
  path_sum[i] = path_summed_values[[i]][max_paths[i]]
}

res = data.frame(correlations = correlations, path_counts = path_counts, 
                 path_sum = path_sum, path_entropies = path_entropies, 
                 lwlr = lwlr)

comb.prod_measures <- res

7 Saving measures and data set

write.csv(data.comb,"data.comb.csv", row.names = TRUE)
write.csv(comb.comp_measures,"comb.comp_measures.csv", row.names = TRUE)
write.csv(comb.prod_measures,"comb.prod_measures.csv", row.names = TRUE)

8 Analysis

Load the derived LDL measures as well as MALD real word and ENGS pseudoword data including /s/ durations.

# LDL measures & data
data.comb <- read.csv("data.comb.csv", header = T)
comp_measures <- read.csv("comb.comp_measures.csv", header = T)
prod_measures <- read.csv("comb.prod_measures.csv", header = T)

data.LDL <- cbind(data.comb, comp_measures, prod_measures)

# MALD & ENGS data
data.MALD.dur <- read.csv("mald/mald_durations.csv", header = T)
data.ENGS.dur <- read.csv("mald/engs_durations.csv", header = T)

data.comb.dur <- rbind(data.MALD.dur, data.ENGS.dur)

# combine LDL & MALD+ENGS data
data <- merge(data.comb.dur, data.LDL, by="Word")

8.1. Variables

LDL Unrelated Variables

Variable Description
word the word(-form)
sDur the duration of word-final /s/
wordDur the duration of the word
baseDur the duration of the base
SylPerMin syllables per minute (~speakingRate)
typeOfS type of word-final /s/
NPhon number of phonemes
NSyll number of syllables
NMorph number of morphemes
preC consonant/vowel preceding the word-final /s
preCType type of preC
BNCfreq BNC frequency of real words
wordType binary: real or pseudo
DISC phonological transcription (CELEX)
Affix affix (if applicable)
Base base

Comprehension measures

Variable Description
recognition whether a word is recognized
cor_max the correlation between s-hat with the semantic vector of its predicted form
cor_target the correlation between s-hat and the semantic vector of its targeted form
l1norm the L1-norm distance of s-hat
l2norm the L2-norm (Euclidean distance) of s-hat
rank the rank of correlation between s-hat and the semantic vector of its targeted form
density the mean correlation of s-hat with the semantic vectors of its top 8 neighbors
cor_affix the correlation of s-hat and the semantic vector of its affix function

Production measures

Variable Description
correlations the correlation of the predicted path with the targeted semantic vector
path_counts the number of paths detected
path_sum the summed support for the predicted path; could be read as the uncertainty about how to pronounce the word
path_entropies Shannon entropy calculated over the path supports
lwlr the length-weakest link ratios of the predicted path

8.2. Creating Subsets

We create the following subsets for further analyses:

# subset containing all PLURAL items
data.pl <- subset(data, typeOfS == "pl")

# subset containing all MONOMORPHEMIC items
data.nm <- subset(data, typeOfS == "nm")

# subsets containing only REAL word data
data.real <- subset(data, wordType == "real")
data.real.nm <- subset(data.real, typeOfS == "nm")
data.real.pl <- subset(data.real, typeOfS == "pl") # this subset contains only 5 observations, so we cannot really make any use of this data set

# subsets containing only PSEUDO word data
data.pseudo <- subset(data, wordType == "pseudo")
data.pseudo.nm <- subset(data.pseudo, typeOfS == "nm")
data.pseudo.pl <- subset(data.pseudo, typeOfS == "pl")

The following variables must be log-transformed in all data sets to create a (more) normal distribution:

data$sDurLog <- log(data$sDur)
data$wordDurLog <- log(data$wordDur)
data$baseDurLog <- log(data$baseDur)
data$SylPerMinLog <- log(data$SylPerMin)

data$BNCfreq[data$BNCfreq==0] <- 1
data$BNCfreqLog <- log(data$BNCfreq)

data$l1normLog <- log(data$l1norm)
data$l2normLog <- log(data$l2norm)
data$densityLog <- log(data$density)

8.3. Correlations

Plural /s/ Items

pairscor.fnc(data.pl[,c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType")])


Non-morphemic /s/ Items

pairscor.fnc(data.nm[,c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType")])


All /s/ Items

pairscor.fnc(data[,c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType")])


8.4. Random Forests

Plural /s/ Items

# without 'SylPerMinLog' to make differences more visible
preds = c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType")
varnames = c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType", "sDurLog")

pl.cforest = cforest(sDurLog ~ ., data = data.pl[, varnames])
pl.varimp = varimp(pl.cforest)
dotplot(sort(pl.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration in plurals"))


Non-Morphemic /s/ Items

# without 'SylPerMinLog' to make differences more visible
preds = c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType")
varnames = c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType", "sDurLog")

nm.cforest = cforest(sDurLog ~ ., data = data.nm[, varnames])
nm.varimp = varimp(nm.cforest)
dotplot(sort(nm.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration in monomorphemic words"))


All /s/ Items

# without 'SylPerMinLog' to make differences more visible
preds = c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType")
varnames = c("wordType", "BNCfreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "cor_affix", "NPhon", "NSyll", "preCType", "sDurLog")

all.cforest = cforest(sDurLog ~ ., data = data[, varnames])
all.varimp = varimp(all.cforest)
dotplot(sort(all.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration in plural+monomorphemic words"))


Overview

dot.pl <- dotplot(sort(pl.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration in plurals"))
dot.nm <- dotplot(sort(nm.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration in monomorphemic words"))
dot.all <- dotplot(sort(all.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration in plural+monomorphemic words"))

dotplots <- ggarrange(dot.pl, dot.nm, dot.all,
                    labels = c("A", "B", "C"),
                    ncol = 2, nrow = 2,
                    font.label = list(size=12))

dotplots


8.5. Linear Models

L1-norm

MALD+ENGS: Plural /s/

lm.pl.l1 <- lm(sDurLog ~ l1normLog +
                 path_sum +
                 SylPerMinLog,
               data.pl)

summary(lm.pl.l1)

Call:
lm(formula = sDurLog ~ l1normLog + path_sum + SylPerMinLog, data = data.pl)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.80136 -0.13307  0.00431  0.15411  0.88570 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.78737    0.06084 -12.942  < 2e-16 ***
l1normLog     0.39417    0.08167   4.826 1.68e-06 ***
path_sum     -0.02003    0.01677  -1.194    0.233    
SylPerMinLog -1.34201    0.03342 -40.158  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2214 on 753 degrees of freedom
Multiple R-squared:  0.6831,    Adjusted R-squared:  0.6818 
F-statistic: 541.1 on 3 and 753 DF,  p-value: < 2.2e-16

 12 617 
  1 330 

MALD+ENGS: Non-Morphemic /s/

lm.nm.l1 <- lm(sDurLog ~ l1normLog +
                 path_sum +
                 SylPerMinLog,
               data.nm)

summary(lm.nm.l1)

Call:
lm(formula = sDurLog ~ l1normLog + path_sum + SylPerMinLog, data = data.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.84809 -0.25665 -0.04661  0.24091  1.02539 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.63130    0.05441 -29.980  < 2e-16 ***
l1normLog     0.06324    0.03321   1.904   0.0573 .  
path_sum      0.09284    0.01487   6.245 6.85e-10 ***
SylPerMinLog -0.61222    0.04240 -14.441  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3427 on 808 degrees of freedom
Multiple R-squared:  0.2228,    Adjusted R-squared:  0.2199 
F-statistic: 77.22 on 3 and 808 DF,  p-value: < 2.2e-16

1002    1 
 480    1 

MALD: Non-Morphemic /s/

lm.real.nm.l1 <- lm(sDurLog ~ l1normLog +
                 path_sum +
                 SylPerMinLog,
               data.real.nm)

summary(lm.real.nm.l1)

Call:
lm(formula = sDurLog ~ l1normLog + path_sum + SylPerMinLog, data = data.real.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47022 -0.10369  0.01557  0.11966  0.39324 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.46810    0.04789 -30.656  < 2e-16 ***
l1normLog     0.04951    0.01849   2.678  0.00804 ** 
path_sum     -0.03278    0.01019  -3.216  0.00153 ** 
SylPerMinLog -0.02141    0.04269  -0.502  0.61653    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.176 on 192 degrees of freedom
Multiple R-squared:  0.07818,   Adjusted R-squared:  0.06378 
F-statistic: 5.428 on 3 and 192 DF,  p-value: 0.001325

1548 1253 
 176  137 

ENGS: Non-Morphemic /s/

lm.pseudo.nm.l1 <- lm(sDurLog ~ l1normLog +
                 path_sum +
                 SylPerMinLog,
               data.pseudo.nm)

summary(lm.pseudo.nm.l1)

Call:
lm(formula = sDurLog ~ l1normLog + path_sum + SylPerMinLog, data = data.pseudo.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54348 -0.12541  0.00118  0.12254  0.55553 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.83482    0.06059 -13.778   <2e-16 ***
l1normLog     0.12438    0.09508   1.308   0.1913    
path_sum     -0.03941    0.01535  -2.568   0.0105 *  
SylPerMinLog -1.27382    0.03029 -42.049   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.193 on 612 degrees of freedom
Multiple R-squared:  0.7431,    Adjusted R-squared:  0.7418 
F-statistic:   590 on 3 and 612 DF,  p-value: < 2.2e-16

783 784 
289 290 

L2-norm

MALD+ENGS: Plural /s/

lm.pl.l2 <- lm(sDurLog ~ l2normLog +
                 path_sum +
                 SylPerMinLog,
               data.pl)

summary(lm.pl.l2)

Call:
lm(formula = sDurLog ~ l2normLog + path_sum + SylPerMinLog, data = data.pl)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.80299 -0.13128 -0.00147  0.15884  0.92370 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.70716    0.42311   1.671 0.095071 .  
l2normLog     0.40192    0.10358   3.880 0.000113 ***
path_sum     -0.02327    0.01683  -1.383 0.167154    
SylPerMinLog -1.33543    0.03357 -39.782  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2225 on 753 degrees of freedom
Multiple R-squared:  0.6797,    Adjusted R-squared:  0.6784 
F-statistic: 532.7 on 3 and 753 DF,  p-value: < 2.2e-16

 12 617 
  1 330 

MALD+ENGS: Non-Morphemic /s/

lm.nm.l2 <- lm(sDurLog ~ l2normLog +
                 path_sum +
                 SylPerMinLog,
               data.nm)

summary(lm.nm.l2)

Call:
lm(formula = sDurLog ~ l2normLog + path_sum + SylPerMinLog, data = data.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77046 -0.24131 -0.04572  0.20230  1.30394 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.07346    0.18162  -0.404  0.68596    
l2normLog     0.35060    0.03836   9.139  < 2e-16 ***
path_sum      0.04390    0.01473   2.980  0.00297 ** 
SylPerMinLog -0.65320    0.04059 -16.094  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.327 on 808 degrees of freedom
Multiple R-squared:  0.2925,    Adjusted R-squared:  0.2898 
F-statistic: 111.3 on 3 and 808 DF,  p-value: < 2.2e-16

1002    9 
 480    9 

MALD: Non-Morphemic /s/

lm.real.nm.l2 <- lm(sDurLog ~ l2normLog +
                 path_sum +
                 SylPerMinLog,
               data.real.nm)

summary(lm.real.nm.l2)

Call:
lm(formula = sDurLog ~ l2normLog + path_sum + SylPerMinLog, data = data.real.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46252 -0.09855  0.01379  0.13129  0.39918 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.25122    0.10996 -11.379  < 2e-16 ***
l2normLog     0.05834    0.02430   2.401  0.01729 *  
path_sum     -0.03202    0.01031  -3.106  0.00218 ** 
SylPerMinLog -0.03464    0.04201  -0.825  0.41062    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1766 on 192 degrees of freedom
Multiple R-squared:  0.07162,   Adjusted R-squared:  0.05712 
F-statistic: 4.938 on 3 and 192 DF,  p-value: 0.002517

1548 1253 
 176  137 

ENGS: Non-Morphemic /s/

lm.pseudo.nm.l2 <- lm(sDurLog ~ l2normLog +
                 path_sum +
                 SylPerMinLog,
               data.pseudo.nm)

summary(lm.pseudo.nm.l2)

Call:
lm(formula = sDurLog ~ l2normLog + path_sum + SylPerMinLog, data = data.pseudo.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54507 -0.12273  0.00069  0.12343  0.55305 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.44437    0.53927  -0.824  0.41025    
l2normLog     0.10686    0.13106   0.815  0.41519    
path_sum     -0.04040    0.01533  -2.635  0.00863 ** 
SylPerMinLog -1.27290    0.03031 -41.995  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1931 on 612 degrees of freedom
Multiple R-squared:  0.7426,    Adjusted R-squared:  0.7414 
F-statistic: 588.7 on 3 and 612 DF,  p-value: < 2.2e-16

783 784 
289 290 

Density

MALD+ENGS: Plural /s/

lm.pl.de <- lm(sDurLog ~ densityLog +
                 path_sum +
                 SylPerMinLog,
               data.pl)

summary(lm.pl.de)

Call:
lm(formula = sDurLog ~ densityLog + path_sum + SylPerMinLog, 
    data = data.pl)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.81719 -0.13697  0.00009  0.16061  1.12027 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.35458    0.20131  -1.761  0.07858 .  
densityLog    0.78185    0.26741   2.924  0.00356 ** 
path_sum     -0.02439    0.01695  -1.439  0.15055    
SylPerMinLog -1.33721    0.03372 -39.659  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2235 on 753 degrees of freedom
Multiple R-squared:  0.677, Adjusted R-squared:  0.6757 
F-statistic:   526 on 3 and 753 DF,  p-value: < 2.2e-16

 12 617 
  1 330 

MALD+ENGS: Non-Morphemic /s/

lm.nm.de <- lm(sDurLog ~ densityLog +
                 path_sum +
                 SylPerMinLog,
               data.nm)

summary(lm.nm.de)

Call:
lm(formula = sDurLog ~ densityLog + path_sum + SylPerMinLog, 
    data = data.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.73761 -0.17549 -0.02799  0.16514  0.97770 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.94840    0.03945 -49.394  < 2e-16 ***
densityLog   -0.67013    0.02916 -22.982  < 2e-16 ***
path_sum      0.04234    0.01113   3.805 0.000152 ***
SylPerMinLog -0.82702    0.03424 -24.155  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2671 on 808 degrees of freedom
Multiple R-squared:  0.5279,    Adjusted R-squared:  0.5262 
F-statistic: 301.2 on 3 and 808 DF,  p-value: < 2.2e-16

1006 1555 
 484  799 

MALD: Non-Morphemic /s/

lm.real.nm.de <- lm(sDurLog ~ densityLog +
                 path_sum +
                 SylPerMinLog,
               data.real.nm)

summary(lm.real.nm.de)

Call:
lm(formula = sDurLog ~ densityLog + path_sum + SylPerMinLog, 
    data = data.real.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.47066 -0.09787  0.00885  0.12146  0.40396 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.388267   0.093778 -14.804   <2e-16 ***
densityLog    0.062932   0.050126   1.255   0.2108    
path_sum     -0.022371   0.009383  -2.384   0.0181 *  
SylPerMinLog -0.050072   0.042053  -1.191   0.2352    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1785 on 192 degrees of freedom
Multiple R-squared:  0.05153,   Adjusted R-squared:  0.03671 
F-statistic: 3.477 on 3 and 192 DF,  p-value: 0.01706

1548    6 
 176    6 

ENGS: Non-Morphemic /s/

lm.pseudo.nm.de <- lm(sDurLog ~ densityLog +
                 path_sum +
                 SylPerMinLog,
               data.pseudo.nm)

summary(lm.pseudo.nm.de)

Call:
lm(formula = sDurLog ~ densityLog + path_sum + SylPerMinLog, 
    data = data.pseudo.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54629 -0.12168  0.00061  0.12515  0.55477 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.71688    0.20604  -3.479 0.000539 ***
densityLog    0.22474    0.27207   0.826 0.409102    
path_sum     -0.04003    0.01538  -2.603 0.009468 ** 
SylPerMinLog -1.27260    0.03030 -42.007  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1931 on 612 degrees of freedom
Multiple R-squared:  0.7427,    Adjusted R-squared:  0.7414 
F-statistic: 588.7 on 3 and 612 DF,  p-value: < 2.2e-16

783 784 
289 290 

L2-norm + Density

MALD+ENGS: Plural /s/

lm.pl.de.l2 <- lm(sDurLog ~ l2normLog + densityLog +
                 path_sum +
                 SylPerMinLog,
               data.pl)

summary(lm.pl.de.l2)

Call:
lm(formula = sDurLog ~ l2normLog + densityLog + path_sum + SylPerMinLog, 
    data = data.pl)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.79216 -0.12910 -0.00261  0.15636  0.80038 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.23116    0.59514   2.069  0.03892 *  
l2normLog     0.66688    0.23567   2.830  0.00478 ** 
densityLog   -0.75826    0.60586  -1.252  0.21113    
path_sum     -0.02492    0.01687  -1.477  0.14006    
SylPerMinLog -1.33369    0.03359 -39.711  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2225 on 752 degrees of freedom
Multiple R-squared:  0.6804,    Adjusted R-squared:  0.6787 
F-statistic: 400.2 on 4 and 752 DF,  p-value: < 2.2e-16
 12 617 
  1 330 


MALD+ENGS: Non-Morphemic /s/

lm.nm.de.l2 <- lm(sDurLog ~ l2normLog + densityLog +
                 path_sum +
                 SylPerMinLog,
               data.nm)

summary(lm.nm.de.l2)

Call:
lm(formula = sDurLog ~ l2normLog + densityLog + path_sum + SylPerMinLog, 
    data = data.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.66626 -0.17567 -0.02108  0.16449  0.80862 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.895289   0.148510  -6.028 2.52e-09 ***
l2normLog     0.226593   0.030878   7.338 5.28e-13 ***
densityLog   -0.631605   0.028733 -21.982  < 2e-16 ***
path_sum      0.007605   0.011773   0.646    0.518    
SylPerMinLog -0.838532   0.033207 -25.252  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2588 on 807 degrees of freedom
Multiple R-squared:  0.5575,    Adjusted R-squared:  0.5553 
F-statistic: 254.1 on 4 and 807 DF,  p-value: < 2.2e-16
 976 1555 
 454  799 


MALD: Non-Morphemic /s/

lm.real.nm.de.l2 <- lm(sDurLog ~ l2normLog + densityLog +
                    path_sum +
                    SylPerMinLog,
                  data.real.nm)

summary(lm.real.nm.de.l2)

Call:
lm(formula = sDurLog ~ l2normLog + densityLog + path_sum + SylPerMinLog, 
    data = data.real.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.46222 -0.09821  0.01386  0.13207  0.39910 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.252749   0.114424 -10.948  < 2e-16 ***
l2normLog     0.059133   0.029075   2.034  0.04335 *  
densityLog   -0.002954   0.059344  -0.050  0.96035    
path_sum     -0.032106   0.010466  -3.068  0.00247 ** 
SylPerMinLog -0.034392   0.042420  -0.811  0.41852    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1771 on 191 degrees of freedom
Multiple R-squared:  0.07164,   Adjusted R-squared:  0.05219 
F-statistic: 3.685 on 4 and 191 DF,  p-value: 0.006479
1548 1253 
 176  137 


ENGS: Non-Morphemic /s/

lm.pseudo.nm.de.l2 <- lm(sDurLog ~ l2normLog + densityLog +
                    path_sum +
                    SylPerMinLog,
                  data.pseudo.nm)

summary(lm.pseudo.nm.de.l2)

Call:
lm(formula = sDurLog ~ l2normLog + densityLog + path_sum + SylPerMinLog, 
    data = data.pseudo.nm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54597 -0.12180  0.00031  0.12472  0.55435 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.64263    1.50293  -0.428  0.66910    
l2normLog     0.02843    0.57014   0.050  0.96024    
densityLog    0.16729    1.18358   0.141  0.88764    
path_sum     -0.04011    0.01549  -2.590  0.00982 ** 
SylPerMinLog -1.27269    0.03037 -41.902  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1933 on 611 degrees of freedom
Multiple R-squared:  0.7427,    Adjusted R-squared:  0.741 
F-statistic: 440.8 on 4 and 611 DF,  p-value: < 2.2e-16
783 784 
289 290 


MALD+ENGS: All /s/

lm.pl.nm.de.l2.r <- lm(sDurLog ~ l2normLog +
                               densityLog +
                               path_sum +
                               typeOfS +
                               SylPerMinLog,
                             data)

# Improving residual distribution:

lm.pl.nm.de.l2 <- lm(sDurLog ~ l2normLog +
                             densityLog +
                             path_sum +
                             typeOfS +
                             SylPerMinLog,
                           data, subset = abs(scale(resid(lm.pl.nm.de.l2.r))) < 2.5)

summary(lm.pl.nm.de.l2)

Call:
lm(formula = sDurLog ~ l2normLog + densityLog + path_sum + typeOfS + 
    SylPerMinLog, data = data, subset = abs(scale(resid(lm.pl.nm.de.l2.r))) < 
    2.5)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6663 -0.1522 -0.0058  0.1488  0.6828 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.5223809  0.1283406  -4.070 4.93e-05 ***
l2normLog     0.2629294  0.0272543   9.647  < 2e-16 ***
densityLog   -0.6688049  0.0262051 -25.522  < 2e-16 ***
path_sum     -0.0002086  0.0091886  -0.023    0.982    
typeOfSpl    -0.0879141  0.0127194  -6.912 6.98e-12 ***
SylPerMinLog -1.1226760  0.0235227 -47.727  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2341 on 1540 degrees of freedom
Multiple R-squared:  0.6679,    Adjusted R-squared:  0.6668 
F-statistic: 619.3 on 5 and 1540 DF,  p-value: < 2.2e-16

1556  986 
1535  973 

8.5. GAMs

MALD+ENGS: Plural /s/

gam.pl.de.l2 <- gam(sDurLog ~ s(l2normLog) +
                      densityLog +
                      path_sum +
                      SylPerMinLog,
                    data = data.pl, method = "ML")

summary(gam.pl.de.l2)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ s(l2normLog) + densityLog + path_sum + SylPerMinLog

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.11622    0.68054  -4.579 5.47e-06 ***
densityLog   -2.92972    0.91427  -3.204  0.00141 ** 
path_sum     -0.02938    0.01684  -1.745  0.08138 .  
SylPerMinLog -1.34114    0.03341 -40.143  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F  p-value    
s(l2normLog) 2.632  3.301 5.577 0.000631 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.684   Deviance explained = 68.6%
-ML = -68.064  Scale est. = 0.048733  n = 757
[1] 330 331


MALD+ENGS: Non-Morphemic /s/

gam.nm.de.l2 <- gam(sDurLog ~ s(l2normLog) +
                      densityLog +
                      path_sum +
                      SylPerMinLog,
                    data = data.nm, method = "ML")

summary(gam.nm.de.l2)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ s(l2normLog) + densityLog + path_sum + SylPerMinLog

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.63850    0.05439 -30.124   <2e-16 ***
densityLog   -0.45502    0.04584  -9.926   <2e-16 ***
path_sum      0.01368    0.01176   1.163    0.245    
SylPerMinLog -0.85996    0.03296 -26.095   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
s(l2normLog) 6.453  7.622 10.77   8e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.571   Deviance explained = 57.6%
-ML = 48.814  Scale est. = 0.064643  n = 812
[1] 222 454


MALD+ENGS: All /s/

gam.all.de.l2 <- gam(sDurLog ~ s(l2normLog) +
                      densityLog +
                      path_sum +
                      SylPerMinLog+
                      typeOfS,
                    data = data, method = "ML")

summary(gam.all.de.l2)

Family: gaussian 
Link function: identity 

Formula:
sDurLog ~ s(l2normLog) + densityLog + path_sum + SylPerMinLog + 
    typeOfS

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.495591   0.045077 -33.178  < 2e-16 ***
densityLog   -0.461633   0.042409 -10.885  < 2e-16 ***
path_sum      0.011971   0.009675   1.237    0.216    
SylPerMinLog -1.059654   0.024182 -43.820  < 2e-16 ***
typeOfSpl    -0.095665   0.013371  -7.155 1.29e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df     F p-value    
s(l2normLog) 7.231  8.259 18.44  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.636   Deviance explained = 63.9%
-ML = 40.091  Scale est. = 0.060814  n = 1569

[1] 976 507

9 Simulation: ENGS

This attempt uses only pseudoword types and tokens. A and S matrix are simulated.

9.1. Matrices, Comprehension & Production

## load ENGS types data

data.engs.types <- read.csv("../engs/final_S_data_LDL_types.csv")

Matrices:

## pseudowords: C matrix

engs.cues = make_cue_matrix(data = data.engs.types, formula = ~ Word + Affix, grams = 3, wordform = "DISC")


## pseudowords: A matrix

# this simulates a semantic space for our pseudowords based on the dimensions of the C matrix
engs.Amat <- semantic_space(~ Word + Affix, data = data.engs.types, nrows=nrow(engs.cues$matrices$C), ncols=ncol(engs.cues$matrices$C), with_semantic_vectors = TRUE, with_mvrnorm = FALSE, wordform = "Word")


## pseudowords: S matrix

engs.Smat = make_S_matrix(data = data.engs.types, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

Comprehension:

# getting the mapping for comprehension
engs.comp = learn_comprehension(cue_obj = engs.cues, S = engs.Smat)
beep()


# evaluating comprehension accuracy
engs.comp_acc = accuracy_comprehension(m = engs.comp, data = data.engs.types, wordform = "DISC", show_rank = TRUE, neighborhood_density = TRUE)
beep()

engs.comp_acc$acc #1


# comprehension measures
engs.comp_measures = comprehension_measures(comp_acc = engs.comp_acc, Shat = engs.comp$Shat, S = engs.Smat)
beep()

Production:

# getting the mapping for production
engs.prod = learn_production(cue_obj = engs.cues, S = engs.Smat, comp = engs.comp)
beep()


# evaluating production accuracy
engs.prod_acc = accuracy_production(m = engs.prod, data = data.engs.types, wordform = "DISC", full_results = TRUE, return_triphone_supports = TRUE)
beep()

engs.prod_acc$acc #1


# production measures
engs.prod_measures = production_measures(prod_acc = engs.prod_acc, S = engs.Smat, prod = engs.prod)
beep()

Export measures:

write.csv(engs.comp_measures,"engs.comp_measures.csv", row.names = TRUE)

write.csv(engs.prod_measures,"engs.prod_measures.csv", row.names = TRUE)

9.2. ENGS data with LDL measures

data.engs.types <- read.csv("../engs/final_S_data_LDL_types.csv")

data.ENGS.dur <- read.csv("../mald/engs_durations.csv", header = T)

names(data.ENGS.dur)[names(data.ENGS.dur) == "ï..number"] <- "number"

engs.comp_measures <- read.csv("../engs.comp_measures.csv", header = T)
engs.prod_measures <- read.csv("../engs.prod_measures.csv", header = T)

data.LDL <- cbind(data.engs.types, engs.comp_measures, engs.prod_measures)

data <- merge(data.ENGS.dur, data.LDL, by="Base")

Check and log-transform variables:

data$sDurLog <- log(data$sDur)
data$wordDurLog <- log(data$wordDur)
data$baseDurLog <- log(data$baseDur)
data$speakingRateLog <- log(data$speakingRate)

data$googleFreqLog <- log(data$googleFreq)

data$l1normLog <- log(data$l1norm)
data$l2normLog <- log(data$l2norm)
data$densityLog <- log(data$density)


data$pauseBin <- as.factor(data$pauseBin)
data$preC <- as.factor(data$preC)
data$folType <- as.factor(data$folType)

data$Affix.x[is.na(data$Affix.x)] <- "nm"
data$Affix.x <- as.factor(data$Affix.x)

9.3. Random Forests

# without 'speakingRateLog' to make differences more visible
preds = c("googleFreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "preC", "folType", "pauseBin", "googleFreqLog")
varnames = c("googleFreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "preC", "folType", "pauseBin", "googleFreqLog", "sDurLog")

pseudo.cforest = cforest(sDurLog ~ ., data = data[, varnames])
pseudo.varimp = varimp(pseudo.cforest)
dotplot(sort(pseudo.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration"))

The following variables appear to be predictive for /s/ duration:

  1. pauseBin
  2. folType
  3. cor_target
  4. correlations
  5. l2normLog
  6. l1normLog
  7. cor_max

9.4. Correlations

pairscor.fnc(data[,c("sDurLog", "pauseBin", "folType", "cor_target", "correlations", "l2normLog", "l1normLog", "cor_max", "Affix.x")])

#  cor_target + correlations; r=1
#  cor_target/correlations + l2normLog; r=0.38
#  cor_target/correlations + l1normLog; r=0.38
#  cor_target/correlations + cor_max; r=0.78
#  cor_target/correlations + Affix.x; r=0.78
#  l1normLog + l2normLog; r=0.93
#  l2normLog + cor_max; r=0.72
#  l1normLog + cor_max; r=0.78

Test which one of all correlated LDL variables is the better predictor.

# 1 cor_target OR correlations

mdl.cor_target <- lmer(sDurLog ~ cor_target + (1 | speaker), data, REML=F)
mdl.correlations <- lmer(sDurLog ~ correlations + (1 | speaker), data, REML=F)

anova(mdl.cor_target, mdl.correlations) # let's take 'correlations' for now
Data: data
Models:
mdl.cor_target: sDurLog ~ cor_target + (1 | speaker)
mdl.correlations: sDurLog ~ correlations + (1 | speaker)
                 npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)    
mdl.cor_target      4 393.86 411.97 -192.93   385.86                        
mdl.correlations    4 393.86 411.97 -192.93   385.86     0  0  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 2 l2normLog OR l1normLog

mdl.l2normLog <- lmer(sDurLog ~ l2normLog + (1 | speaker), data, REML=F)
mdl.l1normLog <- lmer(sDurLog ~ l1normLog + (1 | speaker), data, REML=F)

anova(mdl.l2normLog, mdl.l1normLog) # let's take 'mdl.l2normLog' for now
Data: data
Models:
mdl.l2normLog: sDurLog ~ l2normLog + (1 | speaker)
mdl.l1normLog: sDurLog ~ l1normLog + (1 | speaker)
              npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
mdl.l2normLog    4 398.87 416.98 -195.43   390.87                    
mdl.l1normLog    4 399.97 418.08 -195.99   391.97     0  0          1
# 3 correlations OR l2normLog

mdl.correlations <- lmer(sDurLog ~ correlations + (1 | speaker), data, REML=F)
mdl.l2normLog <- lmer(sDurLog ~ l2normLog + (1 | speaker), data, REML=F)

anova(mdl.correlations, mdl.l2normLog) # let's take 'correlations' > lower AIC
Data: data
Models:
mdl.correlations: sDurLog ~ correlations + (1 | speaker)
mdl.l2normLog: sDurLog ~ l2normLog + (1 | speaker)
                 npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
mdl.correlations    4 393.86 411.97 -192.93   385.86                    
mdl.l2normLog       4 398.87 416.98 -195.43   390.87     0  0          1
# 4 correlations OR cor_max

mdl.correlations <- lmer(sDurLog ~ correlations + (1 | speaker), data, REML=F)
mdl.cor_max <- lmer(sDurLog ~ cor_max + (1 | speaker), data, REML=F)

anova(mdl.correlations, mdl.cor_max) # let's take 'correlations' > lower AIC
Data: data
Models:
mdl.correlations: sDurLog ~ correlations + (1 | speaker)
mdl.cor_max: sDurLog ~ cor_max + (1 | speaker)
                 npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
mdl.correlations    4 393.86 411.97 -192.93   385.86                    
mdl.cor_max         4 402.87 420.98 -197.44   394.87     0  0          1

Keep ‘correlations’;
throw out ‘cor_target’, ‘l2normLog’, ‘l1normLog’, ‘cor_max’.


9.5. Mixed Models

Model without Affix.x as variable:

lm.all.pseudo.m1 <- lmer(sDurLog ~ pauseBin +
                           folType +
                           correlations +
                           speakingRateLog +
                           (1 | speaker),
                         data=data, REML=F)

anova(lm.all.pseudo.m1)
Type III Analysis of Variance Table with Satterthwaite's method
                 Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
pauseBin        10.5203 10.5203     1 674.31 155.7491 < 2.2e-16 ***
folType          3.3545  0.8386     4 646.71  12.4157 9.800e-10 ***
correlations     0.6004  0.6004     1 653.70   8.8891  0.002975 ** 
speakingRateLog  1.4112  1.4112     1 683.94  20.8927 5.764e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model including Affix.x as variable:

lm.all.pseudo.m2 <- lmer(sDurLog ~ pauseBin +
                           folType +
                           correlations +
                           speakingRateLog +
                           Affix.x +
                           (1 | speaker),
                         data=data, REML=F)

anova(lm.all.pseudo.m2) 
Type III Analysis of Variance Table with Satterthwaite's method
                 Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
pauseBin        11.1020 11.1020     1 674.85 168.1359 < 2.2e-16 ***
folType          3.2920  0.8230     4 646.75  12.4640 8.989e-10 ***
correlations     0.0907  0.0907     1 647.75   1.3729    0.2417    
speakingRateLog  1.2189  1.2189     1 683.92  18.4604 1.986e-05 ***
Affix.x          1.0943  1.0943     1 653.02  16.5734 5.254e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It appears that when adding Affix.x as a variable, the LDL measure correlations no longer is predictive for /s/ duration. However, Affix.x and correlations are highly correlated with \(rho=0.78\), thus this may well be a masking effect.

Therefore, we may assume that correlations is just as good as predictor for /s/ duration as is Affix.x.

10 Simulation: ENGS+MALD

This attempt uses pseudoword + real word types, and pseudoword tokens. A and S matrix are simulated.

10.1. Matrices, Comprehension & Production

## load ENGS+MALD types data

data.comb <- read.csv("../data.comb.csv")

Matrices:

## pseudowords: C matrix

comb.cues = make_cue_matrix(data = data.comb, formula = ~ Word + Affix, grams = 3, wordform = "DISC")


## pseudowords: A matrix

# this simulates a semantic space for ALL words based on the dimensions of the C matrix
comb.Amat <- semantic_space(~ Word + Affix, data = data.comb, nrows=nrow(comb.cues$matrices$C), ncols=ncol(comb.cues$matrices$C)-2, with_semantic_vectors = TRUE, with_mvrnorm = FALSE, wordform = "Word")


## pseudowords: S matrix

comb.Smat = make_S_matrix(data = data.comb, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

Comprehension:

# getting the mapping for comprehension
comb.comp = learn_comprehension(cue_obj = comb.cues, S = comb.Smat)
beep()


# evaluating comprehension accuracy
comb.comp_acc = accuracy_comprehension(m = comb.comp, data = data.comb, wordform = "DISC", show_rank = TRUE, neighborhood_density = TRUE)
beep()

comb.comp_acc$acc #0.9837631


# comprehension measures - funktioniert nur ohne affixFunctions
comb.comp_measures = comprehension_measures(comp_acc = comb.comp_acc, Shat = comb.comp$Shat, S = comb.Smat)
beep()

Production:

# getting the mapping for production
comb.prod = learn_production(cue_obj = comb.cues, S = comb.Smat, comp = comb.comp)
beep()


# evaluating production accuracy
comb.prod_acc = accuracy_production(m = comb.prod, data = data.comb, wordform = "DISC", full_results = TRUE, return_triphone_supports = TRUE)
beep()

comb.prod_acc$acc #0.9990449


# production measures
comb.prod_measures = production_measures(prod_acc = comb.prod_acc, S = comb.Smat, prod = comb.prod)
beep()

Export measures:

write.csv(comb.comp_measures, "comb.comp_measures_simulated.csv")

write.csv(comb.prod_measures, "comb.prod_measures_simulated.csv")

10.2. ENGS data with LDL measures (including MALD semantics)

data.comb <- read.csv("../data.comb.csv")

data.ENGS.dur <- read.csv("../mald/engs_durations.csv", header = T)

names(data.ENGS.dur)[names(data.ENGS.dur) == "ï..number"] <- "number"

comb.comp_measures_simulated <- read.csv("../comb.comp_measures_simulated.csv", header = T)
comb.prod_measures_simulated <- read.csv("../comb.prod_measures_simulated.csv", header = T)

data.LDL <- cbind(data.comb, comb.comp_measures_simulated, comb.prod_measures_simulated)

data2 <- merge(data.ENGS.dur, data.LDL, by="Base")

Check and log-transform variables:

data2$sDurLog <- log(data2$sDur)
data2$wordDurLog <- log(data2$wordDur)
data2$baseDurLog <- log(data2$baseDur)
data2$speakingRateLog <- log(data2$speakingRate)

data2$googleFreqLog <- log(data2$googleFreq)

data2$l1normLog <- log(data2$l1norm)
data2$l2normLog <- log(data2$l2norm)
data2$densityLog <- log(data2$density)


data2$pauseBin <- as.factor(data2$pauseBin)
data2$preC <- as.factor(data2$preC)
data2$folType <- as.factor(data2$folType)

data2$Affix.x[is.na(data2$Affix.x)] <- "nm"
data2$Affix.x <- as.factor(data2$Affix.x)

10.3. Random Forests

preds = c("googleFreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "preC", "folType", "pauseBin")
varnames = c("googleFreqLog", "l1normLog", "l2normLog", "densityLog", "path_entropies", "path_counts", "path_sum", "lwlr", "correlations", "cor_max", "cor_target", "preC", "folType", "pauseBin", "sDurLog")

pseudo.cforest = cforest(sDurLog ~ ., data = data2[, varnames])
pseudo.varimp = varimp(pseudo.cforest)
dotplot(sort(pseudo.varimp), xlab = "Relative variable importance for predicting /s/ duration", col = "#a2b98d", main = expression("Relative variable importance for /s/ duration"))

The following variables appear to be predictive for /s/ duration:

  1. pauseBin
  2. folType
  3. correlations
  4. cor_target
  5. path_sum
  6. path_counts
  7. lwlr
  8. l2normLog
  9. cor_max

10.4. Correlations

pairscor.fnc(data2[,c("sDurLog", "pauseBin", "folType", "correlations", "cor_target", "path_sum", "path_counts", "lwlr", "l2normLog", "cor_max", "Affix.x")])

#  cor_target + correlations; r=1
#  cor_target/correlations + path_sum; r=0.38
#  cor_target/correlations + lwlr; r=-0.59
#  cor_target/correlations + l2normLog; r=0.47
#  cor_target/correlations + cor_max; r=0.52

#  path_sum + lwlr; r=-0.76

#  l2normLog + cor_max; r=0.84

Test which one of all correlated LDL variables is the better predictor.

# 1 cor_target OR correlations

mdl.cor_target <- lmer(sDurLog ~ cor_target + (1 | speaker), data2, REML=F)
mdl.correlations <- lmer(sDurLog ~ correlations + (1 | speaker), data2, REML=F)

anova(mdl.cor_target, mdl.correlations) # let's take 'correlations' for now
Data: data2
Models:
mdl.cor_target: sDurLog ~ cor_target + (1 | speaker)
mdl.correlations: sDurLog ~ correlations + (1 | speaker)
                 npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
mdl.cor_target      4 392.45 410.56 -192.22   384.45                    
mdl.correlations    4 392.45 410.56 -192.22   384.45     0  0          1
# 2 correlations OR path_sum

mdl.correlations <- lmer(sDurLog ~ correlations + (1 | speaker), data2, REML=F)
mdl.path_sum <- lmer(sDurLog ~ path_sum + (1 | speaker), data2, REML=F)

anova(mdl.correlations, mdl.path_sum) # let's take 'mdl.path_sum' > lower AIC
Data: data2
Models:
mdl.correlations: sDurLog ~ correlations + (1 | speaker)
mdl.path_sum: sDurLog ~ path_sum + (1 | speaker)
                 npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
mdl.correlations    4 392.45 410.56 -192.22   384.45                         
mdl.path_sum        4 390.63 408.74 -191.31   382.63 1.8236  0  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 3 path_sum OR lwlr

mdl.path_sum <- lmer(sDurLog ~ path_sum + (1 | speaker), data2, REML=F)
mdl.lwlr <- lmer(sDurLog ~ lwlr + (1 | speaker), data2, REML=F)

anova(mdl.path_sum, mdl.lwlr) # let's take 'mdl.path_sum' > lower AIC
Data: data2
Models:
mdl.path_sum: sDurLog ~ path_sum + (1 | speaker)
mdl.lwlr: sDurLog ~ lwlr + (1 | speaker)
             npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
mdl.path_sum    4 390.63 408.74 -191.31   382.63                    
mdl.lwlr        4 398.78 416.89 -195.39   390.78     0  0          1
# 4 l2normLog OR cor_max

mdl.l2normLog <- lmer(sDurLog ~ l2normLog + (1 | speaker), data2, REML=F)
mdl.cor_max <- lmer(sDurLog ~ cor_max + (1 | speaker), data2, REML=F)

anova(mdl.l2normLog, mdl.cor_max) # let's take 'cor_max' > lower AIC
Data: data2
Models:
mdl.l2normLog: sDurLog ~ l2normLog + (1 | speaker)
mdl.cor_max: sDurLog ~ cor_max + (1 | speaker)
              npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
mdl.l2normLog    4 403.01 421.13 -197.51   395.01                         
mdl.cor_max      4 402.87 420.98 -197.43   394.87 0.1483  0  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Keep ‘path_sum’, ‘path_counts’, ‘cor_max’;
throw out ‘cor_target’, ‘correlations’, ‘lwlr’, ‘l2normLog’


10.5. Mixed Models

Model without Affix.x as variable:

lm.all.pseudo.m1 <- lmer(sDurLog ~ pauseBin +
                           folType +
                           path_sum +
                           path_counts +
                           cor_max +
                           speakingRateLog +
                           (1 | speaker),
                         data=data2, REML=F)


step(lm.all.pseudo.m1)
Backward reduced random-effect table:

              Eliminated npar  logLik    AIC    LRT Df Pr(>Chisq)    
<none>                     12  -92.43 208.86                         
(1 | speaker)          0   11 -186.73 395.45 188.59  1  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Backward reduced fixed-effect table:
Degrees of freedom method: Satterthwaite 

                Eliminated  Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
path_counts              1  0.0003  0.0003     1 645.28   0.0044    0.9474    
cor_max                  2  0.0006  0.0006     1 647.71   0.0090    0.9244    
pauseBin                 0 10.3493 10.3493     1 674.09 155.0057 < 2.2e-16 ***
folType                  0  3.6130  0.9032     4 646.64  13.5282 1.348e-10 ***
path_sum                 0  1.0863  1.0863     1 651.30  16.2701 6.143e-05 ***
speakingRateLog          0  1.5168  1.5168     1 683.94  22.7176 2.293e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model found:
sDurLog ~ pauseBin + folType + path_sum + speakingRateLog + (1 | 
    speaker)
lm.all.pseudo.m2 <- lmer(sDurLog ~ pauseBin + 
                           path_sum + 
                           folType + 
                           speakingRateLog + 
                           (1 | speaker),
                         data=data2, REML=F)


anova(lm.all.pseudo.m2)
Type III Analysis of Variance Table with Satterthwaite's method
                 Sum Sq Mean Sq NumDF  DenDF F value    Pr(>F)    
pauseBin        10.3493 10.3493     1 674.09 155.006 < 2.2e-16 ***
path_sum         1.0863  1.0863     1 651.30  16.270 6.143e-05 ***
folType          3.6130  0.9032     4 646.64  13.528 1.348e-10 ***
speakingRateLog  1.5168  1.5168     1 683.94  22.718 2.293e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model including Affix.x as variable:

lm.all.pseudo.m3 <- lmer(sDurLog ~ pauseBin + 
                           path_sum + 
                           folType + 
                           speakingRateLog + 
                           Affix.x +
                           (1 | speaker),
                         data=data2, REML=F)


anova(lm.all.pseudo.m3)
Type III Analysis of Variance Table with Satterthwaite's method
                 Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
pauseBin        10.8569 10.8569     1 674.41 165.2709 < 2.2e-16 ***
path_sum         0.2773  0.2773     1 649.28   4.2209 0.0403289 *  
folType          3.3384  0.8346     4 646.79  12.7048 5.849e-10 ***
speakingRateLog  1.2455  1.2455     1 683.96  18.9595 1.540e-05 ***
Affix.x          0.7929  0.7929     1 656.63  12.0700 0.0005461 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It appears that when adding Affix.x as a variable, the LDL measure path_sum becomes less predictive for /s/ duration. However, Affix.x and path_sum are strongly correlated with \(rho=0.47\), thus this may well be a masking effect.

Therefore, we may assume that path_sum is just as good as predictor for /s/ duration as is Affix.x.