This markdown document presents the combined LDL implementation of real words and pseudowords. This is a step-by-step explanation with accompanying R code for all steps from word sets (real and nonce words) to cue and semantic matrices and the computation of comprehension and production mapping.

This documentation makes mainly use of two packages:

library(WpmWithLdl)
library(LDLConvFunctions)

See the References section for further information and ressources.

1 Data Sets: Real words and pseudowords

1.1 Real words

The set of real words is taken from the Massive Auditory Lexical Decision (MALD) Database (Tucker et al., 2018). It is a subset of the complete MALD database, containing 8285 words of which 2120 words contain one of 36 different affixes. This dataframe is called realwords.


1.2 Pseudowords

Pseudowords are taken from the production study by Schmitz et al. (2020). In total, 48 pseudowords are contained within this data set, with half of them representing monomorphemic words (e.g. singular bloufs) and half of them representing plural wordforms (e.g. plural bloufs). As a number of pseudowords show more than one consistent pronunciation, these are contained more than twice, e.g. prups. This dataframe is called pseudowords.


1.3 All Words

The combined word set contains both data on real and on nonce words as introduced in 1.1 and 1.2. In total, 8333 words and 36 different affixes are part of this data set. This dataframe is called allwords.

2 A matrix

We can create an \(A\) matrix for real words using the semantic vectors constructed of the TASA corpus by Baayen et al. (2019). The \(A\) matrix consists of semantic vectors (columns) for all lexomes in orthographic form (rows).

## load TASA vectors
load("data/TASA.rda")

## find all real word lexomes
realwords.allLexomes = unique(c(realwords$Word, realwords$Affix))
realwords.allLexomes = realwords.allLexomes[!is.na(realwords.allLexomes)]

## find vectors for all unique lexomes > this then is the A matrix
A.matrix = TASA[realwords.allLexomes, ]

3 C matrices

Cue matrices contain all triphones of a data set (columns) which are coded binary for all words in phonological form (rows).


3.1 Real Words

The cue matrix for real words \(real.C\) is computed based on the set of real words and \(Word+Affix\), i.e. words and affixes are considered.

real.C = make_cue_matrix(data = realwords, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

3.2 Pseudowords

The cue matrix for pseudowords \(pseudo.C\) is computed based on the set of pseudoword and \(Word+Affix\), i.e. words and affixes are considered.

pseudo.C = make_cue_matrix(data = pseudowords, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

3.3 All Words

The cue matrix for real and nonce words \(all.C\) is computed based on the combined set of all real and nonce words and \(Word+Affix\), i.e. words and affixes are considered.

all.C = make_cue_matrix(data = allwords, formula = ~ Word + Affix, grams = 3, wordform = "DISC")

4 S matrices

Semantic matrices contain all word forms of a data set (columns) and their pertinent values for all lexomes in phonological form (rows).


4.1 Real Words

The semantic matrix for real words \(real.S\) is computed based on the set of real words, the real word \(A\) matrix, and \(Word+Affix\), i.e. words and affixes are considered.

real.S = make_S_matrix(data = realwords, formula = ~ Word + Affix, inputA = A.matrix, grams = 3, wordform = "DISC")

dim(real.S)
# 8362 5487

4.2 Pseudowords

We can create an estimated semantic matrix for pseudowords \(pseudo.S\) (Chuang et al., 2020) based on the semantics of real words solving the following equation

\[pseudo.S = pseudo.C * F\]

with \(pseudo.S\) being the estimated semantic matrix for pseudowords, \(pseudo.C\) being the cue matrix of pseudowords, and \(F\) being the transformation matrix for mapping real word cues onto real word semantics.

To obtain this real word transformation matrix \(F\) we first need to learn comprehension of real words.

real.comprehension = learn_comprehension(cue_obj = real.C, S = real.S)

Then, we can extract \(F\) from real.comprehension.

F = real.comprehension$F

dim(F) 
# 7610 5487

The following toy example illustrates the use of the \(F\) transformation matrix in mapping \(C\) onto \(S\), i.e. solving

\[real.C * F = predicted.real.S\]

Consider the following toy example illustrating the procedure. The lexicon of this toy example includes the words cat, bus, and eel. Thus, its cue matrix \(C\) looks as follows:

For the same toy lexicon, suppose that the semantic vectors for these three words are the row vectors of the following semantic matrix \(S\):

Then, we are interested in a transformation matrix \(F\), such that

\[C*F=S\] The transformation matrix \(F\) is straightforward to obtain. Let \(C'\) denote the Moore-Penrose generalized inverse of \(C\). Then,

\[F=C'S\]

Thus, for the toy lexicon example,

with \(CF\) being exactly equal to \(S\) in this simple toy example.

Additionally, for matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix (e.g. Nykamp, 2020). While this is easily achieved, one must also consider the order of columns. This is easily illustrated by the following example:

\[ \left(\begin{array}{cc} A & B\\ C & D \end{array}\right) \left(\begin{array}{cc} \alpha & \beta \\ \gamma & \delta \end{array}\right) = \left(\begin{array}{cc} A\alpha+B\gamma & A\beta+B\delta \\ C\alpha+D\gamma & C\beta+D\delta \end{array}\right) \] Let the first matrix be a simplified version of \(pseudo.C\) and the second matrix a simplified version of \(F\). The resulting matrix then is a simplified version of \(pseudo.S\). However, if the order of columns (i.e. triphones) in \(real.C\), from which \(F\) is derived, is different to the order of columns (i.e. triphones) in \(pseudo.C\), multiplication is possible, yet different (wrong) columns and rows are multiplied, as shown in the following example:

\[ \left(\begin{array}{cc} B & A \\ D & C \end{array}\right) \left(\begin{array}{cc} \alpha & \beta \\ \gamma & \delta \end{array}\right) = \left(\begin{array}{cc} B\alpha+A\gamma & B\beta+A\delta \\ D\alpha+C\gamma & D\beta+C\delta \end{array}\right) \]

Thus, we must make sure to use a pseudoword cue matrix \(pseudo.C\) with the same order of columns as the real word cue matrix \(real.C\). This, in turn, also expands the pseudoword cue matrix \(pseudo.C\), i.e. to contain all column names of the real word cue matrix \(real.C\) it must have an identical number of columns.

Using two functions of the LDLConvFunctions package, this is easily done. First, find which column in the real word cue matrix \(real.C\) corresponds to which column in the pseudoword cue matrix \(pseudo.C\):

found_triphones <- find_triphones(pseudo_C_matrix = pseudo.C, real_C_matrix = real.C)

Then, use this information to create a new pseudoword cue matrix \(new.pseudo.C\) which a) has the same number of columns as the real word cue matrix, and b) the same order of columns (i.e. triphones) as the real word cue matrix:

new.pseudo.C <- reorder_pseudo_C_matrix(pseudo_C_matrix = pseudo.C, real_C_matrix = real.C, found_triphones = found_triphones)

Finally, we can solve the aforementioned equation \(pseudo.S = pseudo.C * F\).

pseudo.S = new.pseudo.C %*% F

# change rownames to pseudoword rownames (instead of numbers)
rownames(engs.Smat) <- engs.data$DISC

In this semantic matrix, 48 pseudowords (i.e. 78 pronunciations) and their semantic vectors are contained. However, all of our pseudoword pairs are homophones, thus, they show identical semantic vectors even though half of them contain a suffix, e.g. the semantic vectors for bloufs (non-morphemic S) and bloufs (plural S) are identical.

X fierce fifteen fifteenth fifth fifty
bl6fs -0.0001453 0.0002496 0.0002782 0.0006957 -0.0001303
bl6fs -0.0001453 0.0002496 0.0002782 0.0006957 -0.0001303
bl6ks 0.0000021 -0.0006015 0.0000090 -0.0012144 0.0003275
bl6ks 0.0000021 -0.0006015 0.0000090 -0.0012144 0.0003275

We “fix” this issue by adding the “PL” semantic vector values to the semantic vectors of plural pseudowords.

Thus, we first need to extract the “PL” semantic vector contained in the general \(A\) matrix.

PL.vec <- A.matrix[rownames(A.matrix)=="PL"]

Taking a closer look at the pseudoword semantic matrix \(pseudo.S\), we find that every second row corresponds to a plural pseudoword, i.e. this is where we add the “PL” vector.

pseduo.pl = pseudo.S[seq(2, nrow(pseudo.S), 2), ]

Taking this plural subset, we add the “PL” semantic vector to all rows.

pseudo.pl.added <- pseduo.pl + rep(PL.vec, each = nrow(pseduo.pl))

We then recombine the original monomorphemic semantic matrix and the modified plural semantic matrix, and recreate the original row order.

pseudo.nm = pseudo.S[seq(1, nrow(pseudo.S), 2), ]

pseudo.S <- rbind(pseudo.nm, pseudo.pl.added)

sort.rows <- row.names(previous.pseudo.S)
pseudo.S <- pseudo.S[order(match(rownames(pseudo.S), sort.rows)), , drop = FALSE]

Now, homophonous pseudowords have differing semantic vectors.

X fierce fifteen fifteenth fifth fifty
bl6fs -0.0001453 0.0002496 0.0002782 0.0006957 -0.0001303
bl6fs -0.0000796 0.0001372 0.0003686 0.0003187 -0.0002868
bl6ks 0.0000021 -0.0006015 0.0000090 -0.0012144 0.0003275
bl6ks 0.0000678 -0.0007139 0.0000994 -0.0015913 0.0001709

Similary, we now add the semantic vectors of “alien” and “creature” to all pseudoword semantic vectors to accommodate for the original experimental setup.

# extract the 'alien' semantic vector
alien.vec <- mald.Amat[rownames(mald.Amat)=="alien"]

# extract the 'creature' semantic vector
creature.vec <- mald.Amat[rownames(mald.Amat)=="creature"]

## add the 'alien' + 'creature' semantic vectors to all rows of the pseudoword S matrix
pseudo.S <- pseudo.S + rep(alien.vec, each = nrow(pseudo.S))
pseudo.S <- pseudo.S + rep(creature.vec, each = nrow(pseudo.S))

4.3 All Words

Finally, we can create one combined semantic matrix \(all.S\) for real and nonce words.

all.S <- rbind(real.S, pseudo.S)

5 Comprehension

Using \(all.C\) and \(all.S\), we can now learn and evaluate comprehension of real and nonce words in one go, i.e. as if they are encountered by a person with their individual lexicon.

First, comprehension is learned.

all.comprehension = learn_comprehension(cue_obj = all.C, S = all.S)

Second, the accuracy of this comprehension is evaluated. That is, how good can form (cue matrix) be mapped onto meaning (semantic matrix). The result of this mapping, i.e. a predicted semantic matrix, is obtained as \(predicted.S\).

all.comprehension.accuracy = accuracy_comprehension(m = all.comprehension, data = allwords, wordform = "DISC", show_rank = TRUE, neighborhood_density = TRUE)

all.comprehension.accuracy$acc # 0.7440758

We find that comprehension accuracy is about \(74\%\).

6 Production

Using \(all.C\) and \(all.S\) as well as the comprehension results, we now are able to learn production and evaluate its accuracy.

First, production is learned.

all.production = learn_production(cue_obj = all.C, S = all.S, comp = all.comprehension)

Second, the accuracy of this production is evaluated. That is, how good can meaning (semantic matrix) be mapped onto form (cue matrix). The result of this mapping, i.e. a predicted cue matrix, is obtained as \(predicted.C\).
Note: \(C\) and \(predicted.C\) are sometimes referred to as \(T\) and \(predicted.T\) when talking about production (i.e. T as in Target matrix).

all.production.accuracy = accuracy_production(m = all.production, data = allwords, wordform = "DISC", full_results = TRUE, return_triphone_supports = TRUE)

all.production.accuracy$acc # 0.9727488

We find that production accuracy is about \(97\%\).

7 Measures

7.1 WpmWithLdl measures

Extracting and saving comprehension and production measures as introduced by the WpmWithLdl package (Baayen et al., 2018) is straightforward.

all.comprehension.measures = comprehension_measures(comp_acc = all.comprehension.accuracy, Shat = all.comprehension$Shat, S = all.S, A = A.matrix, affixFunctions = allwords$Affix)
all.production.measures = production_measures(prod_acc = all.production.accuracy, S = all.S, prod = all.production)

7.2 Further Measures

Another set of variables consists of the measures introduced by Chuang et al. (2020). To extract these measures, one can use the LDLConvFunctions package.

# load package
library(LDLConvFunctions)

# compute measures
ALCframe <- ALC(pseudo_S_matrix = pseudo.S, real_S_matrix = real.S, pseudo_word_data = pseudowords)

ALDCframe <- ALDC(prod_acc = all.production.accuracy, data = allwords)

EDNNframe <- EDNN(comprehension = all.comprehension, data = allwords)

NNCframe <- NNC(pseudo_S_matrix = pseudo.S, real_S_matrix = real.S, pseudo_word_data = pseudowords, real_word_data = realwords)

7.3 Cluster Analysis

As suggested by an anonymous reviewer, we include a cluster analysis of our variables of interest. We find that the clustering represents quite well the components retained from the Principal Component Analyses given in the accompanying R File.

# load data
data <- read.csv("data/final_S_LDL_data_210609.csv", stringsAsFactors = T)

# define variables, i.e. LDL variables
variables <- data[, c(36:37, 39, 41:43, 45:46, 49:50, 52)]

# compute clustering
clustering <- corclust(variables)
No factor variables!
# plot clustering
plot(clustering)


7.4 Exploratory Analsis

The following subsections present some exploratory analyses of variables as suggested by an anonymous reviewer. The analyses follow the groups identified by an earlier version of the cluster analysis. Please note that the following variables are no longer contained in the main analysis: lwlr, correlation, cor_target, cor_max.


ALDC, path_counts, path_entropies

  • ALDC: The Average Levenshtein Distance of all Candidate productions, ALDC, is the mean of all Levenshtein distances of a word and its candidate forms. That is, for a word with only one candidate form, the Levenshtein distance between that word and its candidate form is its ALDC. For words with multiple candidates, the mean of the individual Levenshtein distances between candidates and targeted form constitutes the ALDC.
  • path_counts: path_counts describes the number of paths, i.e. possible sequences of triphones, detected for the production of a word by the production model.
  • path_entropies: path_entropies contains the Shannon entropy values which are calculated over the path supports of the predicted form in \(\hat{T}\).

As production accuracy for pseudowords in our LDL implementation is 100%, the maximum value of path_counts is 5. In turn, more paths provide a higher value of path_entropies as more forms not similar to the target form were found by the network:

plot(data$path_counts, data$path_entropies)

Thus, the values of ALDC show a rather small range as well, i.e. non-target candidate forms cannot be very different from the target form as the number of paths is restricted to 5.

table(data$ALDC, data$path_counts)
     
        1   2   3   4   5
  0    74   0   0   0   0
  0.5   0 277   0   0   0
  1     0  52  77 115   0
  1.4   0   0   0   0  58

path_sum, lwlr

  • path_sum: path_sum describes the summed support of paths for a predicted form
  • lwlr: The length-weakest link ratio, lwlr, is computed by taking the number of path nodes divided by the value of the weakest link of that path.

For path_sum, higher values indicate stronger support for a predicted form. For lwlr, higher values indicate weaker support for a predicted form. Thus, they are negatively correlated.

plot(data$path_sum, data$lwlr)

EDNN, correlations, cor_max, cor_target

  • EDNN: This variable describes the Euclidian Distance between a pseudoword’s estimated semantic vector \(s\) and its Nearest semantic real word or pseudoword Neighbour.
  • correlations: Given all candidate forms of a word, i.e. all word forms found as candidate production by the production model, and their predicted semantic vectors, correlation values between these predicted semantic vectors and a word’s semantic vector \(s\) are computed. The highest of these correlation values is taken as value of correlations.
  • cor_max: To obtain this measure, correlations between a word’s predicted semantic vector \(\hat{s}\) and all other semantic vectors of \(all.S\) are computed. Then, the highest of the correlation values is taken as cor_max.
  • cor_target: cor_target describes the correlation between a word’s predicted semantic vector \(\hat{s}\) in \(all.\hat{S}\) and its targeted semantic vector \(s\) in the S matrix.

For the pseudowords of the current implementation, correlations, cor_max, and cor_target show identical values as accuracy is 100%.

sum(data$correlations) == sum(data$cor_max)
[1] TRUE
sum(data$cor_max) == sum(data$cor_target)
[1] TRUE

Additionally, their distributions are highly skewed:

par(mfrow=c(1,3), mar=c(2,2,1,1))

plot(density(data$correlations), main="correlations")
plot(density(data$cor_max), main="cor_max")
plot(density(data$cor_target), main="cor_target")

While EDNN is also highyl correlated with the aforementioned three variables, it shows less skew:

plot(density(data$EDNN), main="EDNN")

density, ALC

  • density: For density, the correlation values of a word’s predicted semantic vector \(\hat{s}\) and its eight nearest neighbours’ semantic vectors are taken into consideration. The mean of these eight correlation values describes density.
  • ALC: The Average Lexical Correlation, ALC, is the mean value of all correlation values of a pseudoword’s estimated semantic vector as contained in \(pseudo.S\) with each of the real word semantic vectors as contained in \(real.S\).
par(mfrow=c(1,2), mar=c(3,3,1,1))

plot(density(data$density), main="density")
plot(density(data$ALC), main="ALC")

Interestingly, monomorphemic and plural pseudowords have significantly different ALC values:

t.test(ALC ~ Affix, data = data)

    Welch Two Sample t-test

data:  ALC by Affix
t = -7.7426, df = 578.45, p-value = 4.376e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.003725199 -0.002217666
sample estimates:
mean in group NM mean in group PL 
     0.003746660      0.006718093 
t.test(density ~ Affix, data = data)

    Welch Two Sample t-test

data:  density by Affix
t = -0.72469, df = 627.99, p-value = 0.4689
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.019687981  0.009073879
sample estimates:
mean in group NM mean in group PL 
       0.7903356        0.7956427 

While density is computed across the semantics of all words (real words and pseudowords), ALC takes pseudoword semantics and correlates them with real word semantics. This may be the explanation why ALC shows a significant difference for monomorphemic and plural pseudowords while density does not.

NNC, l1norm, l2norm

  • NNC: The Nearest Neighbour Correlation is computed by taking a pseudoword’s estimated semantic vector as given in \(pseudo.S\) and checking it for the highest correlation value against all real word semantic vectors as given in \(real.S\).
  • l1norm: The l1norm is the sum of the absolute values of vector elements of a given word’s predicted semantic vector \(\hat{s}\).
  • l2norm: The square root of the sum of the squared values of a given word’s predicted vector \(\hat{s}\), i.e. its Euclidian distance.

Monomorphemic and plural pseudowords have significantly different NNC values:

t.test(NNC ~ Affix, data = data)

    Welch Two Sample t-test

data:  NNC by Affix
t = 42.944, df = 412.36, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.05105690 0.05595531
sample estimates:
mean in group NM mean in group PL 
       0.9993000        0.9457939 

Monomorphemic pseudowords show higher NNC values as compared to plural pseudowords.

References

Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270. doi.org/10.1075/ml.18010.baa

Baayen, R. H., Chuang, Y. Y., and Hitmeier, M. (2019). WpmWithLdl: Implementation of Word and Paradigm Morphology with Linear Discriminative Learning. R package version 1.3.17.1.

Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan, E., & Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39. doi.org/10.1155/2019/4895891

Chuang, Y-Y., Vollmer, M-l., Shafaei-Bajestan, E., Gahl, S., Hendrix, P., & Baayen, R. H. (2020). The processing of pseudoword form and meaning in production and comprehension: A computational modeling approach using Linear Discriminative Learning. Behavior Research Methods, 1-51. doi.org/10.3758/s13428-020-01356-w

Nykamp, D. (2020, November 10). Multiplying matrices and vectors. Math Insight. mathinsight.org/matrix_vector_multiplication

Schmitz, D. (2021). LDLConvFunctions: LDLConvFunctions for measure computation, extraction, and other handy stuff. R package version 1.2.

Schmitz, D., Baer-Henney, D., & Plag, I. (2020). The duration of word-final /s/ differs across morphological categories in English: Evidence from pseudowords [Manuscript submitted for publication].

Tucker, B. V., Brenner, D., Danielson, D. K., Kelley, M. C., Nenadic, F., & Sims, M. (2018). The massive auditory lexical decision (MALD) database. Behavior Research Methods, 1-18. doi.org/10.3758/s13428-018-1056-1