Modelling interaction in the lexicon

  • The big picture: human languages evolve on a cultural timescale:
    • individual utterance selection > language change > language evolution
    • a language -> another language(s)
  • Massive centuries-spanning corpora compiled in the recent years open up an unprecedented avenue of possible investigations into language dynamics.
  • (cf. Cuskley et al., 2014; Feltgen et al., 2017; Frermann and Lapata, 2016; Gulordava and Baroni, 2011; Hamilton et al., 2016; Newberry et al., 2017; Petersen et al., 2012; Bocharev et al., 2014; Sagi et al., 2011; Schlechtweg et al., 2017; Wijaya and Yeniterzi, 2011)
  • Variant usage frequencies but also meaning (and change) using distributional semantics methods

  • What I’m interested in: as new words - e.g. neologisms & borrowings - are selected for, what happens to their older synonyms?
  • Identified two confounds that need to be controlled for
  • Automatic distribution-based similarity measures are useful for quantifying both meaning and meaning change
    • but apparent semantics tend to change when frequency changes (1)
  • Results based on simply counting words can lead to spurious results
    • a big change may well be driven by a change in topic composition (2)

1. Frequency change bias in semantic change measures

  • Distributional semantics: based on contextual co-occurrence; semantic change ~ semantic similarity of a word between temporal subcorpora.
  • Observation: frequency change appears to affect a word’s semantics.
  • If that is the case, this would be a problem for any diachronic approach utilizing automatic semantic change measures (cf. Dubossarsky et al. 2017 for more critique of automated semantic change measures)

Testing approach

  • Simulate frequency change of a word between subcorpora and measure semantics change
  • But instead of actual different subcorpora, use data from one single corpus (2000-2009 in COHA), and generate different versions of it (corpus\('\)) where the occurrences of some target word \(w\) have been downsampled by relabelling a fixed portion of them as \(w'\)
  • Measure the similarity of \(w\) in the original corpus -> to \(w'\) in corpus\('\).
  • Null hypothesis: no semantic change should occur (actually the same word)

  • 100 random words (nouns) from equally spaced log frequency bands, 25 downsample sizes \(s \in [0.1, 7]\)
  • For each \(w\) with frequency \(f\), and each \(s\), relabel a portion \(e^{ln(f) - s} = f/e^s\) (excl. downsamples \(n<10\))
  • E.g., if \(f=1000\), \(s=0.7\), then \(1000/e^{0.7} \approx 496\), or a -50.3% reduction.
  • For each downsampled \(w'\), measure its semantic similarity to the original word, using 5 different distributional approaches (with 10x replications for each combination):
    • full count vectors (no dimension reduction), cosine similarity
    • full vectors, but PPMI weighted, cosine similarity
    • APSyn rank-based similarity, using top 100 PPMI-weighted terms (Santus et al. 2016)
    • Latent Semantic Analysis (SVD) embeddings of count vectors, cosine similarity
    • GloVe embeddings of count vectors, cosine similarity (Pennington et al. 2014)




Interrim conclusions (part 1)

  • All 5 semantic similarity methods exhibit some bias; predictable by frequency band in some
  • Vector space density matters: a large change value does not necessarily correspond to a categorical change in semantics in a sparse space; similarity rank between \(w\) and \(w'\) is comparable between methods
  • Good news: change to the extent of becoming a “different word” (\(w\) not the closest synonym for \(w'\)) occurs mostly at low frequencies (<100), which should be considered unreliable anyway
  • Some methods (APSyn, GLoVe) are more susceptible, while semantics in the simple PPMI-weighted vectors approach suffer the least
  • But wait! Not all methods born equal… the PPMI vectors largely fails at capturing synonymy (similarity scores seem more reflective of association: 0.15 correlation for association norms)
  • Possible future options: different (bigger) corpora, parameter tuning, more training time for GloVe, more methods (e.g. fasttext, recent Bayesian and “deep neural” approaches, cf. Frermann et al., 2016; Rosenfeld et al., 2018)
  • The downsampling apporach is extendable to actual diachronic corpora, to compare observed semantic change against expected change stemming from frequency difference.

2. Fluctuations on topic frequencies

  • The two confounds that need to be controlled for…
    • Interplay of frequency change and semantic change measures ✔
    • Topical fluctuations
  • Observation: the ebb and flow of discourse topics in a diachronic corpus reflects real-world events (wars->ware-related news->frequency of military words increases)
  • Token frequency ~ probability of usage ~ fitness ~ being selected for
  • However: corpus frequencies may be misleading (Chelsey & Baayen, 2010; Lijffijt et al., 2012; Calude, et al., 2017; Szmrecsanyi 2016)
  • Observation: sometimes similar words both increase in frequency, instead of competing; or emergence of new words often coincides with the frequency increase of similar words, not decrease.
  • Frequency change might not necessarily imply selection.


The topical-cultural advection model

  • Control for diachronic topical fluctuations by quantifying the frequency change of a word’s topic.
  • advection: ‘the transport of substance, particularly fluids, by bulk motion’
  • Formalized as the weighted mean of the log frequency changes of the relevant topic (context) words of the target word

How does this work?

  • Generate a “topic” for each target word, consisting of m context words, based on co-occurrence (topic modelling ~ distributional semantics)

  • Topical advection: a measure of how much topic/context words like cafe, cappuchino have changed on average (weighted by some association score) between two periods.
  • latte: calculate its log frequency change (e.g. +1.19 between 1990s->2000s)
  • calculate its topical advection: +0.07 (weighted mean log frequency change in context words) (see Appendix for math)

How well does it work?

  • Correlate the log frequency changes of all (sufficiently frequent) nouns between two time periods to their respective topical advection values
  • What should we expect?


Conclusions (part 2)

Future work

  • The two confounds that need to be controlled for…
    • Interplay of frequency change and semantic change measures ✔
    • Topical fluctuations ✔
  • As new words are selected for, what happens to their older synonyms?
  • Observation: competition may manifest in at least two ways: frequency change or meaning change (e.g. radio <-> wireless, beef <-> cow) - but at times near-synonyms both successfully remain in use.
  • Hypothesis: high semantic similarity (introduced by emergent novel words or semantic change) leads to competition between similar variants1 - unless there is sufficient communicative need2 in the lexical subspace to sustain near-synonymy.
    • 1 apparent by diverging frequency or diverging semantics (while controlling for bias)
      2 as measured by the advection model









Appendix


Math and parameters (the frequency change - semantic change simulation)

  • Used the same data (Corpus of Historical American English) in both sections; here limited to the last decade (2000-2009), which is ~10m content word tokens after cleaning
  • Context window for co-occurrence is +/-5, linearly weighted; excluded low frequency words from models (<100)
  • The methods:
    • cosine similarity over plain co-occurrence count matrix (no dimension reduction), i.e. vector length is the entire lexicon (~10k types after removing low frequency types)
    • cosine over PPMI-weighted full vectors (same, full lexicon)
    • APSyn (N=100 top PPMI-weighted terms are used for the rank comparison)
    • cosine over LSA-reduced co-occurrence matrix (300dim)
    • cosine over GLoVe-reduced co-occurrence matrix (100dim, 20 iterations with early stopping allowed, learning rate = 0.15, alpha = 0.75, lambda = 0; parameter tuning and longer training might improve results)
  • The downsampling:
  • Sampled 100 nouns from equally spaced log frequency bands, with frequencies in \([510, 50863]\)
  • Defined a sequency of 25 downsample sizes \(s \in [0.1, 7]\), the results also include a “sanity check” of \(s=0\), where no reduction is applied, only random reshuffling

For each \(w\) with an original frequency \(f\), and each \(s\), downsampled by randomly relabeling a fixed portion of its occurrences as \(w'\) in the corpus, where the portion is defined as \(e^{ln(f) - s} = f/e^s\) (exluded downsamples with \(<10\) occurrences)
E.g., if \(f=1000\), \(s=0.7\), then \(e^{ ln(1000) - 0.7} \approx 496\), or a -50.3% reduction.

On the heatmaps, \(w\) being closest synonym to \(w'\) is marked as False if this occurs in any of the 10 replicates. The downsampling procedure included a 0-downsample as a sanity check, which entailed no reduction in frequency, i.e. 100% of the occurrences were sampled; these are also displayed on the plots as the first column/value.

Semantic vector space performace measure

  • We used Simlex-999 (Hill et al., 2015) to measure how well the various methods employed here reflect actual semantic similarities (synonymy), as compared to the human judgements recorded in the Simlex dataset. The evaluation entails correlating similarity scores with the test set scores for the given word pairs (Spearman). We chose this set since synonymy is more relevant to future research than e.g. associations. There are also test sets avalailable for other semantic relationships, e.g. association.

Math (the topical-cultural advection model)

The advection value of a word in time period \(t\) is defined as the weighted mean of the changes in frequencies (compared to the previous period) of those associated words. More precisely, the topical advection value for a word \(\omega\) at time period \(t\) is

\[\begin{equation} {\rm advection}(\omega;t) := {\rm weightedMean}\big( \{ {\rm logChange}(N_i;t) \mid i=1,...m \}, \, W \big) \end{equation}\]

where \(N\) is the set of \(m\) words associated with the target at time \(t\) and \(W\) is the set of weights (to be defined below) corresponding to those words. The weighted mean is simply

\[\begin{equation} {\rm weightedMean}(X, W) := \frac{\sum x_i w_i }{\sum w_i} \end{equation}\]

where \(x_i\) and \(w_i\) are the \(i^{\rm th}\) elements of the sets \(X\) and \(W\) respectively. The log change for period \(t\) for each of the associated words \(\omega'\) is given by the change in the logarithm of its frequencies from the previous to the current period. That is,

\[\begin{equation} {\rm logChange}(\omega';t) := \log[f(\omega';t)+1] - \log[f(\omega';t-1)+1] \end{equation}\]

where \(f(\omega';t)\) is the number of occurrences of word \(\omega'\) in the time period \(t\). Note we add \(1\) to these frequency counts, to avoid \(\log(0)\) appearing in the expression.

Parameters

  • Used the COHA corpus, divided into decade subcorpora
  • Preprocessing: lemmatization and stopword removal, and misc cleaning; used only content words, and excluded proper nouns; the advection model was applied to common nouns only.
  • Excluded words with less than 100 occurrences
  • Used the top 100 PPMI-weighted context words (from a window of +/-5) for the simpler approach; an LDA model yielded comparable results (see full paper for more details)

More detailed similarity ranks heatmap

This is a more detailed rendering of the similarity ranks plot shown above - instead of true/false, the mean ranks are shown (i.e., of all words, what is the closest word to \(w'\) - if the value is 1, then the original \(w\) is first; a higher value indicates how low \(w\) ranks, e.g., “10” means it’s the 10th most similar word to \(w'\)).




*This research was supported by the scholarship program Kristjan Jaak, funded and managed by the Archimedes Foundation in collaboration with the Ministry of Education and Research of Estonia.