- The big picture: human languages evolve on a cultural timescale:
- individual utterance selection > language change > language evolution
- a language -> another language(s)
- Massive centuries-spanning corpora compiled in the recent years open up an unprecedented avenue of possible investigations into language dynamics.
- (cf. Cuskley et al., 2014; Feltgen et al., 2017; Frermann and Lapata, 2016; Gulordava and Baroni, 2011; Hamilton et al., 2016; Newberry et al., 2017; Petersen et al., 2012; Bocharev et al., 2014; Sagi et al., 2011; Schlechtweg et al., 2017; Wijaya and Yeniterzi, 2011)
- Variant usage frequencies but also meaning (and change) using distributional semantics methods
- What I’m interested in: as new words - e.g. neologisms & borrowings - are selected for, what happens to their older synonyms?
- Identified two confounds that need to be controlled for
- Automatic distribution-based similarity measures are useful for quantifying both meaning and meaning change
- but apparent semantics tend to change when frequency changes (1)
- Results based on simply counting words can lead to spurious results
- a big change may well be driven by a change in topic composition (2)
- Distributional semantics: based on contextual co-occurrence; semantic change ~ semantic similarity of a word between temporal subcorpora.
- Observation: frequency change appears to affect a word’s semantics.
- If that is the case, this would be a problem for any diachronic approach utilizing automatic semantic change measures (cf. Dubossarsky et al. 2017 for more critique of automated semantic change measures)
- Simulate frequency change of a word between subcorpora and measure semantics change
- But instead of actual different subcorpora, use data from one single corpus (2000-2009 in COHA), and generate different versions of it (corpus\('\)) where the occurrences of some target word \(w\) have been downsampled by relabelling a fixed portion of them as \(w'\)
- Measure the similarity of \(w\) in the original corpus -> to \(w'\) in corpus\('\).
- Null hypothesis: no semantic change should occur (actually the same word)
- 100 random words (nouns) from equally spaced log frequency bands, 25 downsample sizes \(s \in [0.1, 7]\)
- For each \(w\) with frequency \(f\), and each \(s\), relabel a portion \(e^{ln(f) - s} = f/e^s\) (excl. downsamples \(n<10\))
- E.g., if \(f=1000\), \(s=0.7\), then \(1000/e^{0.7} \approx 496\), or a -50.3% reduction.
- For each downsampled \(w'\), measure its semantic similarity to the original word, using 5 different distributional approaches (with 10x replications for each combination):
- full count vectors (no dimension reduction), cosine similarity
- full vectors, but PPMI weighted, cosine similarity
- APSyn rank-based similarity, using top 100 PPMI-weighted terms (Santus et al. 2016)
- Latent Semantic Analysis (SVD) embeddings of count vectors, cosine similarity
- GloVe embeddings of count vectors, cosine similarity (Pennington et al. 2014)
- All 5 semantic similarity methods exhibit some bias; predictable by frequency band in some
- Vector space density matters: a large change value does not necessarily correspond to a categorical change in semantics in a sparse space; similarity rank between \(w\) and \(w'\) is comparable between methods
- Good news: change to the extent of becoming a “different word” (\(w\) not the closest synonym for \(w'\)) occurs mostly at low frequencies (<100), which should be considered unreliable anyway
- Some methods (APSyn, GLoVe) are more susceptible, while semantics in the simple PPMI-weighted vectors approach suffer the least
- But wait! Not all methods born equal… the PPMI vectors largely fails at capturing synonymy (similarity scores seem more reflective of association: 0.15 correlation for association norms)
- Possible future options: different (bigger) corpora, parameter tuning, more training time for GloVe, more methods (e.g. fasttext, recent Bayesian and “deep neural” approaches, cf. Frermann et al., 2016; Rosenfeld et al., 2018)
- The downsampling apporach is extendable to actual diachronic corpora, to compare observed semantic change against expected change stemming from frequency difference.
- The two confounds that need to be controlled for…
- Interplay of frequency change and semantic change measures ✔
- Topical fluctuations
- Observation: the ebb and flow of discourse topics in a diachronic corpus reflects real-world events (wars->ware-related news->frequency of military words increases)
- Token frequency ~ probability of usage ~ fitness ~ being selected for
- However: corpus frequencies may be misleading (Chelsey & Baayen, 2010; Lijffijt et al., 2012; Calude, et al., 2017; Szmrecsanyi 2016)
- Observation: sometimes similar words both increase in frequency, instead of competing; or emergence of new words often coincides with the frequency increase of similar words, not decrease.
- Frequency change might not necessarily imply selection.
- Topical advection: a measure of how much topic/context words like cafe, cappuchino have changed on average (weighted by some association score) between two periods.
- latte: calculate its log frequency change (e.g. +1.19 between 1990s->2000s)
- calculate its topical advection: +0.07 (weighted mean log frequency change in context words) (see Appendix for math)
Works similarly in other diachronic databases of cumulative culture (e.g. click here: movies, boardgames, cookbooks)
A useful baseline to include in any model of diachronic frequency change of linguistic (or other cultural) elements.
- The two confounds that need to be controlled for…
- Interplay of frequency change and semantic change measures ✔
- Topical fluctuations ✔
- As new words are selected for, what happens to their older synonyms?
- Observation: competition may manifest in at least two ways: frequency change or meaning change (e.g. radio <-> wireless, beef <-> cow) - but at times near-synonyms both successfully remain in use.
- Hypothesis: high semantic similarity (introduced by emergent novel words or semantic change) leads to competition between similar variants1 - unless there is sufficient communicative need2 in the lexical subspace to sustain near-synonymy.
- 1 apparent by diverging frequency or diverging semantics (while controlling for bias)
2 as measured by the advection model
- All the code, the slides with interactive plots, the paper: https://andreskarjus.github.io
For each \(w\) with an original frequency \(f\), and each \(s\), downsampled by randomly relabeling a fixed portion of its occurrences as \(w'\) in the corpus, where the portion is defined as \(e^{ln(f) - s} = f/e^s\) (exluded downsamples with \(<10\) occurrences)
E.g., if \(f=1000\), \(s=0.7\), then \(e^{ ln(1000) - 0.7} \approx 496\), or a -50.3% reduction.
On the heatmaps, \(w\) being closest synonym to \(w'\) is marked as False if this occurs in any of the 10 replicates. The downsampling procedure included a 0-downsample as a sanity check, which entailed no reduction in frequency, i.e. 100% of the occurrences were sampled; these are also displayed on the plots as the first column/value.
The advection value of a word in time period \(t\) is defined as the weighted mean of the changes in frequencies (compared to the previous period) of those associated words. More precisely, the topical advection value for a word \(\omega\) at time period \(t\) is
\[\begin{equation} {\rm advection}(\omega;t) := {\rm weightedMean}\big( \{ {\rm logChange}(N_i;t) \mid i=1,...m \}, \, W \big) \end{equation}\]where \(N\) is the set of \(m\) words associated with the target at time \(t\) and \(W\) is the set of weights (to be defined below) corresponding to those words. The weighted mean is simply
\[\begin{equation} {\rm weightedMean}(X, W) := \frac{\sum x_i w_i }{\sum w_i} \end{equation}\]where \(x_i\) and \(w_i\) are the \(i^{\rm th}\) elements of the sets \(X\) and \(W\) respectively. The log change for period \(t\) for each of the associated words \(\omega'\) is given by the change in the logarithm of its frequencies from the previous to the current period. That is,
\[\begin{equation} {\rm logChange}(\omega';t) := \log[f(\omega';t)+1] - \log[f(\omega';t-1)+1] \end{equation}\]where \(f(\omega';t)\) is the number of occurrences of word \(\omega'\) in the time period \(t\). Note we add \(1\) to these frequency counts, to avoid \(\log(0)\) appearing in the expression.
This is a more detailed rendering of the similarity ranks plot shown above - instead of true/false, the mean ranks are shown (i.e., of all words, what is the closest word to \(w'\) - if the value is 1, then the original \(w\) is first; a higher value indicates how low \(w\) ranks, e.g., “10” means it’s the 10th most similar word to \(w'\)).
*This research was supported by the scholarship program Kristjan Jaak, funded and managed by the Archimedes Foundation in collaboration with the Ministry of Education and Research of Estonia.