class: center, middle, inverse, title-slide #
Catching competitors
:
lexical evolution in diachronic corpora
###
Andres Karjus, Richard A. Blythe, Simon Kirby, Kenny Smith
Centre for Language Evolution, University of Edinburgh
###
CL2019, Cardiff, July 2019
--- class: inverse # All languages change <style> .remark-slide-content { padding-top: 7px; padding-left: 25px; padding-right: 20px; padding-bottom: 30px; } body { line-height: 3em; } .mjx-chtml{ font-size: 100% !important; } .small { font-size: 50%; margin-top:0em; margin-bottom:0em; line-height:0px;} .mono {font-family: monospace, monospace; font-size: 80%;} p {margin-bottom:0em} </style> --- # All languages change - The big picture: human languages evolve on a cultural timescale: - individual utterance selection > language change > language evolution - a language -> another language(s) -- - Massive centuries-spanning corpora compiled in the recent years open up an unprecedented avenue of possible investigations into language dynamics. - Variant usage frequencies but also meaning (and change) using distributional semantics methods --- # All languages change - What I'm interested in: as new words - e.g. neologisms & borrowings - are selected for, what happens to their older synonyms? Does direct competition always follow local frequency changes? -- - Hypothesis: - frequency increase in a word will lead to direct competition with (and possibly replacement of) near-synonym(s) - unless the lexical subspace experiences high communicative need. -- - Communicative need and communicative utility .small[(Givon 1982, McMahon 1994:194, Tomasello 1999:37, Regier et al. 2016, Gibson et al. 2017, Smith et al. 2017)] - How much does communicative need drive language change? --- # Some technical challenges - what data to test this on - how to model changes in communicative need - how to capture competition dynamics --- # Data - Sample unique words from a corpus, with frequency increase `\(\ln\geq 2\)` between any 2 spans of 10y, occur in `\(\geq 2\)` years, `\(\geq\)` 100x; e.g.:
--- class:inverse # Quantifying competition (the thing I want to predict) --- # Quantifying competition - Distributional semantics, meaning from data <img src="img/tcm_anim.mov.gif" width="80%" /> - Embed targets into vector space (LSA) of preceding decade, compute semantic neighbors --- # Quantifying competition - Important: word occurrence probabilities sum up to 1; increase in x means decrease in y. -- - The measure: *where* the probability mass gets equalized, i.e., target increase `\(\geq \sum_{}^{}\)` (neighbors' decreases). Either cosine distance, or n increasing neighbors. - Indicates if the increasing target replaced semantically close word(s) (direct competition, obvious likely source of probability mass). --- # An example semantic space
--- - Example: _relativism_, increasing +13.2pmw <br>between 1965-1974, 1975-1984: -- - word | freq.change | cumsum(decr) | normd. dist - _**relativism**_ <font color='darkred'>+13.2</font> . . -- - *marxism* -5.68 5.68 0 -- - *thesis* <font color='gray'>+9.00</font> <font color='gray'>5.68</font> 0.01 -- - *jacksonian* -11.64 17.32><font color='darkred'>13.2</font> 0.03 - - - *validity* - *interpretation* --- class: inverse # Communicative need (the predictor) --- # Communicative need - Model diachronic topical fluctuations by quantifying the frequency change of a word's topic. - The topical-cultural advection model; proxy to communicative need - _advection_: 'the transport of substance, particularly fluids, by bulk motion' - Formalized as the _weighted mean of the log frequency changes of the relevant topic (context) words of the target word_ --- # How does this work? - Generate a "topic" for each target word, consisting of _m_ associated context words <img src="img/tcm_anim.mov.gif" width="80%" height="50%"/> - weighted mean frequency change of topic/context words e.g. _cafe_, _cappuchino_ .small[(these are removed from the list of neighbours)] --- # How well does it work? - Correlate the log frequency changes of all (sufficiently frequent) nouns between two time periods to their respective topical advection values - describes ~20-40% variance in word frequency changes between the 20 decades in COHA - Comparable results from LDA - cf. Karjus et al., *Quantifying the dynamics of topical fluctuations in language* (to appear in Language Dynamics and Change) <br> .small[preprint https://arxiv.org/abs/1806.00699 ] --- class:inverse # Results --- # Results (COHA)
--- - R<sup>2</sup>=0.2. Clearer competition signal if: - lower communicative need (advection; `\(\beta = 0.09\)`, `\(p<0.001\)`) - bursty series - smaller changes - a clear loser present - Also controlled for, but all `\(p>0.05\)`: std of yearly frequencies • semantic subspace instability • uniqueness of the form • smallest edit distance among closest sem neighbors • polsemy • leftover prob. mass • age of word in corpus • target decade. --- # Results (Estonian, German) <img src="cl2019talk_files/figure-html/unnamed-chunk-5-1.png" width="90%" style="display: block; margin: auto;" /> --- class: inverse # Discussion --- # Discussion - Controlling for a range of factors, communicative need (operationalized by advection), describes a small amount of variance in competitive interactions between words - low advection words are more likely to replace a word with a similar meaning -- - high advection words: less likely (prob.mass from elsewhere) -- - tested against a random baseline -- - Real effect is bigger? - Just words; messy population aggregates; messy ML - But also, this approach relies on a *lot* of parameters -- - How does this relate to individual utterance selection processes? --- # Work in progress: Scottish Twitter - ~50k daily users; 72 days, ~1.5m words/day, >140k unique #s <img src="cl2019talk_files/figure-html/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- # Results (Scottish Twitter hashtags) .small[ R<sup>2</sup>=0.12, advection p=0.34 ]
--- class:inverse background-image: url(img/dany.png) background-size: cover --- # Discussion (Scottish Twitter hashtags) - Small set of targets (63)? Also tried with 101 (any) words. - What do #s compete with anyway: only other #s, or all words? - Timespans need more thought. - Probably behave differently from "normal" words: - *I watched Game of Thrones last night.* -- - ?? *I watched Game of Thrones, GoT, the GoT finale last night.* -- - *I watched #gameofthrones last night.<br> #got #gotfinale #gameofthronesfinale* --- class: inverse # Conclusions - Communicative need describes a small amount of variance in competitive interactions between words in diachronic corpora (but not Twitter). - Presumably high communicative need facilitates the co-existence of similar words. -- - Future directions: explore parameters; test with phrases; more corpora; implement with non-discrete spans and modern semantics models .small[ (e.g. temporal referencing, Dubossarsky et al 2019) ]; test this experimentally. -- - Slides: andreskarjus.github.io/cl2019talk <br>Twitter: @AndresKarjus --- class: inverse --- class: inverse # Appendix --- # All the parameters <div style="font-size:16pt; line-height:25px"> - preprocessing choices (lemmatized; removed stopwords, numbers; homogenized compounds, spelling) <br> - timespan (10y), min change (log()>2), min frequency in t2 (100), min occurrence years (2); twitter: 5d spans, at least once in span >=0.5% daily users <br> - LSA k (100 dim), min freq (100), context window size (5), weighted <br> - cosine distance: normalized by 1st neighbour <br> - density: LSA-based, as mean of cos.sim of 2nd..10th neighbours <br> - form similarity (restricted Damerau-Levenshtein, length-normalized) <br> - filter out: leftover>100%, polysemy residuals>2 <br> - advection topic model k (75 words), min freq (100), context window size (10), weighted <br> - polysemy model, context window size (2), weighted </div> --- # The competition measure (COHA) <img src="cl2019talk_files/figure-html/unnamed-chunk-8-1.png" width="100%" style="display: block; margin: auto;" /> --- # Notes on the competition measure <div style="font-size:14pt; line-height:25px"> - We made sure to avoid auto-correlation between the advection measure and the dependent variable by filtering the neighbour lists of each target so that no topic word of the target (i.e. those with a PPMI>0 with the target, which are used to calculate the advection value) would be accounted for as a neighbor. This also makes sense from a semantic point of view: if two words, even if very similar occur near each other (e.g., "salt and pepper"), then it's less palusible that they would be competing against one another. Exceptions are certainly possible, such as meta-linguistic descriptions (e.g., "vapor, previously spelled as vapour"), but we assume these would be rare.<br> - We also filtered out a small subset of target words with considerably higher-than-expected lexical dissemination (a proxy to polysemy, cf. Stewart & Eisenstein 2018), and those with a leftover probability mass >100% of its frequency. <br> - We did not make use of the entire Corpus of Historical American English, as most of the 19th century decades are less balanced and smaller in size, the imbalance extends to the occurrence of non-standard dialects or registers in occasional year subcorpora. Similarly, we only used years after 1800 in the German corpus. The Estonian corpus only spans two decades, 1990s and 2000s, so all comparisons were done between these two, without accounting for exact starting years of each word's increase.<br> - This approach certainly has limitations stemming from the imperfect nature of corpus tagging, composition balance, and vector semantics (LSA). We also disregard issues such as homonymy (although we control for polysemy in the targets) and multi-word units.<br> - We ran randomized baselines to make sure the observed correlation with advection is not some (unknown) artefact of the machine learning models used here. This was done by randomizing similarity matrices, i.e. each target was assigned a random list of neighbors, with random similarity values (drawn from the concatenation of all similarity vectors). After hundreds of iterations, the advection variable would come out with a p-value below 0.05 in only about 5% of the runs (i.e., as expected with an `\(\alpha=0.05\)`).<br> - Some outliers are removed on the bottom left plot on the distributions of distances and neighbors until probability mass equalized. </div> --- # The polysemy measure
--- # References <div style="font-size:10pt; line-height:20px"> Karjus, A., Blythe, R.A., Kirby, S., Smith, K., [to appear in Language Dynamics and Change]. Quantifying the dynamics of topical fluctuations in language.<br> Regier, T., Carstensen, A., Kemp, C., 2016. Languages Support Efficient Communication about the Environment: Words for Snow Revisited. PLOS ONE 11, 1–17.<br> Gibson, E., Futrell, R., Jara-Ettinger, J., Mahowald, K., Bergen, L., Ratnasingam, S., Gibson, M., Piantadosi, S.T., Conway, B.R., 2017. Color naming across languages reflects color use. Proceedings of the National Academy of Sciences. <br> Hamilton, W.L., Leskovec, J., Jurafsky, D., 2016. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.<br> Xu, Y., Kemp, C., 2015. A Computational Evaluation of Two Laws of Semantic Change., in: CogSci.<br> Schlechtweg, Dominik, Stefanie Eckmann, Enrico Santus, Sabine Schulte im Walde, and Daniel Hole, 2017. German in Flux: Detecting Metaphoric Change via Word Entropy. arXiv preprint.<br> Petersen, A.M., Tenenbaum, J., Havlin, S., Stanley, H.E., 2012. Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death. Scientific Reports 2.<br> Stewart, I., Eisenstein, J., 2018. Making “fetch” happen: The influence of social and linguistic context on nonstandard word growth and decline, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, pp. 4360–4370.<br> Turney P.D., Mohammad S.M., 2019 The natural selection of words: Finding the features of fitness. PLOS ONE 14(1): e0211512. </div>