Language and culture dynamics
- Diachronic corpora
- Topics in a corpus of language
The topical-cultural advection model
- How does this work?
- How well does it work?
Cultural evolution
Some more culinary explorations
- Ingredient network: potential interactions
- Some basic network analysis
Conclusions
Extras

Language and culture dynamics

Why and how does language (~culture in general) change over time?

sociolinguistic reasons

language contact

top-down language planning

drift

…

Diachronic corpora

Utterances by numerous presumably uniformly sampled speakers -> samples of data across time -> provide insight into change dynamics

Selection, fitness, and drift

Token frequency ~ fitness (in much of corpus-based language dynamics)

However: corpus frequencies may be misleading (cf. Chelsey & Baayen, 2010; Lijffijt et al. 2012; Calude, et al 2017: Szmrecsanyi 2016)

Topics in a corpus of language

What is talked about, reported on and written about is reflected in corpus composition (and corpora may be imbalanced on top of that, cf.Pechenick et al. 2015)

For example:

war-related news during times of war

computer usage spreads -> computer-related topics proliferate -> related vocabulary - increases in frequency; new more specific words may be introduced.

Some other topics lose relevance in competition? -> vocabulary decreases

This points at a need for a baseline that controls for frequency changes in variants that are driven by changes in topic frequencies.

The topical-cultural advection model

Control for diachronic topical fluctuations by quantifying the frequency change of a word’s topic.
advection: ‘the transport of substance, particularly fluids, by bulk motion’
Formalized as the weighted mean of the log frequency changes of the relevant topic (words) of the target word
High advection expected to correlate with a hightened communicative need in the semantic (or cultural) subspace.
The advection effect also predicts lexical innovation (cf. Karjus et al. 2018)

How does this work?

Generate a “topic” for each target word, consisting of m topic words, based on co-occurrence

..........the question. Does an iced latte  count as a dairy product ?
..social correctness , cappuccino ,  latte  , microbrewed beers . Live
...............be spitting in Ross ' latte  when he 's not looking
.....Seattle are sipping decaf mocha latte  nectar in a local cafe 
                                  ...

Advection: a measure of how much these topic words have changed on average (weighted by some association score) between two periods.

word      cappuccino libel  espresso resolutely vibe   nectar iced  scald ...
log change   +0.17   -1.40   +0.45     +0.12    +0.67  -0.41  -0.12 +0.07 ...
PPMI         11.51    10.3    10.3      9.25     9.05    8.9   8.89  8.72 ...

+1.19 log frequency change 1990s->2000s (1.91pmw->2.18pmw)
+0.07 advection (weighted mean log frequency change in topic words)

How well does it work?

Correlate the log frequency changes of all (sufficiently frequent) nouns between two time periods to their respective advection values
What should we expect?

Cultural evolution

The same effect can be quantified in a (diachronic) database of cultural trends
Analogous to “topics”: clusters of elements that increase/decrease together -> advection
Tested this in 3 distinct domains of culture using massive freely avalilable databases

The data

IMDb: movie tags; contex = co-occurrence in a movie

limited to: USA + drama + movies + >100 votes; filtered out TV shows, porn, animation etc.

compared 1960-1979 vs 2010-2018; 3151 movies, 1387 keywords (out of 41269)

BoardGameGeek: game mechanics; context = co-occurrence in a boardgame

compared 1950-1980 vs 2010-2013; 17551 games, 51 mechanics

Feeding America: The Historic American Cookbook Dataset: ingredients; contex = co-occurrence in a recipe

compared 1803-1860 vs 1890-1913; 744 ingredients in 26378 recipes from 44 cookbooks

But ignoring potentially important metadata (movie popularity, boardgame sales and families, etc.)

Top co-occurring context tags for ‘cult-movie’

Residuals

-Top positive residuals (~selection): celery root, sauerkraut, baking powder, granulated sugar, pork fat, corn starch

-Top negative: powder sugar, powdered loaf sugar, pearl ash (potassium carbonate), sauce, lemon juice, tomata

Some more culinary explorations

Is it possible to determine which new ingredients likely compete with which old ingredients?

Calculate the cosine similarity (on a PPMI-weighted recipe-based co-occurrence matrix) between ingredients

Candidate competitors for each new ingredient: those which are among the top 10 most similar old ingredients, and which have decreased in frequency between the 2 time periods

Ingredient network: potential interactions

Red: new ingredients (size = log frequency increase). Gray: old ingredients. Links: width indicates cosine similarity.

Some basic network analysis

This approach allows for the operationalization of the following quantities besides advection:
Node degree of new ingredients (how many links to old ingredients)
For each new ingredient: the mean of the degrees of all its linked old ingredients
- if low: new one competing alone against a number of old ones
- if high: subspace contested by a number of new ones (or augmented if high advection)

Prediction: high advection -> little competition/low degree (analogy: communicative need in language)

degree(new ingredients) ~ advection(new ingr); R^2=0.31

degree(new ingredients) ~ advection(new ingr) * mean(degrees(old ingr neighbors)); R^2=0.4

Conclusions

Discourse topics play a role in the frequency changes of words

The topical-cultural advection model quantifies that role

The same effect can be demonstrated to work in various domains of cultural evolution

As such, it would be reasonable to include it as a control in any quantitative model that predicts frequency changes of linguistic and/or cultural elements

But: biases likely have differing importance in different domains (e.g. exclusivity bias)

We are working on integrating the advection measure and a model of within-language lexical competition (as seen here) to investigate the effects of the introduction of new variants into a language system.

Full paper now out in arXiv: Karjus, Blythe, Kirby, Smith 2018: Quantifying the dynamics of topical fluctuations in language, https://arxiv.org/abs/1806.00699

Play with the interactive plots: https://andreskarjus.github.io/cultevol_tartu_slides

Code on Github: https://github.com/andreskarjus/cultural_advection_TartuCE

Extras

Parameters

COHA nouns:
- frequency threshold: 100, topic vector length: 100
- eras used here: 1930s vs 1940s (see the paper on arXiv for more)
- cleaned stopwords etc and proper nouns; co-occurrence window size: +-5
Boardgames:
- frequency threshold: 5, topic vector length: 23 (max)
- eras: 1950-1980 vs 2010-2013
IMDb:
- frequency threshold: 25, topic vector length: 100
- include = {Drama}
- exclude = {Documentary, Reality-TV, Short, Talk-Show, Musical, News, Music, Game-Show, Animation, Adult, Reality-tv, Lifestyle}
- eras: 1960-1979 vs 2010-2018
Feeding America:
- frequency threshold: 25, topic vector length: 100
- classes: {fruitvegbeans, meatfishgame, eggscheesedairy, breadsweets, soups, accompaniments, beverages}
- eras: 1803-1860 vs 1890-1913
- carried out some basic tag normalization/stemming and cleaning
Refer to the script on Github for some useful links and to see how these are implemented.

Math

The advection value of a word in time period \(t\) is defined as the weighted mean of the changes in frequencies (compared to the previous period) of those associated words. More precisely, the topical advection value for a word \(\omega\) at time period \(t\) is

\[\begin{equation} {\rm advection}(\omega;t) := {\rm weightedMean}\big( \{ {\rm logChange}(N_i;t) \mid i=1,...m \}, \, W \big) \end{equation}\]

where \(N\) is the set of \(m\) words associated with the target at time \(t\) and \(W\) is the set of weights (to be defined below) corresponding to those words. The weighted mean is simply

\[\begin{equation} {\rm weightedMean}(X, W) := \frac{\sum x_i w_i }{\sum w_i} \end{equation}\]

where \(x_i\) and \(w_i\) are the \(i^{\rm th}\) elements of the sets \(X\) and \(W\) respectively. The log change for period \(t\) for each of the associated words \(\omega'\) is given by the change in the logarithm of its frequencies from the previous to the current period. That is,

\[\begin{equation} {\rm logChange}(\omega';t) := \log[f(\omega';t)+1] - \log[f(\omega';t-1)+1] \end{equation}\]

where \(f(\omega';t)\) is the number of occurrences of word \(\omega'\) in the time period \(t\). Note we add \(1\) to these frequency counts, to avoid \(\log(0)\) appearing in the expression.

See the full paper for more.

Ingredient cosine similarities

Based on ingredient co-occurrence in the cookbooks. Displaying only the igredients with at least one neighbor >0.6 similarity (corresponds to edge width) and excluding nodes with weaker similarity links. Color corresponds to log frequency change (red=increase).

*This research was supported by the scholarship program Kristjan Jaak, funded and managed by the Archimedes Foundation in collaboration with the Ministry of Education and Research of Estonia.

Selection above the baseline:
quantifying the advection effect in four domains of cumulative culture

Andres Karjus

Centre for Language Evolution, University of Edinburgh
& University of Tartu

a.karjus@sms.ed.ac.uk | andreskarjus.github.io | @AndresKarjus

Applications in Cultural Evolution: Arts, Languages, Technologies; Tartu 2018 | https://cultevol.ut.ee/

Language and culture dynamics

Diachronic corpora

Topics in a corpus of language

The topical-cultural advection model

How does this work?

How well does it work?

Cultural evolution

The data

Top co-occurring context tags for ‘cult-movie’

Residuals

Some more culinary explorations

Ingredient network: potential interactions

Some basic network analysis

Conclusions

Extras

Parameters

Math

Ingredient cosine similarities

Selection above the baseline:quantifying the advection effect in four domains of cumulative culture

Andres Karjus

Centre for Language Evolution, University of Edinburgh& University of Tartua.karjus@sms.ed.ac.uk | andreskarjus.github.io | @AndresKarjusApplications in Cultural Evolution: Arts, Languages, Technologies; Tartu 2018 | https://cultevol.ut.ee/

Language and culture dynamics

Diachronic corpora

Topics in a corpus of language

The topical-cultural advection model

How does this work?

How well does it work?

Cultural evolution

The data

Top co-occurring context tags for ‘cult-movie’

Residuals

Some more culinary explorations

Ingredient network: potential interactions

Some basic network analysis

Conclusions

Extras

Parameters

Math

Ingredient cosine similarities

Selection above the baseline:
quantifying the advection effect in four domains of cumulative culture

Centre for Language Evolution, University of Edinburgh
& University of Tartu

a.karjus@sms.ed.ac.uk | andreskarjus.github.io | @AndresKarjus

Applications in Cultural Evolution: Arts, Languages, Technologies; Tartu 2018 | https://cultevol.ut.ee/