Social Science Word Embedding Model
Social Science Word Embedding Model
doi:10.1017/S0003055422001228 © The Author(s), 2023. Published by Cambridge University Press on behalf of the American Political
Science Association. This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://
creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is
properly cited.
S
ocial scientists commonly seek to make statements about how word use varies over circumstances—
including time, partisan identity, or some other document-level covariate. For example, researchers
might wish to know how Republicans and Democrats diverge in their understanding of the term
“immigration.” Building on the success of pretrained language models, we introduce the à la carte on text
(conText) embedding regression model for this purpose. This fast and simple method produces valid
vector representations of how words are used—and thus what words “mean”—in different contexts. We
show that it outperforms slower, more complicated alternatives and works well even with very few
documents. The model also allows for hypothesis testing and statements about statistical significance. We
demonstrate that it can be used for a broad range of important tasks, including understanding US
polarization, historical legislative development, and sentiment detection. We provide open-source software
for fitting the model.
A
ll human communication requires common and the people that produce it (e.g., Caliskan, Bryson,
understandings of meaning. This is nowhere and Narayanan 2017). This is also true in the study of
more true than political and social life, where politics, society, and culture (Garg et al. 2018;
the success of an appeal—rhetorical or otherwise— Kozlowski, Taddy, and Evans 2019; Rheault and
relies on an audience perceiving a message in the Cochrane 2020; Rodman 2020; Wu et al. 2019).
particular way that the speaker seeks to deliver Although borrowing existing techniques has cer-
it. Scholars have therefore spent much effort exploring tainly produced insights, for social scientists two prob-
the meanings of terms, how those meanings are manip- lems remain. First, traditional approaches generally
ulated, and how they change over time and space. require a lot of data to produce high-quality represen-
Historically, this work has been qualitative (e.g., Austin tations—that is, to produce embeddings that make
1962; Geertz 1973; Skinner 1969). But in recent times, sense and connote meaning of terms correctly. The
quantitative analysts have turned to modeling and issue is less that our typical corpora are small—though
measuring “context” directly from natural language they are compared with those on the web-scale collec-
(e.g., Aslett et al. 2022; Hopkins 2018; Park, Greene, tions often used in computer science—and more that
and Colaresi 2020). terms for which we would like to estimate contexts are
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
A promising avenue for such investigations has been subject-specific and thus typically quite rare. As an
the use of “word embeddings”—a family of techniques example, there are fewer than twenty parliamentary
that conceive of meaning as emerging from the distri- mentions of the “special relationship” between the
bution of words that surround a term in text (e.g., United States and the United Kingdom in some years
Mikolov et al. 2013). By representing each word as a of the 1980s—despite this arguably being the high
vector of real numbers and examining the relationships watermark of elite closeness between the two coun-
tries. The second problem is one of inference. Although
Pedro L. Rodriguez , Visiting Scholar, Center for Data Science,
representations themselves are helpful, social scientists
New York University, United States; and International Faculty, want to make statements about the statistical properties
Instituto de Estudios Superiores de Administración, Venezuela, and relationships between embeddings. That is, they
[email protected]. want to speak meaningfully of whether language is used
Arthur Spirling , Professor of Politics and Data Science, Depart- differently across subcorpora and whether those appar-
ment of Politics, New York University, United States, arthur.spir- ent differences are larger than we would expect by
[email protected].
Brandon M. Stewart , Associate Professor, Sociology and Office of
chance. Neither of these problems are well addressed
Population Research, Princeton University, United States, by current techniques. Although there have been
[email protected]. efforts to address inference in embeddings (see, e.g,
Received: June 26, 2021; revised: February 22, 2022; accepted: Octo-
Kulkarni et al. 2015; Lauretig 2019), they are typically
ber 24, 2022. data intensive and computationally intensive.
1
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
We tackle these two problems together in what algorithm and provide three proofs of concept. Subse-
follows. We provide both a statistical framework for quently, we extend ALC to a regression framework and
making statements about covariate effects on embed- then present results from three substantive use cases.
dings and one that performs particularly well in cases of We give practical guidance on use and limitations
rare words or small corpora. Specifically, we innovate before concluding.
on Khodak et al. (2018), which introduced à la carte
embeddings (ALC). In a nutshell, the method takes
embeddings that have been pretrained on large corpora
(e.g., word2vec or GloVe embeddings readily available CONTEXT IN CONTEXT
online), combined with a small sample of example uses
for a focal word, and then induces a new context- … they are casting their problems on society and who is
specific embedding for the focal word. This requires society? There is no such thing!
only a simple linear transformation of the averaged —Margaret Thatcher, interview with
embeddings for words within the context of the Woman’s Own (1987).
focal word.
We place ALC in a regression setting that allows for Paraphrased as “there is no such thing as society,”
fast solutions to queries like “do authors with these Thatcher’s quote has produced lively debate in the
covariate values use these terms in a different way than study and practice of UK politics. Critics—especially
authors with different covariate values? If yes, how do from the left—argued that this was primarily an
they differ?” We provide three proofs of concept. First, endorsement of individual selfishness and greed. But
we demonstrate the strength of our approach by com- more sympathetic accounts have argued that the quote
paring its performance to the “industry standard” as must be seen in its full context to be understood. The
laid out by Rodman (2020) in a study of a New York implication is that reading the line in its original sur-
Times API corpus, where slow changes over long roundings changes the meaning: rather than embracing
periods are the norm. Second, we show that our egotism, it emphasizes the importance of citizens’ obli-
approach can estimate an approximate embedding gations to each other above and beyond what the state
even with only a single context. In particular, we dem- requires.
onstrate that we can separate individual instances of Beyond this specific example, the measurement and
Trump and trump. Third, we show that our method modeling of context is obviously a general problem. In
can also identify drastic switches in meaning over short a basic sense, context is vital: we literally cannot under-
periods—specifically in our case, for the term Trump stand what is meant by a speaker or author without
before and after the 2016 election. it. This is partly due to polysemy—the word “society”
We study three substantive cases to show how the might mean many different things. But the issue is
technique may be put to work. First, we explore parti- broader than this and is at the core of human commu-
san differences in Congressional speech—a topic of nication. Unsurprisingly then, the study of context has
long-standing interest in political science (see, e.g., been a long-standing endeavor in social science. Its
Monroe, Colaresi, and Quinn 2008). We show that centrality has been emphasized in the history of ideas
immigration is, perhaps unsurprisingly, one of the (Skinner 1969) through the lens of “speech acts”
most differently expressed terms for contemporary (Austin 1962), describing cultural practices via “thick
Democrats and Republicans. Our second substantive description” (Geertz 1973), understanding “political
case is historical: we compare across polities (and culture” (Verba and Almond 1963), and the psychol-
corpora) to show how elites in the UK and US ogy of decision making (Tversky and Kahneman 1981).
expressed empire in the postwar period, how that
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
usage diverged, and when. Our third case shows how Approaches to Studying Context
our approach can be used to measure sentiment. We
build on earlier work (e.g., Osnabrügge, Hobolt, and For the goal of describing context in observational data,
Rodon 2021; Slapin et al. 2018) for the UK House of social science has turned to text approaches—with
Commons, yielding novel insights about the relation- topic models being popular (see Grimmer 2010; Quinn
ship between the UK Prime Minister and his back- et al. 2010; Roberts, Stewart, and Airoldi 2016). Topic
benchers on the European Union. We also provide models provide a way to understand the allocation of
advice to practitioners on how to use the technique attention across groupings of words.
based on extensive experiments reported in the Sup- Although such models have a built-in notion of
plementary Materials (SM). polysemy (a single word can be allocated to different
These innovations allow for social scientists to go topics), they are rarely used as a mechanism for study-
beyond general meanings of words to capture situation- ing how individual words are used to convey different
specific usage. This is possible without substantial com- ideas (Grimmer and Stewart 2013). And though topic
putation and, in contrast to other approaches, requires approaches do exist that allow for systematic variation
only the text immediately around the word of interest. in the use of a word across topics by different pieces of
We proceed as follows: in the next section, we pro- observed metadata (Roberts, Stewart, and Airoldi
vide some context for what social scientists mean by 2016), they are computationally intensive (especially
“context” and link this to the distribution of words relative to the approaches we present below). The
around a focal term. We then introduce the ALC common unit of analysis for topic models is the
2
Embedding Regression: Models for Context-Specific Description and Inference
document. This has implications for the way that these and a test statistic. Although there have been efforts to
models capture the logic of the “distributional compare embeddings across groups (Rudolph et al.
hypothesis”—the idea that, in the sense of Firth 2017) and to give frameworks for such conditional
(1957, 11), “You shall know a word by the company it relationships (Han et al. 2018), these are nontrivial to
keeps”—in other words, that one can understand a implement. Perhaps more problematically for most
particular version of the “meaning” of a term from social science cases is that they rely on underlying
the way it co-occurs with other terms. Specifically, in embedding models that struggle to produce “good”
the case of topic models, the entire document is the representations—that make sense and correctly cap-
context. From this we learn the relationships (the ture how that word is actually used—when we have few
themes) between words and the documents in which instances of a term of interest. This matters because we
they appear. are typically far short of the word numbers that stan-
But in the questions we discuss here, the interest is in dard models require for optimal performance and
the contextual use of a specific word. To study this, terms (like “society”) may be used in ways that are
social scientists have turned to word embeddings (e.g., idiosyncratic to a particular document or author.
Rheault and Cochrane 2020; Rodman 2020). For exam- In the next section, we will explain how we build on
ple, Caliskan, Bryson, and Narayanan (2017) and Garg earlier insights from ALC embeddings (Khodak et al.
et al. (2018) have explored relationships between 2018) to solve these problems in a fast, simple, and
words captured by embeddings to describe problematic sample-efficient “regression” framework. Before doing
gender and ethnic stereotypes in society at large. Word so, we note three substantive use cases that both moti-
embeddings predict a focal word as a function of the vate the methodological work we do and show its
other words that appear within a small window of that power as a tool for social scientists. The exercise in all
focal word in the corpus (or the reverse, predict the cases is linguistic discovery insofar as our priors are not
neighboring words from the focal word). In so doing, especially sharp and the primary value is in stimulating
they capture the insight of the distributional hypothesis more productive engagement with the text. Nonethe-
in a very literal way: the context of a term are the tokens less, in using the specific approach we outline in this
that appear near it in text, on average. In practice, this is paper, we will be able to make inferences with atten-
all operationalized via a matrix of co-occurrences of dant statements about uncertainty. In that sense, our
words that respect the relevant window size. In the examples are intended to be illuminating for other
limit, where we imagine the relevant window is the scholars comparing corpora or comparing authors
entire document, one can produce a topic model from within a corpus.
the co-occurrence matrix directly. Thus as the context
window in the embedding model approaches the length
Use Case I: Partisan Differences in Word Usage
of the document, the embeddings will increasingly look
like the word representations in a topic model. A common problem in Americanist political science is
Whether, and in what way, embedding models to estimate partisan differences in the usage of a given
based on the distributional hypothesis capture term. Put literally, do Republicans and Democrats
“meaning” is more controversial. Here we take a nar- mean something different when they use otherwise
row, “structuralist” (in the sense of Harris 1954) view. identical words like immigration and marriage?
For this paper, meaning is in terms of description and is Although there have been efforts to understand differ-
empirical. That is, it arises from word co-occurrences in ential word rate of use within topics pertaining to these
the data, alone: we will neither construct nor assume a terms (e.g., Monroe, Colaresi, and Quinn 2008), there
given theoretical model of language or cognition. And, has been relatively little work on whether the same
in contrast to other scholars (e.g., Miller and Charles words appear in different contexts. Below, we use the
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
1991), we will make no claims that the distributions per Congressional Record (Sessions 111–114) as our corpus
se have causal effects on human understandings of for this study (Gentzkow, Shapiro, and Taddy 2018).
terms. Thus, when we speak of the meaning of a focal This requires that we compare embeddings as a func-
word being different across groups, we are talking in a tion of party (and other covariates).
thin sense about the distribution of other words within a
fixed window size of that focal word being different. Use Case II: Changing UK–US Understandings of
Though we will offer guidance, substantive interpreta- “Empire”
tion of these differences for a given purpose is ulti-
mately up to the researcher. That is, as always with such The United Kingdom’s relative decline as a Great
text measurement strategies, subject-expert validation Power during the postwar period has been well docu-
is important. mented (e.g., Hennessy 1992). One way that we might
For a variety of use cases, social scientists want to investigate the timing of US dominance (over the UK,
make systematic inferences about embeddings—which at least) is to study the changing understanding of the
requires statements about uncertainty. Suppose we term Empire in both places. That is, beyond any
wish to compare the context of “society” as conveyed attitudinal shift, did American and British policy
by British Prime Ministers with that of US Presidents. makers alter the way they used empire as the century
Do they differ in a statistically significant way? To wore on? If they did, when did this occur? And did the
judge this, we need some notion of a null hypothesis, elites of these countries converge or diverge in terms of
some understanding of the variance of our estimates, their associations of the term? To answer these
3
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
questions, we will statistically compare the embedding embeddings (Khodak et al. 2018) promise exactly this.
for the term Empire for the UK House of Commons We now give some background and intuition on that
(via Hansard) versus the US Congress (via the Con- technique. We then replicate Rodman (2020)—a recent
gressional Record) from 1935–2010. study introducing time-dependent word embeddings
for political science—to demonstrate ALC’s efficiency
and quality.
Use Case III: Brexit Sentiment from the Backbenches
The UK’s decision to leave the European Union Word Embeddings Measure Meaning through
(EU) following the 2016 referendum was momentous Word Co-Occurence
(Ford and Goodwin 2017). Although the vote itself was
up to citizens, the build-up to the plebiscite was a matter Word embeddings techniques give every word a dis-
for elites; specifically, it was a consequence of the tributed representation—that is, a vector. The length or
internal machinations of the parliamentary Conserva- dimension (D) of this vector is—by convention—
tive Party that forced the hand of their leader, Prime between 100 and 500. When the inner product between
Minster David Cameron (Hobolt 2016). A natural two different words (two different vectors) is high, we
question concerns the attitudes of that party in the infer that they are likely to co-occur in similar contexts.
House of Commons toward the EU, both over time The distributional hypothesis then allows us to infer
and relative to other issue areas (such as education and that those two words are similar in meaning. Although
health policy). To assess that, we will use an embedding such techniques are not new conceptually (e.g., Hinton
approach to sentiment estimation for single instances of 1986), methodological advances in the last decade
terms that builds on recent work on emotion in parlia- (Mikolov et al. 2013; Pennington, Socher, and Manning
ment (Osnabrügge, Hobolt, and Rodon 2021). This will 2014) allow them to be estimated much more quickly.
also allow us to contribute to the literature on Member More substantively, word embeddings have been
of Parliament (MP) position taking via speech (see, shown to be useful, both as inputs to supervised learn-
e.g., Slapin et al. 2018). ing problems and for understanding language directly.
For example, embedding representations can be used
to solve analogy reasoning tasks, implying the vectors
do indeed capture relational meaning between words
USING ALC EMBEDDINGS TO MEASURE (e.g., Arora et al. 2018; Mikolov et al. 2013).
MEANING Understanding exactly why word embeddings work
is nontrivial. In any case, there is now a large literature
Our methodological goal is a regression framework for proposing variants of the original techniques (e.g.,
embeddings. By “regression” we mean two related Faruqui et al. 2015; Lauretig 2019). A few of these
ideas. Narrowly, we mean that we want to be able to are geared specifically to social science applications
approximate a conditional expectation function, typically where the general interest is in measuring changes in
written E½YjX where, as usual, Y is our outcome, X is a meanings, especially via “nearest neighbors” of specific
particular covariate, and E is the expectations operator. words.
We want to make statements about how embeddings Although the learned embeddings provide a rough
(our Y) differ as covariates (our X) change. More sense of what a word means, it is difficult to use them to
broadly, we use “regression” to mean machinery for answer questions of the sort we posed above. Consider
testing hypotheses about whether the groups actually our interest in how Republicans and Democrats use the
differ in a systematic way. And by extension, we want same word (e.g., immigration) differently. If we train
that machinery to provide tools for making downstream a set of word embeddings on the entire Congressional
comments about how those embeddings differ. In all
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
4
Embedding Regression: Models for Context-Specific Description and Inference
for the embedding of a word and the embeddings of the senses. Unfortunately, they will not—this is shown
words that appear in the contexts around it. empirically in Khodak et al. (2018) and in our
To fix ideas, consider the following toy example. Our Trump/trump example below. As implied above, the
corpus is the memoirs of a politician, and we observe intuition is that simply averaging embeddings overex-
two entries, both mentioning the word “bill”: aggerates common components associated with fre-
quent (e.g., “stop”) words. So we will need the A
1. The debate lasted hours, but finally we [voted on the matrix too: it down-weights these directions so they
and it passed] with a large majority. don’t overwhelm the induced embedding.
2. At the restaurant we ran up [a huge wine to be Khodak et al. (2018) show how to put this logic into
paid] by our host. practice. The idea is that a large corpus (generally the
corpus the embeddings were originally trained on, such
As one can gather from the context—here, the three as Wikipedia) can be used to estimate the transforma-
words either side of the instance of “bill” in square tion matrix A. This is a one time cost after which each
brackets—the politician is using the term in two differ- new word embedding can be computed à la carte (thus
ent (but grammatically correct) ways. the name), rather than needing to retrain an entire
The main result from Arora et al. (2018) shows the corpus just to get the embedding for a single word. As
following: if the random walk model holds, the researcher a practical matter, the estimator for A can be learned
can obtain an embedding for word w (e.g., “bill”) by efficiently with a lightly modified linear regression
taking the average of the embeddings of the words model that reweights the words by a nondecreasing
around w (uw Þ and multiplying them by a particular function αðÞ of the total instances of each word (nw) in
square matrix A . That A serves to downweight the the corpus. This reweighting addresses the fact that
contributions of very common (but uninformative) words words that appear more frequently have embeddings
when averaging. Put otherwise, if we can take averages of that are measured with greater certainty. Thus we learn
some vectors of words that surround w (based on some the transformation matrix as,
preexisting set of embeddings vw) and if we can find a way XW
to obtain A (which we will see is also straightforward), we b = argmin
A αðnw Þ∥vw −Auw ∥22 : (1)
w=1
can provide new embeddings for even very rare words. A
And we can do this almost instantaneously.
Returning to our toy example, consider the first, leg- The natural log is a simple choice for αðÞ, and works
islative, use of “bill” and the words around it. Suppose we well. Given A,b we can introduce new embeddings for
have embedding vectors for those words from some any word by averaging the existing embeddings for all
other larger corpus, like Wikipedia. To keep things words in its context to create uw and then applying the
compact, we will suppose those embeddings are all of b w . The transforma-
three dimensions (such that D = 3), and take the follow- transformation such that v bw = Au
ing values: tion matrix is not particularly hard to learn (it is a linear
regression problem), and each subsequent induced
2 32 32 3 2 32 32 3 word embedding is a single matrix multiply.
−1:22 1:83 −0:06 1:81 −1:50 −0:12
4 54 54 −0:73 5 bill 4 1:86 54 −1:65 54 5: Returning to our toy example, suppose that we
1:33 0:56 1:63
estimate Ab from a large corpus like Hansard or the
0:53 −0:81 0:82 1:57 0:48 −0:17
|fflfflfflffl{zfflfflfflffl}|fflfflfflffl{zfflfflfflffl}|fflfflfflffl{zfflfflfflffl} |fflfflffl{zfflfflffl}|fflfflfflffl{zfflfflfflffl}|fflfflfflffl{zfflfflfflffl} Congressional Record or wherever we obtained the
voted on the and it passed embeddings for the words that surround “bill.” Sup-
pose that we estimate
Obtaining uw for “bill: simply requires averaging these 2 3
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
5
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
2 3 2 3
3:11 2:15 Importantly, unlike other methods, we don’t need an
legislation = 4 2:52 5 and amendment = 4 2:47 5, entirely new corpus to learn embeddings for select focal
3:38 3:42 words; we can select particular words and calculate
(only) their embeddings using only the contexts around
bbill2 are
whereas the nearest neighbors of v those particular words.1 We now demonstrate this power
of ALC by replicating Rodman (2020).2
2 3 2 3
−1:92 −1:95
dollars = 4 −1:54 5 and cost = 4 −1:61 5: Proof of Concept for ALC in Small Political
−0:60 −0:63 Science Corpora: Reanalyzing Rodman (2020)
This makes sense, given how we would typically read The task in Rodman (2020) is to understand changes in
the politician’s lines above. The key here is that the the meaning of equality over the period 1855–2016
ALC method allowed us to infer the meaning of words in a corpus consisting of the headlines and other sum-
that occurred rarely in a small corpus (the memoirs) maries of news articles.3 As a gold standard, a subset of
without having to build embeddings for those rare the articles is hand-coded into 15 topic word categories
words in that small corpus: we could “borrow’” and —of which five are ultimately used in the analysis—and
transform the embeddings from another source. Well the remaining articles are coded using a supervised
beyond this toy example, Khodak et al. (2018) finds topic model with the hand-coded data as input. Four
b a large corpus recovers
empirically that the learned Ain embeddings techniques are used to approximate trends
the original word vectors with high accuracy (greater in coverage of those categories, via the (cosine) dis-
than 0.90 cosine similarity). They also demonstrate that tance between the embedding for the word equality
this strategy achieves state-of-the-art and near state-of- and the embeddings for the category labels. This is
the-art performance on a wide variety of natural lan- challenging because the corpus is small—the first
guage processing tasks (e.g., learning the embedding of 25-year slice of data has only 71 documents—and in
a word using only its definition, learning meaningful almost 30% of the word-slice combinations there are
n-grams, classification tasks, etc.) at a fraction of the fewer than 10 observations.4
computational cost of the alternatives. Rodman (2020) tests four different methods by com-
The ALC framework has three major advantages for paring results to the gold standard; ultimately, the
our setting: transparency, computational ease, and effi- chronologically trained model (Kim et al. 2014) is the
ciency. First, compared with many other embedding best performer. In each era (of 25 years), the model is
strategies for calculating conditional embeddings (e.g., fit several times on a bootstrap resampled collection of
words over time) the information used in ALC is trans- documents and then averaged over the resulting solu-
parent. The embeddings are derived directly from the tions (Antoniak and Mimno 2018). Importantly, the
additive information of the words in the context window model in period t is initialized with period t–1 embed-
around the focal word; there is no additional smoothing dings, whereas the first period is initialized with vectors
or complex interactions across different words. Further- trained on the full corpus. Even for a relatively small
more, the embedding space itself does not change, it corpus, this process is computationally expensive, and
remains fixed to the space defined by the pretrained our replication took about five hours of compute time
embeddings. Second, this same transparency leads to on an eight-core machine.
computational ease. The transformation matrix A only The ALC approach to the problem is simple. For
has to be estimated once, and then each subsequent each period we use ALC to induce a period-specific
induction of a new word is a single matrix multiply and embedding for equality as well as each of the five
thus effectively instantaneous. Later we will be able to category words: gender, treaty, german, race,
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
exploit this speed to allow bootstrapping and permuta- and african_american. We use GloVe pretrained
tion procedures that would be unthinkable if there was embeddings and the corresponding transformation
an expensive model fitting procedure for each word. matrix estimated by Khodak et al. (2018)—in other
Finally, ALC is efficient in the use of information. Once words, we make use of no corpus-specific information
the transformation matrix is estimated, it is only neces- in the initial embeddings and require as inputs only the
sary that uw converges—in other words, we only need to context window around each category word. Following
estimate a D-dimensional mean from a set of samples. In Rodman, we compute the cosine similarity between
the case of a six-word symmetric context window, there equality and each of the five category words, for
are 12 words total within the context window; thus, for
each instance of the focal word we have a sample of size 1
For context, there are many approaches in computer science
12 from which to estimate the mean. including anchoring words (Yin, Sachidananda, and Prabhakar
Although Khodak et al. (2018) focused on using the 2018) and vector space alignment (Hamilton, Leskovec, and Jurafsky
ALC framework to induce embeddings for rare words 2016).
2
and phrases, we will apply this technique to embed words Many papers in computer science have studied semantic change (see
Kutuzov et al. 2018 for a survey).
used in different partitions of a single corpus or to 3
For replication code and data see Rodriguez, Spirling, and Stewart
compare across corpora. This allows us to capture dif- (2022).
ferences in embeddings over time or by speaker, even 4
We provide more information on the sample constraints in Supple-
when we have only a few instances within each sample. mentary Materials, Part A.
6
Embedding Regression: Models for Context-Specific Description and Inference
−2
z score normalized model output
−2
1855−
1880−
1905−
1930−
1955−
1980−
2005−
Gender
1879
1904
1929
1954
1979
2004
2016
2
−2
1855−
1880−
1905−
1930−
1955−
1980−
2005−
1879
1904
1929
1954
1979
2004
2016
Note: ALC = ALC model, CHR = chronological model, and GS = gold standard.
each period. We then standardize (make into z-scores) that will allow us to answer the types of questions we
those similarities. The entire process is transparent and introduced above.
takes only a few milliseconds (the embeddings them-
selves involve six matrix multiplies).
How does ALC do? Figure 1 is the equivalent of
Figure 3 in Rodman (2020). It displays the normalized TESTING HYPOTHESES ABOUT
cosine similarities for the chronological model (CHR, EMBEDDINGS
taken from Rodman 2020) and ALC, along with the
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
7
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
CHR ALC CHR ALC CHR ALC CHR ALC CHR ALC CHR ALC
equality equality will legislatures britain equality reich visit enfranchisement enfranchisement of enacment
the suffrage performing suffrage extradition extradition berlin france marriage equality the abolition
and fairness give constitutions interpolation toleration arms eugenia newmarket abrogation and slavery
of emancipation blackwell missouri minister guaranteeing hitler bilateral louise discriminations in amendment
whites guaranteeing american equality rouher speech von relations need coquetry to abrogation
CHR ALC CHR ALC CHR ALC CHR ALC CHR ALC CHR ALC
crandall’s fidel equality equality narrow equality maintains universities universe equality the gender
costs equality the inequalities designed affirms hinge colleges 1950s segregation for gays
unraveling cubans for inequity missed reaffirms holstein’s campuses warriors inequalities of lesbians
treats nonwhites of inequality assure affirming equality’s striving posits discrimination and transgender
congresswoman lesbians and lesbians trade upholds kiel schools purdy’s inequities to lgbt
Embedding Regression: Models for Context-Specific Description and Inference
Cluster Homogeneity
0.4
5
0.2
0
PC2
0.0
lsa embeddings bert alc
−5
projected to two dimensions with Principal Compo- someone. This serves to highlight the importance of the
nents Analysis (PCA) and identifying the two clusters linear transformation A in the ALC method.
by their dominant word sense. We explicitly mark mis-
classifications with an x.
To provide a quantitative measure of performance 6
we compute the average cluster homogeneity: the For LSA we use two dimensions and tf-idf weighting. We found
these settings produced the best results.
degree to which each cluster contains only members 7
RoBERTa is a substantially more complicated embedding method
of a given class. This value ranges between 0—both that produces contextually specific embeddings and uses word order
clusters have equal numbers of both context types— information.
8
and 1—each cluster consists entirely of a single context Note here that we are treating the A matrix as fixed, and thus we are
type. By way of comparison, we do the same exercise not incorporating uncertainty in those estimates. In experiments (see
using other popular methods of computing word vec- Supplementary Materials, Part F) we found this uncertainty to be
minor and a second-order concern for our applications.
tors for each target realization including latent semantic 9
This may be a result of RoBERTa’s optimization for sentence
analysis (LSA), simple averaging of the corresponding embeddings as opposed to embeddings for an individual word.
Nonetheless, it is surprising given that transformer-based models
lead almost every natural language process benchmark task. Even
5
We used the New York Time‘s developer API to build our corpus. at comparable performance though, there would be reason not to use
See https://developer.nytimes.com/docs/articlesearch-product/1/over RoBERTa models simply based on computational cost and compar-
view. ative complexity.
9
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
TABLE 3. Top 10 Nearest Neighbors Using Simple Averaging of Embeddings and ALC
Trump trump
Although this example is a relatively straightforward produce an outcome variable Y, which is of dimensions
case of polysemy, we also know that the meaning of n (the number of instances of a given word) by D. The
Trump, the surname, underwent a significant transfor- usual multivariate matrix equation is then
mation once Donald J. Trump was elected president of
the United States in November 2016. This is a substan- Y = |{z}
|{z} X β þ |{z}
E , (2)
|{z}
tially harder case because the person being referred to nD npþ1pþ1D nD
is still the same, even though the contexts it is employed
in—and thus in the sense of the distributional hypoth- where X is a matrix of p covariates and includes a
esis, the meaning—has shifted. But as we show in constant term, whereas β is a set of p coefficients and an
Supplementary Materials B, ALC has no problem with intercept (all of dimension D). Then E is an error term.
this case either, returning excellent cluster homogene- To keep matters simple, suppose that there is a
ity and nearest neighbors. constant and then one binary covariate indicating
The good news for the Trump examples is that ALC group membership (in the group, or not). Then, the
can produce reasonable embeddings even from single coefficient β0 (the first row of the matrix β) is equivalent
instances. Next we demonstrate that each of these to averaging over all instances of the target word
instances can be treated as an observation in a belonging to those not in the group. Meanwhile, β0 þ
hypothesis-testing framework. Before doing so, β1 (the second row of β) is equivalent to averaging over
although readers may be satisfied about the perfor- all instances of the target word that belong to the group
mance of ALC in small samples, they may wonder (i.e., for which the covariate takes the value 1, as
about its performance in large samples. That is, whether opposed to zero). In the more general case of contin-
it converges to the inferences one would make from a uous covariates, this provides a model-based estimate
“full” corpus model as the number of instances of the embedding among all instances at a given level of
increases; the answer is “yes,” and we provide more the covariate space.
details in Supplementary Materials C. The main outputs from this à la carte on text
(conText) embedding “regression” model are
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
10
Embedding Regression: Models for Context-Specific Description and Inference
We then compute the proportion of those values that dummy variable equal 1 for 2017–2020 instances of
are larger than the observed norms (i.e., with the true Trump or Clinton. As before, this is simply a
group assignments). This is the empirical p-value. regression-based estimator for the individual sub-
groups. We will use permutation for hypothesis testing.
Note that, if desired, one can obtain the estimated Figure 4 plots the norm of the bβ s along with their
sampling distribution (and thus standard errors) of the bootstrapped 95% CIs. To reiterate, norming means
(normed) coefficients via nonparametric bootstrap. This the coefficient vectors become scalars. The significant
allows for comments on the relative size of differences in positive value on the Trump Post_Election coef-
embeddings across and within groups as defined by their ficient indicates the expected additional shift in the
covariates. We now show how the conText model may usage of Trump postelection over and above the shift
be used in a real estimation problem. in the usage of Clinton.
Although this news is encouraging, readers may
wonder how the conText regression model performs
Our Framework in Action: Pre–Post Election relative to a “natural” alternative—specifically, a full
Hypothesis Testing embeddings model fit to each use of the term by
We can compare the change in the usage of the word covariate value(s). This would require the entire cor-
Trump to the change in the usage of the word Clinton pus (rather than just the instances of Trump and
after the 2016 election. Given Trump won the election Clinton) and would be computationally slow, but
and subsequently became President—a major break perhaps it would yield more accurate inferences. As
with respect to his real-estate/celebrity past—we expect we demonstrate in Supplementary Materials D, infer-
a statistically significant change for Trump relative to ences are similar and our approach is more stable by
any changes in the usage of Clinton. virtue of holding constant embeddings for all nonfocal
We proceed as follows: for each target word-period words.
combination—Clinton and Trump, preelection
(2011–2014) and postelection (2017-2020)—we embed
each individual instance of the focal word from our RESULTS
New York Times corpus of leading article paragraphs,
and estimate the following regression: We now turn to substantive use cases, beginning with
partisan differences in the United States.
Y ¼ β0 þ β1 Trump þ β2 Post_Election þ β3 Trump
Post_Election þ E, (3) Partisanship, Ideology and Gender
Differences
where Trump is an indicator variable equal 1 for Trump We want to evaluate partisan and gender differences in
instances, 0 otherwise. Likewise Post_Election is a the usage of a given term in Congress Sessions 111–114
Trump x ***
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
Post_Election
Post_Election ***
Trump ***
11
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
0.15
Norm of β s
^
0.10
0.05
Note: Generally, different genders in the same party have more similar understanding of a term than the same gender across parties. See
SM Section J for full regression output. All coefficients are statistically significant at 0.01 level.
(Obama years). Our focus is a set of target words each party. One reading of these nearest neighbors is
known to be politically charged: abortion, immi- that Democrats were pushing for reform of existing
gration, and marriage. We also include three non- laws, whereas Republicans were mainly arguing for
partisan stopwords—and, the, and but—in our enforcement. We can corroborate this via the top
target set as comparison. nearest contexts—that is, the individual contexts of
We estimate the following multivariate multiple immigration—embedded using ALC—that are clos-
regression model for each of our words: est to each party’s ALC embedding of the term (see
Table 5). This suggests some validity of our general
Y = β0 þ β1 Republican þ β2 Male þ E: (4) approach.
Our approach is not limited to binary covariates.
The dependent variable is an ALC embedding of each To illustrate, we regress the target word immigra-
individual realization in the corpus. For the right-hand tion on the first dimension of the NOMINATE
side, we use indicator variables (Republican or other- score10—understood to capture the liberal–conserva-
wise; Male or otherwise). We use permutation to tive spectrum on economic matters (Poole 2005). This
approximate the null and bootstrapping to quantify approximates a whole sequence of separate embed-
the sampling variance. dings for each speaker, approximated using a line in
Note again that magnitudes have no natural absolute
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
12
Embedding Regression: Models for Context-Specific Description and Inference
Democrats
this congress to take on comprehensive immigration Brexit Sentiment from the Backbenches
reform and fix our broken immigration
should get to work on comprehensive immigration Our goal is to estimate the sentiment of the Conserva-
reform the immigration system we have tive party toward the EU in the House of Commons.
Republicans First, the underlying debate text and metadata is from
administration wants to ignore our nation’s Osnabrügge, Hobolt, and Rodon (2021), covering the
immigration laws and immigration process the period 2001–2019. We are interested in both major
problem parties of government, Labour and Conservatives.
broken is the enforcement of our immigration laws and We divide those parties’ MPs by role: Cabinet
we have seen that (or Shadow Cabinet in opposition) members of the
government party are “cabinet,” and all others are
“backbenchers,” by definition. We compare policy sen-
timent in three areas: education (where our term of
The Meaning of “Empire” interest is “education”), health (“nhs”), and the EU
(“eu”).
Recall that our plan for the second case study was to In what follows, each observation for us is a represen-
compare the embedding of Empire in the UK and US tation of the sentiment of a party-rank-month triple
context for the period 1935–2010. In the estimation we toward a given term. For instance, (the average)
use the top (most frequent) 5,000 tokens of the com- Conservative-backbencher-July 2015 sentiment toward
bined corpora and we estimate a 300-dimensional “health.” We describe our approach in SM E; in essence
GloVe model and corresponding A matrix specific to we measure the inner product between the term of
the corpus. The multivariate regression analogy is interest to the aggregate embeddings of the (positive
and negative) words from a sentiment dictionary
Y = β0 þ β1 CongressionalRecord þ E, (6) (Warriner, Kuperman, and Brysbaert 2013). We then
rescale within party and policy area, obtaining Figure 9.
estimated for every year of the period. Interest focuses There, each column is a policy area: education, health,
on the (normed) value of β1: when this rises, the use of and then the EU. The rows represent the Conservatives
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
Empire is becoming less similar across the corpora at the top and Labour at the bottom, with the correlation
(Congress is becoming more distinctive). The time between Tory backbenchers and cabinet in the middle.
series of the β1s is given in Figure 7. The basic summary We see an obvious “government versus opposition”
is that, sometime around 1949–50, there was a once- Westminster dynamic: when Labour is in power (so,
and-for-all increase in the distance between US and from the start of the data to 2010), Labour leaders and
UK understandings of Empire. We confirmed this backbenchers are generally enthusiastic about govern-
with a structural break test (Bai and Perron 1998). ment policy. That is, their valence is mostly positive,
To understand the substance of the change, consider which makes sense given almost total government
Figure 8. We report the “most American” and “most agenda control (i.e., the policy being discussed is gov-
British” (with reference to the parliaments) terms from ernment policy). The Conservatives are the converse:
the period on either side of the split in the series. both elites and backbenchers have negative valence for
Specifically, we calculate the cosine similarity between government policy when in opposition but are much
the ALC embedding for Empire and each nearest more enthusiastic when in government. This is true for
neighbor in the UK and US corpus. The x-axis is the education, and health to a lesser extent. So far, so
ratio of these similarities: when it is large (farther to the expected.
right), the word is relatively closer to the US under- But the subject of the EU (the “eu” column) is
standing of Empire than to the UK one. An asterisk by different (top right cell). We see that even after the
the term implies that ratio’s deviation from 1 is Conservatives come to power (marked by the dotted
13
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
FIGURE 6. Cosine Similarity (LOESS Smoothed) between Various Words and “Immigration” at Each
Percentile of NOMINATE Scores
0.6
Median Republican
Median Democrat
predicted ALC embedding and feature
0.5
enforce
Cosine similarity between
illegals
0.4
amend
reform
0.3
0.2
bipartisan
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
Percentile of DW−NOMINATE
(higher values, more Conservative)
Note: We mark the median Democrat and median Republican to help calibrate the scale. See SM Section J for full regression output.
sentiment is generally positive or close to zero for marize experiments we did on real texts. Below, we use
education and health but negative for the EU—that is, “pretrained” to refer to embeddings that have been fit
moving in opposite directions. Our finding here is that to some large (typically on-line) data collection like
Cameron never convinced the average Conservative Wikipedia. We use “locally fit” to mean embeddings
backbencher that his EU policy was something about produced from—that is, vectors learned from—the
which they should feel positive. texts one is studying (e.g., Congressional debates).
A more traditional approach would be to count the We note that Rodriguez and Spirling (2022) provide
number of occurrences of terms in the sentiment dic- extensive results on this comparison for current models;
tionary and assign each speech a net valence score. thus, here we are mostly extending those enquiries to
Figure 10 displays that result. Patterns are harder to our specific approach. Our full write up can be seen in
read. More importantly, only 56% of the terms in the Supplementary Materials F–H. The following are the
dictionary occur in the speeches and a full 69% of most important results.
speeches had no overlap with the set of dictionary First, we conducted a series of supervised tasks,
terms—and thus receive a score of 0. This contrasts where the goal is to separate the uses of trump versus
with the 99% of terms in the dictionary appearing in the Trump per our example above. We found that remov-
pretrained embeddings, allowing for all speeches to be ing stopwords and using bigger context windows results
scored. This is due to the continuity of the embedding in marginally better performance. That is, if the
space. researcher’s goal is to differentiate two separate uses
14
Embedding Regression: Models for Context-Specific Description and Inference
FIGURE 7. Norm of the British and American Difference in Understanding of “Empire,” 1935–2010
Structural break
0.8
0.7
Norm of β
^
0.6
0.5
0.4
1935−1936
1937−1938
1939−1940
1941−1942
1943−1944
1945−1946
1947−1948
1949−1950
1951−1952
1953−1954
1955−1956
1957−1958
1959−1960
1961−1962
1963−1964
1965−1966
1967−1968
1969−1970
1971−1972
1973−1974
1975−1976
1977−1978
1979−1980
1981−1982
1983−1984
1985−1986
1987−1988
1989−1990
1991−1992
1993−1994
1995−1996
1997−1998
1999−2000
2001−2002
2003−2004
2005−2006
2007−2008
2009−2010
Note: Larger values imply the uses are more different. The dashed lines show the bootstrapped 95% CIs. See SM Section J for full
regression output.
of a term (or something related, such as classifying corresponding A matrix makes sense. Our suggested
documents), more data—that is, larger contexts, less approach can then be used to cheaply and flexibly
noise—make sense. To be candid though, we do not study differences across groups. Barring that, using
think such a task—where the goal is a version of pretrained embeddings trained on large online cor-
accuracy—is a particularly common one in political pora (e.g., Stanford GloVe) provides a very reason-
science. able approximation that can be further improved by
We contend a more common task is seeking high- estimating an A matrix specific to the local corpus. But
quality embeddings per se. That is, vector representa- again, if data are scarce, using an A matrix trained on
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
tions of terms that correctly capture the “true” embed- the original online corpus (e.g., Khodak et al.’s 2018 A
ding (low bias) and are simultaneously consistent in the case of GloVe) leads to very reasonable results.
across similar specifications (low variance, in terms of In terms of context window size, avoid small windows
model choices). We give many more details in the SM, (of size < 5). Windows of size 6 and 12 perform very
but the basic idea here is to fit locally trained embed- similarly to each other and acceptably well in an
dings—with context window size 2, 6, and 12—to the absolute sense.
Congressional Record corpus (Sessions 107–114). We 2. Preprocessing: Removing stopwords from contexts
then treat those embeddings as targets to be recovered used in estimating ALC embeddings makes very
from various ALC-based models that follow, with little difference to any type of performance. In
closer approximations being deemed better. As an general, apply the same preprocessing to the ALC
additional “ground truth” model, we use Stanford contexts as was applied at the stage of estimating the
GloVe pretrained embeddings (window size 6, 300 embeddings and A matrix—for example, if stop-
dimensions). We narrow our comparisons to a set of words were not removed, then do not remove stop-
“political” terms as given by Rodriguez and Spirling words. Stemming/lemmatization does not change
(2022). We have five lessons from our experiments: results much in practice.
3. Similarity metrics: The conventional cosine similar-
1. Pretraining and windows: Given a large corpus, ity provides interpretable neighbors, but the inner
local training of a full embeddings model and product often delivers very similar results.
15
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
commonwealth*
territories*
continent*
australia*
colonies*
german*
colonial*
kingdom
french*
britain*
british*
empire
india
italy*
spain*
japan*
africa*
italian*
europe
france*
african*
railways
rhodesia*
germany*
germans*
overseas*
japanese*
countries*
jews*
india*
africa*
greek*
soviet*
italian*
polish*
british*
russia*
cuban*
britain*
french*
poland*
african*
europe*
socialist
russian*
colonial*
german*
chinese*
republic*
railways*
struggle*
moscow*
colonies*
lithuania*
australia*
kingdom*
rhodesia*
continent*
centuries*
territories*
revolution*
communist*
aggression*
communists*
communism*
commonwealth*
Note: Most US and UK nearest neighbors pre and post estimated breakpoint. * = statistically significant at 0.01 level.
4. Uncertainty: Uncertainty in the calculation of the A crowdsourced validation methods may be appropriate
matrix is minimal and unlikely to be consequential (see Rodriguez and Spirling 2022; Ying, Montgomery,
for topline results. and Stewart 2021).
5. Changing contexts over time: Potential changes to Finally, we alert readers to the fact that all of our
contexts of targets is a second-order concern, at least analyses can be implemented using the conText software
for texts from the past 100 years or so. package in R (see Supplementary Materials I and
https://github.com/prodriguezsosa/conText).
Before concluding, we note that as with almost all
descriptive techniques, the ultimate substantive inter-
pretation of the findings is left with the researcher to CONCLUSION
validate. It is hard to give general advice on how this
might be done, so we refer readers to two approaches. “Contextomy”—the art of quoting out of context to
First, one can try to triangulate using various types of ensure that a speaker is misrepresented—has a long
validity: semantic, convergent construct, predictive, and troubling history in politics (McGlone 2005).
and so on (see Quinn et al. 2010 for discussion). Second, It works because judicious removal of surrounding text
16
Embedding Regression: Models for Context-Specific Description and Inference
FIGURE 9. Conservative Backbenchers Were Unsatisfied with Their Own Government’s EU Policy
Prior to the Referendum
Referendum bill
Valence
Valence = 0
Election
(Conservative)
Correlation
Correlation = 0
(Labour)
Valence
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Note: Each column of the plot is a policy area (with the seed word used to calculate sentiment). Those areas are education (education),
health (nhs), and the EU (eu). Note the middle-right plot: rank-and-file Conservative MP sentiment on EU policy is negatively correlated with
the leadership’s sentiment.
can so quickly alter how audiences perceive a central 2013), but these are only possible in cases where mean-
message. Understanding how context affects meaning ings are sufficiently similar within groups. Our tech-
is thus of profound interest in our polarized times. But it nique could be used to explore this tension. Similarly,
is difficult—to measure and model. This is especially what is deemed the “correct” interpretation of treaties
true in politics, where our corpora may be small and our (e.g., Simmons 2010) or constitutions matters. Our
term counts low. Yet we simultaneously want statistical methods could help structure studies of these changes.
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
machinery that allows us to speak of statistically signif- We built our framework on the ALC embedding
icant effects of covariates. This paper begins to address strategy. But our general approach is not inextricably
these problems. connected to this particular method for estimating
Specifically, we proposed a flexible approach to contextually specific meanings. We used it because it
study differences in semantics between groups and over is transparent, efficient, and computationally simple.
time using high-quality pretrained embeddings: the We introduced a regression framework for understand-
conText embedding regression model. It has advan- ing word meanings using individual instance embed-
tages over previous efforts, and it can reveal new things dings as observations. This may be easily extended to
about politics. We explained how controversial terms more complex functional forms.
divide parties not simply in the way they are attached to There are many potential directions for the frame-
topics of debate but in their very meaning. Similarly, we work; we highlight two. First, ALC assumes that the
showed that understandings of terms like “empire” are meaning of nonfocal words is essentially constant. This
not fixed, even in the relatively short term, and instead first-order approximation could be extended with
develop in line with interests in international relations. second-order information—which words co-occur with
We showed that our approach can be used to measure words that co-occur with the focal words—but it is unclear
sentiment toward policy. It is not hard to imagine other how much meaning would have to change across groups
applications. For example, there is evidence that voters for this to matter. Second, we are estimating means in
prefer broad-based appeals (Hersh and Schaffner high dimensions using only a few data points. This is
17
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
Backbenchers Cabinet
education nhs eu
Conservative
(scaled within party and target word)
Referendum bill
Valence = 0
Election
Valence
Labour
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Note: The sentiment patterns are less obvious.
always difficult (see Gentzkow, Shapiro, and Taddy American Political Science Review Dataverse: https://
2019), and our estimates of the norms have a finite- doi.org/10.7910/DVN/NKETXF.
sample bias for rare words. Thus care is needed when
comparing words or groups with substantially different
amounts of available data. Future work could consider ACKNOWLEDGMENTS
the role of term frequency in these measures of meaning. We thank audience members at the Midwest Political
As social scientists develop further methods to study Science Association Annual Meeting (2021), the Polit-
these problems, this will sharpen questions which will in
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
18
Embedding Regression: Models for Context-Specific Description and Inference
2018. “Word Embeddings Quantify 100 Years of Gender and Kutuzov, Andrey, Lilja Øvrelid, Terrence Szymanski, and Erik
Ethnic Stereotypes.” Proceedings of the National Academy of Velldal. 2018. “Diachronic Word Embeddings and Semantic Shifts:
Sciences 115 (16): E3635–44. A Survey.” In Proceedings of the 27th International Conference On
Geertz, Clifford. 1973. “Thick Description: Toward an Interpretive Computational Linguistics, eds. Emily M. Bender, Leon
Theory of Culture.” Turning Points in Qualitative Research: Tying Derczynski, and Pierre Isabelle, 1384–97. Cedarville, OH:
Knots in a Handkerchief 3:143–68. Association for Computational Linguistics.
Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy. 2019. Lauretig, Adam. 2019. “Identification, Interpretability, and Bayesian
“Measuring Group Differences in High-Dimensional Choices: Word Embeddings.” In Proceedings of the Third Workshop on
Method and Application to Congressional Speech.” Econometrica Natural Language Processing and Computational Social Science,
87 (4): 1307–40. eds. Svitlana Volkov, David Jurgens, Dirk Hovy, David Bamman,
Gentzkow, Matthew, J. M. Shapiro, and Matt Taddy. 2018. and Oren Tsur, 7–17. Cedarville, OH: Association for
Congressional Record for the 43rd–114th Congresses: Parsed Computational Linguistics.
Speeches and Phrase Counts [computer file]. Palo Alto, CA: Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi,
Stanford Libraries [distributor], 2018-01-16. https:// Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and
data.stanford.edu/congress_text. Veselin Stoyanov. 2019. “Roberta: A Robustly Optimized BERT
Grimmer, Justin. 2010. “A Bayesian Hierarchical Topic Model for Pretraining Approach.” Preprint, submitted on July 26, 2019.
Political Texts: Measuring Expressed Agendas in Senate Press https://arxiv.org/abs/1907.11692.
Releases.” Political Analysis 18 (1): 1–35. Mcglone, Matthew S. 2005. “Contextomy: The Art of Quoting out of
Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The Context.” Media, Culture & Society 27 (4): 511–22.
Promise and Pitfalls of Automatic Content Analysis Methods for Mikolov, Tomás, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff
Political Texts.” Political Analysis 21 (3): 267–97. Dean. 2013. “Distributed Representations of Words and Phrases
19
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart
and Their Compositionality.” In Advances in Neural Information Rodriguez, Pedro L., Arthur Spirling, and Brandon M. Stewart. 2022.
Processing Systems 26, eds. C. J. C. Burges, L. Bottou, M. Welling, “Replication Data for: Embedding Regression: Models for
Z. Ghahramani, and K. Q. Weinberger, 3111–19. Red Hook NY: Context-Specific Description and Inference.” Harvard Dataverse.
Curran Associates, Inc. Dataset. https://doi.org/10.7910/DVN/NKETXF.
Miller, George A., and Walter G. Charles. 1991. “Contextual Rodriguez, Pedro L., and Arthur Spirling. 2022. “Word Embeddings:
Correlates of Semantic Similarity.” Language and Cognitive What Works, What Doesn’t, and How to Tell the Difference for
Processes 6 (1):1–28. Applied Research.” The Journal of Politics 84 (1): 101–15.
Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. 2008. Rudolph, Maja, Francisco Ruiz, Susan Athey, and David Blei. 2017.
“Fightin’Words: Lexical Feature Selection and Evaluation for “Structured Embedding Models for Grouped Data.” In Advances in
Identifying the Content of Political Conflict.” Political Analysis 16 Neural Information Processing Systems, Vol. 30, eds. I. Guyon, U.
(4): 372–403. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
Osnabrügge, Moritz, Sara B. Hobolt, and Toni Rodon. 2021. “Playing and R. Garnett. Red Hook, NY: Curran Associates, Inc.
to the Gallery: Emotive Rhetoric in Parliaments.” American Simmons, Beth. 2010. “Treaty Compliance and Violation.” Annual
Political Science Review 115 (3): 885–99. Review of Political Science 13:273–96.
Park, Baekkwan, Kevin Greene, and Michael Colaresi. 2020. Skinner, Quentin. 1969. “Meaning and Understanding in the History
“Human Rights Are (Increasingly) Plural: Learning the Changing of Ideas.” History and Theory 8 (1): 3–53.
Taxonomy of Human Rights from Large-Scale Text Reveals Slapin, Jonathan B., Justin H. Kirkland, Joseph A. Lazzaro,
Information Effects.” American Political Science Review 114 (3): Patrick A. Leslie, and Tom O’Grady. 2018. “Ideology,
888–910. Grandstanding, and Strategic Party Disloyalty in the British
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. Parliament.” American Political Science Review 112 (1): 15–30.
2014. “Glove: Global Vectors for Word Representation.” In Tversky, Amos, and Daniel Kahneman. 1981. “The Framing of
Proceedings of the 2014 Conference on Empirical Methods in Decisions and the Psychology of Choice.” Science 211 (4481): 453–58.
Natural Language Processing, eds. Lucia Specia and Xavier Verba, Sidney, and Gabriel Almond. 1963. The Civic Culture:
Carreras, 1532–43. Cedarville, OH: Association for Computational Political Attitudes and Democracy in Five Nations. Princeton, NJ:
Linguistics. Princeton University Press.
Poole, Keith T. 2005. Spatial Models of Parliamentary Voting. Warriner, Amy Beth, Victor Kuperman, and Marc Brysbaert. 2013.
Cambridge: Cambridge University Press. “Norms of Valence, Arousal, and Dominance for 13,915 English
Quinn, Kevin M., Burt L. Monroe, Michael Colaresi, Michael H. Lemmas.” Behavior Research Methods 45 (4): 1191–207.
Crespin, and Dragomir R. Radev. 2010. “How to Analyze Political Wu, Patrick Y., Walter R. Mebane, Jr., Logan Woods, Joseph
Attention with Minimal Assumptions and Costs.” American Klaver, and Preston Due. 2019. “Partisan Associations of Twitter
Journal of Political Science 54 (1): 209–28. Users Based on Their Self-Descriptions and Word Embeddings.”
Rheault, Ludovic, and Christopher Cochrane. 2020. “Word (Mimeo) New York: New York University.
Embeddings for the Analysis of Ideological Placement in Yin, Zi, Vin Sachidananda, and Balaji Prabhakar. 2018. “The Global
Parliamentary Corpora.” Political Analysis 28 (1): 112–33. Anchor Method for Quantifying Linguistic Shifts and Domain
Roberts, Margaret E., Brandon M. Stewart, and Edoardo M Airoldi. Adaptation.” In Advances in Neural Information Processing
2016. “A Model of Text for Experimentation in the Social Systems, Vol. 31, eds. S. Bengio, H. Wallach, H. Larochelle, K.
Sciences.” Journal of the American Statistical Association Grauman, N. Cesa-Bianchi, and R. Garnett, 9434–45. Red Hook,
111 (515): 988–1003. NY: Curran Associates, Inc.
Rodman, Emma. 2020. “A Timely Intervention: Tracking the Ying, Luwei, Jacob M. Montgomery, and Brandon M. Stewart. 2021.
Changing Meanings of Political Concepts with Word Vectors.” “Topics, Concepts, and Measurement: A Crowdsourced Procedure
Political Analysis 28 (1): 87–111. for Validating Topics as Measures.” Political Analysis 30 (4): 570–89.
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press
20