Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views20 pages

Social Science Word Embedding Model

The document introduces a new model called conText embedding regression for analyzing how word meanings and usage vary across different contexts like time periods or author characteristics. It claims this method produces valid vector representations of words in different contexts and outperforms other approaches. The model also allows statistical testing and analyzing phenomena like US polarization, historical legislation, and sentiment across contexts.

Uploaded by

ldzldz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views20 pages

Social Science Word Embedding Model

The document introduces a new model called conText embedding regression for analyzing how word meanings and usage vary across different contexts like time periods or author characteristics. It claims this method produces valid vector representations of words in different contexts and outperforms other approaches. The model also allows statistical testing and analyzing phenomena like US polarization, historical legislation, and sentiment across contexts.

Uploaded by

ldzldz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

American Political Science Review (2023) 1–20

doi:10.1017/S0003055422001228 © The Author(s), 2023. Published by Cambridge University Press on behalf of the American Political
Science Association. This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://
creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is
properly cited.

Embedding Regression: Models for Context-Specific Description


and Inference
PEDRO L. RODRIGUEZ New York University, United States
ARTHUR SPIRLING New York University, United States
BRANDON M. STEWART Princeton University, United States

S
ocial scientists commonly seek to make statements about how word use varies over circumstances—
including time, partisan identity, or some other document-level covariate. For example, researchers
might wish to know how Republicans and Democrats diverge in their understanding of the term
“immigration.” Building on the success of pretrained language models, we introduce the à la carte on text
(conText) embedding regression model for this purpose. This fast and simple method produces valid
vector representations of how words are used—and thus what words “mean”—in different contexts. We
show that it outperforms slower, more complicated alternatives and works well even with very few
documents. The model also allows for hypothesis testing and statements about statistical significance. We
demonstrate that it can be used for a broad range of important tasks, including understanding US
polarization, historical legislative development, and sentiment detection. We provide open-source software
for fitting the model.

INTRODUCTION between vectors for the vocabulary of a corpus,


scholars have uncovered new facts about language

A
ll human communication requires common and the people that produce it (e.g., Caliskan, Bryson,
understandings of meaning. This is nowhere and Narayanan 2017). This is also true in the study of
more true than political and social life, where politics, society, and culture (Garg et al. 2018;
the success of an appeal—rhetorical or otherwise— Kozlowski, Taddy, and Evans 2019; Rheault and
relies on an audience perceiving a message in the Cochrane 2020; Rodman 2020; Wu et al. 2019).
particular way that the speaker seeks to deliver Although borrowing existing techniques has cer-
it. Scholars have therefore spent much effort exploring tainly produced insights, for social scientists two prob-
the meanings of terms, how those meanings are manip- lems remain. First, traditional approaches generally
ulated, and how they change over time and space. require a lot of data to produce high-quality represen-
Historically, this work has been qualitative (e.g., Austin tations—that is, to produce embeddings that make
1962; Geertz 1973; Skinner 1969). But in recent times, sense and connote meaning of terms correctly. The
quantitative analysts have turned to modeling and issue is less that our typical corpora are small—though
measuring “context” directly from natural language they are compared with those on the web-scale collec-
(e.g., Aslett et al. 2022; Hopkins 2018; Park, Greene, tions often used in computer science—and more that
and Colaresi 2020). terms for which we would like to estimate contexts are
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

A promising avenue for such investigations has been subject-specific and thus typically quite rare. As an
the use of “word embeddings”—a family of techniques example, there are fewer than twenty parliamentary
that conceive of meaning as emerging from the distri- mentions of the “special relationship” between the
bution of words that surround a term in text (e.g., United States and the United Kingdom in some years
Mikolov et al. 2013). By representing each word as a of the 1980s—despite this arguably being the high
vector of real numbers and examining the relationships watermark of elite closeness between the two coun-
tries. The second problem is one of inference. Although
Pedro L. Rodriguez , Visiting Scholar, Center for Data Science,
representations themselves are helpful, social scientists
New York University, United States; and International Faculty, want to make statements about the statistical properties
Instituto de Estudios Superiores de Administración, Venezuela, and relationships between embeddings. That is, they
[email protected]. want to speak meaningfully of whether language is used
Arthur Spirling , Professor of Politics and Data Science, Depart- differently across subcorpora and whether those appar-
ment of Politics, New York University, United States, arthur.spir- ent differences are larger than we would expect by
[email protected].
Brandon M. Stewart , Associate Professor, Sociology and Office of
chance. Neither of these problems are well addressed
Population Research, Princeton University, United States, by current techniques. Although there have been
[email protected]. efforts to address inference in embeddings (see, e.g,
Received: June 26, 2021; revised: February 22, 2022; accepted: Octo-
Kulkarni et al. 2015; Lauretig 2019), they are typically
ber 24, 2022. data intensive and computationally intensive.

1
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

We tackle these two problems together in what algorithm and provide three proofs of concept. Subse-
follows. We provide both a statistical framework for quently, we extend ALC to a regression framework and
making statements about covariate effects on embed- then present results from three substantive use cases.
dings and one that performs particularly well in cases of We give practical guidance on use and limitations
rare words or small corpora. Specifically, we innovate before concluding.
on Khodak et al. (2018), which introduced à la carte
embeddings (ALC). In a nutshell, the method takes
embeddings that have been pretrained on large corpora
(e.g., word2vec or GloVe embeddings readily available CONTEXT IN CONTEXT
online), combined with a small sample of example uses
for a focal word, and then induces a new context- … they are casting their problems on society and who is
specific embedding for the focal word. This requires society? There is no such thing!
only a simple linear transformation of the averaged —Margaret Thatcher, interview with
embeddings for words within the context of the Woman’s Own (1987).
focal word.
We place ALC in a regression setting that allows for Paraphrased as “there is no such thing as society,”
fast solutions to queries like “do authors with these Thatcher’s quote has produced lively debate in the
covariate values use these terms in a different way than study and practice of UK politics. Critics—especially
authors with different covariate values? If yes, how do from the left—argued that this was primarily an
they differ?” We provide three proofs of concept. First, endorsement of individual selfishness and greed. But
we demonstrate the strength of our approach by com- more sympathetic accounts have argued that the quote
paring its performance to the “industry standard” as must be seen in its full context to be understood. The
laid out by Rodman (2020) in a study of a New York implication is that reading the line in its original sur-
Times API corpus, where slow changes over long roundings changes the meaning: rather than embracing
periods are the norm. Second, we show that our egotism, it emphasizes the importance of citizens’ obli-
approach can estimate an approximate embedding gations to each other above and beyond what the state
even with only a single context. In particular, we dem- requires.
onstrate that we can separate individual instances of Beyond this specific example, the measurement and
Trump and trump. Third, we show that our method modeling of context is obviously a general problem. In
can also identify drastic switches in meaning over short a basic sense, context is vital: we literally cannot under-
periods—specifically in our case, for the term Trump stand what is meant by a speaker or author without
before and after the 2016 election. it. This is partly due to polysemy—the word “society”
We study three substantive cases to show how the might mean many different things. But the issue is
technique may be put to work. First, we explore parti- broader than this and is at the core of human commu-
san differences in Congressional speech—a topic of nication. Unsurprisingly then, the study of context has
long-standing interest in political science (see, e.g., been a long-standing endeavor in social science. Its
Monroe, Colaresi, and Quinn 2008). We show that centrality has been emphasized in the history of ideas
immigration is, perhaps unsurprisingly, one of the (Skinner 1969) through the lens of “speech acts”
most differently expressed terms for contemporary (Austin 1962), describing cultural practices via “thick
Democrats and Republicans. Our second substantive description” (Geertz 1973), understanding “political
case is historical: we compare across polities (and culture” (Verba and Almond 1963), and the psychol-
corpora) to show how elites in the UK and US ogy of decision making (Tversky and Kahneman 1981).
expressed empire in the postwar period, how that
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

usage diverged, and when. Our third case shows how Approaches to Studying Context
our approach can be used to measure sentiment. We
build on earlier work (e.g., Osnabrügge, Hobolt, and For the goal of describing context in observational data,
Rodon 2021; Slapin et al. 2018) for the UK House of social science has turned to text approaches—with
Commons, yielding novel insights about the relation- topic models being popular (see Grimmer 2010; Quinn
ship between the UK Prime Minister and his back- et al. 2010; Roberts, Stewart, and Airoldi 2016). Topic
benchers on the European Union. We also provide models provide a way to understand the allocation of
advice to practitioners on how to use the technique attention across groupings of words.
based on extensive experiments reported in the Sup- Although such models have a built-in notion of
plementary Materials (SM). polysemy (a single word can be allocated to different
These innovations allow for social scientists to go topics), they are rarely used as a mechanism for study-
beyond general meanings of words to capture situation- ing how individual words are used to convey different
specific usage. This is possible without substantial com- ideas (Grimmer and Stewart 2013). And though topic
putation and, in contrast to other approaches, requires approaches do exist that allow for systematic variation
only the text immediately around the word of interest. in the use of a word across topics by different pieces of
We proceed as follows: in the next section, we pro- observed metadata (Roberts, Stewart, and Airoldi
vide some context for what social scientists mean by 2016), they are computationally intensive (especially
“context” and link this to the distribution of words relative to the approaches we present below). The
around a focal term. We then introduce the ALC common unit of analysis for topic models is the

2
Embedding Regression: Models for Context-Specific Description and Inference

document. This has implications for the way that these and a test statistic. Although there have been efforts to
models capture the logic of the “distributional compare embeddings across groups (Rudolph et al.
hypothesis”—the idea that, in the sense of Firth 2017) and to give frameworks for such conditional
(1957, 11), “You shall know a word by the company it relationships (Han et al. 2018), these are nontrivial to
keeps”—in other words, that one can understand a implement. Perhaps more problematically for most
particular version of the “meaning” of a term from social science cases is that they rely on underlying
the way it co-occurs with other terms. Specifically, in embedding models that struggle to produce “good”
the case of topic models, the entire document is the representations—that make sense and correctly cap-
context. From this we learn the relationships (the ture how that word is actually used—when we have few
themes) between words and the documents in which instances of a term of interest. This matters because we
they appear. are typically far short of the word numbers that stan-
But in the questions we discuss here, the interest is in dard models require for optimal performance and
the contextual use of a specific word. To study this, terms (like “society”) may be used in ways that are
social scientists have turned to word embeddings (e.g., idiosyncratic to a particular document or author.
Rheault and Cochrane 2020; Rodman 2020). For exam- In the next section, we will explain how we build on
ple, Caliskan, Bryson, and Narayanan (2017) and Garg earlier insights from ALC embeddings (Khodak et al.
et al. (2018) have explored relationships between 2018) to solve these problems in a fast, simple, and
words captured by embeddings to describe problematic sample-efficient “regression” framework. Before doing
gender and ethnic stereotypes in society at large. Word so, we note three substantive use cases that both moti-
embeddings predict a focal word as a function of the vate the methodological work we do and show its
other words that appear within a small window of that power as a tool for social scientists. The exercise in all
focal word in the corpus (or the reverse, predict the cases is linguistic discovery insofar as our priors are not
neighboring words from the focal word). In so doing, especially sharp and the primary value is in stimulating
they capture the insight of the distributional hypothesis more productive engagement with the text. Nonethe-
in a very literal way: the context of a term are the tokens less, in using the specific approach we outline in this
that appear near it in text, on average. In practice, this is paper, we will be able to make inferences with atten-
all operationalized via a matrix of co-occurrences of dant statements about uncertainty. In that sense, our
words that respect the relevant window size. In the examples are intended to be illuminating for other
limit, where we imagine the relevant window is the scholars comparing corpora or comparing authors
entire document, one can produce a topic model from within a corpus.
the co-occurrence matrix directly. Thus as the context
window in the embedding model approaches the length
Use Case I: Partisan Differences in Word Usage
of the document, the embeddings will increasingly look
like the word representations in a topic model. A common problem in Americanist political science is
Whether, and in what way, embedding models to estimate partisan differences in the usage of a given
based on the distributional hypothesis capture term. Put literally, do Republicans and Democrats
“meaning” is more controversial. Here we take a nar- mean something different when they use otherwise
row, “structuralist” (in the sense of Harris 1954) view. identical words like immigration and marriage?
For this paper, meaning is in terms of description and is Although there have been efforts to understand differ-
empirical. That is, it arises from word co-occurrences in ential word rate of use within topics pertaining to these
the data, alone: we will neither construct nor assume a terms (e.g., Monroe, Colaresi, and Quinn 2008), there
given theoretical model of language or cognition. And, has been relatively little work on whether the same
in contrast to other scholars (e.g., Miller and Charles words appear in different contexts. Below, we use the
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

1991), we will make no claims that the distributions per Congressional Record (Sessions 111–114) as our corpus
se have causal effects on human understandings of for this study (Gentzkow, Shapiro, and Taddy 2018).
terms. Thus, when we speak of the meaning of a focal This requires that we compare embeddings as a func-
word being different across groups, we are talking in a tion of party (and other covariates).
thin sense about the distribution of other words within a
fixed window size of that focal word being different. Use Case II: Changing UK–US Understandings of
Though we will offer guidance, substantive interpreta- “Empire”
tion of these differences for a given purpose is ulti-
mately up to the researcher. That is, as always with such The United Kingdom’s relative decline as a Great
text measurement strategies, subject-expert validation Power during the postwar period has been well docu-
is important. mented (e.g., Hennessy 1992). One way that we might
For a variety of use cases, social scientists want to investigate the timing of US dominance (over the UK,
make systematic inferences about embeddings—which at least) is to study the changing understanding of the
requires statements about uncertainty. Suppose we term Empire in both places. That is, beyond any
wish to compare the context of “society” as conveyed attitudinal shift, did American and British policy
by British Prime Ministers with that of US Presidents. makers alter the way they used empire as the century
Do they differ in a statistically significant way? To wore on? If they did, when did this occur? And did the
judge this, we need some notion of a null hypothesis, elites of these countries converge or diverge in terms of
some understanding of the variance of our estimates, their associations of the term? To answer these

3
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

questions, we will statistically compare the embedding embeddings (Khodak et al. 2018) promise exactly this.
for the term Empire for the UK House of Commons We now give some background and intuition on that
(via Hansard) versus the US Congress (via the Con- technique. We then replicate Rodman (2020)—a recent
gressional Record) from 1935–2010. study introducing time-dependent word embeddings
for political science—to demonstrate ALC’s efficiency
and quality.
Use Case III: Brexit Sentiment from the Backbenches
The UK’s decision to leave the European Union Word Embeddings Measure Meaning through
(EU) following the 2016 referendum was momentous Word Co-Occurence
(Ford and Goodwin 2017). Although the vote itself was
up to citizens, the build-up to the plebiscite was a matter Word embeddings techniques give every word a dis-
for elites; specifically, it was a consequence of the tributed representation—that is, a vector. The length or
internal machinations of the parliamentary Conserva- dimension (D) of this vector is—by convention—
tive Party that forced the hand of their leader, Prime between 100 and 500. When the inner product between
Minster David Cameron (Hobolt 2016). A natural two different words (two different vectors) is high, we
question concerns the attitudes of that party in the infer that they are likely to co-occur in similar contexts.
House of Commons toward the EU, both over time The distributional hypothesis then allows us to infer
and relative to other issue areas (such as education and that those two words are similar in meaning. Although
health policy). To assess that, we will use an embedding such techniques are not new conceptually (e.g., Hinton
approach to sentiment estimation for single instances of 1986), methodological advances in the last decade
terms that builds on recent work on emotion in parlia- (Mikolov et al. 2013; Pennington, Socher, and Manning
ment (Osnabrügge, Hobolt, and Rodon 2021). This will 2014) allow them to be estimated much more quickly.
also allow us to contribute to the literature on Member More substantively, word embeddings have been
of Parliament (MP) position taking via speech (see, shown to be useful, both as inputs to supervised learn-
e.g., Slapin et al. 2018). ing problems and for understanding language directly.
For example, embedding representations can be used
to solve analogy reasoning tasks, implying the vectors
do indeed capture relational meaning between words
USING ALC EMBEDDINGS TO MEASURE (e.g., Arora et al. 2018; Mikolov et al. 2013).
MEANING Understanding exactly why word embeddings work
is nontrivial. In any case, there is now a large literature
Our methodological goal is a regression framework for proposing variants of the original techniques (e.g.,
embeddings. By “regression” we mean two related Faruqui et al. 2015; Lauretig 2019). A few of these
ideas. Narrowly, we mean that we want to be able to are geared specifically to social science applications
approximate a conditional expectation function, typically where the general interest is in measuring changes in
written E½YjX  where, as usual, Y is our outcome, X is a meanings, especially via “nearest neighbors” of specific
particular covariate, and E is the expectations operator. words.
We want to make statements about how embeddings Although the learned embeddings provide a rough
(our Y) differ as covariates (our X) change. More sense of what a word means, it is difficult to use them to
broadly, we use “regression” to mean machinery for answer questions of the sort we posed above. Consider
testing hypotheses about whether the groups actually our interest in how Republicans and Democrats use the
differ in a systematic way. And by extension, we want same word (e.g., immigration) differently. If we train
that machinery to provide tools for making downstream a set of word embeddings on the entire Congressional
comments about how those embeddings differ. In all
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

Record we only have a single meaning of the word. We


cases, this will require three related operations: could instead train a separate set of embeddings—one
for Republicans and one for Democrats—and then
1. an efficient and transparent way to embed words realign them. This is an extra computational step and
such that we can produce high-quality representa- may not be feasible in other use cases where the
tions even when a given word is rare; vocabularies do not have much overlap. We now dis-
2. given (1), a demonstration that in real problems, a cuss a way to proceed that is considerably easier.
single instance of a word’s use is enough to produce
a good embedding. This allows us to set up the
hypothesis-testing problem as a multivariate regres-
A Random Walk Theoretical Framework and
sion and is the subject of the next section;
ALC Embeddings
3. given (1) and (2), a method for making claims about The core of our approach is ALC embeddings. The
the statistical significance of differences in embed- theory behind that approach is given by Arora et al.
dings, based on covariate profiles. We tackle that (2016) and Arora et al. (2018). Those papers conceive
below. of documents being a “random walk” in a discourse
space, where words are more likely to follow other
Ideally, our framework will deliver good representa- words if they are closer to them in an embedding space.
tions of meaning even in cases where we have very few Crucially for ALC, Arora et al. (2018) also proves that
incidences of the words in question. À la carte under this model, a particular relationship will follow

4
Embedding Regression: Models for Context-Specific Description and Inference

for the embedding of a word and the embeddings of the senses. Unfortunately, they will not—this is shown
words that appear in the contexts around it. empirically in Khodak et al. (2018) and in our
To fix ideas, consider the following toy example. Our Trump/trump example below. As implied above, the
corpus is the memoirs of a politician, and we observe intuition is that simply averaging embeddings overex-
two entries, both mentioning the word “bill”: aggerates common components associated with fre-
quent (e.g., “stop”) words. So we will need the A
1. The debate lasted hours, but finally we [voted on the matrix too: it down-weights these directions so they
and it passed] with a large majority. don’t overwhelm the induced embedding.
2. At the restaurant we ran up [a huge wine to be Khodak et al. (2018) show how to put this logic into
paid] by our host. practice. The idea is that a large corpus (generally the
corpus the embeddings were originally trained on, such
As one can gather from the context—here, the three as Wikipedia) can be used to estimate the transforma-
words either side of the instance of “bill” in square tion matrix A. This is a one time cost after which each
brackets—the politician is using the term in two differ- new word embedding can be computed à la carte (thus
ent (but grammatically correct) ways. the name), rather than needing to retrain an entire
The main result from Arora et al. (2018) shows the corpus just to get the embedding for a single word. As
following: if the random walk model holds, the researcher a practical matter, the estimator for A can be learned
can obtain an embedding for word w (e.g., “bill”) by efficiently with a lightly modified linear regression
taking the average of the embeddings of the words model that reweights the words by a nondecreasing
around w (uw Þ and multiplying them by a particular function αðÞ of the total instances of each word (nw) in
square matrix A . That A serves to downweight the the corpus. This reweighting addresses the fact that
contributions of very common (but uninformative) words words that appear more frequently have embeddings
when averaging. Put otherwise, if we can take averages of that are measured with greater certainty. Thus we learn
some vectors of words that surround w (based on some the transformation matrix as,
preexisting set of embeddings vw) and if we can find a way XW
to obtain A (which we will see is also straightforward), we b = argmin
A αðnw Þ∥vw −Auw ∥22 : (1)
w=1
can provide new embeddings for even very rare words. A
And we can do this almost instantaneously.
Returning to our toy example, consider the first, leg- The natural log is a simple choice for αðÞ, and works
islative, use of “bill” and the words around it. Suppose we well. Given A,b we can introduce new embeddings for
have embedding vectors for those words from some any word by averaging the existing embeddings for all
other larger corpus, like Wikipedia. To keep things words in its context to create uw and then applying the
compact, we will suppose those embeddings are all of b w . The transforma-
three dimensions (such that D = 3), and take the follow- transformation such that v bw = Au
ing values: tion matrix is not particularly hard to learn (it is a linear
regression problem), and each subsequent induced
2 32 32 3 2 32 32 3 word embedding is a single matrix multiply.
−1:22 1:83 −0:06 1:81 −1:50 −0:12
4 54 54 −0:73 5 bill 4 1:86 54 −1:65 54 5: Returning to our toy example, suppose that we
1:33 0:56 1:63
estimate Ab from a large corpus like Hansard or the
0:53 −0:81 0:82 1:57 0:48 −0:17
|fflfflfflffl{zfflfflfflffl}|fflfflfflffl{zfflfflfflffl}|fflfflfflffl{zfflfflfflffl} |fflfflffl{zfflfflffl}|fflfflfflffl{zfflfflfflffl}|fflfflfflffl{zfflfflfflffl} Congressional Record or wherever we obtained the
voted on the and it passed embeddings for the words that surround “bill.” Sup-
pose that we estimate
Obtaining uw for “bill: simply requires averaging these 2 3
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

0:81 3:96 2:86


vectors and thus b = 4 2:02
A 4:81 1:93 5:
2 3 3:14 3:81 1:13
0:12
ubill1 = 4 0:50 5,
0:40 Taking inner products, we have
2 3 2 3
with the subscript denoting the first use case. We can do 3:22 −1:91
b  ubill
bbill1 = A
v = 4 3:42 5 and v b  ubill = 4−1:58 5:
bbill2 = A
the same for the second case—the restaurant sense of 1 2

“bill”—from the vectors of a, huge, wine, to, be, 2:73 −0:62


and paid. We obtain
2 3 These two transformed embeddings vectors are more
0:35 different than they were—a result of down-weighting
ubill2 = 4 −0:38 5, the commonly appearing words around them—but that
−0:24 is not the point per se. Rather, we expect them to be
informative about the word sense by, for example,
which differs from the average for the first meaning. A comparing them to other (preestimated) embeddings
reasonable instinct is that these two vectors should be in terms of distance. Thus we might find that the nearest
enough to give us an embedding for “bill” in the two neighbors of v bbill1 are

5
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

2 3 2 3
3:11 2:15 Importantly, unlike other methods, we don’t need an
legislation = 4 2:52 5 and amendment = 4 2:47 5, entirely new corpus to learn embeddings for select focal
3:38 3:42 words; we can select particular words and calculate
(only) their embeddings using only the contexts around
bbill2 are
whereas the nearest neighbors of v those particular words.1 We now demonstrate this power
of ALC by replicating Rodman (2020).2
2 3 2 3
−1:92 −1:95
dollars = 4 −1:54 5 and cost = 4 −1:61 5: Proof of Concept for ALC in Small Political
−0:60 −0:63 Science Corpora: Reanalyzing Rodman (2020)
This makes sense, given how we would typically read The task in Rodman (2020) is to understand changes in
the politician’s lines above. The key here is that the the meaning of equality over the period 1855–2016
ALC method allowed us to infer the meaning of words in a corpus consisting of the headlines and other sum-
that occurred rarely in a small corpus (the memoirs) maries of news articles.3 As a gold standard, a subset of
without having to build embeddings for those rare the articles is hand-coded into 15 topic word categories
words in that small corpus: we could “borrow’” and —of which five are ultimately used in the analysis—and
transform the embeddings from another source. Well the remaining articles are coded using a supervised
beyond this toy example, Khodak et al. (2018) finds topic model with the hand-coded data as input. Four
b a large corpus recovers
empirically that the learned Ain embeddings techniques are used to approximate trends
the original word vectors with high accuracy (greater in coverage of those categories, via the (cosine) dis-
than 0.90 cosine similarity). They also demonstrate that tance between the embedding for the word equality
this strategy achieves state-of-the-art and near state-of- and the embeddings for the category labels. This is
the-art performance on a wide variety of natural lan- challenging because the corpus is small—the first
guage processing tasks (e.g., learning the embedding of 25-year slice of data has only 71 documents—and in
a word using only its definition, learning meaningful almost 30% of the word-slice combinations there are
n-grams, classification tasks, etc.) at a fraction of the fewer than 10 observations.4
computational cost of the alternatives. Rodman (2020) tests four different methods by com-
The ALC framework has three major advantages for paring results to the gold standard; ultimately, the
our setting: transparency, computational ease, and effi- chronologically trained model (Kim et al. 2014) is the
ciency. First, compared with many other embedding best performer. In each era (of 25 years), the model is
strategies for calculating conditional embeddings (e.g., fit several times on a bootstrap resampled collection of
words over time) the information used in ALC is trans- documents and then averaged over the resulting solu-
parent. The embeddings are derived directly from the tions (Antoniak and Mimno 2018). Importantly, the
additive information of the words in the context window model in period t is initialized with period t–1 embed-
around the focal word; there is no additional smoothing dings, whereas the first period is initialized with vectors
or complex interactions across different words. Further- trained on the full corpus. Even for a relatively small
more, the embedding space itself does not change, it corpus, this process is computationally expensive, and
remains fixed to the space defined by the pretrained our replication took about five hours of compute time
embeddings. Second, this same transparency leads to on an eight-core machine.
computational ease. The transformation matrix A only The ALC approach to the problem is simple. For
has to be estimated once, and then each subsequent each period we use ALC to induce a period-specific
induction of a new word is a single matrix multiply and embedding for equality as well as each of the five
thus effectively instantaneous. Later we will be able to category words: gender, treaty, german, race,
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

exploit this speed to allow bootstrapping and permuta- and african_american. We use GloVe pretrained
tion procedures that would be unthinkable if there was embeddings and the corresponding transformation
an expensive model fitting procedure for each word. matrix estimated by Khodak et al. (2018)—in other
Finally, ALC is efficient in the use of information. Once words, we make use of no corpus-specific information
the transformation matrix is estimated, it is only neces- in the initial embeddings and require as inputs only the
sary that uw converges—in other words, we only need to context window around each category word. Following
estimate a D-dimensional mean from a set of samples. In Rodman, we compute the cosine similarity between
the case of a six-word symmetric context window, there equality and each of the five category words, for
are 12 words total within the context window; thus, for
each instance of the focal word we have a sample of size 1
For context, there are many approaches in computer science
12 from which to estimate the mean. including anchoring words (Yin, Sachidananda, and Prabhakar
Although Khodak et al. (2018) focused on using the 2018) and vector space alignment (Hamilton, Leskovec, and Jurafsky
ALC framework to induce embeddings for rare words 2016).
2
and phrases, we will apply this technique to embed words Many papers in computer science have studied semantic change (see
Kutuzov et al. 2018 for a survey).
used in different partitions of a single corpus or to 3
For replication code and data see Rodriguez, Spirling, and Stewart
compare across corpora. This allows us to capture dif- (2022).
ferences in embeddings over time or by speaker, even 4
We provide more information on the sample constraints in Supple-
when we have only a few instances within each sample. mentary Materials, Part A.

6
Embedding Regression: Models for Context-Specific Description and Inference

FIGURE 1. Replication of Figure 3 in Rodman (2020) Adding ALC Results


ALC Chrono GS

Race African American

−2
z score normalized model output

International Relations Germany

−2

1855−

1880−

1905−

1930−

1955−

1980−

2005−
Gender

1879

1904

1929

1954

1979

2004

2016
2

−2
1855−

1880−

1905−

1930−

1955−

1980−

2005−
1879

1904

1929

1954

1979

2004

2016

Note: ALC = ALC model, CHR = chronological model, and GS = gold standard.

each period. We then standardize (make into z-scores) that will allow us to answer the types of questions we
those similarities. The entire process is transparent and introduced above.
takes only a few milliseconds (the embeddings them-
selves involve six matrix multiplies).
How does ALC do? Figure 1 is the equivalent of
Figure 3 in Rodman (2020). It displays the normalized TESTING HYPOTHESES ABOUT
cosine similarities for the chronological model (CHR, EMBEDDINGS
taken from Rodman 2020) and ALC, along with the
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

Ultimately we want to speak of the way that embed-


gold standard (GS). We observe that ALC tracks
dings differ systematically across levels of covariates.
approximately as well as does Rodman’s chronological
To do this, we will set up a regression-like framework,
model on its own terms. Where ALC clearly does better
where each “observation” is the embedding of a single
is on each model’s nearest neighbors (Tables 1 and 2): it
word. À la carte embeddings will assist us, but first we
produces more semantically interpretable and concep-
show that it can learn meaningful embeddings from one
tually precise nearest neighbors than the chronological
example use.
model.
We emphasize that in the 1855 corpus, four of the five
category words (all except african_american) are À la Carte on Text Can Distinguish Word
estimated using five or fewer instances. Whereas the Meanings from One Example Use
chronological model is sharing information across
periods, ALC is treating each slice separately, meaning Above we explained that ALC averaged pretrained
that our analysis could be conducted effectively with embeddings and then applied a linear transformation.
even fewer periods. This new embedding vector has, say, 300 dimensions,
Collectively, these results suggest that ALC is com- and we might reasonably be concerned that it is too
petitive with the current state of the art within the kind noisy to be useful. To evaluate this, we need a ground
of small corpora that arise in social science settings. We truth. So we study a recent New York Times corpus;
now turn to providing a hypothesis-testing framework based on lead paragraphs, we show that we can reliably

7
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart


TABLE 1. Nearest Neighbors for the 1855 Corpus
african_american gender treaty german race equality

CHR ALC CHR ALC CHR ALC CHR ALC CHR ALC CHR ALC

equality equality will legislatures britain equality reich visit enfranchisement enfranchisement of enacment
the suffrage performing suffrage extradition extradition berlin france marriage equality the abolition
and fairness give constitutions interpolation toleration arms eugenia newmarket abrogation and slavery
of emancipation blackwell missouri minister guaranteeing hitler bilateral louise discriminations in amendment
whites guaranteeing american equality rouher speech von relations need coquetry to abrogation

TABLE 2. Nearest Neighbors for the 2005 Corpus


african_american gender treaty german race equality

CHR ALC CHR ALC CHR ALC CHR ALC CHR ALC CHR ALC

crandall’s fidel equality equality narrow equality maintains universities universe equality the gender
costs equality the inequalities designed affirms hinge colleges 1950s segregation for gays
unraveling cubans for inequity missed reaffirms holstein’s campuses warriors inequalities of lesbians
treats nonwhites of inequality assure affirming equality’s striving posits discrimination and transgender
congresswoman lesbians and lesbians trade upholds kiel schools purdy’s inequities to lgbt
Embedding Regression: Models for Context-Specific Description and Inference

FIGURE 2. Identification of Distinct Clusters FIGURE 3. Cluster Homogeneity


Trump misclassified trump
10 0.6

Cluster Homogeneity
0.4
5

0.2
0
PC2

0.0
lsa embeddings bert alc
−5

Note: Cluster homogeneity (in terms of Trump vs. trump) of


k-means with two clusters of individual term instances embedded
using different methods.
−10

−10 −5 0 5 pretrained embeddings (ALC without transformation


PC1 by AÞ, and RoBERTa contextual embeddings (Liu
et al. 2019).6,7 To quantify uncertainty in our metric,
Note: Each observation represents a single realization of a target we use block bootstrapping—resampling individual
word, either of trump or Trump. Misclassified instances refer to
instances of either target word that were assigned the majority
instances of the focal word.8 Figure 3 summarizes our
cluster of the opposite target word. results.
Latent sematic analysis does not fare well in this task;
ALC, on the other hand, performs close to on par with
distinguish Trump the person (2017–2020) from other transformer-based RoBERTa embeddings.9 Simple
sense of trump as a verb or noun (1990–2020).5 averaging of embeddings also performs surprisingly well,
For each sense of the word (based on capitalization), coming out on top in this comparison. Does this mean the
we take a random sample with replacement of 400 real- linear transformation that distinguishes ALC from sim-
izations—the number of instances of trump in our ple averaging is redundant? To evaluate this, we look at
corpus—from our New York Times corpus and embed nearest neighbors using both methods. Table 3 displays
them using ALC. We apply k-means clustering with two these results. We observe that simple averaging of
clusters to the set of embedded instances and evaluate embeddings produces mainly stopwords as nearest
whether the clusters partition the two senses. If ALC neighbors. The ALC method, on the other hand, outputs
works, we should obtain two separate clouds of points nearest neighbors aligned with the meaning of each term,
that are internally consistent (in terms of the word Trump is associated with president Trump, whereas
sense). This is approximately what we see. Figure 2 trump is largely associated with its two related other
provides a visualization of the 300-dimensional space meanings: a suit in trick-taking games and defeating
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

projected to two dimensions with Principal Compo- someone. This serves to highlight the importance of the
nents Analysis (PCA) and identifying the two clusters linear transformation A in the ALC method.
by their dominant word sense. We explicitly mark mis-
classifications with an x.
To provide a quantitative measure of performance 6
we compute the average cluster homogeneity: the For LSA we use two dimensions and tf-idf weighting. We found
these settings produced the best results.
degree to which each cluster contains only members 7
RoBERTa is a substantially more complicated embedding method
of a given class. This value ranges between 0—both that produces contextually specific embeddings and uses word order
clusters have equal numbers of both context types— information.
8
and 1—each cluster consists entirely of a single context Note here that we are treating the A matrix as fixed, and thus we are
type. By way of comparison, we do the same exercise not incorporating uncertainty in those estimates. In experiments (see
using other popular methods of computing word vec- Supplementary Materials, Part F) we found this uncertainty to be
minor and a second-order concern for our applications.
tors for each target realization including latent semantic 9
This may be a result of RoBERTa’s optimization for sentence
analysis (LSA), simple averaging of the corresponding embeddings as opposed to embeddings for an individual word.
Nonetheless, it is surprising given that transformer-based models
lead almost every natural language process benchmark task. Even
5
We used the New York Time‘s developer API to build our corpus. at comparable performance though, there would be reason not to use
See https://developer.nytimes.com/docs/articlesearch-product/1/over RoBERTa models simply based on computational cost and compar-
view. ative complexity.

9
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

TABLE 3. Top 10 Nearest Neighbors Using Simple Averaging of Embeddings and ALC
Trump trump

Embeddings ALC Embeddings ALC

but president but declarer


that impeaching this trumps
the assailing that colloquies
even president-elect even four-point
this impeach only upend
because assailed because suffice
would re-elect so indomitability
not alluded same topicality
which clinton it misstep
same appointee the reprove

Although this example is a relatively straightforward produce an outcome variable Y, which is of dimensions
case of polysemy, we also know that the meaning of n (the number of instances of a given word) by D. The
Trump, the surname, underwent a significant transfor- usual multivariate matrix equation is then
mation once Donald J. Trump was elected president of
the United States in November 2016. This is a substan- Y = |{z}
|{z} X β þ |{z}
E , (2)
|{z}
tially harder case because the person being referred to nD npþ1pþ1D nD
is still the same, even though the contexts it is employed
in—and thus in the sense of the distributional hypoth- where X is a matrix of p covariates and includes a
esis, the meaning—has shifted. But as we show in constant term, whereas β is a set of p coefficients and an
Supplementary Materials B, ALC has no problem with intercept (all of dimension D). Then E is an error term.
this case either, returning excellent cluster homogene- To keep matters simple, suppose that there is a
ity and nearest neighbors. constant and then one binary covariate indicating
The good news for the Trump examples is that ALC group membership (in the group, or not). Then, the
can produce reasonable embeddings even from single coefficient β0 (the first row of the matrix β) is equivalent
instances. Next we demonstrate that each of these to averaging over all instances of the target word
instances can be treated as an observation in a belonging to those not in the group. Meanwhile, β0 þ
hypothesis-testing framework. Before doing so, β1 (the second row of β) is equivalent to averaging over
although readers may be satisfied about the perfor- all instances of the target word that belong to the group
mance of ALC in small samples, they may wonder (i.e., for which the covariate takes the value 1, as
about its performance in large samples. That is, whether opposed to zero). In the more general case of contin-
it converges to the inferences one would make from a uous covariates, this provides a model-based estimate
“full” corpus model as the number of instances of the embedding among all instances at a given level of
increases; the answer is “yes,” and we provide more the covariate space.
details in Supplementary Materials C. The main outputs from this à la carte on text
(conText) embedding “regression” model are
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

À la Carte on Text Embedding Regression


Model: conText • the coefficients themselves, β0 and β1 . These can be
used to calculate the estimated embeddings for the
Recall the original statement of the relationship word in question. We can take the cosine distance
between the embedding of a focal word and the between these implied embeddings and the (pre-
embeddings of the words within its context: trained) embeddings of other words to obtain the
vw = Auw = AEc ½uwc , where the expectation is taken nearest neighbors for the two groups.
over the contexts, c. Here we note that because the • the (Euclidean) norms of the coefficients. These will
matrix A is constant we can easily swap it into the now be scalars (distances) rather than the vectors of
expectation and then calculate the resulting expecta- the original coefficients. In the categorical covariate
tion conditional on some covariate X : E½Auw jX  . In case, these tell us how different one group is to
particular, this can be done implicitly through a multi- another in a relative sense. Although the magnitude
variate regression procedure. In the case of word of this difference is not directly interpretable, we can
meanings in discrete subgroups, this is exactly the same nonetheless comment on whether it is statistically
as the use of ALC applied above. significantly different from zero. To do this, we use
To illustrate our set up, suppose that each vwi is the a variant of covariate assignment shuffling suggested
embedding of a particular instance of a given word in by Gentzkow, Shapiro, and Taddy (2019). In partic-
some particular context, like Trump. Each is of some ular, we randomly shuffle the entries of the Y column
dimension, Dand thus each observation in this setting is and run the regression many (here 100) times. Each
a 1  D embedding vector. We can stack these to time, we record the norms of the coefficients.

10
Embedding Regression: Models for Context-Specific Description and Inference

We then compute the proportion of those values that dummy variable equal 1 for 2017–2020 instances of
are larger than the observed norms (i.e., with the true Trump or Clinton. As before, this is simply a
group assignments). This is the empirical p-value. regression-based estimator for the individual sub-
groups. We will use permutation for hypothesis testing.
Note that, if desired, one can obtain the estimated Figure 4 plots the norm of the bβ s along with their
sampling distribution (and thus standard errors) of the bootstrapped 95% CIs. To reiterate, norming means
(normed) coefficients via nonparametric bootstrap. This the coefficient vectors become scalars. The significant
allows for comments on the relative size of differences in positive value on the Trump  Post_Election coef-
embeddings across and within groups as defined by their ficient indicates the expected additional shift in the
covariates. We now show how the conText model may usage of Trump postelection over and above the shift
be used in a real estimation problem. in the usage of Clinton.
Although this news is encouraging, readers may
wonder how the conText regression model performs
Our Framework in Action: Pre–Post Election relative to a “natural” alternative—specifically, a full
Hypothesis Testing embeddings model fit to each use of the term by
We can compare the change in the usage of the word covariate value(s). This would require the entire cor-
Trump to the change in the usage of the word Clinton pus (rather than just the instances of Trump and
after the 2016 election. Given Trump won the election Clinton) and would be computationally slow, but
and subsequently became President—a major break perhaps it would yield more accurate inferences. As
with respect to his real-estate/celebrity past—we expect we demonstrate in Supplementary Materials D, infer-
a statistically significant change for Trump relative to ences are similar and our approach is more stable by
any changes in the usage of Clinton. virtue of holding constant embeddings for all nonfocal
We proceed as follows: for each target word-period words.
combination—Clinton and Trump, preelection
(2011–2014) and postelection (2017-2020)—we embed
each individual instance of the focal word from our RESULTS
New York Times corpus of leading article paragraphs,
and estimate the following regression: We now turn to substantive use cases, beginning with
partisan differences in the United States.
Y ¼ β0 þ β1 Trump þ β2 Post_Election þ β3 Trump
Post_Election þ E, (3) Partisanship, Ideology and Gender
Differences
where Trump is an indicator variable equal 1 for Trump We want to evaluate partisan and gender differences in
instances, 0 otherwise. Likewise Post_Election is a the usage of a given term in Congress Sessions 111–114

FIGURE 4. Relative Semantic Shift from “Trump”

Trump x ***
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

Post_Election

Post_Election ***

Trump ***

0.0 0.2 0.4 0.6


^
Norm of βs

Note: Values are the norm of b


β and bootstrap confidence intervals. See SM Section J for full regression output. *** = statistically significant at
0.01 level.

11
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

FIGURE 5. Differences in Word Meaning by Gender and Party


Republican Male

0.15

Norm of β s
^

0.10

0.05

also but and abortion marriage immigration

Note: Generally, different genders in the same party have more similar understanding of a term than the same gender across parties. See
SM Section J for full regression output. All coefficients are statistically significant at 0.01 level.

(Obama years). Our focus is a set of target words each party. One reading of these nearest neighbors is
known to be politically charged: abortion, immi- that Democrats were pushing for reform of existing
gration, and marriage. We also include three non- laws, whereas Republicans were mainly arguing for
partisan stopwords—and, the, and but—in our enforcement. We can corroborate this via the top
target set as comparison. nearest contexts—that is, the individual contexts of
We estimate the following multivariate multiple immigration—embedded using ALC—that are clos-
regression model for each of our words: est to each party’s ALC embedding of the term (see
Table 5). This suggests some validity of our general
Y = β0 þ β1 Republican þ β2 Male þ E: (4) approach.
Our approach is not limited to binary covariates.
The dependent variable is an ALC embedding of each To illustrate, we regress the target word immigra-
individual realization in the corpus. For the right-hand tion on the first dimension of the NOMINATE
side, we use indicator variables (Republican or other- score10—understood to capture the liberal–conserva-
wise; Male or otherwise). We use permutation to tive spectrum on economic matters (Poole 2005). This
approximate the null and bootstrapping to quantify approximates a whole sequence of separate embed-
the sampling variance. dings for each speaker, approximated using a line in
Note again that magnitudes have no natural absolute
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

the NOMINATE space. We estimate the following


interpretation, but they can be compared relatively: that regression:
is, a larger coefficient on X i relative to X j implies the
difference in embeddings for the groups defined by i is Y = β0 þ β1 NOMINATE þ E: (5)
larger than the difference in the groups as defined by j.
Our actual results are displayed in Figure 5. The “Male” We next predict an ALC embedding for immigra-
coefficient is the average difference across the gender tion at each percentile of the NOMINATE score and
classes, controlling for party. The “Republican” coeffi- compute its cosine similarity with a small set of hand-
cient is the average difference across the parties, picked features. Figure 6 plots these results. Consis-
controlling for gender. tent with our results above, we observe how the pre-
As expected, the differences across parties and dicted ALC embedding for immigration is closer to
across genders, is much larger for the more political enforce and illegals at higher values of the
terms—relative to function words. But, in addition, NOMINATE score. It is closer to reform and
embeddings differ more by party than they do by bipartisan at lower values. The feature amend on
gender. That is, on average, men and women within a the other hand, shows similar values across the full
party have more similar understandings of the terms range.
than men and women across parties.
The “most partisan” target in our set is immigra-
10
tion. Table 4 shows the top 10 nearest neighbors for Downloaded from https://voteview.com/data.

12
Embedding Regression: Models for Context-Specific Description and Inference

statistically significantly larger than its permuted value,


TABLE 4. Top 10 Nearest Neighbors for the p < 0.01.
Target Term “Immigration” The main observation is that in the preperiod, British
and American legislators talk about Empire primarily
Democrats enact, overhauling, reform, legislation,
in connection with the old European powers—for
enacting, overhaul, reforming,
revamp, entitlement, bipartisan example, Britain and France. In contrast, the vocabu-
laries are radically different in the postbreak period.
Republicans enforce, laws, enact, enacting, legislate,
Whereas the UK parliament continues to talk of the
legislations, enforcing, regularize,
immigration, legislation “British” empire (and its travails in “India” and
“Rhodesia”), the US focus has switched. For the Amer-
icans, understandings of empire are specifically with
respect to Soviet imperial ambitions, and we see this in
the most distinct nearest neighbors “invasion,”
TABLE 5. Subset of Top Nearest Contexts For “Soviet,” and “communists,” with explicit references
The Target Term “Immigration” to eastern European nations like “Lithuania.”

Democrats
this congress to take on comprehensive immigration Brexit Sentiment from the Backbenches
reform and fix our broken immigration
should get to work on comprehensive immigration Our goal is to estimate the sentiment of the Conserva-
reform the immigration system we have tive party toward the EU in the House of Commons.
Republicans First, the underlying debate text and metadata is from
administration wants to ignore our nation’s Osnabrügge, Hobolt, and Rodon (2021), covering the
immigration laws and immigration process the period 2001–2019. We are interested in both major
problem parties of government, Labour and Conservatives.
broken is the enforcement of our immigration laws and We divide those parties’ MPs by role: Cabinet
we have seen that (or Shadow Cabinet in opposition) members of the
government party are “cabinet,” and all others are
“backbenchers,” by definition. We compare policy sen-
timent in three areas: education (where our term of
The Meaning of “Empire” interest is “education”), health (“nhs”), and the EU
(“eu”).
Recall that our plan for the second case study was to In what follows, each observation for us is a represen-
compare the embedding of Empire in the UK and US tation of the sentiment of a party-rank-month triple
context for the period 1935–2010. In the estimation we toward a given term. For instance, (the average)
use the top (most frequent) 5,000 tokens of the com- Conservative-backbencher-July 2015 sentiment toward
bined corpora and we estimate a 300-dimensional “health.” We describe our approach in SM E; in essence
GloVe model and corresponding A matrix specific to we measure the inner product between the term of
the corpus. The multivariate regression analogy is interest to the aggregate embeddings of the (positive
and negative) words from a sentiment dictionary
Y = β0 þ β1 CongressionalRecord þ E, (6) (Warriner, Kuperman, and Brysbaert 2013). We then
rescale within party and policy area, obtaining Figure 9.
estimated for every year of the period. Interest focuses There, each column is a policy area: education, health,
on the (normed) value of β1: when this rises, the use of and then the EU. The rows represent the Conservatives
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

Empire is becoming less similar across the corpora at the top and Labour at the bottom, with the correlation
(Congress is becoming more distinctive). The time between Tory backbenchers and cabinet in the middle.
series of the β1s is given in Figure 7. The basic summary We see an obvious “government versus opposition”
is that, sometime around 1949–50, there was a once- Westminster dynamic: when Labour is in power (so,
and-for-all increase in the distance between US and from the start of the data to 2010), Labour leaders and
UK understandings of Empire. We confirmed this backbenchers are generally enthusiastic about govern-
with a structural break test (Bai and Perron 1998). ment policy. That is, their valence is mostly positive,
To understand the substance of the change, consider which makes sense given almost total government
Figure 8. We report the “most American” and “most agenda control (i.e., the policy being discussed is gov-
British” (with reference to the parliaments) terms from ernment policy). The Conservatives are the converse:
the period on either side of the split in the series. both elites and backbenchers have negative valence for
Specifically, we calculate the cosine similarity between government policy when in opposition but are much
the ALC embedding for Empire and each nearest more enthusiastic when in government. This is true for
neighbor in the UK and US corpus. The x-axis is the education, and health to a lesser extent. So far, so
ratio of these similarities: when it is large (farther to the expected.
right), the word is relatively closer to the US under- But the subject of the EU (the “eu” column) is
standing of Empire than to the UK one. An asterisk by different (top right cell). We see that even after the
the term implies that ratio’s deviation from 1 is Conservatives come to power (marked by the dotted

13
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

FIGURE 6. Cosine Similarity (LOESS Smoothed) between Various Words and “Immigration” at Each
Percentile of NOMINATE Scores

0.6

Median Republican
Median Democrat
predicted ALC embedding and feature
0.5

enforce
Cosine similarity between

illegals
0.4

amend

reform
0.3
0.2

bipartisan
5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%
Percentile of DW−NOMINATE
(higher values, more Conservative)

Note: We mark the median Democrat and median Republican to help calibrate the scale. See SM Section J for full regression output.

black line in 2010), backbench opinion on government ADVICE TO PRACTITIONERS:


policy toward Europe is negative. In contrast, the Tory EXPERIMENTS, LIMITATIONS, CHALLENGES
leadership are positive about their own policy on this
subject. Only after the Conservatives introduce refer- Our approach requires no active tuning of parameters,
endum legislation (the broken vertical line in 2015) but that does not mean that there are no choices to
upon winning the General Election do the back- make. For example, the end user can opt for different
benchers begin to trend positive toward government context window sizes (literally, the number of words on
policy. The middle row makes this more explicit: the either side of the target word), as well as different
correlation between Tory leadership and backbench preprocessing regimes. To guide practice, we now sum-
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

sentiment is generally positive or close to zero for marize experiments we did on real texts. Below, we use
education and health but negative for the EU—that is, “pretrained” to refer to embeddings that have been fit
moving in opposite directions. Our finding here is that to some large (typically on-line) data collection like
Cameron never convinced the average Conservative Wikipedia. We use “locally fit” to mean embeddings
backbencher that his EU policy was something about produced from—that is, vectors learned from—the
which they should feel positive. texts one is studying (e.g., Congressional debates).
A more traditional approach would be to count the We note that Rodriguez and Spirling (2022) provide
number of occurrences of terms in the sentiment dic- extensive results on this comparison for current models;
tionary and assign each speech a net valence score. thus, here we are mostly extending those enquiries to
Figure 10 displays that result. Patterns are harder to our specific approach. Our full write up can be seen in
read. More importantly, only 56% of the terms in the Supplementary Materials F–H. The following are the
dictionary occur in the speeches and a full 69% of most important results.
speeches had no overlap with the set of dictionary First, we conducted a series of supervised tasks,
terms—and thus receive a score of 0. This contrasts where the goal is to separate the uses of trump versus
with the 99% of terms in the dictionary appearing in the Trump per our example above. We found that remov-
pretrained embeddings, allowing for all speeches to be ing stopwords and using bigger context windows results
scored. This is due to the continuity of the embedding in marginally better performance. That is, if the
space. researcher’s goal is to differentiate two separate uses

14
Embedding Regression: Models for Context-Specific Description and Inference

FIGURE 7. Norm of the British and American Difference in Understanding of “Empire,” 1935–2010

Structural break
0.8

0.7
Norm of β
^

0.6

0.5

0.4
1935−1936
1937−1938
1939−1940
1941−1942
1943−1944
1945−1946
1947−1948
1949−1950
1951−1952
1953−1954
1955−1956
1957−1958
1959−1960
1961−1962
1963−1964
1965−1966
1967−1968
1969−1970
1971−1972
1973−1974
1975−1976
1977−1978
1979−1980
1981−1982
1983−1984
1985−1986
1987−1988
1989−1990
1991−1992
1993−1994
1995−1996
1997−1998
1999−2000
2001−2002
2003−2004
2005−2006
2007−2008
2009−2010
Note: Larger values imply the uses are more different. The dashed lines show the bootstrapped 95% CIs. See SM Section J for full
regression output.

of a term (or something related, such as classifying corresponding A matrix makes sense. Our suggested
documents), more data—that is, larger contexts, less approach can then be used to cheaply and flexibly
noise—make sense. To be candid though, we do not study differences across groups. Barring that, using
think such a task—where the goal is a version of pretrained embeddings trained on large online cor-
accuracy—is a particularly common one in political pora (e.g., Stanford GloVe) provides a very reason-
science. able approximation that can be further improved by
We contend a more common task is seeking high- estimating an A matrix specific to the local corpus. But
quality embeddings per se. That is, vector representa- again, if data are scarce, using an A matrix trained on
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

tions of terms that correctly capture the “true” embed- the original online corpus (e.g., Khodak et al.’s 2018 A
ding (low bias) and are simultaneously consistent in the case of GloVe) leads to very reasonable results.
across similar specifications (low variance, in terms of In terms of context window size, avoid small windows
model choices). We give many more details in the SM, (of size < 5). Windows of size 6 and 12 perform very
but the basic idea here is to fit locally trained embed- similarly to each other and acceptably well in an
dings—with context window size 2, 6, and 12—to the absolute sense.
Congressional Record corpus (Sessions 107–114). We 2. Preprocessing: Removing stopwords from contexts
then treat those embeddings as targets to be recovered used in estimating ALC embeddings makes very
from various ALC-based models that follow, with little difference to any type of performance. In
closer approximations being deemed better. As an general, apply the same preprocessing to the ALC
additional “ground truth” model, we use Stanford contexts as was applied at the stage of estimating the
GloVe pretrained embeddings (window size 6, 300 embeddings and A matrix—for example, if stop-
dimensions). We narrow our comparisons to a set of words were not removed, then do not remove stop-
“political” terms as given by Rodriguez and Spirling words. Stemming/lemmatization does not change
(2022). We have five lessons from our experiments: results much in practice.
3. Similarity metrics: The conventional cosine similar-
1. Pretraining and windows: Given a large corpus, ity provides interpretable neighbors, but the inner
local training of a full embeddings model and product often delivers very similar results.

15
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

FIGURE 8. UK and US Discussions of “Empire” Diverged after 1950

British shared American

commonwealth*

territories*

continent*

australia*
colonies*

german*
colonial*

kingdom

french*
britain*
british*
empire
india

italy*
spain*

japan*
africa*

italian*
europe

france*
african*

railways
rhodesia*

germany*
germans*
overseas*

japanese*
countries*

Cosine similarity ratio (American/British)


(a) Pre-1950

British shared American


empire*

jews*
india*
africa*

greek*

soviet*
italian*

polish*
british*

russia*

cuban*
britain*

french*

poland*
african*
europe*

socialist

russian*
colonial*

german*

chinese*
republic*
railways*

struggle*
moscow*
colonies*

lithuania*
australia*
kingdom*

rhodesia*
continent*

centuries*
territories*

revolution*

communist*
aggression*

communists*
communism*
commonwealth*

Cosine similarity ratio (American/British)


(b) Post-1950
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

Note: Most US and UK nearest neighbors pre and post estimated breakpoint. * = statistically significant at 0.01 level.

4. Uncertainty: Uncertainty in the calculation of the A crowdsourced validation methods may be appropriate
matrix is minimal and unlikely to be consequential (see Rodriguez and Spirling 2022; Ying, Montgomery,
for topline results. and Stewart 2021).
5. Changing contexts over time: Potential changes to Finally, we alert readers to the fact that all of our
contexts of targets is a second-order concern, at least analyses can be implemented using the conText software
for texts from the past 100 years or so. package in R (see Supplementary Materials I and
https://github.com/prodriguezsosa/conText).
Before concluding, we note that as with almost all
descriptive techniques, the ultimate substantive inter-
pretation of the findings is left with the researcher to CONCLUSION
validate. It is hard to give general advice on how this
might be done, so we refer readers to two approaches. “Contextomy”—the art of quoting out of context to
First, one can try to triangulate using various types of ensure that a speaker is misrepresented—has a long
validity: semantic, convergent construct, predictive, and troubling history in politics (McGlone 2005).
and so on (see Quinn et al. 2010 for discussion). Second, It works because judicious removal of surrounding text

16
Embedding Regression: Models for Context-Specific Description and Inference

FIGURE 9. Conservative Backbenchers Were Unsatisfied with Their Own Government’s EU Policy
Prior to the Referendum

Backbenchers Cabinet Correlation


education nhs eu
(Conservative)

Referendum bill
Valence

Valence = 0

Election
(Conservative)
Correlation

Correlation = 0
(Labour)
Valence
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Note: Each column of the plot is a policy area (with the seed word used to calculate sentiment). Those areas are education (education),
health (nhs), and the EU (eu). Note the middle-right plot: rank-and-file Conservative MP sentiment on EU policy is negatively correlated with
the leadership’s sentiment.

can so quickly alter how audiences perceive a central 2013), but these are only possible in cases where mean-
message. Understanding how context affects meaning ings are sufficiently similar within groups. Our tech-
is thus of profound interest in our polarized times. But it nique could be used to explore this tension. Similarly,
is difficult—to measure and model. This is especially what is deemed the “correct” interpretation of treaties
true in politics, where our corpora may be small and our (e.g., Simmons 2010) or constitutions matters. Our
term counts low. Yet we simultaneously want statistical methods could help structure studies of these changes.
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

machinery that allows us to speak of statistically signif- We built our framework on the ALC embedding
icant effects of covariates. This paper begins to address strategy. But our general approach is not inextricably
these problems. connected to this particular method for estimating
Specifically, we proposed a flexible approach to contextually specific meanings. We used it because it
study differences in semantics between groups and over is transparent, efficient, and computationally simple.
time using high-quality pretrained embeddings: the We introduced a regression framework for understand-
conText embedding regression model. It has advan- ing word meanings using individual instance embed-
tages over previous efforts, and it can reveal new things dings as observations. This may be easily extended to
about politics. We explained how controversial terms more complex functional forms.
divide parties not simply in the way they are attached to There are many potential directions for the frame-
topics of debate but in their very meaning. Similarly, we work; we highlight two. First, ALC assumes that the
showed that understandings of terms like “empire” are meaning of nonfocal words is essentially constant. This
not fixed, even in the relatively short term, and instead first-order approximation could be extended with
develop in line with interests in international relations. second-order information—which words co-occur with
We showed that our approach can be used to measure words that co-occur with the focal words—but it is unclear
sentiment toward policy. It is not hard to imagine other how much meaning would have to change across groups
applications. For example, there is evidence that voters for this to matter. Second, we are estimating means in
prefer broad-based appeals (Hersh and Schaffner high dimensions using only a few data points. This is

17
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

FIGURE 10. Replication of Figure 9 Using a Dictionary Approach

Backbenchers Cabinet
education nhs eu

Conservative
(scaled within party and target word)

Referendum bill
Valence = 0

Election
Valence

Labour
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Jun 2001
Jul 2002
Sep 2003
Oct 2004
Dec 2005
Feb 2007
Apr 2008
Jun 2009
Sep 2010
Sep 2011
Oct 2012
Nov 2013
Dec 2014
Feb 2016
Mar 2017
May 2018
Jun 2019
Note: The sentiment patterns are less obvious.

always difficult (see Gentzkow, Shapiro, and Taddy American Political Science Review Dataverse: https://
2019), and our estimates of the norms have a finite- doi.org/10.7910/DVN/NKETXF.
sample bias for rare words. Thus care is needed when
comparing words or groups with substantially different
amounts of available data. Future work could consider ACKNOWLEDGMENTS
the role of term frequency in these measures of meaning. We thank audience members at the Midwest Political
As social scientists develop further methods to study Science Association Annual Meeting (2021), the Polit-
these problems, this will sharpen questions which will in
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

ical Methodology society meeting, the American Polit-


turn spur better methods. But to reiterate, technical ical Science Association Annual Meeting (2021),
machinery cannot, of itself, answer substantive ques- Vanderbilt University’s Data Science Institute, Prince-
tions. That is, claims about meaning must be validated, ton University, and the University of Wisconsin
and the way that differences in measured quantities are (Madison). We are grateful to Clark Bernier, Saloni
interpreted will always be subject to debate. We hope Bhogale, Max Goplerud, Justin Grimmer, Alex Kindel,
that the conText model that we have laid out here can Hauke Licht, John Londregan, Walter Mebane, and
provide a useful foundation for future work. Molly Roberts for comments. We also thank the editor
and four excellent anonymous reviewers for their care-
ful engagement with our work.
SUPPLEMENTARY MATERIALS
To view supplementary material for this article, please
visit http://doi.org/10.1017/S0003055422001228. FUNDING STATEMENT
Work done at Princeton University is in part supported
DATA AVAILABILITY STATEMENT by The Eunice Kennedy Shriver National Institute of
Child Health & Human Development of the National
Research documentation and data that support the Institutes of Health under Award Number
findings of this study are openly available at the P2CHD047879.

18
Embedding Regression: Models for Context-Specific Description and Inference

Hamilton, William L., Jure Leskovec, and Dan Jurafsky. 2016.


CONFLICT OF INTEREST “Diachronic Word Embeddings Reveal Statistical Laws of
Semantic Change.” In Proceedings of the 54th Annual Meeting of
The authors declare no ethical issues or conflicts of The Association for Computational Linguistics (Volume 1: Long
interest in this research. Papers), eds. Katrin Erk and Noah A. Smith, 1489–501. Cedarville,
OH: Association for Computational Linguistics.
Han, Rujun, Michael Gill, Arthur Spirling, and Kyunghyun Cho.
2018. “Conditional Word Embedding and Hypothesis Testing via
Bayes-by-Backprop.” In Proceedings of the 2018 Conference on
ETHICAL STANDARDS Empirical Methods in Natural Language Processing, eds. Ellen
Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii,
The authors affirm this research did not involve human 4890–95. Cedarville, OH: Association for Computational
subjects. Linguistics.
Harris, Zellig S. 1954. “Distributional Structure.” Word 10 (2–3):
146–62.
Hennessy, Peter. 1992. Never Again: Britain 1945-1951. London:
REFERENCES Penguin UK.
Hersh, Eitan D., and Brian F. Schaffner. 2013. “Targeted Campaign
Antoniak, Maria, and David Mimno. 2018. “Evaluating the Stability Appeals and the Value of Ambiguity.” The Journal of Politics
of Embedding-Based Word Similarities.” Transactions of the 75 (2): 520–34.
Association for Computational Linguistics 6:107–19. Hinton, Geoffrey E. 1986. “Learning Distributed Representations of
Arora, Sanjeev, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Concepts.” In Proceedings of The Eighth Annual Conference of the
Risteski. 2016. “A Latent Variable Model Approach to Pmi-Based Cognitive Science Society, Vol. 1, 1–12. Amherst, MA: Cognitive
Word Embeddings.” Transactions of the Association for Science Society.
Computational Linguistics 4:385–99. Hobolt, Sara B. 2016. “The Brexit Vote: A Divided Nation, a Divided
Arora, Sanjeev, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Continent.” Journal of European Public Policy 23 (9): 1259–77.
Risteski. 2018. “Linear Algebraic Structure of Word Senses, with Hopkins, Daniel J. 2018. “The Exaggerated Life of Death Panels?
Applications to Polysemy.” Transactions of the Association for The Limited but Real Influence of Elite Rhetoric in the 2009–2010
Computational Linguistics 6:483–95. Health Care Debate.” Political Behavior 40 (3): 681–709.
Aslett, Kevin, Nora Webb Williams, Andreu Casas, Wesley Khodak, Mikhail, Nikunj Saunshi, Yingyu Liang, Tengyu Ma,
Zuidema, and John Wilkerson. 2022. “What Was the Problem in Brandon M. Stewart, and Sanjeev Arora. 2018. “A La Carte
Parkland? Using Social Media to Measure the Effectiveness of Embedding: Cheap but Effective Induction of Semantic Feature
Issue Frames.” Policy Studies Journal 50 (1): 266–89. Vectors.” In Proceedings of the 56th Annual Meeting of the
Austin, John Langshaw. 1962. How to Do Things with Words. Association for Computational Linguistics, eds. Iryna Gurevych
Oxford: Oxford University Press. and Yusuke Miya, 12–22. Cedarville, OH: Association for
Bai, Jushan, and Pierre Perron. 1998. “Estimating and Testing Linear Computational Linguistics.
Models with Multiple Structural Changes.” Econometrica 66 (1): Kim, Yoon, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav
47–78. Petrov. 2014. “Temporal Analysis of Language through Neural
Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. 2017. Language Models.” In Proceedings of the ACL 2014 Workshop on
“Semantics Derived Automatically from Language Corpora Language Technologies and Computational Social Science, eds.
Contain Human-Like Biases.” Science 356 (6334): 183–86. Cristian Danescu-Niculescu-Mizil, Jacob Eisenstein, Kathleen
Faruqui, Manaal, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, McKeown, and Noah A. Smith, 61–5. Cedarville, OH: Association
Eduard Hovy, and Noah A. Smith. 2015. “Retrofitting Word for Computational Linguistics.
Vectors to Semantic Lexicons.” In Proceedings of the 2015 Kozlowski, Austin C., Matt Taddy, and James A. Evans. 2019. “The
Conference of the North American Chapter of The Association for Geometry of Culture: Analyzing the Meanings of Class through
Computational Linguistics: Human Language Technologies, eds. Word Embeddings.” American Sociological Review 84 (5): 905–49.
Rada Mihalcea, Joyce Chai, and Anoop Sarkar, 1606–15. Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.
Cedarville, OH: Association for Computational Linguistics. 2015. “Statistically Significant Detection of Linguistic Change.” In
Firth, John Rupert. 1957. Studies in Linguistic Analysis. Hoboken, Proceedings of the 24th International Conference on World Wide
NJ: Wiley-Blackwell. Web, gen. chairs. Aldo Gangemi, Stefano Leonardi, and
Ford, Robert, and Matthew Goodwin. 2017. “Britain after Brexit: A Alessandro Panconesi, 625–35. Geneva: International World Wide
Nation Divided.” Journal of Democracy 28 (1): 17–30. Web Conferences Steering Committee and Republic and Canton
Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. of Geneva, Switzerland.
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

2018. “Word Embeddings Quantify 100 Years of Gender and Kutuzov, Andrey, Lilja Øvrelid, Terrence Szymanski, and Erik
Ethnic Stereotypes.” Proceedings of the National Academy of Velldal. 2018. “Diachronic Word Embeddings and Semantic Shifts:
Sciences 115 (16): E3635–44. A Survey.” In Proceedings of the 27th International Conference On
Geertz, Clifford. 1973. “Thick Description: Toward an Interpretive Computational Linguistics, eds. Emily M. Bender, Leon
Theory of Culture.” Turning Points in Qualitative Research: Tying Derczynski, and Pierre Isabelle, 1384–97. Cedarville, OH:
Knots in a Handkerchief 3:143–68. Association for Computational Linguistics.
Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy. 2019. Lauretig, Adam. 2019. “Identification, Interpretability, and Bayesian
“Measuring Group Differences in High-Dimensional Choices: Word Embeddings.” In Proceedings of the Third Workshop on
Method and Application to Congressional Speech.” Econometrica Natural Language Processing and Computational Social Science,
87 (4): 1307–40. eds. Svitlana Volkov, David Jurgens, Dirk Hovy, David Bamman,
Gentzkow, Matthew, J. M. Shapiro, and Matt Taddy. 2018. and Oren Tsur, 7–17. Cedarville, OH: Association for
Congressional Record for the 43rd–114th Congresses: Parsed Computational Linguistics.
Speeches and Phrase Counts [computer file]. Palo Alto, CA: Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi,
Stanford Libraries [distributor], 2018-01-16. https:// Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and
data.stanford.edu/congress_text. Veselin Stoyanov. 2019. “Roberta: A Robustly Optimized BERT
Grimmer, Justin. 2010. “A Bayesian Hierarchical Topic Model for Pretraining Approach.” Preprint, submitted on July 26, 2019.
Political Texts: Measuring Expressed Agendas in Senate Press https://arxiv.org/abs/1907.11692.
Releases.” Political Analysis 18 (1): 1–35. Mcglone, Matthew S. 2005. “Contextomy: The Art of Quoting out of
Grimmer, Justin, and Brandon M. Stewart. 2013. “Text as Data: The Context.” Media, Culture & Society 27 (4): 511–22.
Promise and Pitfalls of Automatic Content Analysis Methods for Mikolov, Tomás, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff
Political Texts.” Political Analysis 21 (3): 267–97. Dean. 2013. “Distributed Representations of Words and Phrases

19
Pedro L. Rodriguez, Arthur Spirling, and Brandon M. Stewart

and Their Compositionality.” In Advances in Neural Information Rodriguez, Pedro L., Arthur Spirling, and Brandon M. Stewart. 2022.
Processing Systems 26, eds. C. J. C. Burges, L. Bottou, M. Welling, “Replication Data for: Embedding Regression: Models for
Z. Ghahramani, and K. Q. Weinberger, 3111–19. Red Hook NY: Context-Specific Description and Inference.” Harvard Dataverse.
Curran Associates, Inc. Dataset. https://doi.org/10.7910/DVN/NKETXF.
Miller, George A., and Walter G. Charles. 1991. “Contextual Rodriguez, Pedro L., and Arthur Spirling. 2022. “Word Embeddings:
Correlates of Semantic Similarity.” Language and Cognitive What Works, What Doesn’t, and How to Tell the Difference for
Processes 6 (1):1–28. Applied Research.” The Journal of Politics 84 (1): 101–15.
Monroe, Burt L., Michael P. Colaresi, and Kevin M. Quinn. 2008. Rudolph, Maja, Francisco Ruiz, Susan Athey, and David Blei. 2017.
“Fightin’Words: Lexical Feature Selection and Evaluation for “Structured Embedding Models for Grouped Data.” In Advances in
Identifying the Content of Political Conflict.” Political Analysis 16 Neural Information Processing Systems, Vol. 30, eds. I. Guyon, U.
(4): 372–403. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
Osnabrügge, Moritz, Sara B. Hobolt, and Toni Rodon. 2021. “Playing and R. Garnett. Red Hook, NY: Curran Associates, Inc.
to the Gallery: Emotive Rhetoric in Parliaments.” American Simmons, Beth. 2010. “Treaty Compliance and Violation.” Annual
Political Science Review 115 (3): 885–99. Review of Political Science 13:273–96.
Park, Baekkwan, Kevin Greene, and Michael Colaresi. 2020. Skinner, Quentin. 1969. “Meaning and Understanding in the History
“Human Rights Are (Increasingly) Plural: Learning the Changing of Ideas.” History and Theory 8 (1): 3–53.
Taxonomy of Human Rights from Large-Scale Text Reveals Slapin, Jonathan B., Justin H. Kirkland, Joseph A. Lazzaro,
Information Effects.” American Political Science Review 114 (3): Patrick A. Leslie, and Tom O’Grady. 2018. “Ideology,
888–910. Grandstanding, and Strategic Party Disloyalty in the British
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. Parliament.” American Political Science Review 112 (1): 15–30.
2014. “Glove: Global Vectors for Word Representation.” In Tversky, Amos, and Daniel Kahneman. 1981. “The Framing of
Proceedings of the 2014 Conference on Empirical Methods in Decisions and the Psychology of Choice.” Science 211 (4481): 453–58.
Natural Language Processing, eds. Lucia Specia and Xavier Verba, Sidney, and Gabriel Almond. 1963. The Civic Culture:
Carreras, 1532–43. Cedarville, OH: Association for Computational Political Attitudes and Democracy in Five Nations. Princeton, NJ:
Linguistics. Princeton University Press.
Poole, Keith T. 2005. Spatial Models of Parliamentary Voting. Warriner, Amy Beth, Victor Kuperman, and Marc Brysbaert. 2013.
Cambridge: Cambridge University Press. “Norms of Valence, Arousal, and Dominance for 13,915 English
Quinn, Kevin M., Burt L. Monroe, Michael Colaresi, Michael H. Lemmas.” Behavior Research Methods 45 (4): 1191–207.
Crespin, and Dragomir R. Radev. 2010. “How to Analyze Political Wu, Patrick Y., Walter R. Mebane, Jr., Logan Woods, Joseph
Attention with Minimal Assumptions and Costs.” American Klaver, and Preston Due. 2019. “Partisan Associations of Twitter
Journal of Political Science 54 (1): 209–28. Users Based on Their Self-Descriptions and Word Embeddings.”
Rheault, Ludovic, and Christopher Cochrane. 2020. “Word (Mimeo) New York: New York University.
Embeddings for the Analysis of Ideological Placement in Yin, Zi, Vin Sachidananda, and Balaji Prabhakar. 2018. “The Global
Parliamentary Corpora.” Political Analysis 28 (1): 112–33. Anchor Method for Quantifying Linguistic Shifts and Domain
Roberts, Margaret E., Brandon M. Stewart, and Edoardo M Airoldi. Adaptation.” In Advances in Neural Information Processing
2016. “A Model of Text for Experimentation in the Social Systems, Vol. 31, eds. S. Bengio, H. Wallach, H. Larochelle, K.
Sciences.” Journal of the American Statistical Association Grauman, N. Cesa-Bianchi, and R. Garnett, 9434–45. Red Hook,
111 (515): 988–1003. NY: Curran Associates, Inc.
Rodman, Emma. 2020. “A Timely Intervention: Tracking the Ying, Luwei, Jacob M. Montgomery, and Brandon M. Stewart. 2021.
Changing Meanings of Political Concepts with Word Vectors.” “Topics, Concepts, and Measurement: A Crowdsourced Procedure
Political Analysis 28 (1): 87–111. for Validating Topics as Measures.” Political Analysis 30 (4): 570–89.
https://doi.org/10.1017/S0003055422001228 Published online by Cambridge University Press

20

You might also like