Understanding and
Understanding and
Abstract
Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global
benchmarks. These biases stem not only from differences in language but also from the cultural
knowledge required to interpret questions, reducing the practical utility of translated datasets
like MMLU. Furthermore, translation often introduces artifacts that can distort the meaning
or clarity of questions in the target language. A common practice in multilingual evaluation is
to rely on machine-translated evaluation sets, but simply translating a dataset is insufficient to
address these challenges. In this work, we trace the impact of both of these issues on multilingual
evaluations and ensuing model performances. Our large-scale evaluation of state-of-the-art open
and proprietary models illustrates that progress on MMLU depends heavily on learning Western-
centric concepts, with 28% of all questions requiring culturally sensitive knowledge. Moreover,
for questions requiring geographic knowledge, an astounding 84.9% focus on either North Amer-
ican or European regions. Rankings of model evaluations change depending on whether they
are evaluated on the full portion or the subset of questions annotated as culturally sensitive,
showing the distortion to model rankings when blindly relying on translated MMLU. We release
Global-MMLU , an improved MMLU with evaluation coverage across 42 languages – with
improved overall quality by engaging with compensated professional and community annotators
to verify translation quality while also rigorously evaluating cultural biases present in the original
dataset. This comprehensive Global-MMLU set also includes designated subsets labeled as
culturally sensitive and culturally agnostic to allow for more holistic, complete eval-
uation.
Global-MMLU : https://hf.co/datasets/CohereForAI/Global-MMLU
Global-MMLU Lite : https://huggingface.co/datasets/CohereForAI/Global-MMLU-Lite
α
First author. β Principal senior advisors.
Corresponding authors: {shivalika, beyza, sarahooker}@cohere.com
Language cannot be simply reduced to a utilitarian tool, otherwise there would be no reason to
have so many diverse ways for saying the same thing or referring to similar concepts. Indeed,
language is also a marker of belonging and a repository of cultural knowledge (Labov, 1963; 1986;
Karlık, 2023). Today, state-of-the-art generative AI is used around the world and yet evaluation
of these systems is primarily conducted using English benchmarks (Zellers et al., 2019; Hendrycks
et al., 2020; Suzgun et al., 2022; Zhang et al., 2023b). Where multilingual evaluations are relied
upon, these are often simply machine translations of widely adopted English benchmarks (Lai
et al., 2023; Üstün et al., 2024).
A pressing question arises: how can we develop large language models (LLMs) that perform effec-
tively and fairly across the full spectrum of languages and cultures? The lack of comprehensive
evaluation benchmarks for many languages poses a significant obstacle for researchers and prac-
titioners striving to create truly multilingual systems. Often, a common practice is to simply
translate English benchmarks into other languages. In this work, we consider the implications of
this given one of the most ubiquitous examples – the Massive Multitask Language Understand-
ing (MMLU) dataset (Hendrycks et al., 2020). Originally compiled using sources in the English
language across 57 diverse subject areas such as elementary mathematics, computer science, and
law, the dataset is often machine-translated into resources for multilingual assessment, which we
collectively term transMMLU (Lai et al., 2023; Üstün et al., 2024; OpenAI, 2024; Dubey et al.,
2024; Bendale et al., 2024). However, the growing adoption of automatically translated “as-is”
transMMLU as a barometer of global AI progress deserves closer inspection and reflection.
While widely adopted for multilingual evaluations, the multilinguality achieved through the
translation of English datasets does not guarantee multiculturality. Evaluating on blindly-
translated datasets risks overemphasizing Western-centric concepts and knowledge. Cultural bias
can reduce the dataset’s practical effectiveness (and conceptual relevance) as a global benchmark
when translated. For example, the original English MMLU dataset contains several subsets
which are US-specific, such as examinations in US History, US Accounting, and US Law. Such
cultural bias reduces the dataset’s practical effectiveness (and conceptual relevance) as a global
benchmark when translated. Furthermore, as these translated datasets become adopted for mul-
tilingual evaluation and developers optimize models for performance on transMMLU datasets,
we risk overfitting to the datasets’ cultural biases and incidentally setting multilingual evaluation
standards to be aligned with certain culture paradigms. Second, while machine translation ex-
pands language coverage, it also introduces practical evaluation challenges. Translation artifacts
known as translationese (Bizzoni et al., 2020; Vanmassenhove et al., 2021; Koppel & Ordan, 2011)
can be introduced, which causes a breakdown in evaluation quality. Automatic data curation is
also known to often exacerbate common data quality issues (Luccioni & Viviano, 2021; Kreutzer
et al., 2022; Ferrara, 2023; Caswell et al., 2020).
Our effort to address the above is twofold. We conduct an extensive evaluation to quantify the
impact of cultural biases in MMLU on model evaluations to-date and contribute improvements
to the overall translation quality to solve linguistic qualms. We hire professional annotators to
verify translation quality and include improvements from rigorous per-question post-edits as well
2
as human translations. We release the comprehensive improved dataset Global-MMLU for
42 languages:
Amharic, Arabic, Bengali, Chinese, Czech, Dutch, English, Filipino, French, German, Greek,
Hausa, Hebrew, Hindi, Igbo, Indonesian, Italian, Japanese, Korean, Kyrgyz, Lithuanian, Mala-
gasy, Malay, Nepali, Nyanja, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Sinhala,
Somali, Shona, Spanish, Swahili, Swedish, Telugu, Turkish, Ukrainian, Vietnamese, and Yoruba.
To address regional and cultural biases, we systematically annotate a subset of the original
English MMLU to identify questions where correctly answering requires cultural, geographical,
or dialect-specific knowledge. We refer to such questions as being Culturally-Sensitive (CS ), in
contrast to questions which do not require this prior knowledge, referred to as being Culturally-
Agnostic (CA ). We evaluate 14 state-of-the-art open-weight and proprietary models from 9
model families, focusing on those known for their high multilingual performance. This enables
rigorous evaluation of how such models serve diverse language users and isolates how ranking
may be subverted by questions which require primarily Western-centric knowledge. Through
extensive evaluations, we consistently find that cultural sensitivity has a significant impact on
model rankings. Our core contributions can be enumerated as follows:
• Analysis of MMLU for cultural biases: We observe that progress on MMLU depends
heavily on learning Western-centric concepts. Out of the annotated sample, we found that
28% of questions require specific knowledge of Western cultures. Moreover, for questions
requiring geographic knowledge, an astounding 84.9% focus on either North American or
European regions.
• Role of data quality improvements: Our analysis highlights notable performance dif-
ferences between human-translated and machine-translated datasets for both high-resource
and low-resource languages. Human-translated datasets are essential for accurately assess-
ing model performance, especially on low-resource languages, as relying solely on machine-
translated data may obscure the true capabilities of models in these contexts. Without
access to high-quality human-translated or in-language datasets, the evaluation of low-
resource language performance remains uncertain.
3
Stemming from our comprehensive results, we make the following recommendations for multilin-
gual evaluation of generative models:
4
Which of the following statements does NOT Opportunity costs or implicit costs of a
A person in the pseudoindependent stage of accurately describe voting behavior in
White racial identity is currently ___________. "Mom & Pop"-owned business are:
the United States?
1 Developing an awareness of the role of Whites in 1 Registered voters between the ages of 35 and 45 are more 1 equal to accounting costs.
perpetrating racism likely to vote than are those under the age of 21.
2 Unaware of race and racism 2 A registered voter who has attained his or her GED 2 equal to accounting profits.
is less likely to vote than a high school dropout.
3 Exploring what it means to be White and confronting 3 Registered voters are more likely to vote in general
elections than they are in primary elections.
3 equal to earnings or profits that could have occurred
own biases using resources elsewhere.
4 Attempting to resolve moral dilemmas associated 4 More women than men have voted in every 4 equal to earnings or profits that occurred for
with an awareness of race and racism presidential election since 1980. Mom & Pop's business.
Figure 2: Examples of questions from MMLU dataset labelled as requiring cultural, regional or
dialectal knowledge.
representative random sample from each of the 57 exam subjects that compose MMLU (50 per
subject), totaling 2,850 samples. This annotated set is referred to as MMLU Annotated (MA)
throughout the paper. Annotators were asked to identify questions where correctly answering
depended upon 1) cultural knowledge , 2) geographic knowledge or 3) dialect knowledge .
We provide more context about each of these categories below:
Figure 20 in Appendix H illustrates the annotation interface used during this process. Annotators
were presented with questions one at a time from each of the 57 MMLU subjects and had to
analyze and label them for the presence of cultural, geographic, dialect knowledge. Each data
point was reviewed by at least three annotators, and some data-points had a maximum of 10
annotators. 96.4% of all data points were reviewed by more than 3 human annotators. We
classify each question as presenting cultural, geographic and dialect sensitivity according to
5
100
Cultural
Dialect
80
Percentage of Samples (%)
Regional
Multi-label
60
40
20
0 )
S) HS
v. en s
lit ios
Hi HS)
S)
Ju ilos ts
ud y
Di ce
w s
)
Hi icy
og all )
ph es
cio )
eh y
y
yc Se sc.
Ps log rity
lo ro)
co Int S)
tin aw
ac Sex ro)
Hu n y
M nA )
ag ng
Bu ub nt
s E l.
ics s
ke )
St ting
Vi HS)
M logy
El Nut l.)
ca n
Bi ng.
Ch Ma S)
)
Ch cine ni.)
M ist ro)
.)
icr ro i.)
on y
ne S)
i.)
. (H
ro
Ge F (HS
So (HS
M (HS
ist (HS
(
Go l Sc ion
La ute
ys ic
es Re
pr h
ec m
Pr log
ec lit
ni
or
tri io
(E
M Ast Un
Un
c
. (H
(H
m . (H
. (H
ici (H
e
ra aci
en
ris op
i
CS
Ph th
st
an gi
(P
ho y (P
(P
em (P
ld Pol
Po ar
Ph Fa
un . L
ro ua
ed (U
(U
M
o no
lE
ec rit
ist
ho cu
em
US cs (
s(
sin lic
h
a ig
Hi
sp
ro
em th
.
gy
ed ry
(
h(
st
st
at
at
M ry
o
or el
ar
W ign
at
i
EU
M ld R
M
P
i
re
or
or
Fo
yc
W
Ac
Ps
US
M
Figure 3: Proportion of samples containing cultural, regional, or dialect-specific references per
subject in the MMLU dataset. Notably, all samples in the World Religions and Moral Scenarios
subjects include at least one such reference. Note that 12 subjects did not contain any Culturally-
Sensitive CS samples and have been excluded from the figure.
majority vote among annotators who reviewed each data point (Feldman, 1980). If half or more
of the annotators apply the same tag to a question, it is categorized under that tag. Detailed
information about the annotators and the annotation process is available in Appendix H.
We also asked annotators to annotate for temporal knowledge to determine if answers for ques-
tions change with time. We find that only 2.4% of annotated samples depend on temporal
knowledge. We provide more details about temporal analysis in the Appendix D.
To understand the prevalence of these attributes at an aggregate level, we also assign a label of
Culturally-Sensitive (CS ) if either Dialect Knowledge , Cultural Knowledge or
Geographic Knowledge are positively attributed to an example. If none of these properties
are present, we deem an example to be Culturally-Agnostic (CA ). This enables us to track
at an aggregate level the fraction of the entire MMLU that requires CS knowledge.
6
represented by 1.3%, 1.1% and 0.7% of the tags, respectively. This shows performing well on
MMLU heavily depends on mastering Western-centric cultural knowledge.
A similar trend is observed for geographic knowledge: 64.5% of CS samples were tagged as
needing regional knowledge of North America, followed by 20.4% tagged as requiring regional
knowledge of Europe. This concentration indicates that progress on MMLU predominantly re-
flects knowledge of Western concepts and regions.
100
80
Percentage of Samples (%)
60
40
20
0
North Europe Asia Africa South Australia Western South Eastern Middle Latin African Indegeous Other
America America & Oceania Asia Asia Eastern American
Figure 4: Distribution of region (left) and culture (right) categories found in CS dataset. The
majority of Region tags (64.5%) correspond to North America, while the majority of Culture
tags (86.5%) are classified as Western. We have excluded samples that do not contain any region
or culture tags or contain multiple region or culture tags from this figure.
Cultural sensitivity varies considerably across subjects. The MMLU dataset, introduced
by Hendrycks et al. (2020), includes 57 subjects spanning four categories: STEM, Humanities,
Social Sciences, and Other. From the Other category, we selected relevant subjects and further
categorized them into Medical (Chen et al., 2023) and Business. Additional details about this
categorization are provided in Appendix B.
Figure 6 illustrates the data distribution for the CA subset, revealing significant variation
7
100
80
Percentage of Samples (%)
60
40
20
0
rs
pt
rs
ae
A
rs
Ge ce
y
ce
ly
n
UK
di
in
re
pa
ke
Ira
Ira
he
he
a
an
ai
ia
y
US
Ita
he
Ch
In
Ko
Isr
tn
an
ee
Eg
r
Ja
Sp
ss
Ot
Tu
Ot
rm
Ot
e
Gr
Fr
Ru
Vi
ut
So
Figure 5: Distribution of cultural and regional tags across countries in the CS dataset. The
percentages indicate the representation of each country within the dataset. We have excluded
samples that do not contain any country tags or contain multiple country tags from this figure.
in cultural and regional references between different MMLU subjects and subject categories.
Questions from categories in Humanities and Social Sciences frequently required cultural or
regional knowledge, while those from the STEM and Medical categories generally did not. Overall
for Humanities, 68% of all questions were tagged as CS . However, this bias was even more
pronounced for certain subjects within Humanities. Notably, more than 80% of samples for
subjects like Philosophy, Moral Scenarios1 , High School US History, and High School Government
and Politics were deemed CS . Within the STEM category, only 30 out of 950 samples (3.15%)
were identified as CS , and for subjects such as Clinical Knowledge, Computer Security, and
Econometrics all question examples were classified as CA . These findings, detailed in Figure 6,
unsurprisingly reveal that certain subjects inherently exhibit more cultural or regional biases.
We provide examples of MMLU questions annotated as CS (Culturally Sensitive) and CA
(Culturally Agnostic) in the Appendix J.
Inter-annotator agreement. Each data point was reviewed by at least three annotators, and
some datapoints had a maximum of 10 annotators. 96.4% of all data points were reviewed
by more than 3 human annotators. Given this rich set of feedback on each data point, we
analyze the agreement between ratings from different annotators using Krippendorff ’s Alpha
scores (Krippendorff, 2004). We observed high inter-annotator agreement across most subjects,
with a unanimous cultural sensitivity agreement in the Anatomy subject. Six subjects showed
disagreement including High-school US History, while Moral Scenarios showed the most disagree-
ment. Detailed results are presented in Figure 23 and 24 in Appendix H.2.
8
100
STEM
80
Percentage of Samples (%)
Social Sciences
Medical
Humanities
60 Business
Other
40
20
0
S)
ic . )
ec ( )
ca .)
at g.
at l.)
tri S)
M irolo n
P rk gy
sin ics g
m Et )
an A s
M P em g
ro bli t
Ps S n. (H l.
ho ua )
gy y
Ps unt t. L )
yc in aw
gy o)
cu )
y
og hi c.
ph ry
cio S)
or Fal ogy
US EU ist ies
re st. S)
P )
w y
ris sp )
Ph de s
ilo nce
Go US Fa y
M Po t. ( s
W al S ics ( )
ld na S)
lig s
ns
at ra
Cl my
m s ( i.)
n r )
on hy .
Fo om sics
al rics
Bi gic
S)
As ene L
n s
Ch io my
M Ma (Un .)
Ch icin (Un )
em e i.)
ry i.)
S)
(H
Ec c. P Sec
Ph CS cal
ed on S
El ine (HS
Hu ess (HS
yc ex S
co In (HS
Se (Pro
Ju Di (Pro
n S
S
Co ute ni.
ed th i.
ac u en
M an hic
u e
v. His ct
Re rio
tro tic
o e
ph
tri Pro
G M
lo lit
r it
La olic
i
V tio
Bu hys etin
ag gin
Ge Pre Mis
St (E
Co ysic Un
em (Un
ist (Un
M l En
M ec h (H
Nu s (H
S o y (H
Fo Hi (H
ig (H
or lit H
or ce H
(H
(H
An eb
lo (Pr
ra to
pr ut
ec c R
io
CS
ld lac
Lo
p U
o
B o
i
rm et
so
in
s
h
g
.
ho g
Al
o t
icr a
H
M M
W
Ac
Figure 6: Proportion of samples retained per subject, after excluding those requiring cultural,
geographic and dialectic knowledge (selected based on majority agreement).
Table 1: Statistics for MA , CS , and CA datasets. The left column displays the number
of subjects included in each dataset, the middle column shows the total number of samples per
category, and the right column illustrates changes in subject category distributions relative to
MA , with arrows indicating increases or decreases in representation.
correctly, and CA , comprising questions that do not require knowledge from these categories.
Table 1 provides a detailed breakdown of the number of subjects and samples in the CS and
CA subsets.
We observe notable differences in subject distribution between the CA and CS subsets, lead-
ing to shifts in category representation. For instance, while questions from the Social Sciences
category make up 21.1% of the MMLU Annotated , a uniformly balanced subsample of the
original MMLU, they are over-represented in CS , accounting for 26.3% of all questions requir-
ing CS knowledge. Conversely, questions from the STEM category, which contribute 33.3% of
the MMLU Annotated , are under-represented in CS , making up only 2.9% of all questions
identified as requiring CS knowledge. These shifts reflect how the nature of the CS subset
emphasizes cultural and contextual knowledge over technical or scientific content.
Overall, the proportions of STEM, Medical, and Business categories increase in the CA subset
due to their globally relevant content. Conversely, Humanities and Social Sciences are over-
represented in the CS subset compared to the original MMLU, as these fields frequently include
cultural or regional references. These findings are critical to the model evaluations in Section 4,
9
illustrating how cultural references in MMLU influence dataset composition and, ultimately,
model performance.
3 Introducing Global-MMLU
To date, many multilingual evaluations have relied on translated MMLU with the most widely
adopted existing multilingual MMLU translation dataset being translated into 26 languages using
ChatGPT2 supported by GPT-3.5 (Lai et al., 2023). We release an improved Global-MMLU
benchmark which is both of higher quality and also supports analysis on both CS and CA
subsets.
Here, we improve quality by incorporating professional edits and translations from native speakers
for a subset of languages and expanding coverage to 42 languages. We achieve this through a com-
bination of paid professional translations, community contributions, and higher-quality machine
translation. This effort involved professionally compensated annotators for four gold-standard
languages and a broader pool of community annotators who contributed to translations in 11
additional languages. Where available, we also included the professional human translations from
the MMMLU dataset3 for 14 languages. We rely as much as possible on human-verified transla-
tions to ensure that the translations are reliable and minimize the biases introduced, specifically
translationese which might be more pronounced in Machine Translation (Bizzoni et al., 2020;
Vanmassenhove et al., 2021; Koppel & Ordan, 2011). Alongside these quality improvements
through human verification, we include the metadata for the CS and CA annotations de-
veloped in the previous sections to allow for analysis on all subsets of data. Below, we provide
further details about our efforts to improve the quality of MMLU and engage compensated hu-
man annotators in translating and verifying quality as well as identifying the CS and CA
subsets.
10
We first translated the English MMLU dataset into 41 languages using the Google Translate
API.4 Despite its cost, we chose to use Google Translate because comprehensive evaluations
spanning 102 languages (Zhu et al., 2024) demonstrate that Google Translate significantly out-
performs alternatives such as NLLB (NLLB-Team et al., 2022), GPT-4, and ChatGPT, on low-
resource languages (Robinson et al., 2023). Recent work (Kocmi et al., 2024) have shown that
LLMs have begun to surpass popular online translation tools like Google Translate for machine
translation on specific high-resource languages. However, given that there is a known tendency
for models to favor their own generations (Panickssery et al., 2024; Shimabucoro et al., 2024),
we decided to use Google Translate for every language in order to avoid introducing bias into
model evaluations. To empirically validate this choice, we compared Google Translate’s outputs
with translations performed by GPT-3.5-turbo, which had been previously used to translate the
MMLU dataset (Lai et al., 2023). As shown in Figure 7, we find that Google Translate achieved
higher ChrF++ scores (Popović, 2017) across all subjects and lower deviation in performance
across languages, consistent with the findings of previous research (Popović, 2017) about its
superiority in translation quality. Following the translation process, native speakers reviewed
and edited the translations to ensure accuracy and fluency, thereby enhancing global representa-
tion. These edits were performed by two types of annotators: professional annotators and native
community annotators.
The participation of native speakers from diverse regions introduced logistical challenges in both
data selection and quality control. To overcome these, we adopted Argilla5 as our primary
annotation platform. In line with our community-based approach, Argilla’s collaborative features
and customizable workflows enabled us to efficiently manage contributions from various regions
while maintaining consistency in translation quality. Annotators were presented with both the
original and machine-translated questions and answers, and were asked to edit any translations
that did not accurately capture the intent of the original text. The translation interface is shown
4
https://cloud.google.com/translate
5
https://github.com/argilla-io/argilla
11
in Figure 21 in Appendix I.
100
Professionally Translated
80
60
40
20
0
a
ub
Be ic
Ch ali
Fr e
G e ch
an
do indi
n
Ja lian
K e
rto n
Sp se
Sw h
ili
Ru e
Am ian
Te c
Uk ugu
Tu n
Si sh
a
h
m n
n
ay
ri
al
es
es
sia
is
ia
ec
Ro rsia
ia
r
ab
ah
ne
al
ng
i
en
ha
rm
Po ore
Yo
an
rk
nh
ss
in
an
in
gu
Cz
Ita
H
l
ne
M
Ar
pa
Pe
ra
na
et
In
Vi
Figure 8: Percentage of Human-Translated Samples in MMLU Annotated .
Figure 8 highlights the number of samples edited by professional annotators and community
contributors. A total of 7,565 edits were made, accounting for 36.9% of the samples reviewed.
On average, professional annotators edited 789 samples per language (38.5% of the total) in the
Gold Set, while community contributors edited 362 samples per language (17.7% of the total). It
is important to note that the differences in edit rates likely reflect variations in time and resources
available to professional versus community annotators, and cannot be interpreted as differences
in translation quality across languages. Additional analyses of question and answer lengths, as
well as edit distances across subject categories, are presented in Appendix I.
MMLU Annotated . This subset consists of 2,850 question-answer pairs sampled at uniform
from the MMLU dataset (50 questions per subject), representing 20% of the original data and
serving as a representative random sample. These samples are annotated in English to determine
whether answering requires cultural, geographic, dialectal, or temporal knowledge. The anno-
tations are then applied to corresponding samples in 41 other languages, resulting in a total of
119,700 samples.
6
https://openai.com/index/openai-o1-system-card/
7
https://hf.co/datasets/openai/MMMLU
12
Culturally-Sensitive (CS) . This subset contains samples identified as requiring dialect
knowledge , cultural knowledge or geographic knowledge to answer correctly. It includes
792 annotated samples in English based on majority voting by annotators. These annotations
are extended to 41 additional languages, creating a dataset with 33,264 entries. This subset is
particularly useful for evaluating model performance on culturally contextual tasks.
Culturally-Agnostic (CA) . This subset includes samples that do not contain cultural,
regional, or dialectal references. It serves as a baseline for evaluating models on tasks that do
not require specific contextual knowledge. The subset consists of 2,058 annotated samples in
English, which are extended to 41 languages for a total of 86,436 entries.
4 Model Evaluations
One of the key findings from Section 2.2 is that MMLU presents severe biases towards CS
knowledge. In this section, we seek to understand how these biases may have impacted evaluation
of open-weights and closed models. To do so, we measure changes to model rankings on 3 subsets
of data: Global-MMLU Annotated , Global-MMLU Culturally-Agnostic (CA ) and
Global-MMLU Culturally-Sensitive (CS ). By comparing model performance across these
three subsets, we aim to address the following questions: (1) How do models perform on the
MMLU test set when it includes culturally-sensitive samples? and (2) How do models perform on
samples that do not require specific contextual knowledge, ensuring consistent and fair evaluations
across different languages and regions?
Evaluation Setup. We use lm-evaluation-harness (Gao et al., 2024) to evaluate the open
multilingual models in a 5-shot setting. For closed models (i.e., GPT-4o and Claude-Sonnet
3.5), we also do 5-shot evaluation. However, since log probabilities are not accessible via API
for closed models, we send the 5-shot prompt via API and get the corresponding generation
from the model. We use a system preamble to make the model respond with only the correct
answer option and extract the answer from the output generation. For prompting, we follow the
same approach as specified in (Hendrycks et al., 2020) and use prompt instructions in the same
language as the sample.
13
Languages. We categorize the languages into two main groups for reporting the results. The
first group consists of human-translated data only, which covers 10 languages from OpenAI’s
human-translated MMLU test set and 4 additional languages from our professionally translated
set. The second group contains all our data (combining professional, community and machine
translations), organized by language resource availability high-resource, mid-resource, and
low-resource languages as defined by Joshi et al. (2019) and categorized in (Singh et al., 2024).
We report results for each of these categories. The high-resource languages are Arabic, Chinese,
Czech, Dutch, English, French, German, Hindi, Italian, Japanese, Persian, Polish, Portuguese,
Russian, Spanish, Swedish, Turkish, Vietnamese, mid-resource languages are Bengali, Filipino,
Greek, Hebrew, Indonesian, Korean, Lithuanian, Malay, Romanian, Serbian, Ukrainian and
low-resource languages are Amharic, Hausa, Igbo, Kyrgyz, Malagasy, Nepali, Nyanja, Shona,
Sinhala, Somali, Swahili, Telugu, Yoruba.
4.2 Results
Evaluations on Human-Translated Data. To assess the performance of models on high-
quality, human-translated data, we conducted evaluations using the subset of 14 languages with
human-translated data. The analysis focuses on both the CA and CS subsets to explore
how models handle tasks with and without cultural context.
Culturally Agnostic (CA) Culturally Sensitive (CS)
90 90 6.06 6.21
5.99
80 6.17 80 9.66
Accuracy (%)
t
o
mm B
mm +
t
o
ma 7B
en o
tra B
SE -32B
ud -v3
ma 7B
en o
tra B
SE -32B
ud -v3
ne
ne
Qw -Nem
T4
Qw -Nem
T4
Co e-32
Co e-32
Mis 1-70
Mis 1-70
Ge andR
Ge andR
Lla a2-2
Lla a2-2
Cla ION
Cla ION
on
on
GP
GP
2.5
2.5
eS
eS
ns
ns
-3.
-3.
l
l
A-L
A-L
pa
pa
Ex
Ex
a
a
Ay
Ay
We evaluated 14 models from 9 different model families, including 2 closed-source models. Fig-
ure 9 presents the results aggregated across 14 languages. We note that the focus of this eval-
uation is not to compare model performances directly but to analyze their behaviors on CA
and CS datasets. Direct comparisons between proprietary models and open-weight models are
not feasible due to significant differences in model sizes (although we note that the parameter
sizes of proprietary models have not been officially disclosed) and different evaluation methods.
Nonetheless, the results show that closed-source proprietary models, such as GPT-4o and Claude
3.5 Sonnet, consistently outperform smaller open-source models. Interestingly, the performance
gap between these models is narrower on CS datasets than on CA datasets.
14
Unlike the full Global-MMLU , this balance enables clearer comparisons. Figure 10 shows
that overall, models perform better on the CA portion.
Culturally Agnostic (CA) Culturally Sensitive (CS)
90 90
80 7.85 9.73 80 8.53
5.82
11.93
70 8.37 9.29 6.31 70 12.35 8.77
10.88
6.76 8.25
60 60 12.30
50 50
B
v3
mm 3
B
0B
0B
o
em
em
v
32
32
32
27
32
27
dR
dR
ION
ION
17
17
an
an
lN
lN
se
se
2.5
a2
2.5
a2
-3.
-3.
A-L
A-L
an
an
mm
mm
mm
tra
tra
en
en
ma
ma
SE
SE
p
p
Mis
Mis
Qw
Qw
Ex
Ex
Co
Co
Ge
Ge
Lla
Lla
a
a
Ay
Ay
Figure 10: Model evaluations on CA and CS samples in Global-MMLU Lite . Error
bars indicate standard deviation across languages.
Performance on CS is higher but presents more variance Another key observation
is that the average accuracy across all models is higher on CS datasets compared to CA
datasets. This trend can be attributed to the nature of the CS samples, which are predom-
inantly drawn from Social Sciences and Humanities domains where models generally perform
better. In contrast, CA datasets include more challenging categories, such as Medical and
STEM, as illustrated in Figure 15.
However, the standard deviation in performance across languages is higher for CS data than
for CA data for all models. This can be attributed to several factors: culturally sensitive
tasks are inherently more challenging and require deeper contextual understanding, making them
more susceptible to variations in translation quality. Nuanced cultural, regional, or dialectal
references in CS tasks often amplify this sensitivity, as differences in how these references are
translated can affect model performance. Furthermore, many large language models are trained
predominantly on data from high-resource or Western cultures, leading to biases that favor these
contexts and cause inconsistencies when applied to less-represented cultures.
On Global-MMLU Lite , the pattern shifts: CS tasks have lower average accuracies and
greater variance than CA tasks. This highlights how cultural specificity increases performance
instability, when the CS and CA samples are balanced.
15
Culturally Agnostic (CA) Culturally Sensitive (CS)
90 90 2.39 2.15
1.69 3.84
3.44 1.63 3.37
80 3.48 80 3.45 4.49 4.26
2.85 3.76 3.44 5.03
Accuracy (%)
70 3.96 70 5.78
60 4.49 60
50 50
40 40
30 30
mm 2B
mm +
mm 2B
en o
mm +
t
o
ma 7B
tra B
SE -32B
ud -v3
ma 7B
en o
tra B
SE -32B
ud -v3
ne
ne
Qw -Nem
T4
Qw -Nem
T4
Mis 1-70
Mis 1-70
Ge andR
Ge andR
e-3
e-3
Lla a2-2
Lla a2-2
Cla ION
Cla ION
on
on
GP
GP
2.5
2.5
eS
eS
ns
ns
-3.
-3.
l
l
A-L
A-L
pa
pa
Ex
Ex
Co
Co
a
a
Ay
Ay
Culturally Agnostic (CA) Culturally Sensitive (CS)
90 90 2.67 2.68
1.65
1.31 3.01
80 4.60 80 5.13
2.63
6.69 8.17 3.68
Accuracy (%)
70 2.39 70 4.26
5.47 6.38 2.78
60 60 5.15
3.64
50 50
40 40
30 30
mm B
mm +
t
o
mm B
mm +
ma 7B
en o
tra B
SE -32B
ud -v3
ma 7B
en o
ud -v3
t
o
tra B
SE -32B
ne
ne
Qw -Nem
T4
Qw -Nem
T4
Co e-32
Co e-32
Mis 1-70
Mis 1-70
Ge andR
Ge andR
Lla a2-2
Lla a2-2
Cla ION
Cla ION
on
on
GP
GP
2.5
2.5
eS
eS
ns
ns
-3.
-3.
l
l
A-L
A-L
pa
pa
Ex
Ex
a
a
Ay
Ay
70 70
60 60 9.77
8.73 7.33
6.91 5.61
50 7.88 5.42 50 7.95
40 5.29 4.90 40 5.43 5.54 7.08
6.33
30 30
mm B
mm +
t
o
mm B
mm +
o
ma 7B
en o
tra B
SE -32B
ud -v3
ma 7B
en o
tra B
SE -32B
ud -v3
t
ne
ne
Qw -Nem
T4
Qw -Nem
T4
Co e-32
Co e-32
Mis 1-70
Mis 1-70
Ge andR
Ge andR
Lla a2-2
Lla a2-2
Cla ION
Cla ION
on
on
GP
GP
2.5
2.5
eS
eS
s
s
-3.
-3.
l
l
A-L
A-L
n
n
pa
pa
Ex
Ex
a
a
Ay
Ay
Figure 11: Model evaluations on (Top) high-resource , (Mid) mid-resource and (Bottom)
low resource data samples for CA and CS subsets.
the standard deviation rising for mid-resource languages and even more so for low-resource
languages, particularly on CS datasets.
The average standard deviation for high-resource languages is 3.21 on CA datasets and
3.86 on CS datasets. For mid-resource languages, these values increase to 3.42 and 4.6,
16
respectively. Low-resource languages exhibit significantly higher standard deviations, with
averages rising to 6.37 on CA datasets and 6.78 on CS datasets. These represent increases
of 98% and 75% compared to high-resource languages, highlighting the greater variability and
sensitivity in low-resource settings. This increased variability in model performances highlights
the challenges of culturally sensitive tasks, which demand a nuanced understanding of regional
or dialectal references. Across all level of resourcefulness, performance on CS shows higher
variability than CA .
Model Rank Changes. This section explores how model performance rankings differ between
CA and CS datasets, calculated relative to their ranks on MA , across multiple languages.
Table 2 highlights rank changes for human-translated languages, organized by resource level:
high-resource, mid-resource, and low-resource. These rankings offer valuable insights into
how dataset type, resource availability and model size impact model performances. Comprehen-
sive rankings for all languages are available in Table 6 and Table 7 in Appendix F.1.
1) Models perform differently across CA and CS datasets, with the latter show-
ing greater variation. Rankings on CA datasets exhibit minimal changes. For instance,
Italian, Japanese, and Portuguese show no rank changes, while Arabic and French each experi-
ence only two shifts, each by one position.
On the other hand, model performance varies significantly on CS datasets. Chinese and Hindi
emerge as the most sensitive languages to culture-specific knowledge, with models showing both
increases and decreases in rankings. Similar variations are evident in French, German, Italian,
Japanese, and Portuguese. Notably, models from the Aya Expanse and CommandR families tend
to show positive trends on CS datasets, particularly for these languages. On average, across all
languages, CA datasets see 3.4 rank changes and 3.7 position changes, whereas CS datasets
experience markedly higher volatility, with 5.7 rank changes and 7.3 position changes.
17
Llama-3.1 70B
Claude Sonnet
Aya Exp. 32B
SEA-LION-v3
CommandR+
Mistral Nemo
Gemma2 27B
Llama-3.1 8B
Qwen2.5 32B
Aya Exp. 8B
Gemma2 9B
Qwen2.5 7B
CommandR
GPT4o
Language Dataset
- - - - - - - - - ↑1 - ↓1 - -
Arabic
- ↑1 - - - ↓1 - ↑1 - - ↓1 - - -
- - ↓1 - ↑1 - - - - - ↑1 - ↓1 -
Chinese
↑1 ↑1 ↑1 ↑2 ↑1 - ↓1 ↑1 - ↓3 ↓1 ↓2 ↑1 ↓1
- - - - - ↓1 - - - ↑1 ↑1 - ↓1 -
English
- ↑1 - - - - - ↑1 - ↓1 ↓1 - - -
- ↑1 - - - - - - - ↓1 - - - -
French
- ↑2 ↑2 ↑1 - ↓2 - ↑1 - ↓3 ↓1 ↑1 - -
- ↓1 - ↓1 - ↑1 - - - ↑1 - - - -
German
- - ↓1 - ↑2 - - ↑1 - ↓3 ↓1 ↑2 - -
- ↑1 ↓2 ↓1 ↑1 - - - - - - ↑1 - -
Hindi
↑1 ↓1 ↑1 ↑2 - ↓1 ↑1 - ↑1 ↓3 ↓1 - ↑1 ↓1
- - - - - - - - - - - - - -
Italian
- - ↑1 ↑1 - ↓1 - ↑1 - ↓2 ↓1 ↑1 - -
- - - - - - - - - - - - - -
Japanese
- ↑1 ↑1 ↑1 ↑1 ↓2 - ↑1 - ↓1 ↓1 ↓1 - -
- - - - - - - - - - - - - -
Portuguese
- ↑1 ↑1 ↑1 ↑1 ↓1 - ↑1 - ↓2 ↓1 ↓1 - -
- ↓1 - ↓1 - ↑1 - - - ↑1 - - - -
Spanish
- - ↑1 - ↑2 - - ↑1 - ↓3 ↓1 - - -
- ↑1 - - - - - ↓1 ↓1 - - - - -
Bengali
- - - - - - - - ↑1 ↓1 - - - -
- - ↓1 ↓1 ↓1 ↑1 - - - ↑2 - - - -
Indonesian
- - ↑1 - - - ↓1 ↑1 ↑1 - ↓1 ↓1 - -
↓1 ↓1 ↓1 - - ↑1 ↑1 - - ↑1 - - - -
Korean
- ↑1 ↑1 ↓1 - ↓1 - ↑1 - - ↓1 - - -
- ↑1 - - - - ↓3 - - ↑2 - - - -
Sinhala
- ↓1 ↑1 ↑1 - - - - - ↓1 - - - -
- ↓1 - - - - ↑1 - - - - - - -
Swahili
- - ↑1 - - - ↓1 - - - - - ↓1 ↑1
- ↑1 ↓2 - ↓1 - - - - ↑2 ↑1 ↓1 - -
Yoruba
- ↓1 ↑1 ↑1 ↑1 - - - - - ↓2 - - -
Large models demonstrate higher consistency across datasets and resource levels. The average
rank changes for large models are minimal, at 0.21 for CA and 0.67 for CS . The maximum
position shift for models in this group is 3 while it can be 5 for small-models. This consistency
reflects their robustness and higher capacity to generalize across diverse datasets.
Mid-size models, on the other hand, show much bigger variability. Their average rank changes are
18
0.33 for CA and 1.97 for CS , indicating they are more sensitive to dataset characteristics,
particularly in the CS datasets that requires cultural knowledge.
Small models exhibit the smallest difference in rank change between CA and CS (0.35 and
0.45, respectively). However, this apparent stability stems from their weaker overall performance
across both datasets. For instance, the average accuracy for small models is 51.3% on CA and
54.8% on CS , while mid-size models achieve 59.1% and 61.7%, and large models perform at
61.6% and 66.8% on CA and CS , respectively.
Llama-3.1 70B
Claude Sonnet
Aya Exp. 32B
SEA-LION-v3
CommandR+
Mistral Nemo
Gemma2 27B
Llama-3.1 8B
Qwen2.5 32B
Aya Exp. 8B
Gemma2 9B
Qwen2.5 7B
CommandR
GPT4o
Language Dataset
↓1 ↓1 - - - ↑1 - - ↓1 ↑2 - - - -
Greek
- - ↑2 ↑3 - ↓1 ↑1 - - ↓1 ↓4 - - -
- ↑1 - ↓1 ↓1 - - - - ↑1 - - - -
Ukrainian
- ↑1 - ↑1 - ↓2 - ↑1 ↑1 ↓1 ↓1 - ↑1 ↓1
- ↓1 - - - - - - - ↑1 - - - -
Malagasy
- ↑1 ↑4 ↑1 - - ↓1 - ↑1 ↓1 ↓5 - - -
- - - - ↓1 - - - - - ↑1 - ↓1 ↑1
Shona
↑2 - ↑1 ↑1 - - ↑1 - - ↓4 ↓1 - - -
Overall, we can conclude that dataset characteristics significantly impact model performance
across all model sizes, though the magnitude of variability differs. Across all groups, models
demonstrate sensitivity to the diverse cultural and linguistic nuances present in CS datasets,
with performance variations reflecting their capacity to adapt to dataset-specific nuances. This
pattern holds consistently, regardless of model size, though the magnitude of variability differs.
A similar trend appears in Global-MMLU Lite , where despite being smaller and balanced,
performance volatility is still higher on CS datasets, particularly for low-resource languages as
shown in Table 4.
The key finding is that models generally perform better on human-translated data for high-
resource languages. This is likely because these languages benefit from extensive in-language
training data. However, this trend shifts for mid-resource languages. The figure reveals that the
performance gap between HT and MT narrows for models such as Claude Sonnet and Qwen2.5
32B. Conversely, models like CommandR+ and Aya Expanse 32B continue to perform better
on HT data. Notably, these two models have strong Korean language support, which can be
attributed to a substantial amount of in-language training data.
19
Llama-3.1 70B
Llama-3.1 70B
SEA-LION-v3
Aya Exp. 32B
SEA-LION-v3
CommandR+
Mistral Nemo
Gemma2 27B
CommandR+
Mistral Nemo
Gemma2 27B
Qwen2.5 32B
Qwen2.5 32B
Language Dataset Language Dataset
CA - ↓1 ↑1 - - - - CA ↓1 ↓2 ↑1 ↓1 - ↑1 ↑2
Arabic Portuguese
CS ↑1 - ↓1 - - - - CS ↑1 - ↓1 - - - -
CA ↑1 ↓1 - - - - - CA - - - - - - -
Chinese Spanish
CS - ↑1 ↓1 - - - - CS - - - ↑1 - ↓1 -
CA ↓1 ↓1 ↑1 ↓1 - ↑1 ↑1 CA ↑1 - - - ↓1 - -
English Bengali
CS ↑1 - ↓1 - ↑1 - ↓1 CS - - - - - - -
CA - - - - - - -
CA ↑1 ↓1 - - - - - Indonesian
French CS ↑1 ↑1 ↓2 - - - -
CS ↓1 ↑1 ↓1 ↑1 ↑2 ↓1 ↓1
CA ↓1 ↑1 - - - - -
CA - ↓1 - ↓1 - ↑2 - Korean
German CS - - - - - - -
CS - ↑1 - ↓1 - - -
CA ↓1 ↑1 ↑1 ↓1 ↑1 ↓1 -
CA ↓1 - - - - ↓2 ↑3 Swahili
Hindi CS ↑1 ↓1 - - - - -
CS - - - - - - -
CA - ↓2 - ↓2 - ↑1 ↑3
CA ↑2 ↓3 - - - - ↑1 Yoruba
Italian CS ↑3 ↑1 ↓4 ↑1 - - ↓1
CS - - - - ↑1 - ↓1
CA ↑1 ↓1 - - - - -
Japanese
CS - - - - - - -
For low-resource languages, a distinct pattern emerges. As shown in the figure, models such
as Claude Sonnet and GPT-4o perform significantly better on MT data than on HT data. Sim-
ilarly, CommandR+ and Qwen2.5 32B also show improved performance on MT data, albeit
with less pronounced differences. This behavior is likely because these models primarily rely on
machine-translated data for low-resource languages during training, and the distribution of the
machine-translated test set aligns more closely with their training data. Notably, the only model
demonstrating consistent performance across both HT and MT datasets is Aya Expanse 32B,
which can be attributed to its broad coverage and strong support for low-resource languages.
These results underscore the importance of in-language or human-translated datasets for eval-
uating low-resource languages. The Global-MMLU dataset provides a valuable tool for
assessing the in-language performance of large language models (LLMs) on low-resource lan-
guages, offering insights into their capabilities and limitations in such contexts.
5 Related Work
5.1 Multilingual Knowledge Evaluation
As the MMLU benchmark has become a standard for evaluating LLMs (Beeching et al., 2023;
OpenAI, 2024; Dubey et al., 2024; Üstün et al., 2024; Aryabumi et al., 2024), addressing its lim-
itations and introducing enhancements are essential to maintaining high evaluation standards.
For English, Gema et al. (2024) manually re-annotated 3K questions across 30 MMLU sub-
jects to identify quality or problematic questions and released it as MMLU-redux. Wang et al.
(2024) introduced an extended version of this dataset, MMLU-Pro, which adds more challenging,
20
Figure 12: Comparison of model performance on human-translated and machine-translated CS
in French, Korean, and Yoruba.
reasoning-focused questions and expands the answer choice set from four to ten options. MMLU-
Pro+ extends the previous work by incorporating questions with multiple correct answers across
diverse domains and evaluating higher-order reasoning in LLMs (Taghanaki et al., 2024).While
these efforts enhance the difficulty and diversity of tasks, they remain restricted to English alone.
There have been multiple efforts to design and construct evaluation datasets that cater to mul-
tilingual settings. AGIEval is a compilation of human-centric standardized exams to assess lan-
guage model performance in English and Chinese (Zhong et al., 2023). BEnQ is similar but for
English and Bengali (Shafayat et al., 2024). EXAMS is a multilingual high school examination
collection covering 16 languages (Hardalov et al., 2020). M3EXAMS is a multimodal multi-
lingual benchmark supporting 9 languages with three educational levels (Zhang et al., 2023a).
Both evaluation sets process exams on various topics in different countries and build per-language
benchmarks. These initiatives strive to evaluate the performance of language models across var-
ious languages; however, they often support a small number of languages and lack a consistent,
21
standardized framework for direct comparison between languages. We note recent work IN-
CLUDE as an exception to this as one of the most extensive evaluation benchmarks, compiled
from local exams across various countries and languages, covering 44 languages (Romanou et al.,
2024).
To enable evaluation across a wider range of languages, efforts have also been made to translate
the MMLU dataset into multiple languages. Lai et al. (2023) use ChatGPT to translate the
English MMLU dataset into 26 languages. However, the quality of translations produced by
ChatGPT can vary significantly across different languages and is not always reliable (Robinson
et al., 2023). More recently OpenAI released MMMLU by translating MMLU into 14 lan-
guages using professional human translators, and we incorporate this high-quality dataset into
our benchmark.
In addition, several studies have explored evaluating multilingual visual language models (VLMs).
PangeaBench is a holistic evaluation suite encompassing 14 pre-existing datasets covering 47
languages (Yue et al., 2024). Romero et al. (2024) presents CVQA, a culturally diverse multilin-
gual Visual Question Answering benchmark that includes culturally-driven images and questions
across 30 countries and 31 languages. Vayani et al. (2024) introduces a multimodal benchmark
including culturally diverse images paired with text across 100 languages.
Numerous studies have also explored the role of pre-training in shaping the cultural biases present
in LLMs. For example, Chen et al. (2024) examines the impact of native versus translated data on
LLM instruction tuning and evaluation. Their findings reveal that models fine-tuned with native
instructions typically outperform those trained using translated data. Similarly, Choenni et al.
(2024) investigates the reliability of machine translation as a substitute for human translation in
8
An acronym for SouthEast Asian Holistic Evaluation of Language Models.
9
https://leaderboard.sea-lion.ai
22
large-scale multilingual evaluations, highlighting its effectiveness across a diverse set of languages.
Üstün et al. (2024) released the Aya-101 model and focused on in-language prompting and using
a comprehensive dataset of human-written data for instruction tuning large language models
across 114 languages to reflect local culture and preferences (Singh et al., 2024). Additionally,
significant efforts have been made to incorporate knowledge from various cultures into LLMs
to achieve broader cultural alignment. For instance, Li et al. (2024b) proposes a cost-effective
fine-tuning strategy to embed cultural differences into LLMs, facilitating better representation
and understanding of global cultural nuances. Meanwhile, AlKhamissi et al. (2024) introduces
“Anthropological Prompting” a novel method that employs anthropological reasoning to enhance
the cultural alignment of LLMs.
23
guistic diversity and inclusivity to create one of the largest multilingual datasets for advancing
state-of-the-art language models (Singh et al., 2024; Üstün et al., 2024).
6 Conclusion
We evaluate the cultural biases present in MMLU and find that 28% of all questions require
culturally-sensitive knowledge. In particular, progress on MMLU depends heavily on learning
Western-centric concepts. For questions requiring geographic knowledge, the vast majority focus
on North America and Europe. This cultural bias remains in translated variants of MMLU
that are widely used for multilingual LLM evaluation, which reduces the dataset’s practical
effectiveness as a global benchmark and risks over-indexing evaluations on Western-centric idioms
and knowledge.
We examine the impact of translation artifacts and cultural bias on multilingual model rank-
ings. We introduce Global-MMLU and Global-MMLU Lite , multilingual multi-domain
datasets that distinguish between culturally-sensitive (CS ) and culturally-agnostic (CA )
knowledge. By incorporating professional and crowd-sourced annotations, these subsets enable
rigorous multilingual model evaluation.
7 Limitations
Uneven distribution of contributions Beyond the gold standard languages where we engaged
with compensated annotators, participation from community annotators was heavily skewed
across languages. Despite a large volume of community annotators, there was a ‘long tail’ of
annotators only contributing one or two annotations. Similarly, there is a huge gap between
languages with the highest number of contributions and ones with the lowest number of contri-
butions. Consequently, this suggests potential unevenness in dataset distributions across different
languages and a lack of annotator diversity within some languages dominated by one or two fre-
quent contributors.
24
and to take into account how technology serves different dialects (a topic we do not address here).
Geo-cultural variation within a language often gives rise to new dialects or creoles over time
(Zampieri et al., 2020; Wolfram, 1997) and, as such, dialects can serve an important function in
establishing and maintaining cultural identity(Falck et al., 2012). Many different dialects that
are generally recognized as belonging to a single parent language are not represented in this
evaluation dataset.
Toxic or offensive speech Our annotation interface does not contain specific flags for toxic,
harmful, or offensive speech, so it is possible that Global-MMLU contains some data that
could be considered harmful. We believe this is of relatively low risk because of the nature of the
original MMLU and the focus on examination material. However, we did not monitor or track
this explicitly during our cultural sensitivity annotations or translation post-edits.
8 Acknowledgments
We would like to thank members of the Cohere For AI community who championed this initia-
tive and helped with annotating samples for cultural sensitivity as well as improving translation
quality across many languages. In particular, we recognize Ashay Srivastava, Aurélien-Morgan
Claudon, Bevnm SaiAsrit, Danylo Boiko, Hanna Yukhymenko, Sai Vineetha Baddepudi Venkata
Naga Sri, Sangyeon Kim, Tadesse Destaw Belay, Alperen Ünlü, Mohammed Hamdy, Muham-
mad Rafi Sudrajat, Olusanya Joy Naomi, Vu Trong Ki, Yiyang Nan, Abdelmoneim Shahd, Arwa
ALaya, Bimasena Putra, Emad Alghamdi, Fabian Farestam, Mridul Sharma, Sayuru Bopitiya,
Surya Abhinai who contributed a significant amount to each of their languages. A special thank
you to Claire Cheng and Trisha Starostina for helping to coordinate the Cohere professional an-
notators who contributed to this project. We thank all these compensated experts who provided
their language knowledge to comprehensively improve quality over our gold languages.
15
https://www.pewresearch.org/global/2013/06/04/regional-categorization/
16
https://ourworldindata.org/world-region-map-definitions
25
References
Aakanksha, Arash Ahmadian, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer,
Marzieh Fadaee, and Sara Hooker. The multilingual alignment prism: Aligning global and
local preferences to reduce harm. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen
(eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Process-
ing, pp. 12027–12049, Miami, Florida, USA, November 2024. Association for Computational
Linguistics. URL https://aclanthology.org/2024.emnlp-main.671.
Gilles Adda, Sebastian Stüker, Martine Adda-Decker, Odette Ambouroue, Laurent Besacier,
David Blachon, Hélène Bonneau-Maynard, Pierre Godard, Fatima Hamlaoui, Dmitry Idiatov,
Guy-Noël Kouarata, Lori Lamel, Emmanuel-Moselly Makasso, Annie Rialland, Mark Van de
Velde, François Yvon, and Sabine Zerbian. Breaking the unwritten language barrier: The bulb
project. Procedia Computer Science, 81:8–14, 2016. ISSN 1877-0509. doi: https://doi.org/
10.1016/j.procs.2016.04.023. URL https://www.sciencedirect.com/science/article/
pii/S1877050916300370. SLTU-2016 5th Workshop on Spoken Language Technologies for
Under-resourced languages 09-12 May 2016 Yogyakarta, Indonesia.
David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D’souza, Julia Kreutzer, Con-
stantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder,
Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue,
Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene
Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Rabiu Gwadabe, Ignatius Ezeani,
Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David,
Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Em-
manuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde,
Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode,
Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dib-
ora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing
Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima
DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, and Salomey Osei.
MasakhaNER: Named entity recognition for African languages. Transactions of the Asso-
ciation for Computational Linguistics, 9:1116–1131, 2021. doi: 10.1162/tacl_a_00416. URL
https://aclanthology.org/2021.tacl-1.66.
David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo
Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo,
Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David,
Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abra-
ham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad,
Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda Yigezu, Tajuddeen
Gwadabe, Idris Abdulmumin, Mahlet Taye, Oluwabusayo Awoyomi, Iyanuoluwa Shode, Tolu-
lope Adelani, Habiba Abdulganiyu, Abdul-Hakeem Omotayo, Adetola Adeeko, Abeeb Afolabi,
Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Ogbu,
Chinedu Mbonu, Chiamaka Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola Awosan,
Tadesse Kebede, Toadoum Sari Sakayo, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf,
Mardiyyah Oduwole, Kanda Tshinu, Ussen Kimanuka, Thina Diko, Siyanda Nxakama, Sinodos
Nigusse, Abdulmejid Johar, Shafie Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed,
Evrard Ngabire, Jules Jules, Ivan Ssenkungu, and Pontus Stenetorp. MasakhaNEWS: News
26
topic classification for African languages. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu,
Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th Inter-
national Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-
Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 144–159, Nusa Dua, Bali, November 2023. Association for Computational Linguistics. doi:
10.18653/v1/2023.ijcnlp-main.10. URL https://aclanthology.org/2023.ijcnlp-main.10.
David Ifeoluwa Adelani, Jessica Ojo, Israel Abebe Azime, Zhuang Yun Jian, Jesujoba Oluwadara
Alabi, Xuanli He, Millicent Ochieng, Sara Hooker, Andiswa Bukula, En-Shiun Annie Lee, Chia-
maka Chukwuneke, Happy Buzaaba, Blessing K. Sibanda, Godson Kalipe, Jonathan Mukiibi,
Salomon Kabongo KABENAMUALU, Foutse Yuehgoh, Mmasibidi Setaka, Lolwethu Ndolela,
Nkiruka Bridget Odu, Rooweither Mabuya, Shamsuddeen Hassan Muhammad, Salomey Osei,
Sokhar Samb, Tadesse Kebede Guge, and Pontus Stenetorp. Irokobench: A new benchmark
for african languages in the age of large language models. ArXiv, abs/2406.03368, 2024. URL
https://api.semanticscholar.org/CorpusID:270258352.
Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. Investigating
cultural alignment of large language models. arXiv preprint arXiv:2402.13231, 2024.
Arnav Arora, Lucie-Aimée Kaffee, and Isabelle Augenstein. Probing pre-trained language models
for cross-cultural differences in values. arXiv preprint arXiv:2203.13722, 2022.
Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin,
Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max
Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil
Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. Aya 23: Open weight releases to
further multilingual progress, 2024. URL https://arxiv.org/abs/2405.15032.
Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen
Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https:
//huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, 2023.
Abhijit Bendale, Michael Sapienza, Steven Ripplinger, Simon Gibbs, Jaewon Lee, and Pranav
Mistry. Sutra: Scalable multilingual language model architecture, 2024. URL https://arxi
v.org/abs/2405.06694.
Steven Bird. Local languages, third spaces, and other high-resource scenarios. In Smaranda
Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7817–7829,
Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/20
22.acl-long.539. URL https://aclanthology.org/2022.acl-long.539.
Abeba Birhane, William Isaac, Vinodkumar Prabhakaran, Mark Diaz, Madeleine Clare Elish,
Iason Gabriel, and Shakir Mohamed. Power to the people? opportunities and challenges for
participatory ai. In Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO
’22. ACM, October 2022. doi: 10.1145/3551624.3555290. URL http://dx.doi.org/10.1145
/3551624.3555290.
Yuri Bizzoni, Tom S Juzek, Cristina España-Bonet, Koel Dutta Chowdhury, Josef van Genabith,
and Elke Teich. How human is machine translationese? comparing human and machine
translations of text and speech. In Marcello Federico, Alex Waibel, Kevin Knight, Satoshi
27
Nakamura, Hermann Ney, Jan Niehues, Sebastian Stüker, Dekai Wu, Joseph Mariani, and
Francois Yvon (eds.), Proceedings of the 17th International Conference on Spoken Language
Translation, pp. 280–290, Online, July 2020. Association for Computational Linguistics. doi:
10.18653/v1/2020.iwslt-1.34. URL https://aclanthology.org/2020.iwslt-1.34.
Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto,
Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer San-
toso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan
Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali
Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith
Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito
Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo
Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timo-
thy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, and Ayu Pur-
warianti. NusaCrowd: Open source initiative for Indonesian NLP resources. In Anna
Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for
Computational Linguistics: ACL 2023, pp. 13745–13818, Toronto, Canada, July 2023. As-
sociation for Computational Linguistics. doi: 10.18653/v1/2023.f indings-acl.868. URL
https://aclanthology.org/2023.findings-acl.868.
Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich. Assessing
cross-cultural alignment between chatgpt and human societies: An empirical study. arXiv
preprint arXiv:2303.17466, 2023.
Isaac Caswell, Theresa Breiner, Daan van Esch, and Ankur Bapna. Language ID in the wild:
Unexpected challenges on the path to a thousand-language web text corpus. In Donia Scott,
Nuria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on
Computational Linguistics, pp. 6588–6608, Barcelona, Spain (Online), December 2020. Inter-
national Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.579.
URL https://aclanthology.org/2020.coling-main.579.
José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang, and Jorge Pérez.
Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020, 2020.
Pinzhen Chen, Simon Yu, Zhicheng Guo, and Barry Haddow. Is it good data for multilingual
instruction tuning or just bad multilingual evaluation for large language models?, 2024. URL
https://arxiv.org/abs/2406.12822.
Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba,
Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami,
Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit,
Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, and Antoine Bosse-
lut. Meditron-70b: Scaling medical pretraining for large language models, 2023. URL
https://arxiv.org/abs/2311.16079.
Rochelle Choenni, Sara Rajaee, Christof Monz, and Ekaterina Shutova. On the evaluation prac-
tices in multilingual nlp: Can machine translation offer an alternative to human translations?,
2024. URL https://arxiv.org/abs/2406.14267.
Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and
David Ha. Deep learning for classical japanese literature, 2018.
28
Eric Corbett, Emily Denton, and Sheena Erete. Power and public participation in ai. In Pro-
ceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and
Optimization, EAAMO ’23, New York, NY, USA, 2023. Association for Computing Machinery.
ISBN 9798400703812. doi: 10.1145/3617694.3623228. URL https://doi.org/10.1145/3617
694.3623228.
Xuan-Quy Dao, Ngoc-Bich Le, The-Duy Vo, Xuan-Dung Phan, Bac-Bien Ngo, Van-Tien Nguyen,
Thi-My-Thanh Nguyen, and Hong-Phuoc Nguyen. Vnhsge: Vietnamese high school graduation
examination dataset for large language models, 2023. URL https://arxiv.org/abs/2305.1
2199.
Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. The participatory turn in ai
design: Theoretical foundations and the current state of practice. Proceedings of the 3rd ACM
Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, 2023. URL
https://api.semanticscholar.org/CorpusID:263605822.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783, 2024.
Oliver Falck, Stephan Heblich, Alfred Lameli, and Jens Südekum. Dialects, cultural identity,
and economic exchange. Journal of urban economics, 72(2-3):225–239, 2012.
Allan M. Feldman. Majority Voting, pp. 161–177. Springer US, Boston, MA, 1980. ISBN 978-
1-4615-8141-3. doi: 10.1007/978-1-4615-8141-3_10. URL https://doi.org/10.1007/978-1
-4615-8141-3_10.
Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models.
First Monday, November 2023. ISSN 1396-0466. doi: 10.5210/fm.v28i11.13346. URL
http://dx.doi.org/10.5210/fm.v28i11.13346.
∀, Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbo-
hungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabena-
mualu, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez
Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-
Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi
Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro
Orife, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsahar, Goodness
Duru, Ghollah Kioko, Murhabazi Espoir, Elan van Biljon, Daniel Whitenack, Christopher
Onyefuluchi, Chris Chinenye Emezue, Bonaventure F. P. Dossou, Blessing Sibanda, Blessing
Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem, Adewale Akinfaderin, and Ab-
dallah Bashir. Participatory research for low-resourced machine translation: A case study
in African languages. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the
Association for Computational Linguistics: EMNLP 2020, Online, November 2020. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2020.f indings-emnlp.195. URL
https://aclanthology.org/2020.findings-emnlp.195.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles
Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas
Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron,
29
Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A frame-
work for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records
/12608602.
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria
Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi
Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken,
and Pasquale Minervini. Are we done with mmlu?, 2024. URL https://arxiv.org/abs/24
06.04127.
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya
Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti,
Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya
Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti,
Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christo-
pher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena
Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru,
Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin,
James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy
Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican,
Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth,
Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar
Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu,
Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian
Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted
Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed,
Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals,
Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Bar-
ral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev,
and Kathleen Kenealy. Gemma: Open models based on gemini research and technology, 2024.
URL https://arxiv.org/abs/2403.08295.
Omid Ghahroodi, Marzia Nouri, Mohammad Vali Sanian, Alireza Sahebi, Doratossadat Dast-
gheib, Ehsaneddin Asgari, Mahdieh Soleymani Baghshah, and Mohammad Hossein Rohban.
Khayyam challenge (persianmmlu): Is your llm truly wise to the persian language?, 2024.
URL https://arxiv.org/abs/2404.06644.
Adriana Guevara-Rukoz, Isin Demirsahin, Fei He, Shan-Hui Cathy Chu, Supheakmungkol Sarin,
Knot Pipatsrisawat, Alexander Gutkin, Alena Butryna, and Oddur Kjartansson. Crowdsourc-
ing Latin American Spanish for low-resource text-to-speech. In Nicoletta Calzolari, Frédéric
Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hi-
toshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk,
and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation
Conference, pp. 6504–6513, Marseille, France, May 2020. European Language Resources As-
sociation. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.801.
Momchil Hardalov, Todor Mihaylov, Dimitrina Zlatkova, Yoan Dinkov, Ivan Koychev, and
Preslav Nakov. EXAMS: A multi-subject high school examinations dataset for cross-lingual
and multilingual question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang
Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language
30
Processing (EMNLP), pp. 5427–5444, Online, November 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.emnlp-main.438. URL https://aclanthology.org/202
0.emnlp-main.438.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.
Carlos Daniel Hernandez Mena and Ivan Vladimir Meza Ruiz. Creating Mexican Spanish lan-
guage resources through the social service program. In Chris Callison-Burch, Christopher
Cieri, James Fiumara, and Mark Liberman (eds.), Proceedings of the 2nd Workshop on Novel
Incentives in Data Collection from People: models, implementations, challenges and results
within LREC 2022, pp. 20–24, Marseille, France, June 2022. European Language Resources
Association. URL https://aclanthology.org/2022.nidcp-1.4.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark,
AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv
preprint arXiv:2410.21276, 2024.
Mika Hämäläinen. Endangered Languages are not Low-Resourced!, pp. 1–11. University of
Helsinki, March 2021. doi: 10.31885/9789515150257.1. URL http://dx.doi.org/10.31
885/9789515150257.1.
Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srini-
vasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali. Unsung
challenges of building and deploying language technologies for low resource language commu-
nities. In Dipti Misra Sharma and Pushpak Bhattacharya (eds.), Proceedings of the 16th
International Conference on Natural Language Processing, pp. 211–219, International Insti-
tute of Information Technology, Hyderabad, India, December 2019. NLP Association of India.
URL https://aclanthology.org/2019.icon-1.25.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The state
and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095,
2020.
Meryem Karlık. Exploring the impact of culture on language learning: How understanding cul-
tural context and values can deepen language acquisition. International Journal of Language,
Linguistics, Literature and Culture, 2:5–11, 09 2023. doi: 10.59009/ijlllc.2023.0035.
Naomi Kipuri. Chapter ii: culture. In UN, Department of Economic and Social Affairs, Divi-
sion for Social Policy and Development, Secretariat of the Permanent Forum on Indigenous
Issues (ed.), State of the world’s indigenous peoples: ST/ESA/328, New York: United Nations
publication, pp. 51–81, 2009.
Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Chris-
tian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, et al.
Findings of the wmt24 general machine translation shared task: the llm era is here but mt
is not solved yet. In Proceedings of the Ninth Conference on Machine Translation, pp. 1–46,
2024.
31
Moshe Koppel and Noam Ordan. Translationese and its dialects. In Proceedings of the 49th
annual meeting of the association for computational linguistics: Human language technologies,
pp. 1318–1326, 2011.
Fajri Koto, Nurul Aisyah, Haonan Li, and Timothy Baldwin. Large language models only
pass primary school exams in indonesia: A comprehensive test on indommlu, 2023. URL
https://arxiv.org/abs/2310.04928.
Fajri Koto, Haonan Li, Sara Shatnawi, Jad Doughman, Abdelrahman Boda Sadallah, Aisha
Alraeesi, Khalid Almubarak, Zaid Alyafeai, Neha Sengupta, Shady Shehata, Nizar Habash,
Preslav Nakov, and Timothy Baldwin. Arabicmmlu: Assessing massive multitask language
understanding in arabic, 2024. URL https://arxiv.org/abs/2402.12840.
Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-
Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang
Setyawan, et al. Quality at a glance: An audit of web-crawled multilingual datasets. Transac-
tions of the Association for Computational Linguistics, 10:50–72, 2022. doi: 10.1162/tacl_a
_00447. URL https://aclanthology.org/2022.tacl-1.4.
Klaus Krippendorff. Reliability in content analysis: Some common misconceptions and recom-
mendations. Human Communication Research, 30(3):411–433, 2004.
William Labov. The social motivation of a sound change. Word, 19(3):273–309, 1963.
William Labov. The social stratification of (r) in new york city department stores. In Dialect
and language variation, pp. 304–329. Elsevier, 1986.
Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien
Nguyen. Okapi: Instruction-tuned large language models in multiple languages with reinforce-
ment learning from human feedback. In Yansong Feng and Els Lefever (eds.), Proceedings of the
2023 Conference on Empirical Methods in Natural Language Processing: System Demonstra-
tions, pp. 318–327, Singapore, December 2023. Association for Computational Linguistics. doi:
10.18653/v1/2023.emnlp-demo.28. URL https://aclanthology.org/2023.emnlp-demo.28.
Wei Qi Leong, Jian Gang Ngui, Yosephine Susanto, Hamsawardhini Rengarajan, Kengatharaiyer
Sarveswaran, and William Chandra Tjhi. Bhasa: A holistic southeast asian linguistic and
cultural evaluation suite for large language models. arXiv preprint arXiv:2309.06085, 2023.
Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.
Soviet Physics Doklady, 10(8):707–710, 1966.
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and
Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese,
2024a. URL https://arxiv.org/abs/2306.09212.
Huihan Li, Liwei Jiang, Jena D. Hwang, Hyunwoo Kim, Sebastin Santy, Taylor Sorensen,
Bill Yuchen Lin, Nouha Dziri, Xiang Ren, and Yejin Choi. Culture-gen: Revealing global
cultural perception in language models through natural language prompting, 2024b. URL
https://arxiv.org/abs/2404.10199.
Chen Cecilia Liu, Iryna Gurevych, and Anna Korhonen. Culturally aware and adapted nlp: A
taxonomy and a survey of the state of the art. arXiv preprint arXiv:2406.03930, 2024.
32
Fangyu Liu, Emanuele Bugliarello, Edoardo Maria Ponti, Siva Reddy, Nigel Collier, and Desmond
Elliott. Visually grounded reasoning across languages and cultures. In Proceedings of the
2021 Conference on Empirical Methods in Natural Language Processing, pp. 10467–10485,
Online and Punta Cana, Dominican Republic, November 2021. Association for Computational
Linguistics. URL https://aclanthology.org/2021.emnlp-main.818.
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James Validad Miranda, Jen-
nifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial,
Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Fred-
erikus Hudi, Jann Railey Montalan, Ryan Ignatius Hadiwijaya, Joanito Agili Lopo, William
Nixon, Börje F. Karlsson, James Jaya, Ryandito Diandaru, Yuze Gao, Patrick Amadeus
Irawan, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse, Ivan Halim Parmonan-
gan, Maria Khelli, Wenyu Zhang, Lucky Susanto, Reynard Adha Ryanda, Sonny Lazuardi
Hermawan, Dan John Velasco, Muhammad Dehan Al Kautsar, Willy Fitra Hendria, Yasmin
Moslem, Noah Flynn, Muhammad Farid Adilazuarda, Haochen Li, Johanes Lee, R. Daman-
huri, Shuo Sun, Muhammad Reza Qorib, Amirbek Djanibekov, Wei Qi Leong, Quyet V.
Do, Niklas Muennighoff, Tanrada Pansuwan, Ilham Firdausi Putra, Yan Xu, Tai Ngee
Chia, Ayu Purwarianti, Sebastian Ruder, William Chandra Tjhi, Peerat Limkonchotiwat, Al-
ham Fikri Aji, Sedrick Keh, Genta Indra Winata, Ruochen Zhang, Fajri Koto, Zheng Xin
Yong, and Samuel Cahyawijaya. SEACrowd: A multilingual multimodal data hub and
benchmark suite for Southeast Asian languages. In Yaser Al-Onaizan, Mohit Bansal, and
Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Nat-
ural Language Processing, pp. 5155–5203, Miami, Florida, USA, November 2024. Associ-
ation for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- main.296. URL
https://aclanthology.org/2024.emnlp-main.296.
Alexandra Luccioni and Joseph Viviano. What’s in the box? an analysis of undesirable content in
the Common Crawl corpus. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.),
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and
the 11th International Joint Conference on Natural Language Processing (Volume 2: Short
Papers), pp. 182–189, Online, August 2021. Association for Computational Linguistics. doi:
10.18653/v1/2021.acl-short.24. URL https://aclanthology.org/2021.acl-short.24.
Jabez Magomere, Shu Ishida, Tejumade Afonja, Aya Salama, Daniel Kochin, Foutse Yuehgoh,
Imane Hamzaoui, Raesetje Sefala, Aisha Alaagib, Elizaveta Semenova, et al. You are what you
eat? feeding foundation models a regionally diverse food dataset of world wide dishes. arXiv
preprint arXiv:2406.09496, 2024.
Reem I. Masoud, Ziquan Liu, Martin Ferianc, Philip Treleaven, and Miguel Rodrigues. Cultural
alignment in large language models: An explanatory analysis based on hofstede’s cultural
dimensions, 2024. URL https://arxiv.org/abs/2309.12342.
Jann Railey Montalan, Jian Gang Ngui, Wei Qi Leong, Yosephine Susanto, Hamsawardhini
Rengarajan, William Chandra Tjhi, and Alham Fikri Aji. Kalahi: A handcrafted, grassroots
cultural llm evaluation suite for filipino, 2024. URL https://arxiv.org/abs/2409.15380.
Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri
Aji, and Monojit Choudhury. Cultural conditioning or placebo? on the effectiveness of socio-
demographic prompting. arXiv preprint arXiv:2406.11661, 2024.
33
Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsu-
vas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, et al. Blend: A
benchmark for llms on everyday knowledge in diverse cultures and languages. arXiv preprint
arXiv:2406.09948, 2024.
Tarek Naous, Michael J. Ryan, Alan Ritter, and Wei Xu. Having beer after prayer? measuring
cultural bias in large language models, 2024. URL https://arxiv.org/abs/2305.14456.
Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole,
Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon
Kabongo, Salomey Osei, et al. Participatory research for low-resourced machine translation:
A case study in african languages. arXiv preprint arXiv:2010.02353, 2020.
NLLB-Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield,
Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler
Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gon-
zalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk
Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey
Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn,
Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang.
No language left behind: Scaling human-centered machine translation, 2022.
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM Evaluators Recognize and Favor
Their Own Generations, 2024. URL https://arxiv.org/abs/2404.13076.
Maja Popović. chrf++: words helping character n-grams. In Proceedings of the Second Con-
ference on Machine Translation, pp. 612–618, Copenhagen, Denmark, September 2017. As-
sociation for Computational Linguistics. doi: 10.18653/v1/W17- 4770. URL https:
//aclanthology.org/W17-4770.
Abhinav Rao, Akhila Yerukola, Vishwa Shah, Katharina Reinecke, and Maarten Sap. Normad:
A framework for measuring the cultural adaptability of large language models, 2024. URL
https://arxiv.org/abs/2404.12464.
Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. Chatgpt mt:
Competitive for high- (but not low-) resource languages, 2023. URL https://arxiv.org/ab
s/2309.07423.
Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shiv-
alika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso
Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny
34
Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou,
Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Is-
lam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher
Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang
Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther
Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh,
Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar,
Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan
Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, and Antoine Bosselut.
Include: Evaluating multilingual language understanding with regional knowledge, 2024. URL
https://arxiv.org/abs/2411.19799.
David Romero, Chenyang Lyu, Haryo Akbarianto Wibowo, Teresa Lynn, Injy Hamed,
Aditya Nanda Kishore, Aishik Mandal, Alina Dragonetti, Artem Abzaliev, Atnafu Lambebo
Tonja, et al. Cvqa: Culturally-diverse multilingual visual question answering benchmark. 38th
Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and
Benchmarks, 2024.
Sheikh Shafayat, H Hasan, Minhajur Mahim, Rifki Putri, James Thorne, and Alice Oh. Benqa:
A question answering benchmark for bengali and english. In ACL Findings, 2024.
Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, Marzieh Fadaee, and Sara Hooker. LLM
see, LLM do: Leveraging active inheritance to target non-differentiable objectives. In Yaser
Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Processing, pp. 9243–9267, Miami, Florida, USA,
November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-m
ain.521. URL https://aclanthology.org/2024.emnlp-main.521.
Shivalika Singh, Freddie Vargus, Daniel D’souza, Börje Karlsson, Abinaya Mahendiran, Wei-
Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura O’Mahony, Mike
Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Moura, Dominik
Krzemiński, Hakimeh Fadaei, Irem Ergun, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake,
Zaid Alyafeai, Vu Chien, Sebastian Ruder, Surya Guthikonda, Emad Alghamdi, Sebastian
Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee,
and Sara Hooker. Aya dataset: An open-access collection for multilingual instruction tun-
ing. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 11521–11567, Bangkok, Thailand, August 2024. Association for Computational Linguistics.
doi: 10.18653/v1/2024.acl-long.620. URL https://aclanthology.org/2024.acl-long.620.
Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi,
Cheonbok Park, Kang Min Yoo, and Stella Biderman. Kmmlu: Measuring massive multitask
language understanding in korean, 2024a. URL https://arxiv.org/abs/2402.11548.
Guijin Son, Hanwool Lee, Suwan Kim, Huiseo Kim, Jaecheol Lee, Je Won Yeom, Jihyu Jung,
Jung Woo Kim, and Songseong Kim. Hae-rae bench: Evaluation of korean knowledge in
language models, 2024b. URL https://arxiv.org/abs/2309.02706.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won
Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-
35
bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,
2022.
Saeid Asgari Taghanaki, Aliasgahr Khani, and Amir Khasahmadi. Mmlu-pro+: Evaluating
higher-order reasoning and shortcut learning in llms, 2024. URL https://arxiv.org/abs/24
09.02257.
Eva Vanmassenhove, Dimitar Shterionov, and Matthew Gwilliam. Machine translationese: Ef-
fects of algorithmic bias on linguistic complexity in machine translation. In Paola Merlo, Jorg
Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European
Chapter of the Association for Computational Linguistics: Main Volume, pp. 2203–2213, On-
line, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-mai
n.188. URL https://aclanthology.org/2021.eacl-main.188.
Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar,
Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuck-
reja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker,
Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh
Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani,
Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova,
Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ke-
tan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei,
Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabr-
era, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoo-
jan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Am-
rin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha,
Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar
Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, and Fahad Khan.
All languages matter: Evaluating lmms on culturally diverse 100 languages, 2024. URL
https://arxiv.org/abs/2411.16508.
Mor Ventura, Eyal Ben-David, Anna Korhonen, and Roi Reichart. Navigating cultural chasms:
Exploring and unlocking the cultural pov of text-to-image models, 2024. URL https://arxi
v.org/abs/2310.01929.
Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, and
Michael R Lyu. Not all countries celebrate thanksgiving: On the cultural dominance in large
language models. arXiv preprint arXiv:2310.12481, 2023.
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weim-
ing Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang,
Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-
task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574.
Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja,
Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, and Graham Neu-
big. Pangea: A fully open multilingual multimodal llm for 39 languages. arXiv preprint
arXiv:2410.16153, 2024. URL https://arxiv.org/abs/2410.16153.
36
Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Şenel, Anna Korhonen, and Hinrich Schütze.
Turkishmmlu: Measuring massive multitask language understanding in turkish, 2024. URL
https://arxiv.org/abs/2407.12402.
Marcos Zampieri, Preslav Nakov, and Yves Scherrer. Natural language processing for similar
languages, varieties, and dialects: A survey. Natural Language Engineering, 26(6):595–612,
2020.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing.
M3exam: A multilingual, multimodal, multilevel benchmark for examining large language
models, 2023a. URL https://arxiv.org/abs/2306.05179.
Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the
performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474,
2023b.
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied,
Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation
models, 2023. URL https://arxiv.org/abs/2304.06364.
Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun
Chen, and Lei Li. Multilingual machine translation with large language models: Empirical
results and analysis. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Findings of
the Association for Computational Linguistics: NAACL 2024, pp. 2765–2781, Mexico City,
Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findi
ngs-naacl.176. URL https://aclanthology.org/2024.findings-naacl.176.
Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke
Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blun-
som, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker.
Aya model: An instruction finetuned open-access multilingual language model, 2024. URL
https://arxiv.org/abs/2402.07827.
A Global-MMLU Languages
In this work we will refer to groups of languages to be “lower-”, “mid-” or “higher”-resourced ac-
cording to their recorded, written, and catalogued NLP resources (Joshi et al., 2020). We group
these 5 distinct clusters following the groupings in (Singh et al., 2024) into a rough taxonomy
of lower-resourced (LR), mid-resourced (MR) and higher-resourced (HR). We note
that this grouping is inevitably imperfect; languages and their varieties cannot absolutely nor
universally be classified based on this single dimension (Hämäläinen, 2021; Bird, 2022). The cat-
egorization in our case serves the purpose of aggregation in our analysis of the data distribution.
37
ar Arabic Arabic High ♣
♤
bn Bengali Bengali Mid ♣
♤
cs Czech Latin High ♢
♦ ♣
♤
de German Latin High ♣
♤
el Greek Greek Mid ♢
♦
en English Latin High ♢
♦ ♣
♤
fil Filipino Latin Mid ♢
♦
fr French Latin High ♣
♤
ha Hausa Latin Low ♢
♦
he Hebrew Hebrew Mid ♢
♦
hi Hindi Devanagari High ♣
♤
ig Igbo Latin Low ♢
♦
id Indonesian Latin Mid ♣
♤
it Italian Latin High ♣
♤
ja Japanese Japanese High ♣
♤
ky Kyrgyz Cyrillic Low ♢
♦
ko Korean Hangul Mid ♣
♤
lt Lithuanian Latin Mid ♢
♦
mg Malagasy Latin Low ♢
♦
ms Malay Latin Mid ♢
♦ ♣
♤
ne Nepali Devanagari Low ♢
♦
nl Dutch Latin High ♢
♦
ny Nyanja Latin Low ♢
♦
fa Persian Arabic High ♢
♦ ♣
♤
pl Polish Latin High ♢
♦
pt Portuguese Latin High ♣
♤
ro Romanian Latin Mid ♢
♦ ♣
♤
ru Russian Cyrillic High ♢
♦ ♣
♤
sin Sinhala Sinhala Low ♢
♦ ♣
♤
sn Shona Latin Low ♢
♦
som Somali Latin Low ♢
♦
es Spanish Latin High ♣
♤
sr Serbian Cyrillic High ♢
♦
sw Swahili Latin Low ♣
♤
sv Swedish Latin High ♢
♦
te Telugu Telugu Low ♢
♦ ♣
♤
tr Turkish Latin High ♢
♦ ♣
♤
uk Ukrainian Cyrillic Mid ♢
♦ ♣
♤
vi Vietnamese Latin High ♢
♦ ♣
♤
yo Yorùbá Latin Low ♣
♤
zh Chinese Hans High ♣
♤
Table 5: 42 languages in Global-MMLU , along with each language’s script and resource
category. We followed (Singh et al., 2024) and categorized languages as low, mid and high
resource based on language classes proposed by (Joshi et al., 2020) (low: [0, 1, 2], mid: [3],
high: [4, 5]). In Global-MMLU, the language is either fully machine translated ♦
♢, fully human
translated ♤♣ , or contains both machine and human translated data ♦
♢♤ ♣.
38
Meanwhile, business ethics, management, marketing, and professional accounting fall under the
Business category. It’s worth noting that the ’Other’ category in Global-MMLU , sometimes
referred to as ’General Knowledge’, includes the remaining two subjects from the original MMLU
’Other’ category: global facts and miscellaneous.
C Global-MMLU Lite
25.00% 25.00%
25
Percentage of samples (%)
20
15 14.50% 14.00%
11.50%
10 9.50%
5
0
es
EM
ies
ess
al
r)
the
dic
nc
nit
ST
sin
cie
Me
(O
ma
Bu
lS
ge
Hu
cia
led
So
ow
Kn
ral
ne
Ge
However, we aimed for a balanced distribution across subject categories. Social Science subjects
like High School Geography and Sociology had higher proportion of CS samples whereas STEM
subjects like Abstract Algebra had higher number of CA samples. To maintain balance, we
sampled five CS and five CA samples per subject where available. Few subjects like Anatomy
or High School Mathematics had only one CS sample available, so for such subjects, only one CS
and one CA sample was taken. Samples from few subjects of Business and Medical categories
were slightly upsampled to ensure adequate representation.
The General Knowledge category, comprising only Miscellaneous and Global Facts, was also
upsampled, with 22 samples from Miscellaneous and 8 from Global Facts per category. This
adjustment ensures sufficient coverage for evaluating general knowledge capabilities. The over-
all goal with Global-MMLU Lite is to have a balanced dataset for efficient multilingual
evaluation across multiple languages.
39
D Temporal Knowledge
As part of the annotation process, annotators were also asked to label samples for temporal or
time-sensitive knowledge . This applies to questions where the correct answer may change over
time due to factors such as current political leaders or economic statistics. Figure 14 shows the
distribution of time sensitive samples in MMLU Annotated . Overall it is observed that only
2.4% of the dataset is tagged as time-sensitive and majority of these samples fall under Social
Sciences, Humanities, Medical and Other categories. STEM is the only category with no time
sensitive samples at all.
Time Sensitive
Not Time Sensitive
100
80
Percentage of Samples (%)
60
40
20
0
Figure 14: Distribution of time-sensitive samples across subject categories. Note that STEM
subjects do not include any temporal knowledge.
E Models Covered
Details of each model are descibed below:
• Aya Expanse17 family of models include 8B18 and 32B19 parameter models. Aya Expanse
models support 23 languages including Arabic, Chinese (simplified & traditional), Czech,
Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese,
Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian,
and Vietnamese. Aya Expanse builds on the Aya initiative which includes multilingual first
releases like Aya 101 (Üstün et al., 2024), Aya 23 (Aryabumi et al., 2024) and extensive
multilingual datasets such as Aya collection (Singh et al., 2024).
• Command R and R+ are open-weight models of size 34B20 and 104B21 respectively
which both support 10 languages: English, French, Spanish, Italian, German, Brazilian
Portuguese, Japanese, Korean, Arabic, Simplified Chinese. We use Command-R 08-2024
and Command-R+ 08-2024 for evaluation.
17
https://hf.co/blog/aya-expanse
18
https://hf.co/CohereForAI/aya-expanse-8b
19
https://hf.co/CohereForAI/aya-expanse-32b
20
https://hf.co/CohereForAI/c4ai-command-r-08-2024
21
https://hf.co/CohereForAI/c4ai-command-r-plus-08-2024
40
• Gemma2 (Gemma Team et al., 2024) is part of the Gemma model family. The languages
targeted are not explicitly reported. We evaluate the instruct-tuned 9B (gemma-2-9b-it)
and 27B (gemma-2-27b-it) variants.
• Llama 3.1 (Dubey et al., 2024) Llama 3.1 is a series of open LLM models that come in
three sizes: 8B, 70B, and 405B parameters. All variants support 8 languages, including
English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. We use Llama-
3.1-8B-Instruct and Llama-3.1-70B-Instruct for evaluation.
• Mistral Nemo25 is a 12B model which supports 11 languages including English, French,
German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, and Hindi.
• Qwen 2.526 models support up to 29 languages, including Chinese, English, French, Span-
ish, and Portuguese. We evaluate Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct variants
of Qwen 2.5.
• GPT-4o (Hurst et al., 2024) is a multilingual, multimodal closed-model and is part of the
GPT-4 family. The languages targeted are not explicitly reported.
• Claude Sonnet 3.5 is also a multilingual, multimodal closed-model from the Claude 3.5
family. The languages supported by this model are also unknown.
F Additional Results
F.1 Model Rank Changes
Table 6 presents the rank changes and corresponding position shifts (indicated next to the ar-
rows) for high-resource languages, while Table 7 provides similar data for mid- and low-resource
languages. The rightmost columns in each table summarize the total number of models that
changed ranks (Total Rank Change) and the total number of position shifts in the rankings
(Total Position Change). A detailed analysis of these results is provided in Section 4.2.
22
https://hf.co/aisingapore/gemma2-9b-cpt-sea-lionv3-instruct
23
An acronym for Southeast Asian Languages in One Network.
24
https://github.com/aisingapore/sealion
25
https://hf.co/mistralai/Mistral-Nemo-Instruct-2407
26
https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e
41
Total position change
Total rank change
Llama-3.1 70B
Claude Sonnet
Aya Exp. 32B
SEA-LION-v3
CommandR+
Mistral Nemo
Gemma2 27B
Llama-3.1 8B
Qwen2.5 32B
Aya Exp. 8B
Gemma2 9B
Qwen2.5 7B
CommandR
GPT4o
Language Dataset
- - - - - - - - - ↑1 - ↓1 - - 2 2
Arabic
- ↑1 - - - ↓1 - ↑1 - - ↓1 - - - 4 4
- - ↓1 - ↑1 - - - - - ↑1 - ↓1 - 4 4
Chinese
↑1 ↑1 ↑1 ↑2 ↑1 - ↓1 ↑1 - ↓3 ↓1 ↓2 ↑1 ↓1 12 16
- - - - - - - ↓1 - - ↑1 - - - 2 2
Czech
↑2 ↓1 - ↑3 - ↓1 ↓2 - - - ↓1 - - - 6 10
- - - - - - - - - - - - - - 0 0
Dutch
- - - ↑1 ↑2 ↓1 - ↑1 - ↓2 ↓1 - - - 6 8
- - - - - ↓1 - - - ↑1 ↑1 - ↓1 - 4 4
English
- ↑1 - - - - - ↑1 - ↓1 ↓1 - - - 4 4
- ↑1 - - - - - - - ↓1 - - - - 2 2
French
- ↑2 ↑2 ↑1 - ↓2 - ↑1 - ↓3 ↓1 ↑1 - - 8 13
- ↓1 - ↓1 - ↑1 - - - ↑1 - - - - 4 4
German
- - ↓1 - ↑2 - - ↑1 - ↓3 ↓1 ↑2 - - 6 10
- ↑1 ↓2 ↓1 ↑1 - - - - - - ↑1 - - 5 6
Hindi
↑1 ↓1 ↑1 ↑2 - ↓1 ↑1 - ↑1 ↓3 ↓1 - ↑1 ↓1 11 14
- - - - - - - - - - - - - - 0 0
Italian
- - ↑1 ↑1 - ↓1 - ↑1 - ↓2 ↓1 ↑1 - - 7 8
- - - - - - - - - - - - - - 0 0
Japanese
- ↑1 ↑1 ↑1 ↑1 ↓2 - ↑1 - ↓1 ↓1 ↓1 - - 9 10
↑1 ↑1 - ↓1 - - ↑1 - ↓2 - - - - - 5 6
Persian
- - - ↑2 ↑1 ↓2 - - ↑1 ↑1 ↓1 ↑1 - - 7 9
↑2 ↑1 ↑2 ↓1 ↓1 - ↓1 - ↓1 ↑2 - ↓1 - - 9 12
Polish
- - ↑2 ↑2 - ↓1 - ↑1 ↑1 ↓1 ↓1 - - - 7 9
- - - - - - - - - - - - - - 0 0
Portuguese
- ↑1 ↑1 ↑1 ↑1 ↓1 - ↑1 - ↓2 ↓1 ↓1 - - 9 10
- ↓1 ↓1 ↓1 ↑1 - - - - ↑2 - - - - 5 6
Russian
↑1 - - ↑2 ↓1 ↓1 ↑1 - - ↓2 ↓1 ↑3 - - 8 12
- ↓1 - ↑1 - - - ↓1 - ↓1 ↑1 - - - 5 5
Serbian
- ↑2 ↑1 ↓1 ↑1 - - - - - - - - - 4 5
- ↓1 - ↓1 - ↑1 - - - ↑1 - - - - 4 4
Spanish
- - ↑1 - ↑2 - - ↑1 - ↓3 ↓1 - - - 5 8
- ↓1 - ↓1 - ↑1 - - - ↓1 - - - - 4 4
Swedish
- - ↑1 - ↑2 - - ↑1 - ↓3 ↓1 - - - 5 8
- - - ↓1 ↑1 - ↓1 - - ↑1 - - - 4 4
Turkish
- ↑2 ↓1 ↑1 - ↓1 - - - - ↓2 - - - 5 7
- - - ↓1 - ↑1 ↓1 - - ↑1 - - - - - 4 4
Vietnamese
- ↓1 ↑3 - - ↓1 ↓1 - ↑1 ↓1 - - - - 6 8
Table 6: Model rankings with MA rank as the reference for high-resource languages ( ). First
row indicates changes in CA ranks, while second row shows the changes in CS ranks relative
to MA. Color-coded boxes highlight increases ( ↑ ) and decreases ( ↓ ).
42
Total position change
Total rank change
Llama-3.1 70B
Claude Sonnet
Aya Exp. 32B
SEA-LION-v3
CommandR+
Mistral Nemo
Gemma2 27B
Llama-3.1 8B
Qwen2.5 32B
Aya Exp. 8B
Gemma2 9B
Qwen2.5 7B
CommandR
GPT4o
Language Dataset
- ↑1 - - - - - ↓1 ↓1 - - - - - 3 3
Bengali
- - - - - - - - ↑1 ↓1 - - - - 2 2
- - - - - - - - - - - - - - 0 0
Filipino
- - - - - ↑1 ↑1 - ↓1 - ↓1 - ↓1 ↑1 6 6
↓1 ↓1 - - - ↑1 - - ↓1 ↑2 - - - - 5 6
Greek
- - ↑2 ↑3 - ↓1 ↑1 - - ↓1 ↓4 - - - 6 12
↓1 ↑1 - ↓1 - - - - ↑1 - - - - - 4 4
Hebrew
- ↑2 - ↑2 - ↓2 - - - - ↓2 - - - 4 8
- - ↓1 ↓1 ↓1 ↑1 - - - ↑2 - - - - 5 6
Indonesian
- - ↑1 - - - ↓1 ↑1 ↑1 - ↓1 ↓1 - - 6 6
↓1 ↓1 ↓1 - - ↑1 ↑1 - - ↑1 - - - - 6 6
Korean
- ↑1 ↑1 ↓1 - ↓1 - ↑1 - - ↓1 - - - 6 6
- - - - ↓1 - - ↓1 - ↑1 ↑1 - - - 4 4
Malay
- ↑1 ↑1 ↓1 - - - - - ↓1 - - - - 4 4
- - - - - - - - - - - - - - 0 0
Lithuanian
- - - ↑2 - - - - - - - ↓2 - - 2 4
- ↑1 - ↓1 - - ↑1 ↓1 ↓1 - ↑1 - - - 6 6
Romanian
- - - ↑2 - ↓1 - - - - - - - - 2 3
- ↑1 - ↓1 ↓1 - - - - ↑1 - - - - 4 4
Ukrainian
- ↑1 - ↑1 - ↓2 - ↑1 ↑1 ↓1 ↓1 - ↑1 ↓1 9 10
- - ↓1 ↑1 ↓1 - - - - - ↑1 - - - 4 4
Amharic
↓1 ↑2 ↑2 ↓1 - - - - ↑1 ↓3 - - - - 6 10
- - - - - - - - - - - - - - 0 0
Hausa
↑1 ↓1 ↑3 ↓1 - - ↓1 - ↓1 ↓1 ↑1 - - - 8 10
- - ↓1 - - - ↓1 - - ↑1 ↑1 - - - 4 4
Igbo
- ↑1 - - ↑1 - - - ↑2 ↓3 - ↓1 - - 5 8
- - - - - ↓1 - - - - ↑1 - - - 2 2
Kyrgyz
- ↓1 ↑1 ↑1 - - ↑1 - - ↓2 - - - - 5 6
- ↓1 - - - - - - - ↑1 - - - - 2 2
Malagasy
- ↑1 ↑4 ↑1 - - ↓1 - ↑1 ↓1 ↓5 - - - 7 14
- - - - - - - - ↓1 ↑1 - - - - 2 2
Nepali
- - - - - ↑1 ↓1 - ↑1 - ↓1 - ↑1 ↓1 6 6
- - - ↓1 ↓1 - - - - - ↑2 - ↑1 ↓1 5 6
Nyanja
- ↓1 ↑1 - - - - - - - - - - - 2 2
- - - - ↓1 - - - - - ↑1 - ↓1 ↑1 4 4
Shona
↑2 - ↑1 ↑1 - - ↑1 - - ↓4 ↓1 - - - 6 10
- ↑1 - - - - ↓3 - - ↑2 - - - - 3 6
Sinhala
- ↓1 ↑1 ↑1 - - - - - ↓1 - - - - 4 4
- ↓2 - ↑1 - - - - - ↑1 - - - - 3 4
Somali
- ↑1 ↑2 ↓2 - - ↑2 - - ↓2 ↓1 - ↓1 ↑1 8 12
- ↓1 - - - - ↑1 - - - - - - - 2 2
Swahili
- - ↑1 - - - ↓1 - - - - - ↓1 ↑1 4 4
- ↓1 - - - - - - - ↑1 ↑1 ↓1 - - 4 4
Telugu
- ↓1 ↑2 ↑1 ↑1 - ↑1 - ↓1 ↓2 ↓1 - - - 8 10
- ↑1 ↓2 - ↓1 - - - - ↑2 ↑1 ↓1 - - 6 8
Yoruba
- ↓1 ↑1 ↑1 ↑1 - - - - - ↓2 - - - 5 6
Table 7: Model rankings with MA rank as the reference for mid ( ) and low ( ) resource
languages. First row indicates changes in CA ranks, while second row shows the changes in
CS ranks relative to MA. Color-coded boxes highlight increases ( ↑ ) and decreases ( ↓ ).
43
F.2 Subject-level Performance
Figure 15 illustrates the performance of the Aya Expanse 32 model across various subjects, with
an average accuracy of 66.4%. Notably, most STEM subjects fall below this average, whereas
the majority of Social Sciences and Humanities subjects exceed it.
44
100
80
Percentage (%)
60
40
20
0
North Europe Asia Africa South Australia Asia North Europe Africa South Australia
America America America America
Figure 16: Relationship between Western and Asia cultures and region tags.
80
Percentage of Samples (%)
60
40
20
0
ia
ico
as
rs
ia
rs
ia
rs
r
US
liv
es
ss
he
he
he
ur
Pe
ex
Ru
on
Bo
nd
Ot
Ot
Ot
M
icr
Ho
Figure 17: Relationship between culture and country tags, focusing on Latin American and
Indigeneous cultures.
H Annotation Process
Communication. For both annotation tasks, annotators were briefed by one of the authors
in a virtual introduction session and were able to ask questions and raise issues throughout
the annotation task in a Discord channel. For both tasks, they were also encouraged to share
frequent error patterns or artifacts that they observed throughout the tasks with the authors and
45
100
80
Percentage of Samples (%)
60
40
20
0
A
rs
da
a
r ia
G e na
ce
ia
Ge ers
ce
ly
itz in
nd
ca
rs
An UK
UK
ae
yp
an
an
tic
pi
bw
US
Ita
ss
he
he
a
ri
an
ee
i
na
st
la
Isr
h
Ch
Sp
io
Eg
rm
rm
Af
rc
Ru
Au
Ot
Ot
Ot
ba
er
Gr
Fr
h
Ca
ta
Et
m
ut
Sw
Zi
So
Figure 18: Relationship between region and country tags, focusing on North America, Europe
and Africa regions.
80
Percentage of Samples (%)
60
40
20
0
a
ia
rs
ia
ico
as
rs
lia
in w
rs
di
in
re
pa
Ira
es
he
na
r
liv
he
he
Gu Ne
Ch
In
Ko
ur
Pe
tra
Ja
on
ea
ex
Ot
et
Bo
nd
Ot
Ot
s
ua
h
M
Vi
icr
Au
ut
Ho
p
M
So
Pa
Figure 19: Relationship between region and country tags, focusing on Asia, South America and
Australia.
capture difficult decisions and their rationales in comments for individual ratings. Similarly, they
discussed ambiguous cases and questions. This helped calibrate annotations across annotators
and languages.
Schedule. Each of the annotation tasks was conducted as 2–3 week long sprints in collaboration
with contributors from the community. There was no fixed time schedule for the annotations,
and annotators contributed varying hours, depending on their availability and speed.
For the cultural sensitivity evaluation task, 100% of the selected samples were labeled whereas for
the translation quality evaluation task, 37% of the provided samples were fully reviewed 12.3%
of the samples were edited in total.
Interface. The annotation interface for both tasks was built using Argilla.27 Argilla is an
open-source tool that can be used for data labeling. Using Argilla’s Python SDK, it was quick
27
https://argilla.io/
46
and easy to set up an annotation interface that could be deployed on Hugging Face Spaces. We
also set up SSO so annotators could log in and easily access the UI using their Hugging Face
accounts.
For cultural sensitivity evaluation, annotators were shown questions one by one from each of the
57 MMLU subjects and were asked to analyze and label the questions for presence of cultural,
geographic, dialect or regional knowledge as explained in 2.1 and shown in Figure 20.
As shown in Figure 21, for translation quality evaluation, annotators were shown the translated
question and corresponding options in their chosen language on the UI. Annotators were also
shown the original question and answer options in English for reference. If the translation was
good in quality and correctly represented the original English text then the annotators could
mark it as acceptable in quality and proceed to next question otherwise they could edit the
provided translation to improve its quality.
47
Figure 21: Translation evaluation annotation interface.
48
100
Cultural Sensitivity Evaluation
60
40
20
0
Male Female Prefer not Under 18 18-24 25-34 35-44 45-54 55-64 Asia Europe North Africa Oceania South
to say America America
Figure 22: Demographics of annotators who registered using our annotation interface for cultural
sensitivity as well as translation quality evaluation.
24. Krippendorff’s Alpha values range between -1 and 1 where 1 denotes that all annotators
agree unanimously and -1 denotes that the annotators are making opposite ratings. We observe
reasonable disagreement among samples for moral scenarios for both cultural sensitivity as well
as time-sensitivity annotations. 12 subjects have complete unanimous agreement regarding time-
sensitivity annotations between annotators.
1
0.8
Alpha Krippendorff Score
0.6
0.4
0.2
-0.2
S) ni.)
C sE y
m ( s
te i.)
at c.
ge )
Co C bra
Ph me cs
ic cs
W S H ts (H )
ld t. )
st S)
Pu . La )
bl w
tri el.
Fa n
s
S)
Ac M cal L
W nti keti .
or ng ng
Ph Reli ro)
s
M em io ( i.)
o ry )
on S)
M )
Ge rm (Un .
og l L i.)
M M hy ( c
al th S)
en S)
s
on S)
ll )
Ph ne s
s
eh hy
EU aw ry
st )
ris xu )
ud ity
Ps dic ispu ce
ho e ( s
So y (P )
cio ro)
Hu Vi logy
As Ag y
g
M at my
US di em .)
re (P t
P .)
cu y
y
(H
u r g
Fo CS isc
on hy l
Hi (Pro
Al (El.
U a S
or is S
In (HS
icr ist HS
Fa S
Ju Se . (HS
g i.
Fo cine en
Co hem thic
ct
ics n
io
Ge acie
ilo tic
yc in te
Ec c. P ica
i
es om
ig ro
tri M
an og
rit
i
Se olic
tio
(U
tro in
M r Se
ra og
pu Un
Ch B n
lo Un
M a Un
co a En
St s (H
Hi (H
(H
ec (H
. (H
or a H
Sc (H
ec (H
. (H
L isto
o si
ys tri
Nu c R
M D en
ys gio
Pr sop
pr al
CS
P
ar
U
M no
o
m rol
n lin
ld (
h
Bi
sin at
an h(
.
gy
ro ics
t
i
Bu An
n
a
g
p
lo
ac lit
ho
M . Po
ec
yc
e
e
El
v
Go
Ps
Figure 23: Krippendorff’s Alpha Scores for checking annotator agreement regarding the presence
of cultural or regional knowledge of samples.
49
1
0.8
0.4
0.2
-0.2
S) ni
.)
C sE y
m ( s
te i.)
at c.
ge )
Co C bra
Ph me cs
ic cs
W S H ts (H )
ld t. )
st S)
Pu . La )
bl w
tri el.
Fa n
s
tri M )
Ac M cal L
un rke g.
or ng ng
Ph Reli ro)
s
M em io ( i.)
o ry )
on S)
M )
Ge m Un .
og al L i.)
M M hy ( c
al th S)
en S)
s
on S)
ll )
Ph ne s
s
eh hy
EU w ry
st )
ris xu )
ud ity
Ps dic ispu ce
ho e ( s
S o y (P )
cio ro)
Hu Vi logy
As Ag y
g
M at my
Fo S isc
on hy l
Al (El.
U a S
or is S
In (HS
HS
icr ist HS
Fa (HS
Hi (Pro
Ju Se . (HS
g i.
Fo cine en
Co hem thic
ct
ics n
io
Ge acie
ilo tic
yc in te
Ec c. P ica
i
es om
ig ro
an og
rit
Se olic
tio
(U
tro in
M r Se
ra og
pu Un
Ch B (Un
lo Un
M na (Un
co a En
St s (H
Hi (H
ec (H
. (H
or a H
Sc (H
ec (H
La isto
o si
ys tri
Nu R
M D en
ys gio
Pr sop
pr al
CS
W ti ti
P
ar
M no
(
o
m rol
n in
ld (
c
h
Bi
sin at
(
.
gy
ro ics
.
t
i
a h
l
Bu An
n
C
p
lo
ac lit
ho
M . Po
r
ec
yc
e
El
v
Go
Ps
Figure 24: Krippendorff’s Alpha Scores for checking annotator agreement regarding the presence
of the time-sensitive nature of samples.
I Translation Analysis
I.1 Translation Quality
Figure 7 shows the translation quality comparison for Google Translate which is used to trans-
late Global-MMLU and GPT-3.5-turbo which is used for translating multilingual MMLU
released by (Lai et al., 2023). We see that Google Translate is significantly better across different
MMLU subject categories. For this analysis, we considered samples from MMMLU dataset28 as
the human reference and only considered languages which overlapped between the two machine
translated sets and human translated MMMLU.
The results reveal that the Humanities category exhibits the largest edit distances, with higher
values observed for questions compared to answers.
Given that longer text may inherently require more edits, we hypothesized that the observed
large edit distances could be influenced by the length of the questions and answers. To account
for this, we analyzed the length of each question-answer pair and computed the Normalized Edit
Distance (NED), where the edit distance is divided by the text length, shown in Figure 26.
The analysis reveals that questions in the Humanities category have the greatest average length,
whereas answers in the STEM category exhibit the highest NED. These findings suggest that
while raw edit distances are influenced by text length, normalized measures provide additional
insights into the complexity of edits across categories.
28
https://openai.com/index/openai-o1-system-card/
50
Figure 25: Average edit distance across different subject categories in MMLU. Each sample
comprises a question-and-answer pair, with the left column showing edit distances for questions
and the right column for answers.
Figure 26: (Top) Average normalized edit distance and (Bottom) average question and answer
lengths across different subject categories. The left column represents questions, while the right
column represents answers.
51
J MMLU Annotated Examples
Dataset Subject Question Choices
US Hist. (HS) This question refers to the following
information: (A) Individual state constitutions written at
“Some men look at constitutions with the time of the Revolution tended to cede
sanctimonious reverence, and deem too much power to the federal government,
them like the ark of the covenant, leading to a call for reform on the part of
too sacred to be touched. They as- Anti-Federalists.
cribe to the men of the preceding
age a wisdom more than human, and (B) The weaknesses of the Articles of Con-
suppose what they did to be beyond federation led James Madison to question
amendment . . . . But I know also, their efficacy and prompted a formation of
that laws and institutions must go the Constitutional Congress in 1787.
hand in hand with the progress of
the human mind. As that becomes (C) Difficulties over trade and foreign relations
more developed, more enlightened, led to a repeal of overly restrictive tariffs
as new discoveries are made, new required by the Articles of Confederation.
truths disclosed, and manners and
opinions change with the change of (D) Washington’s embarrassing failure at the
circumstances, institutions must ad- Whiskey Rebellion led to Federalist de-
vance also, and keep pace with the mands for a new framework for federal
times.” power.
—Thomas Jefferson, 1816
Which of the following best describes
a contributing factor in the crafting
of the United States Constitution?
Accounting (Pro) Under the Sales Article of the UCC,
which of the following circumstances
best describes how the implied war- (A) The buyer is purchasing the goods for a
ranty of fitness for a particular pur- particular purpose and is relying on the
pose arises in a sale of goods transac- seller’s skill or judgment to select suitable
tion? goods.
52
Prehistory What is the name of the lithic tech-
nology seen in the Arctic and con-
sisting of wedge-shaped cores, micro- (A) Clovis Complex
blades, bifacial knives, and burins?
(B) Denali Complex
53
Geography (HS) Which of the following is MOST
likely to experience population pres-
sure? (A) An industrial society with abundant nat-
ural resources and large imports of food
(C) A non-ecumene
(B) Awareness of traditional arts: For in- (B) Principles from the social sciences: The principle of
stance, the unique styles and techniques social exchange, that posits that social behavior is the
of Indigenous Australian art, often featur- result of an exchange process, is used worldwide.
ing dot painting and storytelling.
(C) Standardized international sports: The rules and prac-
(C) References to liberal/conservative atti- tices of soccer (football) are consistent worldwide.
tudes: We can’t assume the notion of lib-
eral is specific to a certain culture or region (D) Math questions which do not rely on local references:
but it inevitably involves social values and For example, the formula for the radius of a circle.
culture.
54
Geographical
(A) Natural Landmark Identification: Rec- (A) Global Climate Patterns: Understanding El Niño and
ognizing and knowing the significance of La Niña weather phenomena, which occur worldwide
regional natural wonders like the Grand and are not specific to any single region.
Canyon in the Southwestern United States
or the Great Barrier Reef in Australia. (B) Universal Celestial Bodies: The Sun and the Moon are
visible worldwide and do not possess regional speci-
(B) Environmental Awareness: Understand- ficity.
ing the impact and importance of regional
weather patterns, such as the monsoons (C) Standardized Geography Terms: Understanding the
in South Asian regions or the hurricanes definition of a peninsula or archipelago is applicable
in the Caribbean. to geographic features globally, not tied to regional
knowledge.
(C) Historical Event Memory: Knowledge of
region-specific historical occurrences, such
as the Gold Rush in California during
the 1850s, which transformed the region’s
economy and demographics.
55
Dialect
(A) Regional slang: Using the word “wicked” (A) Standardized technical jargon: Medical or legal termi-
to mean “very good” in parts of New Eng- nology used internationally within professional fields.
land, USA. Using the phrase “boot of the
car” to mean “trunk” in the UK. (B) Formal literary language: The writings of Shakespeare
or Dickens utilize sophisticated language but are not
(B) Unique idiomatic expressions: The phrase tied to specific dialects.
“Bob’s your uncle” in British English,
meaning “there you have it” or “that’s all (C) Global brand names: Companies like Nike or Adidas
there is to it.” use consistent branding worldwide, avoiding regional
vocabulary.
(C) Knowledge of social greetings: The cus-
tomary handshake and verbal greeting of
“Konnichiwa” when meeting someone in
Japanese culture.
56
logical_fallacies Fallacies
machine_learning ML
management Management
marketing Marketing
medical_genetics Genetics
miscellaneous Misc.
moral_disputes Disputes
moral_scenarios Moral Scenarios
nutrition Nutrition
philosophy Philosophy
prehistory Prehistory
professional_accounting Accounting (Pro)
professional_law Law (Pro)
professional_medicine Medicine (Pro)
professional_psychology Psychology (Pro)
public_relations Public Rel.
security_studies Security
sociology Sociology
us_foreign_policy US Foreign Policy
virology Virology
world_religions World Religions
Table 10: This table shows the short names assigned to MMLU subjects proposed by (Hendrycks
et al., 2020) in Figures 3, 6, 23, 24.
57