Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views29 pages

Hello 5

The document introduces L AZY R EVIEW, a dataset designed to identify lazy thinking in NLP peer reviews, highlighting the issue of reviewers using superficial heuristics due to cognitive overload. The dataset consists of 500 expert-annotated and 1276 silver-annotated review segments, demonstrating that instruction-based fine-tuning on this data significantly enhances the performance of Large Language Models in detecting lazy thinking. The findings emphasize the importance of high-quality training data and improved guidelines to foster better peer review practices in the NLP community.

Uploaded by

dakshmanchanda30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

Hello 5

The document introduces L AZY R EVIEW, a dataset designed to identify lazy thinking in NLP peer reviews, highlighting the issue of reviewers using superficial heuristics due to cognitive overload. The dataset consists of 500 expert-annotated and 1276 silver-annotated review segments, demonstrating that instruction-based fine-tuning on this data significantly enhances the performance of Large Language Models in detecting lazy thinking. The findings emphasize the importance of high-quality training data and improved guidelines to foster better peer review practices in the NLP community.

Uploaded by

dakshmanchanda30
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

L AZY R EVIEW

A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews


Sukannya Purkayastha1 , Zhuang Li2 , Anne Lauscher3 , Lizhen Qu4 , Iryna Gurevych1
1
Ubiquitous Knowledge Processing Lab,
Department of Computer Science and Hessian Center for AI (hessian.AI),
Technical University of Darmstadt
2
School of Computing Technologies, Royal Melbourne Institute of Technology, Australia
3
Data Science Group, University of Hamburg
4
Department of Data Science & AI, Monash University, Australia
www.ukp.tu-darmstadt.de
Abstract [Although the proposed approach does bring WSD
improvements, it is rather incremental. This is probably
not a ""weakness"" per se, just that the paper is not an
Peer review is a cornerstone of quality control eye-opener]. [The evaluation is done on German data
in scientific publishing. With the increasing
arXiv:2504.11042v1 [cs.CL] 15 Apr 2025

only, which leaves some doubts about other languages.]


workload, the unintended use of ‘quick’ heuris- Figure 1: Illustration of lazy thinking in ARR-22 re-
tics, referred to as lazy thinking, has emerged views sourced from NLP EER (Dycke et al., 2023). The
as a recurring issue compromising review qual- first review segment belongs to the class ‘The results are
ity. Automated methods to detect such heuris- not novel.’ The last segment pertains to, ‘The approach
tics can help improve the peer-reviewing pro- is tested only on [not English], so unclear if it will gen-
cess. However, there is limited NLP research eralize to other languages.’ as per ARR-22 guidelines.
on this issue, and no real-world dataset exists
to support the development of detection tools. With the growing load of paper submissions,
This work introduces L AZY R EVIEW, a dataset reviewers often face an overwhelming work-
of peer-review sentences annotated with fine-
load (Landhuis, 2016; Künzli et al., 2022) to assess
grained lazy thinking categories. Our analysis
reveals that Large Language Models (LLMs) multiple manuscripts quickly. When presented with
struggle to detect these instances in a zero- a cognitively hard task (e.g., reviewing) coupled
shot setting. However, instruction-based fine- with information overload and limited time, hu-
tuning on our dataset significantly boosts per- mans often end up using simple decision rules, also
formance by 10-20 performance points, high- known as heuristics (Tversky et al., 1974). Though
lighting the importance of high-quality train- simple and efficient, these heuristics can often lead
ing data. Furthermore, a controlled experi- to errors and unfair evaluations (Raue and Scholl,
ment demonstrates that reviews revised with
2018). The usage of such heuristics to dismiss
lazy thinking feedback are more comprehen-
sive and actionable than those written without research papers in the Natural Language Process-
such feedback. We will release our dataset and ing (NLP) Community has been termed as lazy
the enhanced guidelines that can be used to thinking (Rogers and Augenstein, 2021). One such
train junior reviewers in the community.1 example is shown in Fig 1. Here, the reviewer dis-
misses the paper in the first review segment for not
1 Introduction being an “eye-opener”. However, they do not pro-
vide any references regarding similar prior work or
Peer Reviewing is widely regarded as one of the feedback to improve the paper. This corresponds to
most effective ways to assess the quality of sci- the lazy thinking class ‘The results are not novel’.
entific papers (Ware and Mabe, 2009). It is a
In 2021, when ACL Rolling Review (ARR) was
distributed procedure where the experts (review-
adopted as one of the main reviewing platforms
ers) independently evaluate whether a submitted
for major NLP conferences, these heuristics were
manuscript adheres to the standards of the field.
added to the guidelines (Rogers and Augenstein,
With the Mathew effect (Merton, 1968) in science
2021), aiming to discourage reviewers from rely-
(“rich get richer”), where the researchers receive
ing on such approaches.2 However, in their ACL
benefits throughout their career for having papers
2023 report, Rogers et al. (2023) identified the us-
at prestigious venues, it is of utmost importance to
age of these heuristics as one of the leading factors
ensure sound practices in the reviewing process.
(24.3%) of author-reported issues in peer reviews.
1
Code available here: https://github.com/UKPLab/
2
arxiv2025-lazy-review https://aclrollingreview.org/
Therefore, in this work, we focus on this pertinent ing on our dataset enhances model performance.
issue, lazy thinking in NLP peer reviewing and heed (5) Find that annotated lazy thinking classes im-
Kuznetsov et al. (2024)’s call for automated meth- prove review quality.
ods to signal such occurrences in order to improve
the overall reviewing quality. 2 L AZY R EVIEW: A Dataset for detecting
To have a finer look at the problem of lazy think- Lazy Thinking in Peer Reviews
ing in NLP peer reviews, we introduce L AZY R E - In line with Rogers and Augenstein (2021), we first
VIEW , a dataset with 500 expert-annotated review define lazy thinking as follows.
segments and 1276 silver annotated review seg-
Definition. ‘Lazy thinking, in the context of NLP
ments from the best-performing model tagged with
research paper reviews, refers to the practice of
lazy thinking classes with a review segment con-
dismissing or criticizing research papers based
sisting of 1 or more review sentences. We develop
on superficial heuristics or preconceived notions
this dataset over three rounds of annotation and in-
rather than thorough analysis. It is characterized
crementally sharpen the guidelines guided by inter-
by reviewers raising concerns that lack substantial
annotator agreements. We further provide positive
supporting evidence and are often influenced by
examples, i.e., annotated examples for each class,
prevailing trends within the NLP community.’
to enhance the understanding of annotators for this
task and reach annotation consensus sooner. An- Rogers and Augenstein (2021) enlist a total of 14
notating review segments by a new batch of an- types of lazy thinking heuristics in the ARR 2021
notators who were not involved in developing the guidelines adapted from the study in Rogers and
guidelines resulted in Cohen’s κ (Landis and Koch, Augenstein (2020). We show some of the classes
1977) values of 0.32, 0.36, and 0.48, respectively. as described in the guidelines in Table 1.3 Re-
The steady increase in agreement across rounds jus- iterating the example in Fig 1, we note that such
tifies the effectiveness of the developed guidelines. claims about novelty need to be backed up with
With this dataset, we further test the zero-shot supporting evidence and hence constitute a classic
capabilities of LLMs in identifying lazy thinking, example of lazy thinking as per the guidelines.
which can be leveraged downstream to identify In this section, we describe the creation of
rule-following behaviour (Sun et al., 2024b) in peer- our dataset, L AZY R EVIEW guided by the ARR
review scenarios. Despite likely exposure to peer 2022 (Rogers and Augenstein, 2021) and the
reviews via platforms like OpenReview during pre- EMNLP 2020 guidelines (Liu et al., 2020). We
training, LLMs struggle to accurately identify the describe the dataset curation process followed by
type of lazy thinking when presented with the cur- an analysis of the dataset. This is the first dataset of
rent guidelines. To enhance their comprehension annotated instances with fine-grained lazy thinking
of lazy thinking, we employ instruction-tuning on classes for NLP peer reviews.
the LLMs using our dataset, leading to significant 2.1 Review Collection and Sampling
performance gains—around 10-20 accuracy points
compared to their zero-shot and few-shot perfor- We use the ARR 2022 reviews from the NLP EER
mances. Finally, we perform a controlled experi- Dataset (Dycke et al., 2023) in which the reviews
ment where human reviewers rewrite peer reviews are collected using the 3Y-pipeline (Dycke et al.,
with(out) using annotations of lazy thinking from 2022) and as such has clear licenses attached.
our dataset. Human preference-based evaluations The dataset comprises reviews from five venues:
reveal that reviews written with the lazy thinking CONLL 2016, ACL 2017, COLING 2020, ARR
guidance are more comprehensive and actionable 2022, and F1000RD (an open-source science jour-
than their vanilla counterparts. nal), with 11K review reports sourced from 5K
Contributions. We make the following contribu- papers. Focusing on the lazy thinking definition in
tions: (1) Introduce L AZY R EVIEW, a dataset an- the ARR 2022 guidelines (Rogers and Augenstein,
notated with lazy thinking classes for a new task 2021), we consider only the ARR-22 reviews, using
in model development and evaluation. (2) En- 684 reviews from 476 papers with 11,245 sentences
hance annotation guidelines to improve both au- in total. Each review is divided into sections like
tomated and human annotations. (3) Demonstrate ‘paper summary’, ‘summary of strengths’, ‘sum-
that positive examples boost annotation quality and mary of weaknesses’, and ‘comments, typos, and
3
in-context learning. (4) Show that instruction tun- The full table is provided in Table 7 in Appendix §A.1
Heuristics Description Example review segments
The results are not surprising Many findings seem obvious in ret- Transfer learning does not look
rospect, but this does not mean that to bring significant improvements.
the community is already aware of Looking at the variance, the results
them and can use them as building with and without transfer learning
blocks for future work. overlap. This is not surprising.
The results are not novel Such broad claims need to be The approach the authors propose
backed up with references. is still useful but not very novel.
The paper has language errors As long as the writing is clear The paper would be easy to fol-
enough, better scientific content low with English proofreading even
should be more valuable than better though the overall idea is under-
journalistic skills. standable.
Table 1: Descriptions for some of the lazy thinking classes sourced from ARR 2022 guidelines (Rogers and
Augenstein, 2021). We present some examples corresponding to these classes from our dataset, L AZY R EVIEW.
suggestions’, with reviewers completing the rele- confidence levels: low, medium, or high. To aid in
vant sections based on their evaluation. However, understanding, we added two additional classes:
there is no standardized format for expressing con- ‘None’ for no lazy thinking, and ‘Not Enough
cerns, so we use automated methods to extract rele- Information’ for instances lacking specificity or
vant review segments, as detailed later. needing the paper for proper classification.
Two Ph.D. students, both fluent in English and
2.2 Review Formatting experienced in NLP peer reviewing, are tasked
We utilize GPT-4 (OpenAI et al., 2024) to prepare with annotating review segments iteratively, given
the instances (review segments) for subsequent an- that peer reviewing requires specialized expertise.
notations. We instruct GPT-4 to extract the review A senior PostDoc with an NLP background acts
segments from the ‘Summary of Weaknesses’ sec- as a third annotator to resolve any disagreements
tion of the review that can be classified into one of between the initial annotations. The guidelines
the lazy thinking classes, as outlined in the ARR evolve over multiple rounds. Once the guidelines
guidelines4 . Since lazy thinking often contributes are refined and finalized, we recruit a new batch
to reasons for rejecting a paper, we believe it will of annotators to re-validate them by asking them
appear only in the ‘Summary of Weaknesses’ sec- to annotate the same review segments. These new
tion. We obtain 1,776 review segments after this annotators follow the same set of guidelines used in
step, with each segment having varied lengths from the earlier rounds to ensure consistency. After this
1 to 5 review sentences, as described later in Sec 2.5. validation process, the original Ph.D. annotators
To validate the quality of the extracted segments, are retained to annotate additional instances.
we sample a set of 100 segments from this pool
and task three annotators to independently annotate 2.4 Evolving Guidelines
each segment within the context of the entire review We incrementally improve the guidelines guided
to decide on their candidacy towards lazy thinking. by the inter-annotator agreements. Given the high
The final label is chosen based on a majority vote. subjectivity of this domain, we consider the guide-
We obtain Cohen’s κ of 0.82 and obtain precision, lines to be precise once we achieve a substantial
recall and F1 scores as 0.74, 1.00 and 0.85, respec- agreement on annotating the instances further.
tively. Intuitively, this means that GPT-4 samples Round 1: ARR 2022 Guidelines. In this round,
more candidates than required, giving us a larger the annotators are provided with the existing
pool of candidates for annotation. We therefore ARR (Rogers and Augenstein, 2021) guidelines
introduce another class to our labelling schema, and asked to annotate the lazy thinking classes. We
‘None’ to annotate non-lazy thinking candidates. sample a set of 50 instances from the review seg-
2.3 Annotation Protocol ments extracted by GPT-4 as described in Sec §2.1.
After the first round of annotation, we obtain a
Annotators are given the full review and the target
Cohen’s kappa, κ of 0.31. This is a substantially
review segment (highlighted) to classify according
low agreement, revealing an ambiguity in the re-
to incrementally developed lazy thinking guide-
viewing guidelines. The confusion matrix for the
lines based on ARR 2022. They also indicate their
first round of annotation is shown in Fig 16a (cf.
4
The prompt for GPT-4 is provided in Appendix §A.4 §A.9). We find that there is a high agreement for
the ‘None’ class, which implies that the annotators Resource Paper
Not SOTA
can easily detect a review segment that is not prob- Simple Method
lematic. However, they struggle to determine the No Precedence
Results Negative
fine-grained label of the problematic review seg- Tested on Language X
ments. Further analysis of the confidence level of Method Not Preferred
Contradict Expectation
the annotators reveals that for most of the cases, Should do X
the annotators have low confidence, as shown in Extra Experiment
Not surprising
Fig 17a (cf. §A.9), which points towards ambiguity Missing Comparison
in the guidelines. Not Novel
Language Errors
Round 2: ARR 2022 and EMNLP guidelines. Niche Topics
We further explored the initial reviewing guidelines 0 10 20 30 40 50 60 70 80

released during EMNLP 2020 (Liu et al., 2020). Figure 2: Distribution of lazy thinking labels in our
We identify some additional classes that are dataset, L AZY R EVIEW.
now missing from the ARR guidelines, namely ing a Cohen’s κ of 0.86. For round 3, we sampled
‘Non-mainstream Approaches’ (rejecting papers 50 new review segments and included annotated
for not using current trending approaches) and examples from the previous round, resulting in a
‘Resource Paper’ (favoring resource papers lesser Cohen’s κ of 0.52, indicating substantial agreement
for ACL conferences). Additionally, we extend despite the task’s subjectivity, as shown in Fig 16c
descriptions of some of the classes such as, ‘This (cf. §A.9). We also obtain higher annotator confi-
has no precedent in existing literature’, ‘The dence levels as shown in Fig 17c (cf. §A.9).5
method is too simple’ using the guidelines in Liu Final Validation. To validate the guidelines de-
et al. (2020). Moreover, we extend the name of veloped over three rounds, we hired a new group
some of the class labels based on the EMNLP of English-speaking PhD students with NLP peer
2020 guidelines, such as ‘The paper has language review expertise to annotate the same set of in-
errors’ with ‘Writing Style’; ‘The topic is too stances. These annotators were not involved in
niche’ with ‘Narrow topics’, which have similar the development of the guidelines. Re-annotation
meanings. We show the extended descriptions for across rounds 1, 2, and 3 resulted in inter-annotator
those classes in Table 8 (cf. Appendix §A.1). We agreement (κ) values of 0.32, 0.36, and 0.48, re-
annotate the same set of instances as in Round 1 spectively. This steady increase in κ aligns with
and eventually calculate agreements. We obtain previous results of 0.31, 0.38, and 0.52, validating
Cohen’s κ of 0.38, which is significantly higher the iterative improvement of the guidelines based
than the previous round (0.31). We observe on stronger inter-annotator agreement. Agreement
higher agreements for the classes having extended scores of 0.4 or higher are considered substantial
names such as, ‘The paper has language errors’ due to the inherent subjectivity of the task and are
and ‘Niche Topics’, as illustrated in Fig 16b (cf. consistent with agreement levels reported in previ-
§A.9). The confidence level of the annotators also ous studies within the peer-reviewing domain (Ken-
substantially increased from low to medium in this nard et al., 2022; Purkayastha et al., 2023). We
round, as can be seen in Fig 17b (cf. §A.9). retain the annotators from the initial rounds who
Round 3: Round 2 guidelines with positive exam- developed the guidelines to annotate a total of 500
ples. To promote quick learning through “worked review segments as a part of our dataset, L AZY R E -
examples" (Atkinson et al., 2000), we refine our VIEW . The overall annotation time amounted to 20
annotation round by fixing the guidelines and incor- hours. This corresponds to 25 minutes per exam-
porating positive examples for each lazy thinking ple, ranging from less than a minute for annotating
class. We evaluated several techniques to select shorter segments up to half an hour for longer ones.
representative examples, measuring effectiveness
using Cohen’s κ. The methods include (a) Random:
selecting the shortest or longest review segments, 2.5 Dataset Analysis
and (b) Diversity: encoding segments with SciB- Our dataset comprises 500 expert-annotated review
ERT (Beltagy et al., 2019), clustering them using segments tagged with 18 classes. Out of all the
K-Means, and choosing the cluster center with the labels, 16 classes in our dataset correspond to ex-
lowest cosine distance. After pairwise comparisons,
5
the random shortest method was preferred, achiev- The examples used in this round are in Table 9 in §A.1
plicit lazy thinking. The distribution of our dataset stance of lazy thinking or not. For fine-grained
is shown in Fig 2 corresponding to these 16 classes. one, the model selects the best-fitting lazy thinking
The most frequent lazy thinking is ‘Extra Experi- class from the provided options. We test two input
ments’, where the reviewers ask the authors to con- types: the target segment (T) alone, or the combi-
duct additional experiments without proper justifi- nation of the review and target segment (RT).7
cation. This trend reflects the current expectations Metrics. To evaluate LLM outputs, we use both
in the NLP field, which has rapidly shifted towards strict and relaxed measures because LLMs some-
machine learning models and methods that often times do not produce exact label phrases. The
emphasize extensive empirical evaluations (Guru- strict measure, as defined by Helwe et al. (2024),
raja et al., 2023). The next most frequent classes uses regular expressions to check for matches be-
are ‘Not Enough Novelty’ and ‘Language Errors’. tween the gold label and predictions, reporting ac-
The ACL review report does not state the individual curacy and macro-F1 (string matching). The re-
distribution of these classes but constitutes around laxed measure employs GPT-3.5 to judge whether
24.3% of all reported issues. We further analyze the provided answer is semantically equivalent to
the sentence length of the review segments within the ground-truth class, outputting a “yes” or “no”
these classes. This is illustrated in Fig 18 of §A.13. decision. This approach follows previous work on
We observe that most of these review segments evaluating free-form LLM outputs (Wadden et al.,
have a length of 1 sentence, underscoring the use 2024; Holtermann et al., 2024), and reports accu-
of shorter arguments to dismiss papers. The lazy racy based on the number of “yes” responses. Both
thinking class ‘Extra Experiment’ is the most com- metrics determine whether predictions are correct
mon with variable segment lengths. or incorrect. We also performed an alignment study
on 50 responses from every model to validate the
3 Experiments reliability of the evaluators. We find that the string-
matching-based evaluator underestimates the cor-
We use the L AZY R EVIEW dataset to assess the
rect predictions, whereas the GPT-based evaluator
performance of various open-source LLMs in de-
overestimates them, rendering a correct balance of
tecting lazy thinking in NLP peer reviews.
lower and upper bounds for the model predictions.8
Experimental Setup RQ1: How effective are the improved guidelines
Tasks We propose two formulations for detecting in enhancing zero-shot performance of LLMs?
lazy thinking in peer reviews: (i) Coarse-grained Since the first two rounds of our annotation study
classification: a binary task to determine if an input are dedicated to fixing the guidelines for annota-
segment x is lazy thinking, and (ii) Fine-grained tions, we evaluate the understanding of the LLMs
classification: a multi-class task to classify x into on the same 50 instances on both annotation rounds,
one of the specific lazy thinking classes, ci ∈ C. i.e., rounds 1 and 2, respectively. This validates
Models. Since the guidelines for NLP conferences whether the improved guidelines actually influence
are mainly instructions, we explore various open- the zero-shot performance of LLMs.
source instruction-tuned LLMs for this task. We Modelling Approach. We prompt the LLMs for
experiment with the chat versions of LLaMa-2 round 1, with the definition of lazy thinking classes,
7B and 13B (Touvron et al., 2023) (abbr. LLaMa, as shown in Table 1, representing the existing ARR
LLaMaL ), Mistral 7B instruct (Jiang et al., 2023) guidelines. For round 2, we prompt the LLMs with
(abbr. Mistral), Qwen-Chat 1.5 7B (Bai et al., 2023) the new guidelines as described in Sec §A.1 where
(abbr. Qwen), Yi-1.5 6B instruct (AI et al., 2024) we added new classes and extended descriptions of
(abbr. Yi-1.5), Gemma-1.1 7B instruct (Team et al., the existing classes in the ARR guidelines using
2024) (abbr. Gemma), and SciTülu 7B (Wadden the EMNLP 2020 Blog (Liu et al., 2020).
et al., 2024) (abbr. SciTülu).6 Results. We present results for fine-grained and
Prompts. Following our annotation protocol, we coarse-grained classification of zero-shot LLMs
prompt the LLMs with guidelines and instructions in Table 2.9 Using only the target segment (T)
for each round. For coarse-grained classification, as input generally outperforms using both the tar-
the model determines whether the input is an in-
7
Full details are in Appendix §A.6.
6 8
The justification for the choice of these models along with Details about the study are in Appendix §A.5.
9
implementation details are in Appendix §A.2 Full results in Tables 10 and 11 of Appendix §A.11.
Fine-gr. Coarse-gr. Fine-gr. Coarse-gr.
Models
Models
R1 R2 R1 R2 S.A G.A S.A G.A
S.A G.A S.A G.A S.A G.A S.A G.A R ANDOM 2.46 - 43.3 -
R ANDOM 7.11 - 4.34 - 40.7 - 40.7 - M AJORITY 5.11 - 52.3 -
M AJORITY 11.1 - 7.34 - 51.4 - 51.4 - Gemma + TE 24.45.5 41.18.9 75.620.0 88.931.7
Gemma + T 22.2 52.2 26.7 58.1 44.3 51.1 46.1 54.4 Gemma + RTE 17.32.9 32.80.2 71.15.5 82.216.6
Gemma + RT 12.2 46.7 11.6 51.1 48.1 47.4 50.4 49.1 LLaMa + TE 15.64.5 38.93.3 84.44.4 89.13.0
LLaMa + T 12.2 15.6 22.2 30.6 57.7 70.0 60.0 75.0 LLaMa + RTE 14.22.0 30.81.9 75.32.0 81.14.4
LLaMa + RT 12.2 25.6 13.2 33.7 53.3 55.1 60.0 67.7 LLaMaL + TE 24.413.3 41.15.5 73.11.9 71.110.0
LLaMaL + T 26.7 44.4 26.7 45.3 60.2 73.1 62.2 75.4 LLaMaL + RTE 18.88.1 34.42.2 70.31.5 61.19.9
LLaMaL + RT 15.6 41.1 17.6 40.4 68.6 69.4 70.2 70.2
Mistral + TE 30.01.8 55.61.2 86.712.3 86.711.5
Mistral + T 27.8 47.8 28.8 51.1 57.8 64.8 58.8 66.3
Mistral + RTE 27.85.6 52.21.1 68.86.6 68.86.6
Mistral + RT 12.2 28.9 16.6 35.9 55.4 53.8 57.4 56.0
Qwen + TE 31.12.2 56.412.0 86.74.0 86.74.0
Qwen + T 21.1 46.7 22.7 50.0 68.9 74.1 70.4 76.1
Qwen + RT 12.2 43.3 13.3 42.6 53.3 53.3 56.5 56.5 Qwen + RTE 27.81.1 44.20.9 62.22.0 62.22.0
Yi-1.5 + T 35.3 56.7 37.6 60.0 64.4 71.1 68.7 73.4 Yi-1.5 + TE 30.03.3 54.91.1 74.53.2 73.81.5
Yi-1.5 + RT 34.4 51.1 32.8 52.2 63.3 65.1 68.3 70.4 Yi-1.5 + RTE 24.43.2 52.71.4 70.12.0 72.42.3
SciTülu + T 14.4 18.1 25.3 29.4 57.8 57.8 58.3 58.3 SciTülu + TE 23.31.1 44.82.6 72.221.1 72.220.0
SciTülu + RT 15.6 17.3 18.3 23.7 55.6 55.6 58.7 58.7 SciTülu + RTE 19.70.8 41.12.2 88.817.7 91.122.3

Table 2: LLM performance across annotation rounds in Table 3: Performance of LLMs for round 3 in terms
terms of string-matching (S.A) and GPT-based (G.A) of the metrics used in Table 2 for fine-grained (Fine-
accuracy for fine-grained (Fine-gr.) and coarse-grained gr.) and coarse-grained (Coarse-gr.) tasks. ‘E’ denotes
(Coarse-gr.) tasks. ‘T’ uses only the target sentence, adding in-context exemplars to input types: ‘T’ (target
‘RT’ combines review and target. R1 and R2 represent sentence) and ‘RT’ (review + target sentence). Red:
‘Round 1’ and ‘Round 2’. Increments from R1 to R2 are Increments with exemplars. Subscripts represent incre-
highlighted in red; decreases or no change in gray. ments as compared to the zero-short versions.
get segment and review (RT) across most models, grained and coarse-grained classification (cf. §A.7),
likely due to spurious correlations from longer in- finding that the static strategy outperforms all other
puts, as noted by Feng et al. (2023). All models setups, especially for Mistral and Qwen. Increasing
improve from round 1 (R1) to round 2 (R2), with exemplars does not enhance performance, support-
Gemma gaining 4.5 points in string accuracy (S.A). ing Srivastava et al. (2024) that random examples
SciTülu, however, benefits from the broader con- yield better results than similarity-based methods.
text of RT, possibly due to its pre-training on long- This also can be attributed to the ability of LLMs to
context tasks. Coarse-grained classification scores learn the format of the task from exemplars rather
are consistently higher than fine-grained, showing than learning to perform the task (Lu et al., 2024).
LLMs can serve as basic lazy thinking detectors. We adopt a static in-context learning method using
1 exemplar for the rest of the paper.
RQ2: How effective are positive examples in We show the results using 1 static example in
improving the performance of LLMs? Table 3. ICL-based methods surpass zero-shot mod-
To test the effect of using positive examples on els, with notable gains in coarse classification (20
the performance of LLMs, we leverage in-context points for Gemma, 21 points for SciTülu string-
learning to test the models with the same 50 in- based accuracy (S.A.)). We also obtain positive
stances as used in round 3 of our annotation study. increments across the board in fine-grained clas-
Modelling Approach. We prompt LLMs using pos- sifications. SciTülu excels in the coarse-grained
itive examples from previous rounds and Round 2 classification, possibly due to the domain specific
guidelines, combining target segments (TE) and the pre-training. Qwen leads in fine-grained classifi-
combination of review and target segments (RTE) cation, likely due to the multi-lingual pre-training
as in-context learning examples. Since LLMs have data and high-quality data filtering performed for
a fixed context window, we employ two setups for the pre-training phase (Bai et al., 2023).
selecting examples: (i) Static selection of fixed ran-
dom examples for all inferences, and (ii) Dynamic RQ3: How effective is instruction-tuning in
selection on per test case basis using following improving the performance of LLMs?
methods: (a) BM25: uses BM25 to select exam-
Building on previous work in aligning LLMs with
ples, (b) Top K: uses embedding space similarity,
new tasks (Ivison et al., 2023; Wadden et al., 2024),
and (c) Vote-K: penalizing already selected exam-
we apply instruction-based fine-tuning for lazy-
ples. We tested with 1, 2, and 3 exemplars.10
thinking detection. Using a bottom-up approach,
Results. We plot results in Fig 3 and 4 for fine- we determine the data requirements using 3-fold
10
We leverage OpenICL (Wu et al., 2023) for retrieving cross-validations for maximum performance on a
exemplars and GPT2-XL as the embedding generator. small validation set. We then identify optimal data
Fine-gr. Coarse-gr.
mixes for the test set. We subsequently use a simi- Models
R1 R2 R3 R1 R2 R3
lar mix to train the models and then compare their
R AND . 7.11 4.34 2.46 40.7 40.7 43.3
performance to their non-instruction tuned coun- M AJ . 11.1 7.34 5.11 51.4 51.4 52.3
Gemma 22.2 26.7 24.4 48.1 50.4 75.6
terparts from the previous annotation rounds and it + T 31.49.2 38.812.1 34.610.2 57.89.7 61.410.0 81.25.6
obtain silver annotations from the best model. it + RT 28.26.0 35.79.0 32.88.4 55.67.5 59.49.0 78.83.2
LLaMa 12.2 22.2 15.6 57.7 60.0 84.4
Modelling Approach. We apply instruction-based it + T 43.831.6 47.825.6 44.729.1 62.75.0 65.45.0 85.41.0
finetuning for the same models as detailed in it + RT 43.231.0 45.323.1 41.826.2 61.23.5 63.13.1 81.33.1
LLaMaL 26.7 26.7 24.4 68.6 70.2 73.1
Sec §3. To optimize our limited computational it + T 45.819.1 47.821.1 50.526.1 74.35.7 74.64.4 75.32.2
resources, we employ LoRa (Hu et al., 2022) it + RT 41.214.5 45.218.5 47.322.9 70.21.6 71.81.8 73.30.2
Mistral 27.8 28.8 30.0 57.8 58.8 86.7
for parameter-efficient finetuning and utilize open- it + T 35.47.6 37.48.6 42.412.4 60.22.4 62.23.4 86.40.3
instruct (Wang et al., 2023).11 We use the same it + RT
Qwen
31.23.4
21.1
35.26.4
22.7
37.87.8
31.1
65.37.5
68.9
68.29.4
70.4
88.21.5
86.7
prompt (cf. §3) as instruction along with the round it + T 45.924.8 48.425.7 59.428.3 75.46.5 76.35.9 88.41.7
it + RT 41.220.1 42.419.7 47.816.7 73.24.3 74.13.7 86.30.4
2 guidelines (combination of guidelines of ARR Yi-1.5 35.3 37.6 30.0 64.4 68.7 74.5
and EMNLP blog) while performing this experi- it + T 45.19.8 47.810.2 47.917.9 69.55.1 74.25.5 78.43.9
it + RT 43.27.9 45.37.7 46.316.3 67.22.8 69.40.7 73.21.3
ment. We use multiple data sources to create data SciTülu 15.6 25.3 23.3 57.8 58.3 88.8
mixes: (i) T ÜLU V2 (Ivison et al., 2023) with it + T 45.730.1 48.623.3 54.331.0 66.38.5 68.410.1 91.22.4
it + RT 41.425.8 42.617.3 51.428.1 62.44.6 65.67.3 87.21.6
326,154 samples, (ii) S CI RIFF (Wadden et al.,
Table 4: Performance of LLMs after instruction tuning
2024) with 154K demonstrations across 54 tasks, (it) for fine-grained classification using target segment
and (iii) L AZY R EVIEW with 1000 samples evenly (T) and the combination of review and target segment
split between coarse and fine-grained classifica- (RT) in terms of string-matching accuracy (St. (Acc)).
tion. L AZY R EVIEW is divided into 70% training The first row of each model states the best results ob-
(700 examples), 10% validation (100 examples), tained previously as detailed in Tables 2 and 3 respec-
and 20% testing (200 samples) in a 3-fold cross- tively. Subscripts represent increment or decrement
validation setup. We train on 700 examples in the compared to the non-instruction tuned versions.
N O M IX setting. We use an equal number of in- multilingual training on 2.4T tokens (Bai et al.,
stances (700) from each data source to balance the 2023) compared to Gemma’s English-only train-
other mixtures. We create S CI RIFF M IX (1400 ing (Team et al., 2024). Mistral and Yi perform
samples) and T ÜLU M IX (1400 samples) by com- best with the full dataset mix, with Yi surpass-
bining L AZY R EVIEW with the other datasets and a ing Mistral (35.7 vs. 26.8 S.A.), possibly due
F ULL M IX (2100 samples) integrating all three. We to its larger vocabulary size. Qwen and Yi lead
test the performance on the same cross-validated in fine-grained classification, while SciTülu ex-
test sets for each mix. We use a proportion of 0.3 cels in coarse classification, consistent with earlier
for the ‘T’ (using only target segment) setup and findings. Instruction tuning’s effectiveness relies
0.7 for the ‘RT’ setup (using a combination of re- on data composition; blending general-purpose in-
view and target segment) from the different mixes struction data from T ÜLU with scientific data from
to optimize performance and data efficiency based S CI R I FF optimizes LLM performance (Shi et al.,
on a hyper-parameter search on different validation 2023) on this task. However, including tasks from
sets.12 all sources (F ULL M IX) can occassionally under-
Results. We compare instruction-tuned models perform, likely due to negative task transfer (Jang
to their zero and few-shot counterparts for fine- et al., 2023; Wadden et al., 2024).
grained and coarse-grained classification using 3- We train models using optimal data mixes from
fold cross-validation, as shown in Tables 12 and our cross-validation experiments to obtain silver
13 (cf. §A.10). Instruction tuning significantly en- annotations for the rest of the 1,276 review seg-
hances model performance. The LLaMa models ments. We test the performance on the annotated in-
and SciTülu excel with the S CI RIFF M IX, ben- stances from previous rounds (details in Appendix
efiting from pre-training on S CI RIFF and T ÜLU. §A.8). We ensure no leakage between the annota-
Gemma and Qwen achieve their best results with tion rounds and the training data. The instruction-
the T ÜLU M IX, with Qwen outperforming Gemma tuned models’ results for fine-grained classification
(45.5 vs. 31.6 S.A. accuracy), likely due to Qwen’s (Table 4), demonstrate significant improvements
11 over zero-shot and few-shot baselines (best results)
https://github.com/allenai/open-instruct
12
Details on hyper-parameters are in §A.8, with tuning from previous rounds (full results in Tables 10 and
methods described in §A.8.3. 11). Qwen achieves the best performance, with
Constr. Justi. Adh.
gains up to 31 pp. for SciTülu. Coarse-grained clas- Type
W/T/L W/T/L W/T/L
sification also shows positive increments, though Orig. vs lazy 85/5/10 85/10/5 90/5/5
Orig w. gdl vs lazy 70/5/25 70/5/25 75/5/20
improvements are smaller (Table 4). This high-
lights the effectiveness of instruction-tuning in Table 5: Pair-wise comparison of rewrites based on Win
(W), Tie (T), and Loss (L) rates across metrics. The first
guiding models towards accurate outputs (Wad-
row compares lazy thinking rewrites (lazy) with original
den et al., 2024; Ivison et al., 2023). We use the reviews (Orig), while the second compares lazy rewrites
instruction-tuned Qwen model to label the rest of with guideline-based rewrites (Orig w. gdl).
the 1,276 review segments and release these silver
approach resulted in more constructive and justified
annotations as a part of our dataset. Detailed analy-
feedback. We also train a Bradley-Terry preference
ses on this portion are provided in Appendix §A.14.
ranking model (Bradley and Terry, 1952) on adher-
ence preference data, which reveals strengths of 1.6
RQ4: How effective are lazy thinking guidelines
for lazy thinking rewrites, -1.5 for original reviews,
in improving review quality?
and 0.4 for guideline-based rewrites. This trans-
Setup. To improve review quality by addressing lates to a 95.6% win rate for lazy thinking rewrites
lazy thinking, we focus on assessing peer reviews over original texts and 76.8% over guideline-based
created with and without lazy thinking annotations. rewrites, further confirming the effectiveness of
In a controlled experiment, we form two treatment these annotations. We obtain Krippendorff’s α
groups, each with two PhD students experienced scores of 0.62, 0.68, and 0.72 for constructiveness,
in NLP peer reviewing. We sample 50 review re- justified, and adherence respectively.
ports from NLP EER for which we have explicit
lazy thinking annotations. One group rewrites 4 Related Work
these reviews based on the current ARR guide-
lines (Rogers and Augenstein, 2021), while the Rule Following. Rules are essential for everyday
other group rewrites the same reviews using the reasoning (Geva et al., 2021; Wang et al., 2024),
same guidelines with our lazy thinking annotations with much research focusing on evaluating rule-
included. Following previous studies (Yuan et al., following behavior in generated outputs. Prior
2022; Sun et al., 2024a), we evaluate the reviews studies have examined inferential logic in question-
based on Constructiveness, ensuring actionable answering (Wang et al., 2022a,b; Weir et al., 2024;
feedback is provided, and Justified, requiring clear Sun et al., 2024b) and detecting forbidden tokens in
reasoning for arguments. Additionally, we intro- red-teaming (Mu et al., 2024). However, identify-
duce Adherence to assess how well the reviews ing rule adherence in peer-review reports requires
align with the given guidelines. A senior PostDoc more abstract reasoning than in typical rule-based
and a PhD compare the reviews from both groups scenarios which makes our work more challenging.
and the original reviews pairwise. We split the re- Review Quality Analysis. As research outputs con-
views into equal portions (25 reviews each) while tinue to rise, interest in assessing review quality
keeping 10 reviews for calculating agreements.13 has grown significantly within the academic com-
Results. The ‘win-tie-loss’ results in Table 5 show munity (Kuznetsov et al., 2024). Previous studies
that reviews using lazy thinking signals outperform have examined various aspects of review reports,
original reviews across all measures, with 90% win including the importance of politeness and profes-
rates for adherence and 85% for constructiveness sional etiquette (Bharti et al., 2023, 2024), thor-
and justification. This suggests that lazy thinking oughness (Severin et al.; Guo et al., 2023), and
annotations help to provide actionable, evidence- comprehensiveness (Yuan et al., 2022). There have
backed feedback. When compared to guideline- been efforts to automate the peer-reviewing pro-
based rewrites (75% for adherence and 70% for cess using large language models (Du et al., 2024;
constructiveness and justified), the win rates are Zhou et al., 2024). However, there has been no fo-
lower. This is likely because reviewers often exer- cus on analyzing the usage of heuristics within the
cised greater caution when rewriting reviews based reviews. This work focuses on lazy thinking, a com-
on guidelines by shifting concerns from the weak- mon issue affecting the quality of peer reviews in
ness section to the comments section to soften crit- Natural Language Processing (NLP). While Rogers
icism rather than fully rewriting the reviews. This and Augenstein (2020) have qualitatively analyzed
heuristics in NLP conferences, we are the first to
13
Annotation instructions are provided in Appendix §A.12 present a dedicated dataset for identifying such
practices and to propose automated methods for de- the lazy thinking framework presents a compelling
tecting low-quality reviews, thereby enhancing the avenue for future research.
effectiveness of the peer-review process in NLP.
Ethics Statement
5 Conclusion
Our dataset, L AZY R EVIEW will be released under
Aiming to address the pressing issue of lazy think- the CC-BY-NC 4.0 licenses. The underlying ARR
ing in review reports, we have presented L AZY R E - reviews have been collected from NLP EER (Dycke
VIEW , a novel resource for detecting lazy thinking et al., 2023), which is also licensed under the CC-
in peer reviews. We conduct extensive experiments BY-NC-SA 4.0 license. The analysis and automatic
and establish strong baselines for this task. Addi- annotation do not require processing any personal
tionally, our controlled experiments demonstrate or sensitive information. We prioritize the mental
that providing explicit lazy thinking signals to re- health of the annotators and ensure that each an-
viewers can significantly improve review quality. notator takes a break every hour or whenever they
We hope our work will inspire further efforts to feel unwell.
improve the overall quality of peer reviews.
Acknowledgements
Limitations
This work has been funded by the German Re-
In this work, we introduce L AZY R EVIEW, a novel search Foundation (DFG) as part of the Research
resource for detecting lazy thinking in NLP peer Training Group KRITIS No. GRK 2222, along
reviews. While the resource is aimed at promoting with the German Federal Ministry of Education
sound reviewing practices and enhancing overall and Research and the Hessian Ministry of Higher
review quality, it has some limitations. First, we Education, Research, Science and the Arts, within
adopt the definition and categories of lazy think- their joint support of the National Research Center
ing from the ARR guidelines (Rogers and Au- for Applied Cybersecurity ATHENE. We gratefully
genstein, 2021) and the EMNLP Blog (Liu et al., acknowledge the support of Microsoft with a grant
2020), which are specific to NLP conference re- for access to OpenAI GPT models via the Azure
views. Therefore, the resource does not encom- cloud (Accelerate Foundation Model Academic Re-
pass peer-reviewing venues beyond ARR, and it search). The work of Anne Lauscher is funded un-
would need adaptation to suit other domains with der the Excellence Strategy of the German Federal
differing definitions of lazy thinking. Second, from Government and the Federal States.
a task and modeling perspective, we focus on the We want to thank Viet Pham, Yu Liu, Hao Yang,
under-explored area of detecting lazy thinking in and Haolan Zhan for their help with the annotation
peer reviews. Although we conduct thorough ex- efforts. We are grateful to Qian Ruan and Gholam-
periments and analyses, the findings may not be reza (Reza) Haffari for their initial feedback on a
fully generalizable to other classification tasks in draft of this paper.
the scientific domain, so results should be inter-
preted with caution. Additionally, ACL ARR guide-
lines currently characterize lazy thinking specif- References
ically within the weakness section of a review;
01. AI, :, Alex Young, Bei Chen, Chao Li, Chen-
future research could explore how such patterns
gen Huang, Ge Zhang, Guanwei Zhang, Heng Li,
manifest in other parts of the review as well. In Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong
addition, lazy thinking patterns may extend into Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang,
subsequent author–reviewer discussions. How- Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang,
ever, as our starting dataset, NLP EER, does not Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng
Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai,
include these interactions, we refrain from explor- Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. 2024.
ing this aspect, leaving it as a potential direction Yi: Open foundation models by 01.ai. ArXiv preprint.
for future investigation. Furthermore, this study fo-
cuses exclusively on reviews authored prior to the Robert K. Atkinson, Sharon J. Derry, Alexander Renkl,
and Donald Wortham. 2000. Learning from exam-
widespread adoption of large language models (i.e., ples: Instructional principles from the worked ex-
before 2023). As AI-generated reviews become in- amples research. Review of Educational Research,
creasingly prevalent, exploring their impact within 70(2):181–214.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, rolling review and beyond. In Findings of the Associ-
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei ation for Computational Linguistics: EMNLP 2022,
Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, pages 300–318, Abu Dhabi, United Arab Emirates.
Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Association for Computational Linguistics.
Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren,
Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Nils Dycke, Ilia Kuznetsov, and Iryna Gurevych. 2023.
Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng- NLPeer: A unified resource for the computational
guang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, study of peer review. In Proceedings of the 61st An-
Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, nual Meeting of the Association for Computational
Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingx- Linguistics (Volume 1: Long Papers), pages 5049–
uan Zhang, Yichang Zhang, Zhenru Zhang, Chang 5073, Toronto, Canada. Association for Computa-
Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang tional Linguistics.
Zhu. 2023. Qwen technical report. ArXiv preprint.
Tao Feng, Lizhen Qu, and Gholamreza Haffari. 2023.
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB- Less is more: Mitigate spurious correlations for
ERT: A pretrained language model for scientific text. open-domain dialogue response generation models
In Proceedings of the 2019 Conference on Empirical by causal discovery. Transactions of the Association
Methods in Natural Language Processing and the for Computational Linguistics, 11:511–530.
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 3615– Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot,
3620, Hong Kong, China. Association for Computa- Dan Roth, and Jonathan Berant. 2021. Did aristotle
tional Linguistics. use a laptop? a question answering benchmark with
implicit reasoning strategies. Transactions of the
Prabhat Kumar Bharti, Mayank Agarwal, and Asif Ek- Association for Computational Linguistics, 9:346–
bal. 2024. Please be polite to your peers: a multi- 361.
task model for assessing the tone and objectivity of
critiques of peer review comments. Scientometrics, Yanzhu Guo, Guokan Shang, Virgile Rennard, Michalis
129(3):1377–1413. Vazirgiannis, and Chloé Clavel. 2023. Automatic
analysis of substantiation in scientific peer reviews.
Prabhat Kumar Bharti, Meith Navlakha, Mayank Agar- In Findings of the Association for Computational
wal, and Asif Ekbal. 2023. Politepeer: does peer Linguistics: EMNLP 2023, pages 10198–10216, Sin-
review hurt? a dataset to gauge politeness intensity gapore. Association for Computational Linguistics.
in the peer reviews. Language Resources and Evalu-
ation, 58(4):1291–1313. Sireesh Gururaja, Amanda Bertsch, Clara Na, David
Widder, and Emma Strubell. 2023. To build our
Ralph Allan Bradley and Milton E. Terry. 1952. Rank future, we must know our past: Contextualizing
analysis of incomplete block designs: I. the method paradigm shifts in natural language processing. In
of paired comparisons. Biometrika, 39(3/4):324– Proceedings of the 2023 Conference on Empirical
345. Methods in Natural Language Processing, pages
13310–13325, Singapore. Association for Compu-
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anasta- tational Linguistics.
sios Nikolas Angelopoulos, Tianle Li, Dacheng Li,
Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Chadi Helwe, Tom Calamai, Pierre-Henri Paris, Chloé
Gonzalez, and Ion Stoica. 2024. Chatbot arena: An Clavel, and Fabian Suchanek. 2024. MAFALDA: A
open platform for evaluating llms by human prefer- benchmark and comprehensive study of fallacy de-
ence. ArXiv preprint. tection and classification. In Proceedings of the 2024
Conference of the North American Chapter of the
Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Association for Computational Linguistics: Human
Deng, Shuaiqi Liu, Renze Lou, Henry Peng Zou, Language Technologies (Volume 1: Long Papers),
Pranav Narayanan Venkit, Nan Zhang, Mukund Sri- pages 4810–4845, Mexico City, Mexico. Association
nath, Haoran Ranran Zhang, Vipul Gupta, Yinghui for Computational Linguistics.
Li, Tao Li, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi
Gao, Congying Xia, Chen Xing, Cheng Jiayang, Carolin Holtermann, Paul Röttger, Timm Dill, and Anne
Zhaowei Wang, Ying Su, Raj Sanjay Shah, Ruohao Lauscher. 2024. Evaluating the elementary multi-
Guo, Jing Gu, Haoran Li, Kangda Wei, Zihao Wang, lingual capabilities of large language models with
Lu Cheng, Surangika Ranathunga, Meng Fang, Jie MultiQ. In Findings of the Association for Compu-
Fu, Fei Liu, Ruihong Huang, Eduardo Blanco, Yixin tational Linguistics: ACL 2024, pages 4476–4494,
Cao, Rui Zhang, Philip S. Yu, and Wenpeng Yin. Bangkok, Thailand. Association for Computational
2024. LLMs assist NLP researchers: Critique paper Linguistics.
(meta-)reviewing. In Proceedings of the 2024 Con-
ference on Empirical Methods in Natural Language Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan
Processing, pages 5081–5099, Miami, Florida, USA. Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
Association for Computational Linguistics. Weizhu Chen. 2022. Lora: Low-rank adaptation of
large language models. In The Tenth International
Nils Dycke, Ilia Kuznetsov, and Iryna Gurevych. 2022. Conference on Learning Representations, ICLR 2022,
Yes-yes-yes: Proactive data collection for ACL Virtual Event, April 25-29, 2022. OpenReview.net.
Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Yang Liu, Trevor Cohn, Bonnie Webber, and Yulan He.
Nathan Lambert, Matthew Peters, Pradeep Dasigi, 2020. Advice on Reviewing for EMNLP. EMNLP
Joel Jang, David Wadden, Noah A. Smith, Iz Belt- 2020 blog.
agy, and Hannaneh Hajishirzi. 2023. Camels in a
changing climate: Enhancing lm adaptation with tulu Sheng Lu, Irina Bigoulaeva, Rachneet Sachdeva, Harish
2. ArXiv preprint. Tayyar Madabushi, and Iryna Gurevych. 2024. Are
emergent abilities in large language models just in-
Joel Jang, Seungone Kim, Seonghyeon Ye, Doyoung context learning? In Proceedings of the 62nd Annual
Kim, Lajanugen Logeswaran, Moontae Lee, Kyung- Meeting of the Association for Computational Lin-
jae Lee, and Minjoon Seo. 2023. Exploring the bene- guistics (Volume 1: Long Papers), pages 5098–5139,
fits of training expert language models over instruc- Bangkok, Thailand. Association for Computational
tion tuning. In Proceedings of the 40th International Linguistics.
Conference on Machine Learning, ICML 2023, Hon-
Robert K Merton. 1968. The matthew effect in science:
olulu, Hawaii, USA, July 23-29 2023. PMLR.
The reward and communication systems of science
are considered. Science, 159(3810):56–63.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego Norman Mu, Sarah Chen, Zifan Wang, Sizhe
de las Casas, Florian Bressand, Gianna Lengyel, Guil- Chen, David Karamardian, Lulwa Aljeraisy, Dan
laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Hendrycks, and David Wagner. 2024. Can llms fol-
Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, low simple rules? ArXiv preprint.
Thibaut Lavril, Thomas Wang, Timothée Lacroix,
and William El Sayed. 2023. Mistral 7b. ArXiv OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal,
preprint. Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
man, Diogo Almeida, Janko Altenschmidt, Sam Alt-
Neha Nayak Kennard, Tim O’Gorman, Rajarshi Das, man, Shyamal Anadkat, Red Avila, Igor Babuschkin,
Akshay Sharma, Chhandak Bagchi, Matthew Clin- Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim-
ton, Pranay Kumar Yelugam, Hamed Zamani, and ing Bao, Mohammad Bavarian, Jeff Belgum, Ir-
Andrew McCallum. 2022. DISAPERE: A dataset wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro,
for discourse structure in peer review discussions. Christopher Berner, Lenny Bogdonoff, Oleg Boiko,
In Proceedings of the 2022 Conference of the North Madelaine Boyd, Anna-Luisa Brakman, Greg Brock-
American Chapter of the Association for Computa- man, Tim Brooks, Miles Brundage, Kevin Button,
tional Linguistics: Human Language Technologies, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany
pages 1234–1249, Seattle, United States. Association Carey, Chelsea Carlson, Rory Carmichael, Brooke
for Computational Linguistics. Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully
Chen, Ruby Chen, Jason Chen, Mark Chen, Ben
Nishant Kumar, Zulfiqar Ali, and Rudrashish Haldar. Chess, Chester Cho, Casey Chu, Hyung Won Chung,
2023. Novelty in research: A common reason for Dave Cummings, Jeremiah Currier, Yunxing Dai,
manuscript rejection! Indian Journal of Anaesthesia, Cory Decareaux, Thomas Degry, Noah Deutsch,
67(3):245–246. Damien Deville, Arka Dhar, David Dohan, Steve
Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti,
Ilia Kuznetsov, Osama Mohammed Afzal, Koen Der- Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix,
cksen, Nils Dycke, Alexander Goldberg, Tom Hope, Simón Posada Fishman, Juston Forte, Isabella Ful-
Dirk Hovy, Jonathan K. Kummerfeld, Anne Lauscher, ford, Leo Gao, Elie Georges, Christian Gibson, Vik
Kevin Leyton-Brown, Sheng Lu, Mausam, Margot Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-
Mieskes, Aurélie Névéol, Danish Pruthi, Lizhen Lopes, Jonathan Gordon, Morgan Grafstein, Scott
Qu, Roy Schwartz, Noah A. Smith, Thamar Solorio, Gray, Ryan Greene, Joshua Gross, Shixiang Shane
Jingyan Wang, Xiaodan Zhu, Anna Rogers, Nihar B. Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris,
Shah, and Iryna Gurevych. 2024. What can natu- Yuchen He, Mike Heaton, Johannes Heidecke, Chris
ral language processing do for peer review? ArXiv Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele,
preprint. Brandon Houghton, Kenny Hsu, Shengli Hu, Xin
Hu, Joost Huizinga, Shantanu Jain, Shawn Jain,
Nino Künzli, Anke Berger, Katarzyna Czabanowska, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun
Raquel Lucas, Andrea Madarasova Geckova, Sarah Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee-
Mantwill, and Olaf von dem Knesebeck. 2022. «i woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka-
do not have time»-is this the end of peer review mali, Ingmar Kanitscheider, Nitish Shirish Keskar,
in public health sciences? Public Health Reviews, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim,
43(11):1605407. Christina Kim, Yongjik Kim, Jan Hendrik Kirch-
ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo,
Esther Landhuis. 2016. Scientific literature: Informa- Łukasz Kondraciuk, Andrew Kondrich, Aris Kon-
tion overload. Nature, 535(7612):457–458. stantinidis, Kyle Kosic, Gretchen Krueger, Vishal
Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan
J Richard Landis and Gary G Koch. 1977. The mea- Leike, Jade Leung, Daniel Levy, Chak Ming Li,
surement of observer agreement for categorical data. Rachel Lim, Molly Lin, Stephanie Lin, Mateusz
Biometrics, 33(1):159–174. Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue,
Anna Makanju, Kim Malfacini, Sam Manning, Todor EMNLP 2020, pages 1256–1262, Online. Association
Markov, Yaniv Markovski, Bianca Martin, Katie for Computational Linguistics.
Mayer, Andrew Mayne, Bob McGrew, Scott Mayer
McKinney, Christine McLeavey, Paul McMillan, Anna Rogers, Marzena Karpinska, Jordan Boyd-Graber,
Jake McNeil, David Medina, Aalok Mehta, Jacob and Naoaki Okazaki. 2023. Program chairs’ report
Menick, Luke Metz, Andrey Mishchenko, Pamela on peer review at acl 2023. In Proceedings of the
Mishkin, Vinnie Monaco, Evan Morikawa, Daniel 61st Annual Meeting of the Association for Compu-
Mossing, Tong Mu, Mira Murati, Oleg Murk, David tational Linguistics (Volume 1: Long Papers), pages
Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, xl–lxxv, Toronto, Canada. Association for Computa-
Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, tional Linguistics.
Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex
Paino, Joe Palermo, Ashley Pantuliano, Giambat- Anne Rogers and Isabelle Augenstein. 2021. How to
tista Parascandolo, Joel Parish, Emy Parparita, Alex review for acl rolling review. ACL Rolling Review.
Passos, Mikhail Pavlov, Andrew Peng, Adam Perel- Anna Severin, Michaela Strinzel, Matthias Egger, Tiago
man, Filipe de Avila Belbute Peres, Michael Petrov, Barros, Alexander Sokolov, Julia Vilstrup Mouatt,
Henrique Ponde de Oliveira Pinto, Michael, Poko- and Stefan Müller. Journal impact factor and peer
rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow- review thoroughness and helpfulness: A supervised
ell, Alethea Power, Boris Power, Elizabeth Proehl, machine learning study. ArXiv preprint.
Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh,
Cameron Raymond, Francis Real, Kendra Rimbach, Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Be-
Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry- siroglu, Marius Hobbhahn, and Pablo Villalobos.
der, Mario Saltarelli, Ted Sanders, Shibani Santurkar, 2022. Compute trends across three eras of machine
Girish Sastry, Heather Schmidt, David Schnurr, John learning. In Proceedings of the International Joint
Schulman, Daniel Selsam, Kyla Sheppard, Toki Conference on Neural Networks, IJCNN 2022, July
Sherbakov, Jessica Shieh, Sarah Shoker, Pranav 18-23, Padua, Italy, pages 1–8. IEEE.
Shyam, Szymon Sidor, Eric Sigler, Maddie Simens,
Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Chufan Shi, Yixuan Su, Cheng Yang, Yujiu Yang, and
Sokolowsky, Yang Song, Natalie Staudacher, Fe- Deng Cai. 2023. Specialist or generalist? instruction
lipe Petroski Such, Natalie Summers, Ilya Sutskever, tuning for specific NLP tasks. In Proceedings of the
Jie Tang, Nikolas Tezak, Madeleine B. Thompson, 2023 Conference on Empirical Methods in Natural
Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Language Processing, pages 15336–15348, Singa-
Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe- pore. Association for Computational Linguistics.
lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya,
Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Pragya Srivastava, Satvik Golechha, Amit Deshpande,
Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, and Amit Sharma. 2024. NICE: To optimize in-
CJ Weinmann, Akila Welihinda, Peter Welinder, Ji- context examples or not? In Proceedings of the 62nd
ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Annual Meeting of the Association for Computational
Clemens Winter, Samuel Wolrich, Hannah Wong, Linguistics (Volume 1: Long Papers), pages 5494–
Lauren Workman, Sherwin Wu, Jeff Wu, Michael 5510, Bangkok, Thailand. Association for Computa-
Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim- tional Linguistics.
ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong
Lu Sun, Aaron Chan, Yun Seo Chang, and Steven P
Zhang, Marvin Zhang, Shengjia Zhao, Tianhao
Dow. 2024a. Reviewflow: Intelligent scaffolding to
Zheng, Juntang Zhuang, William Zhuk, and Barret
support academic peer reviewing. In Proceedings of
Zoph. 2024. Gpt-4 technical report. ArXiv preprint.
the 29th International Conference on Intelligent User
Interfaces, IUI 2024, March 18-21, 2024, Greenville,
Pallets. 2024. Jinja. https://github.com/pallets/
South Carolina, USA, pages 120–137. Association
jinja/. GitHub repository.
for Computing Machinery.
Sukannya Purkayastha, Anne Lauscher, and Iryna Wangtao Sun, Chenxiang Zhang, Xueyou Zhang,
Gurevych. 2023. Exploring jiu-jitsu argumentation Ziyang Huang, Haotian Xu, Pei Chen, Shizhu He,
for writing peer review rebuttals. In Proceedings Jun Zhao, and Kang Liu. 2024b. Beyond instruc-
of the 2023 Conference on Empirical Methods in tion following: Evaluating rule following of large
Natural Language Processing, pages 14479–14495, language models. ArXiv preprint.
Singapore. Association for Computational Linguis-
tics. Gemma Team, Thomas Mesnard, Cassidy Hardin,
Robert Dadashi, Surya Bhupatiraju, Shreya Pathak,
Martina Raue and Sabine G. Scholl. 2018. The Use Laurent Sifre, Morgane Rivière, Mihir Sanjay
of Heuristics in Decision Making Under Risk and Kale, Juliette Love, Pouya Tafti, Léonard Hussenot,
Uncertainty, pages 153–179. Springer International Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam
Publishing. Roberts, Aditya Barua, Alex Botev, Alex Castro-
Ros, Ambrose Slone, Amélie Héliou, Andrea Tac-
Anna Rogers and Isabelle Augenstein. 2020. What chetti, Anna Bulanova, Antonia Paterson, Beth
can we do to improve peer review in NLP? In Find- Tsai, Bobak Shahriari, Charline Le Lan, Christo-
ings of the Association for Computational Linguistics: pher A. Choquette-Choo, Clément Crepy, Daniel Cer,
Daphne Ippolito, David Reid, Elena Buchatskaya, Zhou, Zhongyu Wei, Zhumin Chen, and Nan Duan.
Eric Ni, Eric Noland, Geng Yan, George Tucker, 2022a. From lsat: The progress and challenges of
George-Christian Muraru, Grigory Rozhdestvenskiy, complex reasoning. IEEE/ACM Transactions on
Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Audio, Speech, and Language Processing, 30:2201–
Jacob Austin, James Keeling, Jane Labanowski, 2216.
Jean-Baptiste Lespiau, Jeff Stanway, Jenny Bren-
nan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Siyuan Wang, Zhongyu Wei, Yejin Choi, and Xiang Ren.
Mao-Jones, Katherine Lee, Kathy Yu, Katie Milli- 2024. Can llms reason with rules? logic scaffolding
can, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, for stress-testing and improving llms. ArXiv preprint.
Machel Reid, Maciej Mikuła, Mateo Wirth, Michael
Sharman, Nikolai Chinaev, Nithum Thain, Olivier Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu
Bachem, Oscar Chang, Oscar Wahltinez, Paige Bai- Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and Nan
ley, Paul Michel, Petko Yotov, Rahma Chaabouni, Duan. 2022b. Logic-driven context extension and
Ramona Comanescu, Reena Jana, Rohan Anil, Ross data augmentation for logical reasoning of text. In
McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Findings of the Association for Computational Lin-
Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, guistics: ACL 2022, pages 1619–1629, Dublin, Ire-
Shree Pandya, Siamak Shakeri, Soham De, Ted Kli- land. Association for Computational Linguistics.
menko, Tom Hennigan, Vlad Feinberg, Wojciech Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack
Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Hessel, Tushar Khot, Khyathi Raghavi Chandu,
Gong, Tris Warkentin, Ludovic Peran, Minh Giang,
David Wadden, Kelsey MacMillan, Noah A Smith,
Clément Farabet, Oriol Vinyals, Jeff Dean, Koray
Iz Beltagy, and Hannaneh Hajishirzi. 2023. How Far
Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani,
Can Camels Go? Exploring the State of Instruction
Douglas Eck, Joelle Barral, Fernando Pereira, Eli Tuning on Open Resources. ArXiv preprint.
Collins, Armand Joulin, Noah Fiedel, Evan Senter,
Alek Andreev, and Kathleen Kenealy. 2024. Gemma: Mark Ware and Michael Mabe. 2009. An overview of
Open models based on gemini research and technol- scientific and scholarly journal publishing. The STM
ogy. ArXiv preprint. report, 1082:1083.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- Nathaniel Weir, Peter Clark, and Benjamin Van Durme.
bert, Amjad Almahairi, Yasmine Babaei, Nikolay 2024. Nellie: a neuro-symbolic inference engine for
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti grounded, compositional, and explainable reasoning.
Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton In Proceedings of the Thirty-Third International Joint
Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Conference on Artificial Intelligence, IJCAI 2024,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, August 3-9, 2024, Jeju, Korea, pages 3602–3612.
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An- ijcai.org.
thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Zhiyong
Isabel Kloumann, Artem Korenev, Punit Singh Koura, Wu, Jiangtao Feng, Jingjing Xu, and Yu Qiao. 2023.
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di- OpenICL: An open-source framework for in-context
ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar- learning. In Proceedings of the 61st Annual Meet-
tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly- ing of the Association for Computational Linguistics
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen- (Volume 3: System Demonstrations), pages 489–498,
stein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Toronto, Canada. Association for Computational Lin-
Ruan Silva, Eric Michael Smith, Ranjan Subrama- guistics.
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
lor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Weizhe Yuan, Pengfei Liu, and Graham Neubig. 2022.
Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Can we automate scientific reviewing? Journal of
Melanie Kambadur, Sharan Narang, Aurelien Ro- Artificial Intelligence Research, 75:171–212.
driguez, Robert Stojnic, Sergey Edunov, and Thomas
Scialom. 2023. Llama 2: Open foundation and fine- Ruiyang Zhou, Lu Chen, and Kai Yu. 2024. Is LLM
tuned chat models. ArXiv preprint. a reliable reviewer? a comprehensive evaluation of
LLM on automatic paper reviewing tasks. In Pro-
Amos Tversky, Daniel Kahneman, and Paul Slovic. ceedings of the 2024 Joint International Conference
1974. Judgment under uncertainty: Heuristics and on Computational Linguistics, Language Resources
biases. Science, 185(4157):1124–1131. and Evaluation, LREC-COLING 2024, May 20-25,
2024, pages 9340–9351. ELRA and ICCL.
David Wadden, Kejian Shi, Jacob Morrison, Aakanksha
Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom
Hope, Luca Soldaini, Shannon Zejiang Shen, Doug
Downey, Hannaneh Hajishirzi, and Arman Cohan.
2024. Sciriff: A resource to enhance language model
instruction-following over scientific literature. ArXiv
preprint.
Siyuan Wang, Zhongkun Liu, Wanjun Zhong, Ming
Model Size Link
LLaMa 2 (Touvron et al., 2023) 7B meta-llama/Llama-2-7b-chat-hf A.5 Metric alignment study
LLaMa 2 (Touvron et al., 2023) 13B meta-llama/Llama-2-13b-chat
Gemma 1.1 (Team et al., 2024)
Mistral v0.1 (Jiang et al., 2023)
7B
7B
google/gemma-1.1-7b-it
mistralai/Mistral-7B-Instruct-v0.1
To verify the reliability of our metrics, we follow
Qwen-1.5 (Bai et al., 2023) 7B Qwen/Qwen-7B-Chat Holtermann et al. (2024) and perform an align-
Yi-1.5 (AI et al., 2024) 6B 01-ai/Yi-6B-Chat
SciTülu (Wadden et al., 2024) 7B allenai/scitulu-7b ment study. We sample 50 responses from var-
ious models and evaluate them with both string
Table 6: Overview of models used in our work along
matching and GPT-based methods. We report the
with their sizes and links.
results in Tables 14 and 15 respectively. We find
that the string-based evaluator is precise for correct
A Appendix answers but tends to underestimate them, while
the GPT-based evaluator is better at identifying
A.1 Guidelines used in multiple Annotation
incorrect answers but may overestimate correct
Rounds
ones. As such, the string-based evaluator serves
The existing ARR guidelines (Rogers and Augen- as a lower bound while the GPT-based evaluator
stein, 2021) are shown in Tab 7, which we used in serves as an upper bound. Manual analysis shows
Round 1 of our annotation study. The guidelines that the GPT-based evaluator struggles with similar
extended using the EMNLP Blog (Liu et al., 2020) classes, like "not surprising" versus "not novel,"
is shown in Table 8. The positive examples that we while the string-based evaluator misses correct pre-
provided the annotators for completing round 3 of dictions such as "The writing needs to be clear" and
our annotation study are shown in Table 9. "Language errors/Writing style." However, for the
coarse-grained classification, both the evaluators
A.2 Model and Implementation details
perform equally well.
We select the LLMs based on multiple criteria: (a)
all the models should be open-sourced in order to A.6 Prompts for Coarse-grained and
deploy them in real-world reviewing systems, (b) Fine-grained classification
their size should be reasonable in order to perform
We use a fixed prompt template to describe the
full fine-tuning using LoRA (Hu et al., 2022). c)
lazy thinking as shown in Fig 6. We then add the
they must be state-of-the-art on the existing LLM
task-specific prompt for fine-grained and coarse-
leaderboards (Chiang et al., 2024). Based on these
grained classification as shown in Fig 7 and Fig 8,
criteria, we select the chat version of the various
respectively. For the first round, we add the classes
models shown in Table 6. We use vLLM for fast
in the fixed template from the initial ARR guide-
inference of all the models.14 . We set the tempera-
lines (Table 7). In the second round, the classes are
ture as 0 to have consistent predictions throughout
added from the enhanced guidelines in Table 8. We
and limit the output tokens to 30.
use the full review and target segment as input for
A.3 Computational Budget the ‘RT’ setup and only the target segment for the
‘T’ setup as described in Sec §3. The prompt tem-
We ran all the experiments on Nvidia A100 80GB
plates for In-Context Learning as used in Round 3
GPUs. None of the experiments consumed more
of our annotation study, shown in Figures 9 and 10
than 36 hours.
for fine-grained and coarse-grained classification,
A.4 Prompt for GPT-based review segment respectively.
extraction
A.7 In-context Learning Experiment for
We describe the prompt for GPT-based review seg-
Round 3
ment extraction in Fig 5. We provide the classes
from the existing ARR guidelines and ask the We use different types of methods to generate in-
model to provide us a list of review segments that context learning exemplars, namely Static, Ran-
can likely be related to lazy thinking. We show a dom, Top-K, and Vote-K, as described in Sec §3.
subset of the classes in the prompt due to space We use 1,2, and 3 exemplars for this experiment.
constraints. The performance of the models for the fine-grained
and coarse-grained classification are shown in Fig 3
and 4, respectively. We observe that the perfor-
mances donot change on increasing the number of
14
https://github.com/vllm-project/vllm exemplars or using different strategies.
40 40 40
50

30 30 30 40
Acc

Acc

Acc

Acc
30
20 20 20

20
10 10 10
10

0 0 0 0
1 2 3 1 2 3 1 2 3 1 2 3

(a) Gemma 7B (b) LLaMa 7B (c) LLaMa 13B (d) Mistral 7B

40 40

30 30
Acc

Acc
20 20

10 10

0 0
1 2 3 1 2 3

(e) SciTülu 7B (f) Qwen 7B

Figure 3: Performance of LLMs on using different In Context learning (ICL) methods for Round 3 of our annotation
study for fine-grained classification. Error bars indicate using only the target segment (T) as the information source.
Acc refers to using GPT-based accuracy

40
80 80
60
30
60 60
40
Acc

Acc

Acc

Acc

20 40
40

20
10 20 20

0 0 0 0
1 2 3 1 2 3 1 2 3 1 2 3

(a) Gemma 7B (b) LLaMa 7B (c) LLaMa 13B (d) Mistral 7B

80
80

60
60
Acc

Acc

40
40

20 20

0 0
1 2 3 1 2 3

(e) SciTülu 7B (f) Qwen 7B

Figure 4: Performance of LLMs on using different In Context learning (ICL) methods for Round 3 of our annotation
study for the coarse classification task. Error bars indicate using only the target segment (T) as the information
source. Acc refers to using GPT-based evaluator accuracy.

A.8 Training Details to perform instruction agreement (Sec §A.1). During fine-tuning, we
tuning follow the S CI RIFF template using Jinja (Pal-
lets, 2024), providing task descriptions and data
A.8.1 Instructions as prompts, with output labels as responses. Two
instruction formats are used: one with just the tar-
We use the same instructions from Round 2 of
get segment (T) and another with both the target
our annotation study due to higher inter-annotator
Prompt for GPT-based segment extraction
Prompt: As the number of scientific publications grows over the years, the need for efficient quality control in
peer-reviewing becomes crucial. Due to the high reviewing load, the reviewers are often prone to different kinds of bias,
which inherently contribute to lower reviewing quality and this overall hampers the scientific progress in the field. The
ACL peer reviewing guidelines characterize some of such reviewing bias. These are termed lazy thinking. The lazy
thinking classes and the reason for them being problematic are provided as a dictionary below:

{’The results are not surprising’: ’Many findings seem obvious in retrospect, but this does not mean that the community
is already aware of them and can use them as building blocks for future work.’,

’The results contradict what I would expect’: ’You may be a victim of confirmation bias, and be unwilling
to accept data contradicting your prior beliefs.’,

’The results are not novel’: ’Such broad claims need to be backed up with references.’,
’This has no precedent in existing literature’: ’Believe it or not: papers that are more novel tend to be harder to publish.
Reviewers may be unnecessarily conservative.’,

’The results do not surpass the latest SOTA’, ’SOTA results are neither necessary nor sufficient for a scien-
tific contribution. An engineering paper could also offer improvements on other dimensions (efficiency, generalizability,
interpretability, fairness, etc.’,

’The results are negative’: ’The bias towards publishing only positive results is a known problem in many
fields, and contributes to hype and overclaiming. If something systematically does not work where it could be expected
to, the community does need to know about it.’,

’This method is too simple’: ’The goal is to solve the problem, not to solve it in a complex way. Simpler
solutions are in fact preferable, as they are less brittle and easier to deploy in real-world settings.’,

’The paper doesn’t use [my preferred methodology], e.g., deep learning’: ’NLP is an interdisciplinary field,
relying on many kinds of contributions: models, resource, survey, data/linguistic/social analysis, position, and theory.’,
[...]

You are given a review, and your task is to identify the segments (1 or more sentences) within the ‘summary of
weaknesses’ section that can correspond to any lazy thinking class as described above. The output should be a list of
these review segments.

Review: paper summary: This work provides a broad and thorough analysis of how 8 different model families and
varying model sizes for a total of 28 models perform on the oLMpics benchmark and the psycholinguistic probing
datasets from Ettinger (2020). It finds that all models struggle to resolve compositional questions zero-shot and that
attributes such as model size, pretraining objective, etc are not predictive of a model’s linguistic capabilities. summary of
strengths:- The work is well-motivated and clear -A vast selection of models are investigated. summary of weaknesses:
While the findings are interesting, there is little to no qualitative analysis to provide insight into why these effects might
occur -I would expect an analysis such as this to have at least 3 runs with varying random seeds per model to give
greater confidence in the model’s abilities. A growing body of work indicates that models’ linguistic abilities can vary
considerably even across initialisations -The exploration and prompt design to adapt the GPT models to the tasks at
hand is quite limited. commnets, suggestions, and typos: NA

Figure 5: Prompt for GPT based review segment extraction

segment and full review (RT). The output is either learning rate is set as 1e − 4.
a fine-grained lazy thinking class or a coarse label
(not lazy thinking or lazy thinking). Demonstration A.8.3 Hyper-parameter search
instances are excluded to ensure consistency with We train models using various data proportions:
S CI RIFF and T ÜLU V2. 0.2, 0.4, 0.6, 0.8, and 1.0 from the different mixes.
We then evaluate their performance on multiple
A.8.2 Hyper-parameters for LoRa training
validation sets. The performance results are plotted
We train all the models for 3 epochs. For LoRa, in two ways: using only the target segment (T)
we use the rank of 64, alpha of 16, and dropout of in Fig 11 and using a combination of review and
0.1. The models are trained with a cosine learning target segment (RT) in Fig 12 corresponding to the
rate scheduler and a warmup ratio of 0.03. We use fine-grained classification task. For the T setup,
BF16 and TF32 to ensure training precision. The where only the target segment is used, the models
Prompt Template for Lazy Thinking
Prompt: As the number of scientific publications grows over the years, the need for efficient quality control in
peer-reviewing becomes crucial. Due to the high reviewing load, the reviewers are often prone to different kinds of bias,
which inherently contribute to lower reviewing quality and this overall hampers the scientific progress in the field. The
ACL peer reviewing guidelines characterize some of such reviewing bias. These are termed lazy thinking. The lazy
thinking classes and the reason for them being problematic are provided as a dictionary below:

{’The results are not surprising’: ’Many findings seem obvious in retrospect, but this does not mean that the community
is already aware of them and can use them as building blocks for future work.’,

’The results contradict what I would expect’: ’You may be a victim of confirmation bias, and be unwilling
to accept data contradicting your prior beliefs.’,

’The results are not novel’: ’Such broad claims need to be backed up with references.’,
’This has no precedent in existing literature’: ’Believe it or not: papers that are more novel tend to be harder to publish.
Reviewers may be unnecessarily conservative.’,

’The results do not surpass the latest SOTA’, ’SOTA results are neither necessary nor sufficient for a scien-
tific contribution. An engineering paper could also offer improvements on other dimensions (efficiency, generalizability,
interpretability, fairness, etc.’,

’The results are negative’: ’The bias towards publishing only positive results is a known problem in many
fields, and contributes to hype and overclaiming. If something systematically does not work where it could be expected
to, the community does need to know about it.’,

’This method is too simple’: ’The goal is to solve the problem, not to solve it in a complex way. Simpler
solutions are in fact preferable, as they are less brittle and easier to deploy in real-world settings.’,

[...]

Figure 6: Fixed Prompt for defining lazy thinking

Prompt Template for Fine-Grained Classification


Task: Given a full review and a target segment corresponding to that review, you need to classify the target sentence
into one of the lazy thinking classes eg., ’The topic is too niche’, ’The results are negative’.
Full Review: <review>
Target Segment: <target segment>

Figure 7: Prompt for fine-grained classification

Prompt Template for Coarse-Grained Classification


Task: Given a full review and a target segment corresponding to that review, you need to classify the target sentence
into whether it is ‘lazy thinking’ or not ‘lazy thinking’
Full Review: <review>
Target Segment: <target segment>

Figure 8: Prompt for coarse-grained classification

achieve optimal performance with 0.2 to 0.4 of prompts provide more context and take longer for
the data. Performance tends to stagnate with more the model to fully grasp, allowing it to reach peak
data indicative of reaching the maxima and is in performance with a larger dataset. As a result,
line with the findings in Wadden et al. (2024). In we select 0.3 as the optimal data proportion for
the RT setup, where both the target segment and the T setup and 0.7 for the RT setup, balancing
review are used, the models perform best with 0.6 performance and avoiding the issues observed with
to 0.8 of the data. This is because the combined different data sizes. We obtain similar findings for
Prompt Template for Fine-Grained Classification (ICL Based)
Task: Given a full review and a target segment corresponding to that review, you need to classify the target sentence
into one of the lazy thinking classes eg., ’The topic is too niche’, ’The results are negative’. An example is shown below.
Full Review: paper summary: This work provides a broad and thorough analysis of how 8 different model families
and varying model sizes for a total of 28 models perform on the oLMpics benchmark and the psycholinguistic probing
datasets from Ettinger (2020). It finds that all models struggle to resolve compositional questions zero-shot and that
attributes such as model size, pretraining objective, etc are not predictive of a model’s linguistic capabilities. summary of
strengths:- The work is well-motivated and clear -A vast selection of models are investigated. summary of weaknesses:
While the findings are interesting, there is little to no qualitative analysis to provide insight into why these effects might
occur -I would expect an analysis such as this to have at least 3 runs with varying random seeds per model to give
greater confidence in the model’s abilities. A growing body of work indicates that models’ linguistic abilities can vary
considerably even across initialisations -The exploration and prompt design to adapt the GPT models to the tasks at
hand is quite limited. commnets, suggestions, and typos: NA
Target Segment: The exploration and prompt design to adapt the GPT models to the tasks at hand is quite limited.
Class: The authors should do extra experiment [X]
Full Review: <review>
Target Segment: <target segment>

Figure 9: Prompt for fine-grained classification based on In-Context Learning (ICL) as used in Round 3 of our study

Prompt Template for Coarse-Grained Classification (ICL Based)


Task: Given a full review and a target segment corresponding to that review, you need to classify the target sentence
into whether it is ‘lazy thinking’ or not ‘lazy thinking’
Full Review: paper summary: This work provides a broad and thorough analysis of how 8 different model families
and varying model sizes for a total of 28 models perform on the oLMpics benchmark and the psycholinguistic probing
datasets from Ettinger (2020). It finds that all models struggle to resolve compositional questions zero-shot and that
attributes such as model size, pretraining objective, etc are not predictive of a model’s linguistic capabilities. summary of
strengths:- The work is well-motivated and clear -A vast selection of models are investigated. summary of weaknesses:
While the findings are interesting, there is little to no qualitative analysis to provide insight into why these effects might
occur -I would expect an analysis such as this to have at least 3 runs with varying random seeds per model to give
greater confidence in the model’s abilities. A growing body of work indicates that models’ linguistic abilities can vary
considerably even across initialisations -The exploration and prompt design to adapt the GPT models to the tasks at
hand is quite limited. commnets, suggestions, and typos: NA
Target Segment: The exploration and prompt design to adapt the GPT models to the tasks at hand is quite limited.
Class: Lazy Thinking
Full Review: <review>
Target Segment: <target segment>

Figure 10: Prompt for coarse-grained classification based on In-Context Learning (ICL) as used in Round 3 of our
study

coarse-grained classification as shown in Fig 13 train LLaMa and SciTülu models, T ÜLU M IX for
(‘T’) and Fig 14 (‘RT’), respectively. the Gemma and Qwen models and Full Mix for
Mistral and Yi models.
A.8.4 Instruction Tuning Training Setup for
Annotation Rounds A.9 Confusion matrices for Annotation
We train models with the mixes identified in the ex- Labels and Confidences across annotation
periments with 3-fold cross-validation for optimal rounds
performance. We use the same instructions that are We show the confusion matrices for annotation
used in the annotations and zero-shot prompting labels and confidences in Figures 16 and 17 respec-
for finetuning the models. We use a data proportion tively.
of 0.3 to train models within the target segment-
based prompting setup (T) and 0.8 for the RT (full A.10 Results for the 3-fold cross validation
review with target segment) setup from the best The results for the 3-fold cross-validation for fine-
mizes identified fro each model. In line with the grained and coarse-grained classification is shown
results on the test sets, we use S CI RIFF M IX to in Tables 12 and 13 respectively. The random and
45.0 46
40.0
42.5 40 44
37.5
40.0 42
35.0
No Mix 40 No Mix

G Acc

G Acc
No Mix
37.5 No Mix 38
G Acc

G Acc
32.5 Tulu Mix SciRiFF Mix Tulu Mix SciRiFF Mix
Full Mix 38 Full Mix
30.0 Full Mix 35.0 Full Mix
SciRiFF Mix 36 Tulu Mix
SciRiFF Mix Tulu Mix 36
27.5 32.5
34
25.0 30.0
34 32
22.5 27.5
30
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(a) Gemma 7B (b) LLaMa 7B (c) LLaMa 13B (d) Mistral 7B


37.5
35.0 50 52.5
32.5 50.0
30.0 No Mix 47.5
45

G Acc
No Mix No Mix
G Acc

G Acc
Tulu Mix Tulu Mix Tulu Mix
27.5 45.0
Full Mix Full Mix Full Mix
25.0 SciRiFF Mix 40 SciRiFF Mix 42.5 SciRiFF Mix
22.5 40.0
20.0 35 37.5
17.5
35.0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(e) SciTülu 7B (f) Qwen 7B (g) Yi 6B

Figure 11: Performance of instruction-tuned LLMs for fine-grained classification on the dev set with multiple
percentages of dataset mixes using target segment (T) as the source of information in the prompt.

35 45 40 40

30 40 35 35
No Mix 35 No Mix No Mix No Mix
25
G Acc

G Acc

G Acc

G Acc
Tulu Mix SciRiFF Mix
30 Tulu Mix 30 SciRiFF Mix
Full Mix 30 Full Mix Full Mix Full Mix
20 25
SciRiFF Mix Tulu Mix SciRiFF Mix 25 Tulu Mix
25
15 20
20 20
10 15
15
15
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(a) Gemma 7B (b) LLaMa 7B (c) LLaMa 13B (d) Mistral 7B


50
35.0
45 45
32.5
30.0 40
40
No Mix No Mix
G Acc

G Acc

27.5 No Mix 35
G Acc

Tulu Mix Tulu Mix Tulu Mix


25.0 30 Full Mix 35 Full Mix
Full Mix
22.5 SciRiFF Mix SciRiFF Mix SciRiFF Mix
25
20.0 30
20
17.5
15 25
15.0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(e) SciTülu 7B (f) Qwen 7B (g) Yi 6B

Figure 12: Performance of instruction-tuned LLMs for fine-grained classification on the dev set with multiple
percentages of dataset mixes using the combination of review and target segment (RT) as the source of information
in the prompt.

majority baseline scores for fine-grained classi- A.11 Results for all the experiments across
fication are 4.115.2 and 6.75.4 respectively. The the annotation rounds
random and majority baseline scores for coarse-
We report the full results for the fine-grained and
grained classification are 44.113.2 and 47.73.2 re-
coarse-grained classification in Tables 10 and 11
spectively. Thus, all the models significantly out-
respectively.
perform the majority and random baselines for both
the tasks. A.12 Instructions to Annotators for Rewriting
Reviews and Evaluation
As discussed in Sec §3, we form two control groups
to rewrite reviews. The first control group is asked
65.0 66
60.0
62.5 60 64
57.5
60.0 62
55.0
No Mix 60 No Mix

G Acc

G Acc
No Mix
57.5 No Mix 58
G Acc

G Acc
52.5 Tulu Mix SciRiFF Mix Tulu Mix SciRiFF Mix
Full Mix 58 Full Mix
50.0 Full Mix 55.0 Full Mix
SciRiFF Mix 56 Tulu Mix
SciRiFF Mix Tulu Mix 56
47.5 52.5
54
45.0 50.0
54 52
42.5 47.5
50
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(a) Gemma 7B (b) LLaMa 7B (c) LLaMa 13B (d) Mistral 7B


57.5
55.0 70 72.5
52.5 70.0
50.0 No Mix 67.5
65

G Acc
No Mix No Mix
G Acc

G Acc
Tulu Mix Tulu Mix Tulu Mix
47.5 65.0
Full Mix Full Mix Full Mix
45.0 SciRiFF Mix 60 SciRiFF Mix 62.5 SciRiFF Mix
42.5 60.0
40.0 55 57.5
37.5
55.0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(e) SciTülu 7B (f) Qwen 7B (g) Yi 6B

Figure 13: Performance of instruction-tuned LLMs for coarse-grained classification on the dev set with multiple
percentages of dataset mixes using target segment (T) as the source of information in the prompt.

55 65 60 60

50 60 55 55
No Mix 55 No Mix No Mix No Mix
45
G Acc

G Acc

G Acc

G Acc
Tulu Mix SciRiFF Mix
50 Tulu Mix 50 SciRiFF Mix
Full Mix 50 Full Mix Full Mix Full Mix
40 45
SciRiFF Mix Tulu Mix SciRiFF Mix 45 Tulu Mix
45
35 40
40 40
30 35
35
35
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(a) Gemma 7B (b) LLaMa 7B (c) LLaMa 13B (d) Mistral 7B


70
55.0
65 65
52.5
50.0 60
60
No Mix No Mix
G Acc

G Acc

47.5 No Mix 55
G Acc

Tulu Mix Tulu Mix Tulu Mix


45.0 50 Full Mix 55 Full Mix
Full Mix
42.5 SciRiFF Mix SciRiFF Mix SciRiFF Mix
45
40.0 50
40
37.5
35 45
35.0
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(e) SciTülu 7B (f) Qwen 7B (g) Yi 6B

Figure 14: Performance of instruction-tuned LLMs for coarse-grained classification on the dev set with multiple
percentages of dataset mixes using the combination of review and target segment (RT) as the source of information
in the prompt.

to rewrite reviews based on the current ARR guide- reviews to comply with the current ARR guide-
lines. While the other control group is also pro- lines.15 . Please feel free to edit or remove the con-
vided with lazy thinking annotations. We provide tent from any of the reviewing sections such as
both the control groups with the paper and the re- ‘summary of weaknesses’, ‘comments, suggestions
views corresponding to that paper. Finally, a senior and typos’ if applicable.
Ph.D and PostDoc evaluate the rewrites pair-wise Instructions for re-writing based on current
based on various measures. ARR guidelines and lazy thinking annotations.
Given the review and the corresponding manuscript,
Instructions for re-writing based on current you have been provided annotated instances of lazy
ARR guidelines. Given the review and the cor-
15
responding manuscript, your task is to re-write the https://aclrollingreview.org/reviewertutorial
thinking. The definition of lazy thinking is as fol- Results Negative
lows ‘Lazy thinking, in the context of NLP research Simple Method
paper reviews, refers to the practice of dismissing Method Not Preferred
or criticizing research papers based on superficial
Not surprising
heuristics or preconceived notions rather than thor-
Extra Experiment
ough analysis. It is characterized by reviewers
Not SOTA
raising concerns that lack substantial supporting
evidence and are often influenced by prevailing No Precedence

trends within the NLP community.’ According to Tested on Language X

ARR guidelines, reviewers are explicitly discour- Contradict Expectation


aged from using such heuristics in their review Not Novel
reports. In line with the ARR guidelines, your task Niche Topics
is to re-write the reviews to comply with the guide- Language Errors
lines. Please feel free to edit or remove the content 0 50 100 150 200 250 300
from any of the reviewing sections such as ‘sum-
mary of weaknesses’, ‘comments, suggestions and Figure 15: Distribution of lazy thinking labels in the
typos’ if applicable. silver annotated data
Instruction for evaluating the rewrites You are
provided a paper along with a pair of reviews writ- ers often using it to reject papers without sufficient
ten for the same paper. Your task is to perform justification (Kumar et al., 2023).
pair-wise comparison of the reviews based on: (1)
Constructiveness- reviewers should provide action-
able feedback for the authors to improve on the
paper, (2) Justified - reviewers should clearly state
the reason for their arguments rather than putting
out vague statements. Additionally, we introduce
a measure, (3)A dherence, which judges how well
the reviews are written based on the current ARR
guidelines.

A.13 Distribution of review segment lengths


in L AZY R EVIEW
We plot the distribution of review segment lengths
in Fig 18. We find that ‘Extra Experiments’ is the
most common with variable segment lengths.

A.14 Analysis on the Silver Annotations


As discussed in RQ3 (cf. §3), we use the best-
performing model (Qwen) to annotate the remain-
ing 1,276 review segments in our dataset, gener-
ating silver annotations. One of the annotators re-
views a sample of 100 segments, yielding a Cohen’s
κ of 0.56 with the silver annotations, which is sub-
stantial given the subjectivity of this task and the
domain. The class label distribution is presented
in Fig 15, where ‘Extra Experiment’ remains the
most frequent lazy thinking class. Additionally,
we see increases in ‘Not SOTA’ and ‘Not Novel’
instances, likely reflecting the fast-paced advance-
ments in ML that quickly render state-of-the-art
methods obsolete (Sevilla et al., 2022). The notion
of ‘novelty’ is also frequently debated, with review-
8 12 5
Not surprising 1 0 0 0 0 0 1 0 0 0 0 0 0 0 Not surprising 1 0 1 0 0 0 0 0 0 0 0 0 0 Not surprising 1 0 1 0 0 1 0 0 0 0 0 0 0 0
Contradict Expectation 0 0 0 0 0 0 1 0 0 0 0 0 2 0
Contradict Expectation 0 0 0 0 0 0 0 0 0 0 0 2 0
Contradict Expectation 0 2 0 0 0 0 0 0 1 1 0 0 4 2
7
Not Novel 0 0 2 0 0 1 0 0 0 0 0 0 1 0 10 Not Novel 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Not Novel 0 0 1 0 0 0 0 0 0 0 0 1 0 4
No precedence 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 No precedence 0 0 0 1 0 0 0 0 0 0 0 0 1 0
Simple Method 0 0 1 0 0 0 0 0 0 0 0 0 0
Negative Results 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Negative Results 0 0 0 0 1 0 0 0 1 0 1 0 0 2
Method Not Preferred 0 0 0 0 0 0 0 0 0 0 0 0 0 8
Annotator 1

Annotator 1
5

Annotator 1
Simple Method 0 0 0 0 0 1 0 0 0 0 0 0 1 0 Simple Method 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3
Niche Topics 0 0 0 0 0 3 0 0 0 0 0 0 0
Method Not Preferred 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Method Not Preferred 0 0 0 0 0 0 1 0 0 0 0 0 0 0
4 Tested on Language X 0 0 0 0 0 0 0 0 0 0 0 0 0 6
Niche Topics 0 0 0 1 0 0 0 1 0 0 0 0 0 1 Niche Topics 0 0 0 0 0 0 0 4 0 0 1 0 1 0
Language Errors 0 0 0 0 0 0 0 6 0 0 1 0 0
Language Errors 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Language Errors 0 0 0 0 0 0 0 0 4 0 0 0 1 0 2
3
Missing Comparison 0 0 0 0 0 0 0 0 1 1 0 1 0 4
Missing Comparison 0 0 0 0 0 0 0 0 0 1 0 1 0 0 Missing Comparison 0 0 0 0 0 0 0 0 0 2 0 0 1 0
Extra Experiment 0 1 0 0 0 0 0 0 0 1 0 3 3 1 2
Extra Experiment 0 1 0 0 0 0 1 0 1 0 3 2 1 Extra Experiment 0 0 0 0 0 0 0 0 0 1 5 0 0 0
Should do X 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Should do X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
Should do X 0 0 0 0 0 0 0 0 0 0 0 1 0 0
None 0 1 0 2 0 0 0 0 4 0 0 0 8 1 1 None 4 0 2 0 1 0 0 0 1 0 0 12 1 None 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Not Enough Info 0 0 0 0 0 0 0 0 1 0 0 0 1 1 Not Enough Info 0 0 0 0 0 0 0 0 0 0 0 0 0 Not Enough Info 0 0 0 0 0 0 0 0 0 1 0 0 1 4
0 0 0
Annotator 2 Annotator 2 Annotator 2

(a) Round 1 (b) Round 2 (c) Round 3

Figure 16: Confusion matrices for the lazy thinking classes from the ARR guidelines and EMNLP Blog used for
annotation across multiple rounds. The class names are shortened due to space constraints.

25
8 12.5
Low 7 3 7 High 26 2 9
Medium 13 0 8 20
Annotator 1

10.0

Annotator 1
Annotator 1

6
7.5 15
High 9 2 4 High 7 0 3 Medium 0 1 2
4
5.0 10
2 5
Medium 9 0 9 Low 8 0 11 2.5 Low 6 2 1
0 0.0 0
Annotator 2 Annotator 2 Annotator 2
(a) Round 1 (b) Round 2 (c) Round 3

Figure 17: Confusion matrices for the confidence levels among annotators across multiple rounds of annotation.

Niche Topics
Language Errors
140 Not Novel
Missing Comparison
120 Not surprising
Extra Experiment
Should do X
100 Contradict Expectation
Frequency

Method Not Preferred


Tested on Language X
80 Results Negative
No Precedence
Simple Method
60
Not SOTA
Resource Paper
40

20

0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Figure 18: Segment lengths


Heuristics Description
The results are not surprising Many findings seem obvious in retrospect, but this does not
mean that the community is already aware of them and can
use them as building blocks for future work.
The results contradict what I would expect You may be a victim of confirmation bias, and be unwilling
to accept data contradicting your prior beliefs.
The results are not novel Such broad claims need to be backed up with references.
This has no precedent in existing literature Believe it or not: papers that are more novel tend to be harder
to publish. Reviewers may be unnecessarily conservative.
The results do not surpass the latest SOTA SOTA results are neither necessary nor sufficient for a sci-
entific contribution. An engineering paper could also offer
improvements on other dimensions (efficiency, generalizabil-
ity, interpretability, fairness, etc.)
The results are negative The bias towards publishing only positive results is a known
problem in many fields, and contributes to hype and over-
claiming. If something systematically does not work where
it could be expected to, the community does need to know
about it.
This method is too simple The goal is to solve the problem, not to solve it in a complex
way. Simpler solutions are in fact preferable, as they are less
brittle and easier to deploy in real-world settings.
The paper doesn’t use [my preferred methodology], e.g., deep NLP is an interdisciplinary field, relying on many kinds of
learning contributions: models, resource, survey, data/linguistic/social
analysis, position, and theory.
The topic is too niche A main track paper may well make a big contribution to a
narrow subfield.
The approach is tested only on [not English], so unclear if it The same is true of NLP research that tests only on English.
will generalize to other languages Monolingual work on any language is important both prac-
tically (methods and resources for that language) and theo-
retically (potentially contributing to deeper understanding of
language in general).
The paper has language errors As long as the writing is clear enough, better scientific content
should be more valuable than better journalistic skills.
The paper is missing the comparison to the [latest X] Per ACL policy, the authors are not obliged to draw com-
parisons with contemporaneous work, i.e., work published
within three months before the submission (or three months
before a re-submission).
The authors could also do [extra experiment X] It is always possible to come up with extra experiments and
follow-up work. This is fair if the experiments that are already
presented are insufficient for the claim that the authors are
making. But any other extra experiments are in the “nice-
to-have” category, and belong in the “suggestions” section
rather than “weaknesses.”
The authors should have done [X] instead A.k.a. “I would have written a different paper.” There are
often several valid approaches to a problem. This criticism
applies only if the authors’ choices prevent them from an-
swering their research question, their framing is misleading,
or the question is not worth asking. If not, then [X] is a
comment or a suggestion, but not a “weakness.”

Table 7: Full ARR 2022 guidelines on lazy thinking sourced from Rogers and Augenstein (2021).
Heuristics Description
This has no precedent in existing literature The paper’s topic is completely new, such that there’s no
prior art or all the prior art has been done in another field. We
are interested in papers that tread new ground. Believe it or
not: papers that are more novel tend to be harder to publish.
Reviewers may be unnecessarily conservative.
This method is too simple The paper’s method is too simple. Our goal is not to design
the most complex method. Again, think what the paper’s
contributions and findings are. Often the papers with the
simplest methods are the most cited. If a simple method
outperforms more complex methods from prior work, then
this is often an important finding. The goal is to solve the
problem, not to solve it in a complex way. Simpler solutions
are in fact preferable, as they are less brittle and easier to
deploy in real-world settings.
The paper doesn’t use [my preferred methodology], e.g., deep The paper does not use a particular method (e.g., deep learn-
learning ing). No one particular method is a requirement for good
work. Please justify why that method is needed. Think about
what the paper’s contributions are, and bear in mind that
having a diversity of methods used is not a bad thing. NLP
is an interdisciplinary field, relying on many kinds of con-
tributions: models, resource, survey, data/linguistic/social
analysis, position, and theory.
The topic is too niche / Narrow Topics The paper’s topic is narrow or outdated. Please be open
minded. We do not want the whole community to chase a
trendy topic. Look at the paper’s contributions and consider
what impact it may have on our community. It is easier to
publish on trendy, ‘scientifically sexy’ topics (Smith, 2010).
In the last two years, there has been little talk of anything
other than large pretrained Transformers, with BERT alone
becoming the target of over 150 studies proposing analysis
and various modifications (Rogers et al., 2020). The ‘hot
trend’ forms the prototype for the kind of paper that should be
recommended for acceptance. Niche topics such as historical
text normalization are downvoted (unless, of course, BERT
could somehow be used for that).
The approach is tested only on [not English], so unclear if it The paper’s work is on a language other than English. We
will generalize to other languages care about NLP for any language. The same is true of NLP
research that tests only on English. Monolingual work on
any language is important both practically (methods and
resources for that language) and theoretically (potentially
contributing to deeper understanding of language in general).
The paper has language errors / Writing Style As long as the writing is clear enough, better scientific content
should be more valuable than better journalistic skills.
Non-mainstream approaches Since a ‘mainstream’ *ACL paper currently uses DL-based
methods, anything else might look like it does not really
belong in the main track - even though ACL stands for ‘Asso-
ciation for Computational Linguistics’. That puts interdisci-
plinary efforts at a disadvantage, and continues the trend for
intellectual segregation of NLP (Reiter, 2007). E.g., theoreti-
cal papers and linguistic resources should not be a priori at
a disadvantage just because they do not contain DL experi-
ments.
Resource paper The paper is a resource paper. In a field that relies on su-
pervised machine learning as much as NLP, development of
datasets is as important as modeling work.

Table 8: Lazy thinking classes with extended names, extended descriptions and new additions used in Round 2 of
our annotations. The rest of the class names and descriptions are the same as in Round 1 of our annotation sourced
from Rogers and Augenstein (2021).
Heuristics Positive examples
The results are not surprising Although the experiments are very comprehensive, this paper
lacks technical novelty. The optimal data selection strategy
also seems to offer limited improvement over random data
selection.
The results contradict what I would expect Although the paper empirically shows that the baseline is not
as effective as the proposed method, - I expect more discus-
sion on why using activation values is not a good idea, this
contradicts my prior assumption.-One limitation of this study
is that the paper only focuses on single-word cloze queries
(as discussed in the paper)
The results are not novel The novelty of the approach is limited. The attempt to train
the document retrieval and outcome prediction jointly is un-
successful.
This has no precedent in existing literature There are a lot of problems that I can imagine for real-world
large scale models such as GPT3. To mention a few: (1)
Recent works have shown it is possible to continue training
language models for either language understanding [1] or
additional applications such as code synthesis [2]. Therefore,
perhaps these simple approaches are sufficient enough for
good generalization.
The results are negative I have two concerns about the experimental results. 1. In the
few-shot learning, when the samples are increased from 24
to 1K, the attribute relevance drops by 2 points for positive
class as shown in Table 1, and the toxicity metric becomes
worse for the detoxification task as shown in Table 2. Please
give some explanations. 2. In human evaluation, the inter-
annotator agreement on the sentiment task and the AGNews
482 task is only 0.39 and 0.30 in Fleiss’ κ, which is low to
guarantee a high quality of the evaluation data.
This method is too simple There is unfortunately not a whole lot of new content in this
paper. The proposed method really just boils down to playing
with the temperature settings for the attention of the teacher
model; something that is usually just considered a hyperpa-
rameter. While I have no problem with what is currently in
the paper, I am just not sure that this is enough to form a long
paper proposing a new method.
The topic is too niche / Narrow Topics 1- It is not clear how this task and the approach will perform
for new information (updates) about an event (even if it is
contradictory to what is known about an event) 2- The ap-
proach operates on clusters. New events may not have clus-
ters.
The authors could also do [extra experiment X] 1. According to Table 3, the performance of BARTword and
BARTspan on SST-2 degrades a lot after incorporating text
smoothing, why? 2. Lack of experimental results on more
datasets: I suggest conducting experiments on more datasets
to make a more comprehensive evaluation of the proposed
method. The experiments on the full dataset instead of that
in the low-resource regime are also encouraged.

Table 9: Positive examples used during Round 3 of our annotations. The review segment that corresponds to lazy
thinking is highlighted. We display only the weakness section of the reviews rather than the full review due to space
constraints.
Fine-grained
S.A G.A
Method Models
Acc F1 Acc
R1 R2 (∆R1 ) R3 R1 R2 (∆R1 ) R3 R1 R2 (∆R1 ) R3
R ANDOM - 7.11 4.34 (-2.77) 2.46 (-1.88) 7.15 2.39 (-4.76) 1.28 - - -
M AJORITY - 11.1 7.34 (+3.00) 5.11 (-2.23) 9.05 7.17 (-1.88) 3.45 - - -
Gemma-1.1 7B (T) 22.2 26.7 (+4.50) 18.9 23.1 24.8 (+1.7) 16.7 52.2 58.1 (+5.90) 32.2
Gemma-1.1 7B (TE) - - 24.4 (+5.5) - - 19.4 (+2.7) - - 41.1 (+8.9)
Gemma-1.1 7B (RT) 12.2 11.6 (-0.60) 14.4 9.51 10.4 (+0.89) 10.8 46.7 51.1 (+4.40) 32.2
Gemma-1.1 7B (RTE) - - 17.3 (+2.9) - - 12.8 (+2.0) - - 32.8 (+0.6)
LLaMa 7B (T) 12.2 22.2 (+10.0) 11.1 9.51 20.3 (+10.79) 9.12 15.6 30.6 (+15.0) 35.6
LLaMa 7B (TE) - - 15.6 (+4.5) - - 12.4 (+3.28) - - 38.9 (+3.3)
LLaMa 7B (RT) 12.2 13.2 (+1.00) 12.2 9.51 9.89 (+0.38) 10.1 25.6 33.7 (+7.90) 28.9
LLaMa 7B (RTE) - - 14.2 (+2.0) - - 11.5 (+1.4) - - 30.8 (+1.9)
LLaMa 13B (T) 26.7 26.7 (+0.00) 11.1 20.4 23.8 (+3.4) 9.11 44.4 45.3 (+0.90) 35.6
LLaMa 13B (TE) - - 24.4 (+13.3) - - 20.8 (+11.7) - - 41.1 (+5.5)
LLaMa 13B (RT) 15.6 17.6 (+3.00) 10.7 11.7 13.8 (+2.1) 8.89 41.1 40.4 (-1.10) 32.2
LLaMa 13B (RTE) - - 18.8 (+8.1) - - 17.5 (+8.61) - - 34.4 (+2.2)
Zero-Shot Mistral 7B (T) 27.8 28.8 (+1.10) 28.2 17.5 19.7 (+2.2) 24.3 47.8 51.1 (+3.30) 54.4
Mistral 7B (TE) - - 30.0 (+1.8) - - 26.1 (+1.8 ) - - 55.6 (+1.2)
Mistral 7B (RT) 12.2 16.6 (+4.40) 22.2 9.51 12.8 (+3.29) 19.4 28.9 35.9 (+6.90) 51.1
Mistral 7B (RTE) - - 27.8 (+5.6) - - 23.2 (+3.8) - - 52.2 (+1.1)
Qwen 7B (T) 21.1 22.7 (+1.60) 28.9 17.8 19.8 (+2.0) 24.5 46.7 50.0 (+3.30) 44.4
Qwen 7B (TE) - - 31.1 (+2.2) - - 26.8 (+2.3) - - 56.4 (+12.0)
Qwen 7B (RT) 12.2 13.3 (+1.10) 26.7 9.51 13.4 (+3.89) 22.8 43.3 42.6 (-1.60) 43.3
Qwen 7B (RTE) - - 27.8 (+1.1) - - 24.7 (+1.9) - - 44.2 (+0.9)
Yi-1.5 6B (T) 35.3 37.6 (+2.3) 26.7 23.5 29.7 (+6.2) 22.1 56.7 60.0 (+3.30) 53.8
Yi-1.5 6B (TE) - - 30.0 (+3.3) - - 24.8 (+2.7) - - 54.9 (+1.1)
Yi-1.5 6B (RT) 34.4 32.8 (-1.60) 21.2 22.7 27.8 (+5.1) 19.4 51.1 52.2 (+1.20) 51.3
Yi-1.5 6B (RTE) - - 24.4 (+2.2) - - 20.6 (+1.2) - - 52.7 (+0.6)
SciTülu 7B (T) 14.4 25.3 (+10.9) 22.2 10.5 22.5 (+12.0) 16.2 18.1 29.4 (+11.3) 42.2
SciTülu 7B (TE) - - 23.3 (+1.1) - - 17.8 (+1.6) - - 44.8 (+2.6)
SciTülu 7B (RT) 15.6 18.3 (+2.70) 18.9 12.8 16.7 (+3.9) 14.9 17.3 23.7 (+6.4) 38.9
SciTülu 7B (RTE) - - 19.7 (+0.8) - - 15.5 (+0.6) - - 41.1 (+2.2)
Gemma-1.1 7B (T) 31.4 38.8 (+7.40) 34.6 29.8 31.6 (+1.8) 28.1 49.6 53.4 (+3.80) 46.9
Gemma-1.1 7B (RT) 28.2 35.7 (+7.50) 32.8 26.8 33.7 (+6.9) 24.3 44.7 51.2 (+6.50) 42.8
LLaMa 7B (T) 43.8 47.8 (+2.00) 44.7 36.8 37.2 (+0.4) 37.2 60.8 65.4 (+4.60) 48.9
LLaMa 7B (RT) 43.2 45.3 (+2.10) 41.8 32.1 33.8 (+0.7) 32.3 58.4 61.3 (+2.90) 45.4
LLaMa 13B (T) 45.8 47.8 (+2.00) 50.5 37.5 38.2 (+0.7) 40.1 50.3 51.2 (+0.90) 54.2
LLaMa 13B (RT) 41.2 45.2 (+4.00) 47.3 32.4 36.7 (+3.3) 42.6 48.2 48.8 (+0.60) 52.2
Mistral 7B (T) 35.4 37.4 (+2.00) 42.4 28.8 30.4 (+1.6) 36.4 49.2 54.3 (+5.10) 58.3
Instruction Tuned
Mistral 7B (RT) 31.2 35.2 (+4.00) 37.8 24.5 27.4 (+2.9) 32.5 44.2 50.4 (+6.20) 56.3
Qwen 7B (T) 45.9 48.4 (+2.50) 59.4 38.6 40.2 (+1.6) 57.2 51.2 55.8 (+4.60) 62.3
Qwen 7B (RT) 41.2 42.4 (+1.20) 47.8 34.3 36.4 (+2.1) 35.3 49.3 50.4 (+1.10) 47.3
Yi-1.5 6B (T) 45.1 47.8 (+2.70) 47.9 36.4 38.2 (+1.8) 38.2 51.2 53.8 (+2.60) 52.2
Yi-1.5 6B (RT) 43.2 45.3 (+2.10) 46.3 31.7 33.5 (+1.8) 34.6 59.4 61.2 (+1.80) 58.4
SciTülu 7B (T) 45.7 48.6 (+2.90) 54.3 38.6 40.2 (+1.6) 48.5 51.3 52.6 (+1.30) 54.8
SciTülu 7B (RT) 41.4 42.6 (+1.20) 51.4 34.3 35.6 (+1.3) 42.3 49.4 50.2 (+0.80) 51.3

Table 10: Performance of LLMs on different rounds of annotation. S.A represents the string-matching evaluator,
and G.A represents the GPT-based evaluator. ‘T’ represents prompting with only the target sentence, RT represents
the combination of the review and the target sentence. Adding demonstrations to the prompt is represented by ‘E’.
R1 represents ‘Round 1’, R2 represents ‘Round 2’, and R3 represents ‘Round 3’ respectively.
Coarse-grained
S.A G.A
Method Models
Acc F1 Acc
R1 R2 (∆R1 ) R3 R1 R2 (∆R1 ) R3 R1 R2 (∆R1 ) R3
R ANDOM - 40.7 40.7 (+0.00) 43.3 35.6 35.6 (+0.0) 38.2 - - -
M AJORITY - 51.4 51.4 (+0.00) 52.3 44.4 48.2 (+3.8) 49.3 - - -
Gemma-1.1 7B (T) 44.3 46.1 (+1.8) 55.6 38.7 40.2 (+1.5) 45.3 51.1 54.4 (+3.30) 57.8
Gemma-1.1 7B (TE) - - 75.6 (+20.0) - - 67.8 (+12.5) - - 88.9 (+31.1)
Gemma-1.1 7B (RT) 48.1 50.4 (+2.30) 65.6 42.2 45.3 (+3.1) 54.2 47.4 49.1 (+1.70) 65.6
Gemma-1.1 7B (RTE) - - 71.1 (+5.5) - - 65.4 (+11.2) - - 82.2 (+16.6)
LLaMa 7B (T) 57.7 60.0 (+2.30) 80.0 52.4 55.3 (+0.9) 67.2 70.0 75.0 (+5.00) 86.1
LLaMa 7B (TE) - - 84.4 (+4.4) - - 70.3 (+3.1) - - 89.1 (+3.0)
LLaMa 7B (RT) 53.3 60.0 (+6.70) 73.3 48.4 51.6 (+3.2) 58.3 55.1 67.7 (+12.6) 76.7
LLaMa 7B (RTE) - - 75.3 (+2.0) - - 60.2 (+1.9) - - 81.1 (+4.4)
LLaMa 13B (T) 60.2 62.2 (+2.00) 71.2 53.4 55.7 64.3 73.1 75.4 (+2.30) 61.1
LLaMa 13B (TE) - - 73.1 (+1.9) - - 67.2 (+2.9) - - 71.1 (+10.0)
LLaMa 13B (RT) 68.6 70.2 (+1.60) 68.8 60.2 65.7 (+5.5) 58.1 69.4 70.2 (+0.80) 51.2
LLaMa 13B (RTE) - - 70.3 (+1.5) - - 62.2 (+4.1) - - 61.1 (+9.9)
Zero-Shot Mistral 7B (T) 57.8 58.8 (+1.00) 74.4 51.2 53.2 (+2.0) 65.2 64.8 66.3 (+1.50) 75.2
Mistral 7B (TE) - - 86.7 (+4.0) - - 70.2 (+5.0) - - 86.7 (+11.5)
Mistral 7B (RT) 55.4 57.4 (+2.00) 62.2 48.4 50.2 (+1.8) 54.3 53.8 56.0 (+2.20) 62.2
Mistral 7B (RTE) - - 68.8 (+6.6) - - 58.4 (+4.1) - - 68.8 (+6.6)
Qwen 7B (T) 68.9 70.4 (+1.40) 82.7 61.2 63.4 (+2.2) 73.2 74.1 76.1 (+1.20) 82.7
Qwen 7B (TE) - - 86.7 (+4.0) - - 74.4 (+1.2) - - 86.7 (+4.0)
Qwen 7B (RT) 53.3 56.5 (+3.20) 60.2 47.1 49.2 (+2.1) 51.2 53.3 56.5 (+3.20) 60.2
Qwen 7B (RTE) - - 62.2 (+2.0) - - 55.4 (+4.2) - - 62.2 (+2.0)
Yi-1.5 6B (T) 64.4 68.7 (+4.3) 71.3 62.2 66.6 (+4.4) 68.3 71.1 73.4 (+2.3) 72.3
Yi-1.5 6B (TE) - - 74.5 (+3.2) - - 72.8 (+4.5) - - 73.8 (+1.5)
Yi-1.5 6B (RT) 63.3 68.3 (+5.0) 68.1 69.2 65.1 70.4 (+5.30) 70.1
Yi-1.5 6B (RTE) - - 70.1 (+2.0) - - 71.6 (+2.4) - - 72.4 (+2.3)
SciTülu 7B (T) 57.8 58.3 (+0.50) 51.1 51.2 52.6(+1.4) 45.3 57.8 58.3 (+0.50) 52.2
SciTülu 7B (TE) - - 72.2 (+21.1) - - 64.3 (+19.0) - - 72.2 (+20.0)
SciTülu 7B (RT) 55.6 58.7 (+3.10) 71.1 65.2 55.6 58.7 (+3.10) 68.8
SciTülu 7B (RTE) - - 88.8 (+17.7) - - 76.4 (+11.2) - - 91.1 (+22.3)
Gemma-1.1 7B (T) 57.8 61.4 (+3.60) 81.2 50.2 54.6 (+4.4) 75.3 55.7 58.8 (+3.10) 91.2
Gemma-1.1 7B (RT) 55.6 59.4 (+3.80) 78.8 53.2 52.5 (-0.7) 71.5 54.9 56.6 (+1.70) 89.2
LLaMa 7B (T) 62.7 65.4 (+2.70) 85.4 52.1 56.3 (+4.2) 68.3 75.4 79.3 (+3.90) 91.1
LLaMa 7B (RT) 61.2 63.1 (+1.90) 81.3 55.4 57.4 (+2.0) 65.2 71.3 77.2 (+5.90) 88.6
LLaMa 13B (T) 74.3 74.6 (+0.30) 75.3 65.3 67.2 (+1.9) 69.1 75.2 77.8 (+2.60) 75.2
LLaMa 13B (RT) 70.2 71.8 (+1.60) 73.3 62.4 63.7 (+1.3) 65.9 71.2 72.3 (+1.10) 74.2
Mistral 7B (T) 60.2 62.2 (+2.00) 86.4 54.6 56.9 (+2.3) 68.4 70.2 74.3 (+4.10) 88.1
Instruction Tuned
Mistral 7B (RT) 65.3 68.2 (+2.90) 88.2 57.2 59.4 (+2.2) 75.3 68.2 70.2 (+2.00) 89.2
Qwen 7B (T) 75.4 76.3 (+0.90) 88.4 68.2 69.0 (+0.8) 76.2 74.8 76.2 (+1.40) 89.4
Qwen 7B (RT) 73.2 74.1 (+0.90) 86.3 65.1 68.2 (+3.1) 74.2 72.8 74.0 (+1.20) 86.2
Yi-1.5 6B (T) 69.5 74.2 (+5.30) 78.4 68.2 72.1 (+3.9) 74.1 72.3 75.8 (+3.50) 79.3
Yi-1.5 6B (RT) 67.2 69.4 (+2.20) 73.2 66.2 70.1 (+3.9) 74.1 69.3 72.2 (+2.90) 75.2
SciTülu 7B (T) 66.3 68.4 (+2.10) 91.2 58.1 60.2 (+2.1) 80.3 76.8 78.8 (+2.00) 92.3
SciTülu 7B (RT) 62.4 65.6 (+3.20) 87.2 56.1 58.1 (+2.0) 76.1 72.3 73.3 (+1.00) 89.9

Table 11: Performance of LLMs on different rounds of annotation for the coarse classification task. S.A represents
the string-matching evaluator, and G.A represents the GPT-based evaluator. ‘T’ represents prompting with only the
target sentence, RT represents the combination of the review and the target sentence. Adding demonstrations to the
prompt is represented by ‘E’. R1, R2 and R3 represent ‘Round 1’, ‘Round 2’, and ‘Round 3’ respectively.
Gemma-1.1 7B LLaMa 7B LLaMa 13B Mistral 7B Qwen 7B Yi-1.5 6B SciTülü 7B
Method Mode Mix
S G S G S G S G S G S G S G
T - 18.13.2 23.04.6 14.24.5 24.55.6 17.82.9 26.53.4 21.04.6 23.03.8 21.11.1 24.51.1 24.21.1 30.51.2 11.12.2 14.02.4
Zero-shot
TE - 22.23.4 27.13.4 22.25.4 28.45.8 21.42.2 29.32.4 18.64.7 25.54.9 22.21.1 28.21.5 29.11.8 36.52.0 11.22.4 20.62.6
(+4.1) (+4.1) (+8.0) (+3.9) (+3.6) (+2.8) (+2.4) (+2.5) (+1.1) (+3.7) (+5.9) (+6.0) (+0.1) (+6.6)
No Mix 29.53.4 36.53.4 25.44.8 30.54.6 29.73.3 35.53.6 27.14.4 30.54.4 27.91.4 30.51.4 44.80.8 50.51.4 15.22.4 23.02.6
(+11.4) (+13.5) (+11.2) (+6.0) (+11.9) (+11.0) (+6.1) (+7.5) (+6.8) (+6.0) (+20.6) (+20.0) (+4.1) (+9.0)
SciRiff Mix 23.53.4 29.53.3 38.64.1 46.34.6 32.03.4 36.03.2 24.04.2 26.54.2 27.61.4 31.51.6 45.41.2 51.51.7 28.82.4 36.32.4
Instruction (+5.4) (+6.5) (+10.6) (+6.0) (+14.2) (+10.5) (+3.0) (+3.5) (+6.5) (+7.0) (+21.2) (+21.0) (+17.7) (+22.3)
T
Tuned Tülu Mix 31.63.0 39.53.1 21.74.2 26.54.4 28.42.9 33.03.0 25.74.4 26.54.4 45.51.1 51.01.3 27.21.0 31.01.0 16.52.2 21.52.4
(+13.5) (+16.5) (+7.5) (+2.0) (+10.6) (+6.5) (+4.7) (+3.5) (+24.4) (+26.5) (+3.0) (+1.5) (+5.4) (+7.5)
Full Mix 25.43.0 30.03.0 28.34.2 35.54.0 31.53.1 37.03.2 28.64.2 32.34.4 31.91.0 35.51.0 45.70.8 50.00.9 20.42.0 26.52.2
(+7.3) (+7.0) (+14.1) (+11.0) (+13.7) (+10.5) (+7.6) (+9.3) (+10.8) (+11.0) (+11.5) (+19.5) (+9.3) (+12.5)
RT - 17.24.1 21.84.6 10.15.2 16.36.1 14.53.1 28.13.1 17.84.1 22.04.9 18.01.1 22.81.9 21.61.1 29.81.1 08.22.1 14.42.1
Zero-shot
RTE - 18.73.4 25.83.6 12.45.4 20.45.1 21.52.1 29.12.9 18.44.1 23.34.6 19.31.1 26.61.1 25.40.8 31.10.6 12.72.1 18.82.2
(+1.5) (+4.0) (+2.3) (+4.1) (+7.0) (+1.0) (+0.6) (+1.3) (+1.3) (+3.8) (+3.8) (+1.3) (+4.5) (+4.4)
No Mix 26.83.1 28.83.2 22.94.4 28.84.4 28.41.4 30.61.6 24.14.4 27.64.4 22.61.4 29.61.1 28.70.4 36.90.4 31.53.4 34.4
(+9.6) (+7.0) (+12.8) (+12.5) (+13.9) (+12.5) (+6.3) (+5.6) (+4.6) (+6.8) (+7.1) (+7.1) (+23.3) (+20.0)
SciRiff Mix 22.23.2 26.03.2 33.64.1 42.34.6 31.53.9 37.33.6 18.64.4 24.44.2 22.50.4 26.40.6 38.51.4 41.31.7 24.82.4 32.32.4
Instruction (+5.0) (+4.2) (+28.5) (+30.0) (+17.0) (+11.2) (+0.8) (+2.4) (+4.5) (+3.6) (+16.9) (+11.5) (+16.6) (+17.9)
RT
Tuned Tülu Mix 25.73.0 30.63.0 28.64.2 33.14.0 27.23.0 33.13.0 21.03.8 25.03.8 25.11.2 31.31.6 35.80.8 40.60.9 28.72.0 35.62.2
(+8.5) (+8.8) (+18.5) (+16.8) (+12.7) (+5.0) (+3.2) (+3.0) (+7.1) (+8.5) (+14.2) (+10.8) (+20.5) (+21.2)
Full Mix 21.43.2 27.03.1 22.84.0 29.04.0 25.43.0 29.02.9 27.64.2 28.34.4 22.71.1 28.01.2 25.40.8 32.20.9 26.82.2 31.42.4
(+4.2) (+5.2) (+11.7) (+12.7) (+10.9) (+0.9) (+9.8) (+6.3) (+4.7) (+5.2) (+3.8) (+1.4) (+18.6) (+17.0)

Table 12: Performance of LLMs for fine-grained classification with 3-fold cross-validation on L AZY R EVIEW test
sets. ‘S’ represents the string-matching evaluator, and ‘G’ represents the GPT-based evaluator reporting accuracies.
‘T’ represents prompting with only the target sentence, RT represents the combination of the review and the target
sentence. Adding demonstrations to the prompt is represented by ‘E’. The best results for this task is highlighted in
cyan. Increments are shown with the classic zero-shot setup without exemplars (first row for zero-shots) with T or
RT.

Gemma-1.1 7B LLaMa 7B LLaMa 13B Mistral 7B Qwen 7B Yi-1.5 6B SciTülü 7B


Task Mode Mix
S G S G S G S G S G S G S G
T - 57.02.9 66.02.8 52.23.9 57.83.9 51.82.6 58.62.8 51.13.8 61.13.6 53.41.0 63.51.0 57.21.2 65.51.4 27.32.6 32.02.8
Zero-Shot
TE - 63.13.1 65.23.2 58.24.2 60.24.4 59.43.2 61.13.0 59.14.0 63.44.2 61.60.8 65.70.7 62.21.4 67.81.6 29.12.6 34.82.9
(+6.0) (-0.8) (+6.0) (+2.4) (+7.6) (+2.5) (+8.0) (+2.3) (+8.2) (+2.2) (+5.0) (+2.3) (+1.8) (+2.8)
No Mix 69.33.2 71.53.2 69.24.3 71.54.1 60.42.9 65.03.0 61.04.2 65.24.0 63.30.8 64.01.0 68.21.0 71.51.2 69.62.8 71.52.9
(+12.3) (+5.5) (+12.0) (+13.8) (+8.6) (+6.4) (+9.9) (+4.1) (+15.9) (+0.5) (+11.0) (+6.0) (+42.0) (+39.5)
SciRiFF Mix 69.43.0 71.53.0 69.83.0 72.53.0 68.42.9 72.03.0 68.43.0 70.53.0 60.43.0 64.53.0 69.43.0 71.53.0 69.83.0 73.53.0
Instruction (+12.4) (+5.5) (+17.6) (+14.7) (+16.6) (+13.4) (+17.3) (+9.4) (+7.0) (+1.0) (+12.2) (+6.0) (+42.5) (+41.5)
T
Tuned Tülu Mix 69.63.2 72.43.2 65.43.8 67.03.8 65.32.8 67.32.8 56.23.8 62.53.8 69.30.8 72.01.0 66.31.5 68.01.7 65.22.8 67.02.8
(+12.6) (+6.4) (+13.2) (+9.2) (+13.5) (+8.7) (+5.1) (+1.4) (+15.9) (+8.5) (+9.1) (+2.5) (+37.9) (+35.0)
Full Mix 69.43.3 71.53.4 69.84.1 71.54.4 68.43.4 68.23.6 71.34.0 72.04.2 63.80.8 64.00.6 69.81.8 72.01.2 60.82.8 68.52.7
(+12.4) (+5.5) (+17.6) (+13.7) (+16.6) (+9.6) (+20.2) (+10.9) (+10.4) (+0.5) (+12.6) (+6.5) (+33.5) (+36.5)
RT - 58.23.1 61.33.1 51.54.2 55.64.2 54.83.6 57.63.4 58.84.2 60.04.0 51.00.8 55.00.7 50.41.0 52.51.1 27.82.4 31.82.6
Zero-Shot
RTE - 59.93.0 63.43.0 56.14.0 58.24.1 58.43.8 61.33.6 57.44.0 61.14.1 54.50.7 57.80.8 51.81.2 54.51.4 32.82.8 36.92.9
(+1.7) (+2.1) (+13.3) (+11.3) (+3.6) (+3.7) (-1.4) (+1.1) (+3.5) (+2.8) (+1.4) (+2.0) (+5.0) (+5.1)
No Mix 62.83.2 65.93.1 64.84.0 66.94.2 60.22.7 62.63.0 65.44.0 67.54.0 59.21.0 61.90.8 63.81.1 65.01.2 64.62.4 66.92.4
(+4.6) (+4.6) (+13.3) (+11.3) (+5.4) (+5.0) (+6.6) (+7.5) (+8.2) (+6.9) (+13.4) (+12.5) (+36.8) (+35.1)
SciRiFF Mix 63.43.2 66.33.0 66.53.8 66.93.9 58.23.0 60.62.8 66.43.4 68.13.3 58.51.0 60.60.8 64.31.2 66.21.4 64.32.6 66.92.8
Instruction (+5.2) (+5.0) (+1.7) (+11.3) (+3.4) (+3.0) (+7.6) (+8.1) (+7.5) (+5.6) (+13.9) (+13.7) (+36.5) (+35.1)
RT
Tuned Tülu Mix 64.33.1 66.33.2 65.44.2 67.54.3 62.82.8 66.92.8 63.92.8 66.92.8 59.22.8 61.02.8 62.42.8 66.32.8 63.82.8 66.92.8
(+6.1) (+5.0) (+13.9) (+11.9) (+8.0) (+9.3) (+5.1) (+6.9) (+8.2) (+6.0) (+12.0) (+13.8) (+36) (+35.1)
Full Mix 64.33.3 66.93.5 65.34.2 66.94.3 65.43.2 66.93.4 64.84.3 66.94.4 56.40.4 58.10.8 62.81.8 65.61.6 64.52.4 66.92.6
(+6.1) (+5.6) (+13.8) (+11.3) (+9.6) (+9.3) (+6.0) (+6.9) (+5.0) (+3.1) (+12.4) (+13.1) (+36.7) (+35.1)

Table 13: Performance of LLMs on 3-fold cross-validation on L AZY R EVIEW test sets for coarse-grained classifica-
tion. S represents the string-matching evaluator, and G represents the GPT-based evaluator reporting accuracies.
‘T’ represents prompting with only the target sentence, RT represents the combination of the review and the target
sentence. Adding demonstrations to the prompt is represented by ‘E’. The best results for this task is highlighted in
blue. Increments are shown with the classic zero-shot setup without exemplars (first row for zero-shots) with T or
RT.
Fine-grained Coarse-grained
Model Class
P R F1 P R F1
Correct 0.56 1.00 0.71 0.95 0.92 0.93
gemma-1.1-7b-it
Incorrect 1.00 0.91 0.95 1.00 0.92 0.95
Correct 1.00 1.00 1.00 0.95 0.98 0.97
Llama-7B-chat
Incorrect 1.00 1.00 1.00 1.00 1.00 1.00
Correct 0.88 1.00 0.93 0.95 0.96 0.96
Llama-13b-chat
Incorrect 1.00 0.97 0.99 1.00 1.00 1.00
Correct 0.83 1.00 0.91 0.92 0.98 0.95
Mistral-7B-instruct
Incorrect 1.00 0.97 0.91 1.00 1.00 1.00
Correct 0.86 1.00 0.92 1.00 1.00 1.00
Qwen-7B-chat
Incorrect 1.00 0.97 0.99 1.00 1.00 1.00
Correct 1.00 0.92 0.96 1.00 1.00 1.00
Yi-6B-chat
Incorrect 0.97 1.00 0.98 1.00 1.00 1.00
Correct 1.00 0.95 0.98 0.91 0.94 0.98
SciTülu 7B
Incorrect 1.00 1.00 0.98 1.00 1.00 1.00

Table 14: Manual evaluation of the string-based evalua-


tor for each model in terms of Precision (P), Recall (R),
and F1 scores.

Fine-grained Coarse-grained
Model Class
P R F1 P R F1
Correct 0.69 0.96 0.81 1.00 1.00 1.00
Gemma-1.1-7b-it
Incorrect 0.89 0.42 0.57 1.00 1.00 1.00
Correct 0.76 1.00 0.87 1.00 1.00 1.00
LlaMa-7B-chat
Incorrect 0.89 0.42 0.57 1.00 1.00 1.00
Correct 0.76 1.00 0.87 1.00 1.00 1.00
LlaMa-13b-chat
Incorrect 1.00 0.44 0.61 1.00 1.00 1.00
Correct 0.82 1.00 0.90 1.00 1.00 1.00
Mistral-7B-instruct
Incorrect 1.00 0.36 0.53 1.00 1.00 1.00
Correct 0.89 1.00 0.94 1.00 1.00 1.00
Qwen-7B-chat
Incorrect 1.00 0.42 0.59 1.00 1.00 1.00
Correct 0.88 1.00 0.93 1.00 1.00 1.00
Yi-6B-chat
Incorrect 1.00 0.76 0.87 1.00 1.00 1.00
Correct 0.83 1.00 0.94 1.00 1.00 1.00
SciTülu 7B
Incorrect 1.00 0.86 0.88 1.00 1.00 1.00

Table 15: Manual evaluation of the GPT-3.5 evaluator


for each model in terms of Precision (P), Recall (R), and
F1 scores.

You might also like