2022 Aacl-Main 77
2022 Aacl-Main 77
Avg # Tokens
Abstract Dataset Language Domain #Doc
Doc Summ
CNN/DM (Hermann et al., 2015) EN News 312K 781 56
Gigawords (Napoles et al., 2012) EN News 4.02M 31 8
arXiv (Cohan et al.) EN Academic 216K 6,914 293
Summarization of legal case judgement docu- PubMed (Cohan et al.) EN Academic 133K 3,224 214
ments is a challenging problem in Legal NLP. TL;DR, TOS;DR (Manor and Li, 2019)
BigPatent (Sharma et al.)
EN
EN
Contracts
Patent
506
1.34M
106
3,573
17
117
However, not much analyses exist on how dif- RulingBR (Feijó and Moreira, 2018) Portugese Court Rulings 10,623 1,397 100
This work
ferent families of summarization models (e.g., IN-Ext (Indian docs, extractive summ) EN Court Rulings 50 5,389 1,670
IN-Abs (Indian docs, abstractive summ) EN Court Rulings 7,130 4,378 1,051
extractive vs. abstractive) perform when ap- UK-Abs (UK docs, abstractive summ) EN Court Rulings 793 14,296 1,573
Table 8: Evaluation of some summaries from the IN-Abs dataset, by three domain experts (two recent LLB graduates
and a Senior faculty of Law). The evaluation parameters are explained in the text. Scores are given by each expert
in the range [0-5], 5 being the best. The Mean and Median (Med.) scores for each summarization algorithm and for
each parameter are computed over 15 scores (across 5 documents; each judged by 3 experts).
Results (Table 8): According to the Law experts, 7 algorithms (details in Appendix Section A.9).
important information (Imp. Inf.) could be covered Following this procedure, the correlation of the
best by DSDR, followed by CaseSummarizer and mean ‘Overall’ score (assigned by experts) with
SummaRuNNer. In terms of readability (Read.) ROUGE-1 F-Score is 0.212, that with ROUGE-2
as well, DSDR, CaseSummarizer and SummaRuN- F-Score is 0.208, that with ROUGE-L F-Score is
Ner have higher mean scores than others. Finally, 0.132 and the correlation with BERTScore is 0.067.
through the Overall ratings, we understand that These low correlation scores again suggest that au-
DSDR is of higher satisfaction to the Law practi- tomatic summarization metrics may be insufficient
tioners than the other algorithms, with CaseSum- to judge the quality of summaries in specialized
marizer coming second. These observations show domains such as Law.
a discrepancy with the automatic evaluation in
Section 7 where supervised methods got better 8 Concluding discussion
ROUGE scores than unsupervised ones.
We develop datasets and benchmark results for le-
Importantly, we again see that none of the sum- gal case judgement summarization. Our study pro-
maries could achieve a balanced representation of vides several guidelines for long and legal doc-
all the rhetorical segments (RPC – Arg). For in- ument summarization: (1) For extractive sum-
stance, DSDR (which gets the best overall scores) marization of legal documents, DSDR (unsuper-
represents the final judgement (RPC) and statutes vised) and SummaRuNNer (supervised) are promis-
(STA) well, but misses important precedents (PRE) ing methods. (2) For abstractive summarization,
and arguments (ARG). Legal-Pegasus (pretrained and finetuned) is a good
In general, the experts opined that the summaries choice. (3) For long documents, fine-tuning mod-
generated by several algorithms are good in the ini- els through chunking seems a promising way.
tial parts, but their quality degrades gradually from (4) Document-wide evaluation does not give the
the middle. Also, the experts felt the abstractive complete picture; domain-specific evaluation meth-
summaries to be less organized, often having in- ods, including domain experts, should also be used.
complete sentences; they felt that the abstractive
summaries have potential but need improvement. Acknowledgements
Correlation between expert judgments and the The authors acknowledge the anonymous review-
automatic metrics: As stated above, there seems ers for their suggestions. The authors thank the
to be some discrepancy between expert judgements Law domain experts from the Rajiv Gandhi School
and the automatic metrics for summarization. To of Intellectual Property Law, India (Amritha Shaji,
explore this issue further, we compute the corre- Ankita Mohanty, and Prof. Uday Shankar) and
lation between the expert judgments (average of from the West Bengal National University of Ju-
the ‘Overall’ scores of the three annotators) and ridical Sciences, India (Prof. Shouvik Guha) who
the automatic metrics (ROUGE-1,2, L Fscores and helped in developing the gold standard summaries
BERT-Scores). The human evaluation was con- (IN-Ext dataset) and evaluating the summaries. The
ducted over 5 documents and 7 algorithms. So, for research is partially supported by the TCG Centres
each metric, correlation was calculated between the for Research and Education in Science and Tech-
5 human-assigned overall scores and the 5 metric nology (CREST) through a project titled “Smart
scores, and then an average was taken across all the Legal Consultant: AI-based Legal Analytics”.
1056
References Yihong Gong and Xin Liu. 2001. Generic text summa-
rization using relevance measure and latent seman-
Legal pegasus. https://huggingface.co/ tic analysis. In Proc. International conference on
nsi319/legal-pegasus. [Online]. Research and development in information retrieval
(SIGIR).
Ahsaas Bajaj, Pavitra Dangati, Kalpesh Krishna, Prad-
hiksha Ashok Kumar, Rheeya Uppaal, Bradford
Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Lijun
Windsor, Eliot Brenner, Dominic Dotterrer, Rajarshi
Zhang, Deng Cai, and Xiaofei He. 2012. Document
Das, and Andrew McCallum. 2021. Long document
summarization based on data reconstruction. In Proc.
summarization in a low resource setting using pre-
AAAI Conference on Artificial Intelligence, pages
trained language models. CoRR, abs/2103.00751.
620–626.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Longformer: The long-document transformer. CoRR, Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
abs/2004.05150. stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. 2015. Teaching machines to read
Paheli Bhattacharya, Kaustubh Hiware, Subham Raj- and comprehend. In Advances in Neural Information
garia, Nilay Pochhi, Kripabandhu Ghosh, and Sap- Processing Systems, volume 28. Curran Associates,
tarshi Ghosh. 2019. A comparative study of summa- Inc.
rization algorithms applied to legal case judgments.
In Proc. European Conference on Information Re- Dandan Huang, Leyang Cui, Sen Yang, Guangsheng
trieval. Bao, Kun Wang, Jun Xie, and Yue Zhang. 2020a.
What have we achieved on text summarization? In
Paheli Bhattacharya, Shounak Paul, Kripabandhu Proceedings of the 2020 Conference on Empirical
Ghosh, and Saptarshi Ghosh. 2021. Deeprhole: deep Methods in Natural Language Processing (EMNLP),
learning for rhetorical role labeling of sentences in le- pages 446–469, Online. Association for Computa-
gal case documents. Artificial Intelligence and Law. tional Linguistics.
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Dandan Huang, Leyang Cui, Sen Yang, Guangsheng
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Bao, Kun Wang, Jun Xie, and Yue Zhang. 2020b.
Goharian. A discourse-aware attention model for What have we achieved on text summarization? In
abstractive summarization of long documents. In Proceedings of the 2020 Conference on Empirical
Proceedings of the 2018 Conference of the North Methods in Natural Language Processing (EMNLP),
American Chapter of the Association for Computa- pages 446–469.
tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 615–621. Hongyan Jing. 2000. Sentence reduction for automatic
text summarization. In Proc. Applied Natural Lan-
Yue Dong. 2018. A survey on neural network-based guage Processing Conference.
summarization methods. CoRR, abs/1804.04589.
Chris Kedzie, Kathleen Mckeown, and Hal Daumé III.
Günes Erkan and Dragomir R. Radev. 2004. Lexrank: 2018. Content selection in deep learning models of
Graph-based lexical centrality as salience in text sum- summarization. In Proceedings of the 2018 Con-
marization. J. Artif. Int. Res., 22(1). ference on Empirical Methods in Natural Language
Processing, pages 1818–1828.
Atefeh Farzindar and Guy Lapalme. 2004. Letsum,
an automatic legal text summarizing system. Proc.
Philippe Laban, Tobias Schnabel, Paul N. Bennett, and
Legal knowledge and information systems (JURIX).
Marti A. Hearst. 2022. SummaC: Re-visiting NLI-
Diego Feijo and Viviane P. Moreira. 2021. Improving based models for inconsistency detection in summa-
abstractive summarization of legal rulings through rization. Transactions of the Association for Compu-
textual entailment. Artificial Intelligence and Law. tational Linguistics, 10:163–177.
Diego Feijó and Viviane Moreira. 2018. Rulingbr: A Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
summarization dataset for legal texts. In Proc. Inter- Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
national Conference on Computational Processing Veselin Stoyanov, and Luke Zettlemoyer. 2019.
of the Portuguese Language, pages 255–264. BART: denoising sequence-to-sequence pre-training
for natural language generation, translation, and com-
Dephne Gelbart and JC Smith. 1991. Beyond boolean prehension. CoRR, abs/1910.13461.
search: Flexicon, a legal tex-based intelligent system.
In Proceedings of the 3rd international conference on Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Artificial intelligence and law, pages 225–234. ACM. Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Alexios Gidiotis and Grigorios Tsoumakas. 2020. A BART: Denoising sequence-to-sequence pre-training
divide-and-conquer approach to the summarization for natural language generation, translation, and com-
of academic articles. CoRR, abs/2004.06190. prehension. In Proc. ACL, pages 7871–7880.
1057
Chao-Lin Liu and Kuan-Chun Chen. 2019. Extracting Eva Sharma, Chen Li, and Lu Wang. BIGPATENT:
the gist of chinese judgments of the supreme court. A large-scale dataset for abstractive and coherent
In Proc. International Conference on Artificial Intel- summarization. In Proc. ACL, pages 2204–2213.
ligence and Law (ICAIL).
Sajad Sotudeh, Arman Cohan, and Nazli Goharian.
Yang Liu and Mirella Lapata. 2019. Text summarization 2021. On generating extended summaries of long
with pretrained encoders. In Proc. EMNLP-IJCNLP. documents. In The AAAI-21 Workshop on Scientific
Document Understanding (SDU 2021).
Laura Manor and Junyi Jessy Li. 2019. Plain English
summarization of contracts. In Proceedings of the Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
Natural Legal Language Processing Workshop 2019, ter J Liu. 2020. Pegasus: pre-training with extracted
pages 1–11. gap-sentences for abstractive summarization. In Pro-
ceedings of the 37th International Conference on
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Machine Learning, pages 11328–11339.
Summarunner: A recurrent neural network based
sequence model for extractive summarization of doc- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
uments. In Proc. AAAI Conference on Artificial In- Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
telligence. uating text generation with bert. arXiv preprint
Courtney Napoles, Matthew Gormley, and Benjamin arXiv:1904.09675.
Van Durme. 2012. Annotated Gigaword. In Proceed- Hao Zheng and Mirella Lapata. 2019. Sentence cen-
ings of the Joint Workshop on Automatic Knowledge trality revisited for unsupervised summarization. In
Base Construction and Web-scale Knowledge Extrac- Proc. ACL, pages 6236–6247.
tion (AKBC-WEKEX), pages 95–100.
Linwu Zhong, Ziyi Zhong, Zinian Zhao, Siyuan Wang,
Shashi Narayan, Shay B Cohen, and Mirella Lapata.
Kevin D Ashley, and Matthias Grabmair. 2019. Auto-
2018. Ranking sentences for extractive summariza-
matic summarization of legal decisions using iterative
tion with reinforcement learning. In Proc. NAACL-
masking of predictive sentences. In Proc. Interna-
HLT, pages 1747–1759.
tional Conference on Artificial Intelligence and Law
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia (ICAIL).
Tsvetkov. 2021. Understanding Factuality in Ab-
stractive Summarization with FRANK: A Benchmark
for Factuality Metrics. In Proc. NAACL-HLT, pages
4812–4829.
Seth Polsley, Pooja Jhunjhunwala, and Ruihong Huang.
2016. Casesummarizer: A system for automated
summarization of legal texts. In Proc. Iinternational
conference on Computational Linguistics (COLING)
System Demonstrations.
Tharindu Ranasinghe, Constantin Orasan, and Ruslan
Mitkov. 2019. Enhancing unsupervised sentence
similarity methods with deep contextualised word
representations. In Proceedings of the International
Conference on Recent Advances in Natural Language
Processing (RANLP 2019), pages 994–1003, Varna,
Bulgaria. INCOMA Ltd.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
M Saravanan, B Ravindran, and S Raman. 2006. Im-
proving legal document summarization using graph-
ical models. In Legal knowledge and information
systems, JURIX.
Abigail See, Peter J. Liu, and Christopher D. Manning.
2017. Get to the point: Summarization with pointer-
generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1073–
1083.
1058
A Appendix variable number of Reasoning & Evidential Sup-
port sentences selected using Maximum Margin
A.1 Rhetorical Role Labels in a Legal Case Relevance, (v) one sentence from the conclusion.
Document Pattern-based regex extractors are used to identify
According to our legal experts, rhetorical role la- the sentences (i)-(iii) and (v).
bels/segments define a semantic function of the Reasoning & Evidential Support sentences are
sentences in a legal case documents. A good sum- identified using a 2-step supervised classification
mary should contain a concise representation of method – in the first step, sentences predictive of
each segment. These rhetorical segments are de- a case’s outcome are detected using Convolutional
fined as follows: Neural Networks. In the second step, a Random
(i) Facts (abbreviated as FAC): refers to the Forest Classifier is used to specifically extract the
chronology of events that led to filing the case; “Reasoning & Evidential Support” sentences from
(ii) Argument (ARG): arguments of the contend- the predictive sentences. In the absence of such
ing parties; annotated training datasets to build a 2-stage clas-
(iii) Statute (STA): Established laws referred to sification framework for India and UK, we adopt
by the present court; only the Maximum Margin Relevance module of
(iv) Precedent (PRE): Precedents/prior cases that their work as a baseline.
were referred to; This method decides the inclusion of a sentence
(v) Ratio of the decision (Ratio): reason- Si to the summary based on λ × Sim(Si , Case) +
ing/rationale for the final judgement given by the (1 − λ) × Sim(Si , Summary), where Case indi-
present court; cates the set of sentences in the original case doc-
(vi) Ruling by Present Court (RPC): the final ument and Summary represents the current set of
judgement given by the present court. sentences in the summary. λ acts as the weight that
balances the relevance and diversity; we consider
A.2 Implementations details of λ = 0.5.
Domain-Specific Extractive
• Gist: Gist uses the following handcrafted
Summarization Methods
features to represent every sentence in the input
We state here the reproducibility details of the legal case document (which is to be summarized) –
domain-specific summarization methods, which (i) Quantitative features: number of words, number
could not be stated in the main paper due to lack of of characters, number of unique words, and
space. position of the sentence
(ii) Case category information: The original paper
• Legal Dictionary: Some domain-specific sum-
produced summaries of Chinese documents which
marization methods like CaseSummarizer and
contain information like whether a document is
Gist use a set of legal keywords for identify-
recorded as a judgment or as a ruling (which is a
ing importance of sentences in the input docu-
category of judicial judgments) and specific words
ment. We identify these keywords using a glos-
that are used by the courts to indicate subcategories
sary from the legal repository https://www.
of the judgments. These information are absent in
advocatekhoj.com/library. This website
Indian and UK Supreme Court Case documents.
provides several legal resources for Indian legal
So we do not consider this category of features.
documents, including a comprehensive glossary of
(iii) Specific Legal Terms: We use a
legal terms.
legal dictionary for the purpose (from
• MMR: The original paper experiments on BVA https://www.advocatekhoj.com/
decision of the US jurisdiction. The MMR method library/glossary/a.php, as stated in the
creates a template-based summary considering vari- main paper).
ous semantic parts of a legal case document, and se- (iv) Word Embeddings: To construct the embed-
lecting a certain number of sentences from each se- ding of a sentence, we take the average of the
mantic part. Specifically, the summary is assumed embeddings of the words in the sentence. To this
to contain (i) one sentence from the procedural end, we train a word2vec model on the training
history, (ii) one sentence from issue, (iii) one sen- corpus (7030 documents of the IN-Abs and 693
tence from the service history of the veteran, (iv) a documents of the UK-Abs dataset). During
1059
evaluation, the trained word2vec model is used to labels are used as a post-summarization step that
derive the embeddings. is mainly used for displaying the summary in a
(v) One-hot vectors of first k POS tags in the structured way. We therefore implement only the
sequence, where k = 10 as mentioned in the paper sentence ranking modules for these methods – i.e,
(vi) Word Embeddings of the opening words: we TF-IDF based summarization for LetSum and K-
take the average of the embeddings of the first mixture model based summarization for KMM.
5 words in the sentence, since the paper did not
clearly mention how to obtain them. A.3 Implementation Details of
Based on the above features, Gist uses 3 models Domain-Independent Extractive
– MLP, Gradient Boosted Decision Tree, LSTM Summarization Methods
and a combination of LSTM and MLP classifiers We use the publicly available implementations of
– to rank sentences in order of their likelihood to the domain-independent extractive methods from
be included in the summary. We observe the best the following sources:
performance by using Gradient Boosted Decision
Tree as the ML classifier, which we report. • LexRank, LSA, Luhn and Reduction:
https://pypi.org/project/sumy/
• CaseSummarizer: The original method of Cas-
eSummarizer was developed for Australian doc- • PacSum: https://github.com/
uments. All sentences in the input document mswellhao/PacSum
are ranked using the following score: wnew =
• SummaRuNNer: https://github.
wold + σ (0.2d + 0.3e + 1.5s), where wold is the
com/hpzhao/SummaRuNNer)
sum of the TF-IDF values of its constituent words,
normalized over the sentence length, d is the num- • BERTSUM: https://github.com/
ber of ‘dates’ present in the sentence, e is the num- nlpyang/PreSumm. The original BERT-
ber of named entity mentions in the sentence, s is SUM model uses a post-processing step
a boolean variable indicating whether the sentence called Trigram Blocking that excludes a
is at the start of any section, and σ is the standard candidate sentence if it has a significant
deviation among the sentence scores. amount of trigram overlap with the already
The Indian case documents used in our study generated summary (to minimize redundancy
(IN-Ext and IN-Abs) are less structured than Aus- in the summary). However, we observed that
tralian case documents, and they do not contain this step leads to summaries that are too short,
‘section headings’. So, in place of that feature we as also observed in (Sotudeh et al., 2021).
used a count of the number of legal terms (identi- Hence we ignore this step.
fied by a legal dictionary) present in the sentence.
We could find section numbers of Acts in our gold A.4 Methods for obtaining Training Data for
standard summaries, for example, “section 302 of Extractive Supervised Methods
the Indian Penal Code”. Hence, for the parameter As stated in Section 5, we tried three methods for
“d” in the formulation, we included both dates and generating training data for extractive supervised
section numbers. The authors did not clearly men- methods from abstractive reference summaries.
tion how they have identified the “entities” in the The best-performing Avr method (which we fi-
texts. So, we have used the Stanford NER Tagger nally used in our experiments) was described in
for identifying entities within the sentence. For Section 5. Here we describe the other two methods
ensuring a fair comparison, we have used the same that we tried.
setting on UK-Abs too.
(i) Maximal: In this approach proposed in (Nal-
• LetSum and KMM: Both the LetSum and KMM lapati et al., 2017) the basic premise was to max-
methods initially assign rhetorical labels to sen- imize the ROUGE score between the extractive
tences (using certain cue-phrases and Conditional and the abstractive gold-standard summaries. How-
Random Fields respectively). The sentences are ever global optimization is computationally expen-
then ranked, for which LetSum uses TF-IDF scores sive; a faster greedy strategy is – keep adding sen-
and KMM uses a K-Mixture Model based score. tences to the extractive summary one by one, each
However, the rhetorical role information is not used time selecting the sentence that when added to
for generating the summary. Rather, the rhetorical the already extracted summary has the maximum
1060
Model Fine-tuning parameters
Learning rate - 2e-5, Epochs - 3, Batch size - 1
is similar to the MCS approach; only here instead
BART
Max input length - 1024, Max output length - 512 of mean, we consider a weighted mean, and we
Learning rate - 5e-5, Epochs - 2, Batch size - 1 use a pre-trained BERT model. The weight of
Legal-Pegasus
Max input length - 512, Max output length - 256 a
Learning rate - 1e-3, Epochs - 3, Batch size - 4 every token w is given by a+p(w) Where p(w) is
Legal-LED
Max input length - 16384, Max output length - 1024 the estimated frequency of a word in the whole
Table 9: Hyper-paramaters used in finetuning BART,
dataset. In other word, the weight for a word would
Legal-Pegasus and Legal-LED. be inversely proportional to the number of word
occurrences.
ROUGE score with respect to the abstractive gold-
(ii) Cosine similarity with BERT [CLS] token
standard summary. This process is repeated till the
(CLS-CS): Here we consider the cosine similarity
ROUGE score does not increase anymore. Finally,
of the encodings of the CLS tokens of the two sen-
all the sentences in this extractive summary are
tences (as given by the pre-trained BERT model).
labelled as 1, the rest as 0.
(iii) MCS_RR: Here, we using Rhetorical Roles
(ii) TF-IDF: We calculated the TF-IDF vectors
(RR) for generating finetuning data that incorpo-
for all the sentences in the source document and
rates legal domain knowledge. As described earlier
those in the summary. For each sentence in the sum-
in Section 3, a legal case document consists of
mary, we find three sentences in the full text that
7 rhetorical segments such as Facts, Statutes, etc.
are most similar to it. The similarity is measured as
We incorporate this knowledge into our abstractive
the cosine-similarity between the TF-IDF vectors
summarization process by combining it with the
of a sentence in the summary and a sentence in the
divide and conquer approach presented in (Gid-
source document, and similarity should be greater
iotis and Tsoumakas, 2020) (which is originally
than 0.4. We label the sentences in the source doc-
designed for summarizing research articles that are
ument that are similar to some summary-sentence
already segmented into logical segments).
as 1, rest as 0.
We first use a state-of-the-art classifier for rhetor-
A.5 Implementation details of Abstractive ical labeling of sentences in a legal document (Bhat-
Summarization Methods tacharya et al., 2021) to assign one of the labels –
We use the publicly available implementations of RPC, FAC, STA, RLC, Ratio, PRE, ARG – to each
the abstractive methods from the following sources: sentence of a document. We collate sentences of
a particular role as one segment. Thus, effectively,
• BART: https://huggingface.co/ we partition a document into 7 segments, each seg-
facebook/BART_large ment corresponding to a rhetorical role. Then we
apply the same approach as stated above to gener-
• Legal-Pegasus (trained on legal documents):
ate the summary of each segment; for this, we use
https://huggingface.co/nsi319/
the MCS sentence similarity measure (which per-
legal-pegasus
forms the best, as we shall see later in Section 7).
• Legal-LED (trained on legal documents): Note that, some of these rhetorical segments them-
https://huggingface.co/nsi319/ selves may be longer than the input token limit of
legal-led-base-16384 BART and Pegasus; in such cases, we further di-
vide the rhetorical segments into smaller chunks,
The hyper-parameters for finetuning are given in and then generate the summary of each chunk.
Table 9.
A.6 Methods for obtaining finetuning data for A.7 Detailed Summarization Results
abstractive summarization models Table 10, Table 11 and Table 12 contain the
As stated in Section 6.2, we experimented with document-wide ROUGE and BERTScores for the
several sentence similarity measures for generating IN-Ext, IN-Abs and UK-Abs datasets respectively.
finetuning data for abstractive models. The best These tables give the results for all summarization
performing sentence similarity measure, MCS, was methods that we have applied (while the tables in
described in Section 6.2. Here we describe the the main text report results of only some of the
other sentence similarity measures that we tried. best-performing methods).
(i) Smooth Inverse frequency with cosine similar- Table 13 and Table 14 contain the segment-
ity (SIF) (Ranasinghe et al., 2019): This approach wise ROUGE scores over the IN-Ext and UK-Abs
1061
ROUGE Scores ROUGE Scores
Algorithm BERTScore Algorithm BERTScore
R-1 R-2 R-L R-1 R-2 R-L
Extractive Methods Extractive Methods
Unsupervised, Domain Independent Unsupervised, Domain Independent
LexRank 0.564 0.344 0.388 0.862 LexRank 0.436 0.195 0.284 0.843
Lsa 0.553 0.348 0.397 0.875 Lsa 0.401 0.172 0.259 0.834
DSDR 0.566 0.317 0.264 0.834 DSDR 0.485 0.222 0.27 0.848
Luhn 0.568 0.373 0.422 0.882 Luhn 0.405 0.181 0.268 0.837
Reduction 0.561 0.358 0.405 0.869 Reduction 0.431 0.195 0.284 0.844
Pacsum_bert 0.590 0.410 0.335 0.879 Pacsum_bert 0.401 0.175 0.242 0.839
Pacsum_tfidf 0.566 0.357 0.301 0.839 Pacsum_tfidf 0.428 0.194 0.262 0.834
Unsupervised, Legal Domain Specific Unsupervised, Legal Domain Specific
MMR 0.563 0.318 0.262 0.833 MMR 0.452 0.21 0.253 0.844
KMM 0.532 0.302 0.28 0.836 KMM 0.455 0.2 0.259 0.843
LetSum 0.591 0.401 0.391 0.875 LetSum 0.395 0.167 0.251 0.833
CaseSummarizer 0.52 0.321 0.279 0.835 CaseSummarizer 0.454 0.229 0.279 0.843
Supervised, Domain Independent Supervised, Domain Independent
SummaRunner 0.532 0.334 0.269 0.829 SummaRunner 0.493 0.255 0.274 0.849
BERT-Ext 0.589 0.398 0.292 0.85 BERT-Ext 0.427 0.199 0.239 0.821
Supervised, Legal Domain Specific Supervised, Legal Domain Specific
Gist 0.555 0.335 0.391 0.864 Gist 0.471 0.238 0.308 0.842
Abstractive Methods Abstractive Methods
Pretrained Pretrained
BART 0.475 0.221 0.271 0.833 BART 0.39 0.156 0.246 0.829
BERT-BART 0.488 0.236 0.279 0.836 BERT-BART 0.337 0.112 0.212 0.809
Legal-Pegasus 0.465 0.211 0.279 0.842 Legal-Pegasus 0.441 0.19 0.278 0.845
Legal-LED 0.175 0.036 0.12 0.799 Legal-LED 0.223 0.053 0.159 0.813
Finetuned Finetuned
BART_CLS 0.534 0.29 0.349 0.853 BART_CLS 0.484 0.231 0.311 0.85
BART_MCS 0.557 0.322 0.404 0.868 BART_MCS 0.495 0.249 0.33 0.851
BART_SIF 0.540 0.304 0.369 0.857 BART_SIF 0.49 0.246 0.326 0.851
BERT_BART_MCS 0.553 0.316 0.403 0.869 BERT_BART_MCS 0.487 0.243 0.329 0.853
Legal-Pegasus_MCS 0.575 0.351 0.419 0.864 Legal-Pegasus_MCS 0.488 0.252 0.341 0.851
Legal-LED 0.471 0.26 0.341 0.863 Legal-LED 0.471 0.235 0.332 0.856
BART_MCS_RR 0.574 0.345 0.402 0.864 BART_MCS_RR 0.49 0.234 0.311 0.849
Table 10: Document-wide ROUGE-L and BERTScores Table 11: Document-wide ROUGE-L and BERTScores
(Fscore) on the IN-Ext dataset. All values averaged over (Fscore) on the IN-Abs dataset, averaged over the 100
the 50 documents in the dataset. The best value in a test documents. The best value in a particular class of
particular class of methods is in bold. methods is in bold.
datasets, for all methods that we have applied. In-Ext dataset and the Background segment in the
UK-Abs dataset are large segments that appear at
A.8 More Insights from Segment-wise the beginning of the case documents. On the other
Evaluation hand, the RPC (Ruling by Present Court) segment
Table 13 shows the segment-wise ROUGE-L Re- in In-Ext and the ‘Final judgement’ segment in UK-
call scores of all methods on the IN-Ext dataset, Abs are short segments appearing at the end of the
considering the 5 rhetorical segments RPC, FAC, documents. Most domain-independent models, like
STA, ARG, and Ratio+PRE. Similarly, Table 14 Luhn and BERT-Ext, perform much better for the
shows the segment-wise ROUGE-L Recall scores FAC and Background segments, than for the RPC
of all methods on the UK-Abs dataset, considering and ‘Final judgement’ segments. Such models may
the 3 segments Background, Reasons, and Final be suffering from the lead-bias problem (Kedzie
Judgement. In this section, we present some more et al., 2018) whereby a method has a tendency to
observations from these segment-wise evaluations, pick initial sentences from the document for inclu-
which could not be reported in the main paper due sion in the summary.
to lack of space. However, the RPC and ‘Final judgement’ seg-
An interesting observation is that the perfor- ments are important from a legal point of view, and
mances of several methods on a particular segment should be represented well in the summary accord-
depend on the size and location of the said segment ing to domain experts (Bhattacharya et al., 2019).
in the documents. The FAC (Facts) segment in the In fact, the performances of all methods are rela-
1062
ROUGE Scores Rouge L Recall
Algorithm BERTScore Algorithms
RPC FAC STA Ratio+Pre ARG
R-1 R-2 R-L (6.42%) (34.85%) (13.42%) (28.83%) (16.45%)
Extractive Methods Extractive Methods
Unsupervised, Domain Independent LexRank 0.039 0.204 0.104 0.208 0.127
Lsa 0.037 0.241 0.091 0.188 0.114
LexRank 0.481 0.187 0.265 0.848
DSDR 0.053 0.144 0.099 0.21 0.104
Lsa 0.426 0.149 0.236 0.843 Luhn 0.037 0.272 0.097 0.175 0.117
DSDR 0.484 0.174 0.221 0.832 Reduction 0.038 0.236 0.101 0.196 0.119
Pacsum_bert 0.038 0.238 0.087 0.154 0.113
Luhn 0.444 0.171 0.25 0.844
Pacsum_tfidf 0.039 0.189 0.111 0.18 0.111
Reduction 0.447 0.169 0.253 0.844 MMR 0.049 0.143 0.092 0.198 0.096
Pacsum_bert 0.448 0.175 0.228 0.843 KMM 0.049 0.143 0.1 0.198 0.103
Pacsum_tfidf 0.414 0.146 0.213 0.825 LetSum 0.036 0.237 0.115 0.189 0.1
CaseSummarizer 0.044 0.148 0.084 0.212 0.104
Unsupervised, Legal Domain Specific SummaRunner 0.059 0.158 0.08 0.209 0.096
MMR 0.440 0.151 0.205 0.83 BERT-Ext 0.038 0.199 0.082 0.162 0.093
KMM 0.430 0.138 0.201 0.827 Gist 0.041 0.191 0.102 0.223 0.093
Pretrained Abstractive Methods
LetSum 0.437 0.158 0.233 0.842 BART 0.037 0.148 0.076 0.187 0.087
CaseSummarizer 0.445 0.166 0.227 0.835 BERT-BART 0.038 0.154 0.078 0.187 0.084
Supervised, Domain Independent Legal-Pegasus 0.043 0.139 0.076 0.186 0.092
Legal-LED 0.049 0.131 0.078 0.228 0.091
SummaRunner 0.502 0.205 0.237 0.846 Finetuned Abstractive Methods
BERT-Ext 0.431 0.184 0.24 0.821 BART_MCS 0.036 0.206 0.082 0.228 0.092
Supervised, Legal Domain Specific BERT_BART_MCS 0.037 0.205 0.085 0.237 0.094
Legal-Pegasus_MCS 0.037 0.192 0.09 0.257 0.101
Gist 0.427 0.132 0.215 0.819
Legal-LED 0.053 0.245 0.086 0.187 0.124
Abstractive Methods BART_MCS_RR 0.061 0.192 0.082 0.237 0.086
Pretrained
Pointer_Generator 0.420 0.133 0.193 0.812 Table 13: Segment-wise ROUGE-L Recall scores of
BERT-Abs 0.362 0.087 0.208 0.803 all methods on the IN-Ext dataset. All values averaged
BART 0.436 0.142 0.236 0.837
over the 50 documents in the dataset. The best value for
BERT-BART 0.369 0.099 0.198 0.805
Legal-Pegasus 0.452 0.155 0.248 0.843
each segment in a particular class of methods is in bold.
Legal-LED 0.197 0.038 0.138 0.814
Finetuned in the evaluation by the domain experts. But the
BART_CLS 0.481 0.172 0.255 0.844
BART_MCS 0.496 0.188 0.271 0.848
number of summaries that we could get evaluated
BART_SIF 0.485 0.18 0.262 0.845 was limited by the availability of the experts.
BERT_BART_MCS 0.476 0.172 0.259 0.847
Legal-Pegasus_MCS 0.476 0.171 0.261 0.838 Framing the questions asked in the survey: We
Legal-LED 0.482 0.186 0.264 0.851 framed the set of questions (described in Sec-
BART_MCS_RR 0.492 0.184 0.26 0.839
tion 7.3) based on the parameters stated in (Bhat-
Table 12: Document-wide ROUGE-L and BERTScores tacharya et al., 2019; Huang et al., 2020b) about
(Fscore) on UK-Abs dataset, averaged over the 100 test how a legal document summary should be evalu-
documents. The best value for each category of methods ated.
is in bold.
Pearson Correlation as IAA : The human anno-
tively poor for for these segments (see Table 13 tators were asked to rate the summaries on a scale
and Table 14). Hence, another open challenge in of 0-5, for different parameters. Here we discuss
domain-specific long document summarization is the IAA in the ‘Overall’ parameter. For a particular
to develop algorithms that perform well on short summary of a document, consider that Annotator 1
segments that have domain-specific importance. and Annotator have given scores of 2 and 3 respec-
tively. Now, there are two choices for calculating
A.9 Expert Evaluation Details the IAA – (i) in a regression setup, these scores
denote a fairly high agreement between the anno-
We mention below some more details of the expert
tators, (ii) in a classification setup, if we consider
evaluation, which could not be accommodated in
each score to be a ‘class’, then Annotator 1 has
the main paper due to lack of space.
assigned a ‘class 2’ and Annotator 2 has assigned
Choice of documents for the survey: We se- a ‘class 3’; this implies a total disagreement be-
lected 5 documents from the IN-Abs test set, specif- tween the two experts. In our setting, we find the
ically, those five documents that gave the best aver- regression setup for calculating IAA more suitable
age ROUGE-L F-scores over the 7 summarization than the Classification setup. Therefore we use
methods chosen for the human evaluation. Pearson Correlation between the expert scores as
Ideally, some summaries that obtained lower the inter-annotator agreement (IAA) measure. For
ROUGE scores should also have been included each algorithmic summary, we calculate the corre-
1063
Rouge-L Recall
Algorithms
Background Final Judgement Reasons
lation between ROUGE-1 Fscore and the overall
(39%) (5%) (56%) scores assigned by the human evaluators.
Extractive Methods Likewise, we compute the correlation between
LexRank 0.197 0.037 0.161
Lsa 0.175 0.036 0.141 other automatic metrics (e.g., ROUGE-2 Fscore,
DSDR 0.151 0.041 0.178 BertScore) and the human-assigned overall scores.
Luhn 0.193 0.034 0.146
Reduction 0.188 0.035 0.158
Pacsum_bert 0.176 0.036 0.148
A.10 Ethics and limitations statement
Pacsum_tfidf 0.154 0.035 0.157
All the legal documents and summaries used in the
MMR 0.152 0.04 0.17
KMM 0.133 0.037 0.157 paper are publicly available data on the Web, ex-
LetSum 0.133 0.037 0.147 cept the reference summaries for the In-Ext dataset
CaseSummarizer 0.153 0.036 0.17
SummaRunner 0.172 0.044 0.165 which were written by the Law experts whom we
BERT-Ext 0.203 0.034 0.135 consulted. The law experts were informed of the
Gist 0.123 0.041 0.195
Pretrained Abstractive Methods
purpose for which the annotations/surveys were
BART 0.161 0.04 0.175 being carried out, and they were provided with
BERT-BART 0.143 0.04 0.158
a mutually agreed honorarium for conducting the
Legal-Pegasus 0.169 0.042 0.177
Legal-LED 0.177 0.066 0.219 annotations/surveys as well as for writing the refer-
Finetuned Abstractive Methods ence summaries in the IN-Ext dataset.
BART_MCS 0.168 0.041 0.184
BERT_BART_MCS 0.174 0.047 0.183 The study was performed over legal documents
Legal-Pegasus_MCS 0.166 0.039 0.202 from two countries (India and UK). While the meth-
Legal-LED 0.187 0.058 0.172
BART_MCS_RR 0.165 0.042 0.18
ods presented in the paper should be applicable
to legal documents of other countries as well, it
Table 14: Segment-wise ROUGE-L Recall scores of all is not certain whether the reported trends in the
methods on the UK-Abs dataset. All values averaged results (e.g., relative performances of the various
over 100 documents in the evaluation set. Best value for summarization algorithms) will generalize to legal
each segment in a particular class of methods is in bold.
documents of other countries.
The evaluation study by experts was conducted
lation between the two sets of ‘Overall’ scores. We over a relatively small number of summaries (35)
then take the average across all the seven ‘Overall’ which was limited by the availability of the ex-
correlation scores for the seven algorithmic sum- perts. Also, different Law practitioners have dif-
maries. ferent preferences about summaries of case judge-
Computing the correlation between human ments. The observations presented are according
judgements and the automatic metrics: Recall to the Law practitioners we consulted, and can vary
that we have 5 documents for the human evaluation. in case of other Law practitioners.
For a particular algorithm, e.g. DSDR, suppose the
average ‘Overall score given by human annotators
to the summaries of the 5 documents generated by
DSDR are [h1 , h2 , h3 , h4 , h5 ], where hi denotes
the average ‘Overall’ score given by humans for
the ith document’s summary (range [0-1]).
Suppose, the ROUGE-1 FScore of the DSDR
summaries (computed with respect to the reference
summaries) are [d1 , d2 , d3 , d4 , d5 ], where di de-
notes the ROUGE-1 Fscore for the ith document’s
DSDR-generated summary (range [0-1]).
We then compute the Pearson Correlation
cDSDR between the list of human scores and the
list of Rouge-1 Fscores for DSDR. We repeat the
above procedure for all the 7 algorithms for a par-
ticular metric (e.g. ROUGE-1 Fscore) to get 7 c
values (e.g., cDSDR , cGist , etc.) and then take the
average of the 7 values. This gives the final corre-
1064