Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views17 pages

2022 Aacl-Main 77

Uploaded by

Debtanu Datta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views17 pages

2022 Aacl-Main 77

Uploaded by

Debtanu Datta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Legal Case Document Summarization: Extractive and Abstractive Methods

and their Evaluation

Abhay Shukla1 Paheli Bhattacharya1 Soham Poddar1 Rajdeep Mukherjee1


2 1
Kripabandhu Ghosh Pawan Goyal Saptarshi Ghosh1∗
1
Indian Institute of Technology Kharagpur, India
2
Indian Institute of Science Education and Research Kolkata, India

Avg # Tokens
Abstract Dataset Language Domain #Doc
Doc Summ
CNN/DM (Hermann et al., 2015) EN News 312K 781 56
Gigawords (Napoles et al., 2012) EN News 4.02M 31 8
arXiv (Cohan et al.) EN Academic 216K 6,914 293
Summarization of legal case judgement docu- PubMed (Cohan et al.) EN Academic 133K 3,224 214
ments is a challenging problem in Legal NLP. TL;DR, TOS;DR (Manor and Li, 2019)
BigPatent (Sharma et al.)
EN
EN
Contracts
Patent
506
1.34M
106
3,573
17
117
However, not much analyses exist on how dif- RulingBR (Feijó and Moreira, 2018) Portugese Court Rulings 10,623 1,397 100
This work
ferent families of summarization models (e.g., IN-Ext (Indian docs, extractive summ) EN Court Rulings 50 5,389 1,670
IN-Abs (Indian docs, abstractive summ) EN Court Rulings 7,130 4,378 1,051
extractive vs. abstractive) perform when ap- UK-Abs (UK docs, abstractive summ) EN Court Rulings 793 14,296 1,573

plied to legal case documents. This question


is particularly important since many recent Table 1: Comparing some existing summarization
transformer-based abstractive summarization datasets with the three legal summarization datasets
models have restrictions on the number of in- developed in this work. Last two columns give the aver-
put tokens, and legal documents are known to age number of tokens per document and per summary.
be very long. Also, it is an open question on A plethora of solutions exists for text summariza-
how best to evaluate legal case document sum-
tion, for e.g., extractive and abstractive, supervised
marization systems. In this paper, we carry out
extensive experiments with several extractive
and unsupervised, etc. (Huang et al., 2020a). Also,
and abstractive summarization methods (both several legal domain-specific methods have been
supervised and unsupervised) over three legal designed for case document summarization (Zhong
summarization datasets that we have developed. et al., 2019; Liu and Chen, 2019). However, de-
Our analyses, that includes evaluation by law tailed systematic analyses are rare on how the dif-
practitioners, lead to several interesting insights ferent families of summarization models perform
on legal summarization in specific and long on legal case documents. Our prior work (Bhat-
document summarization in general.
tacharya et al., 2019) took an early step in this
direction, but it mostly considered extractive meth-
1 Introduction
ods. The state-of-the-art in document summariza-
In Common Law systems (followed in India, UK, tion has advanced rapidly in the last couple of years,
USA, etc.) law practitioners have to read through and there has not been much exploration on how
hundreds of case judgements/rulings in order to recent transformer-based summarization models
identify relevant cases that they can cite as prece- perform on legal documents (Feijo and Moreira,
dents in an ongoing case. This is a time-consuming 2021; Bajaj et al., 2021).
process as case documents are generally very long To bridge this gap, we (1) develop three le-
and complex. Thus, automatic summarization of gal case judgement summarization datasets from
legal case documents is an important problem (Gel- case documents from the Indian and UK Supreme
bart and Smith, 1991; Bhattacharya et al., 2019; Courts (see Table 1; details in Section 3), and (2) re-
Zhong et al., 2019; Liu and Chen, 2019). It is ad- produce/apply representative methods from several
ditionally challenging due to two primary reasons families of summarization models on these datasets,
as demonstrated in Table 1 – (i) legal documents and analyse their performances. To our knowledge,
as well as their summaries are much longer than this is the first study on how a wide spectrum of
most other types of documents, and (ii) since it is summarization methods perform over legal case
expensive to get Law Experts to write summaries, documents. We list below some interesting insights
the datasets are usually much smaller, making it that come out from our analyses.
difficult to use supervised models.
• Domain-specific vs Domain-agnostic meth-

Corresponding author: [email protected] ods: We apply several domain-independent sum-
1048
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the
12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1048–1064
November 20–23, 2022. ©2022 Association for Computational Linguistics
marization methods, including unsupervised ex- actual end-users of legal summarization systems).
tractive (e.g., LexRank (Erkan and Radev, 2004), We show that simply computing document-wide
DSDR (He et al., 2012), and PacSum (Zheng metrics gives an incomplete picture of the qual-
and Lapata, 2019)), supervised extractive (e.g., ity of legal document summarization. In par-
SummaRunner (Nallapati et al., 2017), and BERT- ticular, we see some differences between auto-
SUMM (Liu and Lapata, 2019)), and supervised matic evaluation and evaluation by domain experts.
abstractive (e.g., BART (Lewis et al., 2020), For instance, supervised methods like SummaRun-
and Longformer (Beltagy et al., 2020)) on legal ner, and finetuned BART usually achieve higher
case documents. We then reproduce several legal ROUGE scores, but the law practitioners often pre-
domain-specific summarization methods, for e.g., fer the summaries generated by simpler unsuper-
MMR (Zhong et al., 2019), CaseSummarizer (Pol- vised methods such as DSDR and CaseSummarizer.
sley et al., 2016) (unsupervised) and Gist (Liu and Again, the ROUGE scores achieved by the best ex-
Chen, 2019) (supervised). In many cases, we ob- tractive models are at par with those achieved by
serve general (domain-agnostic) methods to per- the best abstractive models. However, the practi-
form better than domain-specific methods. tioners often prefer the extractive summaries over
• Domain-specific training/fine-tuning: Using the abstractive ones.
models pretrained on legal corpora, like Legal- Availability of resources: The three legal sum-
Pegasus (leg), consistently improves performance. marization datasets curated in this work and the
We also explore and compare multiple ways of gen- implementations of various summarization mod-
erating legal data for training supervised models els are publicly available at https://github.
and further fine-tuning pretrained models. com/Law-AI/summarization.
• How to deal with long documents: A key chal-
lenge in using existing abstractive summarizers on 2 Related Work
legal documents is that the input capacity of such
We give an overview of existing summarization
models is often much lower than the length of le-
algorithms (Dong, 2018; Huang et al., 2020a).
gal documents. Accordingly, we experiment with
three different approaches for summarizing long Extractive domain-independent methods: There
legal case documents – (i) applying long document exists a wide range of general/domain-agnostic un-
summarizers such as Longformer (Beltagy et al., supervised summarizers such as Reduction (Jing,
2020) that are designed to handle long documents, 2000), and the graph-based LexRank algo-
(ii) applying short document summarizers such as rithm (Erkan and Radev, 2004). LSA (Gong and
BART (Lewis et al., 2020) and Legal-Pegasus (leg) Liu, 2001) is a matrix-factorization based method
together with approaches for chunking the docu- and DSDR (He et al., 2012) relies on data recon-
ments, and (iii) reducing the size of the input docu- struction. PacSum (Zheng and Lapata, 2019) is a re-
ment by first performing an extractive summariza- cent BERT-based method. Among supervised neu-
tion and then going for abstractive summarization. ral summarizers, SummaRuNNer (Nallapati et al.,
In general, we find the chunking-based approach to 2017) and BERTSum (Liu and Lapata, 2019) treat
perform better for legal documents, especially with document summarization as a binary classification
fine-tuning, although Longformer performs the best problem (in-summary vs. out-of-summary).
on the UK-Abs dataset containing the longest doc- Extractive domain-specific methods: Several
uments, according to some of the metrics. domain-specific approaches have been specifically
• Evaluation of summary quality: As noted designed for summarizing legal case documents.
in (Bhattacharya et al., 2019), Law Experts advise Among unsupervised methods, (1) LetSum (Farzin-
to not only evaluate the full-document summaries, dar and Lapalme, 2004) and (2) KMM (Saravanan
but also check how well a summary is able to rep- et al., 2006) rank sentences based on term distribu-
resent the different logical rhetorical segments in tion models (TF-IDF and k-mixture model respec-
a legal case document (such as Facts, Final Judge- tively); (3) CaseSummarizer (Polsley et al., 2016)
ment, etc. – see Appendix, Section A.1). To this ranks sentences based on their TF-IDF weights cou-
end, we perform (i) document-wide automatic eval- pled with legal-specific features; (4) MMR (Zhong
uations, (ii) segment-wise automatic evaluations, et al., 2019) generates a template-based summary
as well as (iii) evaluations by Law practitioners (the using a 2-stage classifier and a Maximum Margin
1049
Relevance (Zhong et al., 2019) module. (i) Indian-Abstractive dataset (IN-Abs): We
To our knowledge, Gist (Liu and Chen, 2019) is collect Indian Supreme Court judgements from
the only supervised method specifically designed the website of Legal Information Institute of
for summarizing legal case documents. Gist first India (http://www.liiofindia.org/in/
represents a sentence with different handcrafted cases/cen/INSC/) which provides free and
features. It then uses 3 models – MLP, Gradient non-profit access to databases of Indian law. Ab-
Boosted Decision Tree, and LSTM – to rank sen- stractive summaries (also called “headnotes”) are
tences in order of their likelihood to be included available for some of these cases; of which we in-
in the summary. We reproduce all these methods clude 7, 130 case documents, together with their
(implementation details in Appendix, Section A.2). headnotes/summaries as part of the dataset. We re-
serve 100 randomly-selected document-summary
Abstractive methods: Most abstractive summa-
pairs for evaluation and the remaining 7, 030 pairs
rization models have an input token limit which
are used for training the supervised models.
is usually shorter than the length of legal case
documents. Approaches from this family include (ii) Indian-Extractive dataset (IN-Ext): Differ-
Pointer-Generator (See et al., 2017), BERTSum- ent law practitioners may have different prefer-
Abs (Liu and Lapata, 2019), Pegasus (Zhang et al., ences about the summary of a legal case docu-
2020) and BART (Lewis et al., 2019) (input token ment. Per discussion with Law Experts (two re-
limits for these models are at most 1024). Models cent LLB graduates and a Professor from the Rajiv
like Longformer (Beltagy et al., 2020) introduce Gandhi School of Intellectual Property Law, a re-
transformer architectures with more efficient atten- puted Law school in India), we understand that they
tion mechanisms that enables them to summarize are not much satisfied with the summaries in the
long documents (up to 16 × 1024 input tokens). IN-Abs dataset. According to these experts, legal
Bajaj et al. (2021) developed a two-step case documents have various rhetorical segments,
extractive-abstractive approach for long document and the summary should contain a representation
summarization – they use a pre-trained BART from each segment. Based on the above preference,
model over compressed documents generated by the two LLB graduates first rhetorically labelled
identifying salient sentences. In this work, we re- each sentence from 50 case documents from the
produce a simplified version of this method. Indian Supreme Court (total 9,380 sentences), with
Gidiotis and Tsoumakas (2020) presented a one of the following labels – Facts (abbreviated as
divide and conquer approach for long document FAC), Argument (ARG), Statute (STA), Precedent
summarization; they split the documents and sum- (PRE), Ratio of the decision (Ratio), and Ruling by
maries, using sentence similarity, into an ensem- Present Court (RPC). Descriptions of these rhetori-
ble of smaller summarization problems. In this cal labels are given in the Appendix (Section A.1).
work, we apply a method inspired by Gidiotis and Then they wrote extractive summaries for the same
Tsoumakas (2020) to fine-tune abstractive models. 50 documents, each of length approximately one-
To our knowledge, the only method for ab- third of that of the documents. They summarized
stractive legal document summarization is Legal- each rhetorical segment separately; however, they
Summ (Feijo and Moreira, 2021). The method preferred to summarize the segments ‘Ratio’ and
uses the RulingBR dataset (in Portuguese language) ‘Precedent’ together. Each LLB graduate was paid
which has much shorter documents and summaries a (mutually agreed) honorarium of INR 800 for
than the datasets in this work (see Table 1). A labeling and summarizing each document.
limitation of LegalSumm is that it can generate Since 50 document-summary pairs are not suf-
summaries only up to 200 tokens (which is much ficient for training supervised models, when ap-
smaller than our target summaries); hence we do plying these models on IN-Ext, they were trained
not apply this method in this work. over the 7, 030 document-summary pairs in the IN-
Abs train set. We ensure that there is no overlap
3 Datasets for Legal Summarization between this training set and the IN-Ext dataset.
There are very few publicly available datasets for (iii) UK-Abstractive dataset (UK-Abs): The
legal case document summarization, especially in UK Supreme court website (https://www.
English (see Table 1). In this work, we develop the supremecourt.uk/decided-cases/)
following three datasets: provides all cases judgements that were ruled since
1050
Dataset Type of Compression Test Training
Summary Ratio Set Size Set Size
used, and other parameters kept as default),
IN-Ext Ext, segmented 0.31 50
7030
and BertScore (Zhang et al., 2019) (com-
IN-Abs Abs, non-segmented 0.24 100
UK-Abs Abs, segmented 0.11 100 693
puted using https://pypi.org/project/
bert-score/ version 0.3.4) that calculates the
Table 2: The three datasets developed in this work. semantic similarity scores using the pretrained
the year 2009. For most of the cases, along with BERT model. We calculate two kinds of ROUGE
the judgements, they also provide the official press and BERTScore as follows:
summaries of the cases, which we consider as the (a) Overall document-wide scores: For a given doc-
reference summary. The summaries are abstractive ument, we compute the ROUGE and BERTScore
in nature and are divided into three segments – of an algorithmic summary with respect to the refer-
‘Background to the Appeal’, ‘Judgement’, and ence summary. For IN-Ext, we compute the scores
‘Reasons for Judgement’. We gathered a set of individually with each of the two reference sum-
793 case documents (decided during the years maries and take the average. The scores are aver-
2009–2021) and their summaries. We reserve 100 aged over all documents in the evaluation set.
document-summary pairs for evaluation and use
the remaining 693 document-summary pairs for (b) Segment-wise scores: In legal case judgement
training the supervised models. summarization, a segment-wise evaluation is im-
portant to understand how well each rhetorical seg-
Table 2 provides a summary of the datasets, while ment has been summarized (Bhattacharya et al.,
Table 1 compares the length of the documents in 2019). We can perform this evaluation only for the
these datasets with those in other datasets. Note IN-Ext and UK-Abs datasets (and not for IN-Abs),
that the documents in UK-Abs are approximately where the reference summaries are written segment-
double the length of the IN-Abs and IN-Ext doc- wise. For each rhetorical segment (e.g., Fact or
uments, and have a very low compression ratio Background), we extract the portion of the gold
(0.11); hence the UK-Abs dataset is the most chal- standard summary that belongs to that segment.
lenging one for automatic summarization. Then we compute the ROUGE score between the
entire algorithmic summary and segment-specific
4 Experimental Setup and Evaluation part of the reference summary. We compute the
average ROUGE score for a particular segment,
Target length of summaries: During inference,
averaged over all documents in the evaluation set.1
the trained summarization models need to be pro-
In the segment-wise evaluation, we only report
vided with the target length of summaries L (in
ROUGE Recall scores, and not F-scores. This is
number of words). For every document in the
because the summarization algorithms output only
IN-Ext dataset, we have two reference summaries
a coherent set of sentences as summary, and do
(written by two experts). For a particular document,
not specify which part of the summary belongs to
we consider L to be the average of the number of
which segment; computing ROUGE Precision or
words in the two reference summaries for that docu-
F-Score in this case would be misleading.
ment. For IN-Abs and UK-Abs datasets, L is taken
as the number of words in the single abstractive Expert evaluation: We select a few methods (that
reference summary for a given document. achieve the highest ROUGE scores) and get the
Given a document, every model is made to gen- summaries generated by them for a few documents
erate a summary of length at most L words. Some evaluated by three Law experts (Section 7.3).
algorithms (e.g. KMM, Gist) return a ranking of Consistency scores: It is important to measure
sentences according to their summary-worthiness. the consistency of an algorithmic summary with
The final summary is obtained by selecting sen- the original document, given the possibility of hal-
tences in descending order of the ranked list till the lucination by abstractive models (Pagnoni et al.,
limit of L words is reached. 2021). To this end, we experimented with the
Evaluation of summary quality: We re- SummaCCONV summary consistency checker (La-
port ROUGE-1, ROUGE-2, and ROUGE-L F- ban et al., 2022). However, we find that it gives very
scores (computed using https://pypi.org/ 1
In this paper, we report segment-wise ROUGE scores only
project/py-rouge/, with max_n set to 2, since both segment-wise ROUGE scores as well as segment-
parameters limit_length and length_limit not wise BERTScores give similar insights.
1051
low consistency scores to the expert-written refer- Avr: We adopt the technique given by Narayan et al.
ence abstractive summaries – the average scores (2018). For each sentence in the abstractive gold-
for the expert summaries in IN-Abs and UK-Abs standard summary, we select 3 sentences from the
are 0.485 and 0.367 respectively. A probable rea- source document (full text) that have the maximum
son for these counter-intuitive scores could be that average of ROUGE-1, ROUGE-2 and ROUGE-L
the SummaCCONV model could not be fine-tuned scores w.r.t. the sentence in the abstractive sum-
on a legal domain-specific dataset, owing to its mary. Then we take the union of all the sentences
unavailability. Curating such a dataset to check thus selected, and label them 1 (to be included in
for factual consistency of summaries of legal docu- the summary). All other sentences in the source
ments, together with developing a suitable consis- document are assigned a label of 0.
tency measure for summaries in the legal domain
are envisioned as immediate future works. The 6 Abstractive Summarization Methods
present SummaCCONV consistency scores are there-
fore concluded to be unreliable for legal document We apply several abstractive methods for legal doc-
summarization, and hence are not reported. ument summarization, including both pretrained
models and models finetuned for legal document
summarization. A key challenge in applying such
5 Extractive Summarization Methods
methods is that legal documents are usually very
long, and most abstractive summarization models
We consider some representative methods from
have restrictions on the number of input tokens.
four classes of extractive summarizers: (1) Le-
gal domain-specific unsupervised methods: Let-
6.1 Pretrained Abstractive Models
Sum, KMM, CaseSummarizer, and MMR. (2) Le-
gal domain-specific supervised methods: Gist. 6.1.1 Models meant for short documents
(3) Domain-independent unsupervised methods: We consider Legal-Pegasus (leg) which is already
LexRank, LSA, DSDR, Luhn, Reduction and Pac- pretrained on legal documents, and BART (Lewis
Sum. (4) Domain-independent supervised methods: et al., 2020) (max input length of 1024 tokens). We
SummaRuNNer and BERTSum. use their pre-trained versions from the Hugging-
Short descriptions of all the above methods are Face library; details in the Appendix (Section A.5).
given in Section 2. The implementation details for The input token limit in these models (1024) is
the domain-specific methods we implemented, and much smaller than the number of words in a typical
publicly available code repositories are stated in legal case document. Hence, to apply these models
the Appendix (Section A.2 and Section A.3). on legal case documents, we apply a chunking-
Training supervised extractive models: The based approach as described below:
supervised methods (Gist, SummaRuNNer and Chunking-based approach: We first divide a doc-
BERTSUM) require labelled training data, where ument into small chunks, the size of each chunk
every sentence must be labeled as 1 if the sentence being the maximum number of tokens (say, n) that
is suitable for inclusion in the summary, and 0 oth- a model is designed/pre-trained to accept without
erwise. As stated in Section 3, we use parts of truncating (e.g., n = 1024 for BART). Specifically,
the IN-Abs and UK-Abs datasets for training the the first n tokens (without breaking sentences) go
supervised methods. However, since both these to the first chunk, the next n tokens go to the sec-
datasets have abstractive summaries, they cannot ond chunk, and so on. Then we use a model to
be directly used to train the extractive summarizers. summarize every chunk. For a given document, we
We explore three methods – Maximal, Avr, and equally divide the target summary length among
TF-IDF – for converting the abstractive summaries all the chunks. Finally, we append the generated
to their extractive counterparts. Best performances summaries for each chunk in sequence.
for the supervised methods are observed when the
training data is generated through the Avr method; 6.1.2 Models meant for long documents
hence we describe Avr here and report results of Models like Longformer (LED) (Beltagy et al.,
the supervised methods trained on data generated 2020) have been especially designed to handle long
through Avr. Descriptions of Maximal and TF-IDF documents (input capacity = 16,384 tokens), by in-
are stated in the Appendix (Section A.4). cluding an attention mechanism that scales linearly
1052
with sequence length. We use Legal-LED specifi- niques for measuring sentence similarity between
cally finetuned on legal data (details in Appendix, two sentences – (i) Mean Cosine Similarity (MCS),
Section A.5). The model could accommodate most (ii) Smooth Inverse Frequency (SIF), (iii) Cosine
case documents fully. A few documents in UK-Abs similarity between BERT [CLS] token embeddings
are however longer (see Table 2), those documents (CLS), and (iv) MCS_RR which incorporates
were truncated after 16,384 tokens. rhetorical role information. Out of these, we find
MCS to perform the best. Hence we describe MCS
6.1.3 Hybrid extractive-abstractive approach in detail here. Descriptions of the other methods
To focus only on important parts of the document can be found in the Appendix (Section A.6).
in the chunking-based approach, we use a hybrid of In Mean Cosine Similarity (MCS) (Ranasinghe
an extractive approach and an abstractive approach, et al., 2019), we calculate the mean of token-level
similar to Bajaj et al. (2021). First, the document embeddings (obtained using SBERT (Reimers and
length is reduced by selecting salient sentences us- Gurevych, 2019)) to obtain the representation for
ing a BERT-based extractive summarization model. a given sentence. We then compute the cosine
Then a BART model is used to generate the final similarity between two such sentence embeddings.
summary (Bajaj et al., 2021). Since, in our case, We used all the methods stated above to gener-
we often require a summary length greater than ate fine-tuning datasets for IN-Abs and UK-Abs.
1024 (see Table 1), we use a chunking-based BART We finetune three different versions of the BART
(rather than pre-trained BART) in the second step. model, BART_CLS, BART_MCS, and BART_SIF,
We call this model BERT_BART. using the three sentence similarity measures de-
scribed above. Out of these, BART_MCS performs
6.2 Finetuning Abstractive Models
the best (as we will see in Section 7). Therefore,
Fine-Tuning transformer models has shown sig- we use MCS for generating finetuning data for the
nificant improvement in most downstream tasks. other models, to obtain Legal-Pegasus-MCS and
Hence, we finetune BART, Longformer, and Legal- BART_MCS_RR (where the finetuning data is gen-
Pegasus on our proposed datasets. We also use fine- erated based on rhetorical labels). We also use the
tuned BART as part of our BERT_BART model. finetuned BART_MCS model with BERT_BART
Generating finetuning data: Finetuning super- method to get BERT_BART_MCS.
vised models needs a large set of doc-summary The hyper-parameters used to finetune the differ-
pairs. However, our considered models (apart from ent abstractive models are stated in Table 9 in the
Longformer) have a restricted input limit which is Appendix (Section A.5).
lesser than the length of documents in our datasets.
Hence, we use the following method, inspired 7 Results and Analyses
from Gidiotis and Tsoumakas (2020), to generate This section analyzes the performance of differ-
finetuning data for chunking based summarization. ent summarization models. For IN-Ext, In-Abs
Consider (d, s) to be a (training document, refer- and UK-Abs datasets, Table 3, Table 4 and Ta-
ence summary) pair. When d is segmented into n ble 5 report the overall evaluation of a few of the
chunks d1 , d2 , ... dn , it is not logical for the same best-performing methods, respectively. Table 6 and
s to be the reference summary for each chunk di . Table 7 show the segment-wise evaluation of a few
In order to generate a suitable reference summary best-performing methods on the IN-Ext and UK-
si for each chunk di , first we map every sentence Abs datasets respectively. Detailed results are given
in s to the most similar sentence in d. Here, we in Tables 10–14 in the Appendix (Section A.7).
use a variety of sentence-similarity measures, as
detailed below. Then for every chunk di , we com- 7.1 Evaluation of Extractive methods
bine all sentences in s which are mapped to any of Overall Evaluation (Tables 3–5): Among the un-
the sentences in di , and consider those sentences as
supervised general methods, Luhn (on IN-Ext) and
the summary si (of di ). Following this procedure, DSDR (on IN-Abs and UK-Abs) show the best per-
from each document, we get a large number of (di , formances. Among the unsupervised legal-specific
si ) pairs which are then used for finetuning. methods, CaseSummarizer performs the best on
Sentence similarity measures for generating fine- both In-Abs and UK-Abs datasets, while LetSum
tuning data: We experiment with several tech- performs the best on IN-Ext. Among supervised
1053
ROUGE Scores ROUGE Scores
Algorithm BERTScore Algorithm BERTScore
R-1 R-2 R-L R-1 R-2 R-L
Extractive Methods Extractive Methods (U: Unsupervised, S: Supervised)
Unsupervised, Domain Independent DSDR (U) 0.485 0.222 0.270 0.848
Luhn 0.568 0.373 0.422 0.882 CaseSummarizer (U) 0.454 0.229 0.279 0.843
Pacsum_bert 0.59 0.41 0.335 0.879 SummaRunner (S) 0.493 0.255 0.274 0.849
Unsupervised, Legal Domain Specific Gist (S) 0.471 0.238 0.308 0.842
MMR 0.563 0.318 0.262 0.833 Finetuned Abstractive Methods
KMM 0.532 0.302 0.28 0.836 BART_MCS 0.495 0.249 0.330 0.851
LetSum 0.591 0.401 0.391 0.875 BERT_BART_MCS 0.487 0.243 0.329 0.853
Supervised, Domain Independent Legal-Pegasus_MCS 0.488 0.252 0.341 0.851
SummaRunner 0.532 0.334 0.269 0.829 Legal-LED 0.471 0.235 0.332 0.856
BERT-Ext 0.589 0.398 0.292 0.85
Supervised, Legal Domain Specific Table 4: Document-wide ROUGE-L and BERTScores
Gist 0.555 0.335 0.391 0.864 (Fscore) on the IN-Abs dataset, averaged over the 100
Abstractive Methods test documents. Results of some of the top-performing
Pretrained
methods are shown here (all results in Table 11).
BART 0.475 0.221 0.271 0.833
BERT-BART 0.488 0.236 0.279 0.836
Legal-Pegasus 0.465 0.211 0.279 0.842 ROUGE Scores
Legal-LED 0.175 0.036 0.12 0.799 Algorithm BERTScore
R-1 R-2 R-L
Finetuned Extractive Methods (U: Unsupervised, S: Supervised)
BART_MCS 0.557 0.322 0.404 0.868 DSDR (U) 0.484 0.174 0.221 0.832
BART_MCS_RR 0.574 0.345 0.402 0.864 CaseSummarizer (U) 0.445 0.166 0.227 0.835
BERT_BART_MCS 0.553 0.316 0.403 0.869 SummaRunner (S) 0.502 0.205 0.237 0.846
Legal-Pegasus_MCS 0.575 0.351 0.419 0.864 Gist 0.427 0.132 0.215 0.819
Legal-LED 0.471 0.26 0.341 0.863 Finetuned Abstractive Methods
BART_MCS 0.496 0.188 0.271 0.848
Table 3: Document-wide ROUGE-L and BERTScores BERT_BART_MCS 0.476 0.172 0.259 0.847
(FScore) on the IN-Ext dataset. All values averaged Legal-Pegasus_MCS 0.476 0.171 0.261 0.838
over the 50 documents in the dataset. The best value in Legal-LED 0.482 0.186 0.264 0.851
a particular class of methods is highlighted in bold.
Table 5: Document-wide ROUGE-L and BERTScores
(Fscore) on UK-Abs dataset, averaged over the 100
extractive methods, SummaRuNNer performs the test documents. Results of some of the top-performing
best across both domain-independent and domain- methods are shown here (all results in Table 12).
specific categories, on the IN-Abs and UK-Abs
datasets. BERT-Ext is the best performing model summaries (Table 3), followed by BART-based
on the IN-Ext dataset. methods. This is expected, since Legal-Pegasus
Segment-wise Evaluation: Table 6 and Table 7 is pre-trained on legal documents. This short doc-
show the segment-wise ROUGE-L Recall scores of ument summarizer, when used with chunking to
some of the best performing methods on the IN-Ext handle long documents, notably outperforms Legal-
and UK-Abs datasets respectively. Section 4 details LED, which is meant for long documents. For IN-
the process of obtaining these scores. According to Ext dataset, BERT_BART performs the best maybe
overall ROUGE scores, it may seem that a partic- due to extractive nature of the summaries.
ular method performs very well (e.g., LetSum on All models show notable improvement through
In-Ext), but that method may not perform the best fine-tuning. Overall, the best performances are
across all the segments (e.g. among the extractive noted by Legal-Pegasus (IN-Ext and IN-Abs) and
methods, LetSum performs the best in only 1 out of BART_MCS (UK-Abs).
the 5 segments in In-Ext). This observation shows Segment-wise Evaluation (Tables 6, 7): Again,
the importance of segment-wise evaluation. It is annone of the methods performs well across all seg-
open challenge to develop an algorithm that shows ments, and fine-tuning generally improves perfor-
a balanced segment-wise performance. Some more mance. Interestingly, though Legal-LED performs
interesting observations on segment-wise evalua- poorly with respect to document-wide ROUGE
tions are given in the Appendix (Section A.8). scores, it shows better performance in segment-
wise evaluation – it gives the best performance in
7.2 Evaluation of Abstractive methods the FAC and ARG segments of IN-Ext and in 2 out
Overall Evaluation (Tables 3–5): Among the pre- of the 3 segments of UK-Abs. Since the UK-Abs
trained models, Legal-Pegasus generates the best dataset contains the longest documents, possibly
1054
Rouge L Recall
Algorithms
RPC FAC STA Ratio+Pre ARG maries in IN-Ext) from the Rajiv Gandhi School of
(6.42%) (34.85%) (13.42%) (28.83%) (16.45%) Intellectual Property Law (RGSOIPL), India, who
Extractive Methods (U: Unsupervised, S: Supervised)
LexRank (U) 0.039 0.204 0.104 0.208 0.127 were mentored by a Professor of the same Law
Luhn (U) 0.037 0.272 0.097 0.175 0.117
LetSum (U) 0.036 0.237 0.115 0.189 0.1 school (as mentioned in Section 3) while carrying
SummaRunner (S) 0.059 0.158 0.08 0.209 0.096
Gist (S) 0.041 0.191 0.102 0.223 0.093
out the annotations. Additionally, we recruited a
Finetuned Abstractive Methods senior Faculty of Law from the West Bengal Na-
BART_MCS_RR 0.061 0.192 0.082 0.237 0.086
Legal-Pegasus_MCS 0.037 0.192 0.09 0.257 0.101 tional University of Juridical Sciences (WBNUJS),
Legal-LED 0.053 0.245 0.086 0.187 0.124
India. Note that both RGSOIPL and WBNUJS are
Table 6: Segment-wise ROUGE-L Recall scores of the among the most reputed Law schools in India.
best methods in Table 3 on the IN-Ext dataset. All values Each annotator was paid a (mutually agreed)
are averaged over the 50 documents in the dataset. The honorarium of INR 200 for evaluation of each sum-
best scores for each segment in a particular class of mary. The annotators were clearly informed of
methods are in bold. Results of all methods in Table 13. the purpose of the survey. Also we discussed their
Rouge-L Recall experiences after the survey about. Through all
Algorithms
Background Final Judgement Reasons these steps, we tried our best to ensure that the
(39%) (5%) (56%)
Extractive Methods (U: Unsupervised, S: Supervised) annotations were done rigorously.
SummaRunner (S) 0.172 0.044 0.165
BERT-Ext (S) 0.203 0.034 0.135 Survey setup: We select the summaries gener-
Gist (S) 0.123 0.041 0.195 ated by 7 algorithms which give relatively high
Finetuned Abstractive Methods
Legal-Pegasus_MCS 0.166 0.039 0.202
ROUGE-L F-Score on IN-Abs – see Table 8. Then,
Legal-LED 0.187 0.058 0.172 we show the annotators 5 selected documents and
their summaries generated by the 7 algorithms (35
Table 7: Segment-wise ROUGE-L Recall scores of the
summaries evaluated in total). An annotator was
best methods in Table 5 on the UK-Abs dataset. All val-
ues averaged over the 100 documents in the evaluation asked to evaluate a summary on the basis of the fol-
set. Best scores for each segment in a particular class of lowing parameters – (1) how well a summary repre-
methods are in bold. Results of all methods in Table 14. sents each rhetorical segment, i.e., the final judge-
ment (RPC), facts (FAC), relevant statutes/laws
Legal-LED has an advantage over chunking-based cited (STA), relevant precedents cited (PRE), the
methods when evaluated segment-wise. reasoning/rationale behind the judgement (Ratio),
and the arguments presented in the case (ARG).
Overall performance on long legal case docu-
(2) how well important information has been cov-
ments: We experimented with three approaches
ered in the summary (Imp Inf). (3) Readability
for summarizing long documents – (i) models with
and grammatical coherence (Read). (4) An overall
modified attention mechanism such as Legal-LED,
score for the summary (Overall).
(ii) methods based on chunking the documents,
Each summary was rated on a Likert scale of
and (iii) reducing the size of the input by initial
0 − 5 on each parameter, independently by the 3
extractive summarization and then going for ab-
annotators. Thus, a particular method got 15 scores
stractive summarization (BERT_BART). When we
for each parameter – for 5 documents and by 3
see the overall (document-wide) ROUGE scores,
annotators. Table 8 reports (i) the mean/average,
Legal-Pegasus and BART (when used along with
and (ii) the median of all these 15 scores for each
chunking), are seen to perform the best, followed
method and for each parameter.
by BERT_BART. However for segment-wise per-
formances Legal-LED shows greater potential. Inter-Annotator Agreement: We calculate pair-
wise Pearson Correlation between the ‘Overall’
7.3 Expert evaluation scores given by the three annotators over the 35
Finally, we evaluate some of the model-generated summaries, and then take the average correlation
summaries via three domain experts. Since it is value as the IAA. Refer to the Appendix (Sec-
expensive to obtain evaluations from Law experts, tion A.9) for why we chose this IAA measure. The
we chose to conduct this evaluation for a few docu- average IAA is 0.525 which shows moderate agree-
ments/summaries from the IN-Abs dataset. ment between the annotators2 .

Recruiting the 3 experts: We recruited the two re- 2


https://www.andrews.edu/~calkins/
cent LLB graduates (who wrote the reference sum- math/edrm611/edrm05.htm
1055
RPC FAC STA PRE Ratio ARG Imp.Inf. Read. Overall
Algorithms
Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med.
DSDR 4.2 5 3.8 4 3.7 4 3.1 3.7 3.7 4 1.9 3.7 3.7 4 4.3 4 3.9 4
CaseSummarizer 2.1 2 3.8 4 3.6 4 3 3.6 3.5 3 2.4 3 3.2 3 4.3 4 3.6 3
SummaRuNNer 2.1 3 4.2 4 2.4 3 3.3 3 2.9 3 2.1 2.9 3.2 3 4.1 4 3.2 4
Gist 3.3 4 1.8 3 2.6 3 3.5 3 3.2 4 2.1 3.2 3 3 3.9 4 3.2 3
Legal-Pegasus 1.4 1 3.9 4 3.2 4 2.4 3.2 2.9 3 2 2.9 3 3 3.5 4 3 3
BART-MCS 0.9 1 2.8 3 2.9 3 3.3 3 2.5 3 1.8 2.5 2.8 3 2.7 3 2.8 3
BART-MCS-RR 0.8 1 2.7 3 3.1 3 2.6 3 2.6 3 1.3 2.6 2.6 3 2.9 3 2.6 3

Table 8: Evaluation of some summaries from the IN-Abs dataset, by three domain experts (two recent LLB graduates
and a Senior faculty of Law). The evaluation parameters are explained in the text. Scores are given by each expert
in the range [0-5], 5 being the best. The Mean and Median (Med.) scores for each summarization algorithm and for
each parameter are computed over 15 scores (across 5 documents; each judged by 3 experts).

Results (Table 8): According to the Law experts, 7 algorithms (details in Appendix Section A.9).
important information (Imp. Inf.) could be covered Following this procedure, the correlation of the
best by DSDR, followed by CaseSummarizer and mean ‘Overall’ score (assigned by experts) with
SummaRuNNer. In terms of readability (Read.) ROUGE-1 F-Score is 0.212, that with ROUGE-2
as well, DSDR, CaseSummarizer and SummaRuN- F-Score is 0.208, that with ROUGE-L F-Score is
Ner have higher mean scores than others. Finally, 0.132 and the correlation with BERTScore is 0.067.
through the Overall ratings, we understand that These low correlation scores again suggest that au-
DSDR is of higher satisfaction to the Law practi- tomatic summarization metrics may be insufficient
tioners than the other algorithms, with CaseSum- to judge the quality of summaries in specialized
marizer coming second. These observations show domains such as Law.
a discrepancy with the automatic evaluation in
Section 7 where supervised methods got better 8 Concluding discussion
ROUGE scores than unsupervised ones.
We develop datasets and benchmark results for le-
Importantly, we again see that none of the sum- gal case judgement summarization. Our study pro-
maries could achieve a balanced representation of vides several guidelines for long and legal doc-
all the rhetorical segments (RPC – Arg). For in- ument summarization: (1) For extractive sum-
stance, DSDR (which gets the best overall scores) marization of legal documents, DSDR (unsuper-
represents the final judgement (RPC) and statutes vised) and SummaRuNNer (supervised) are promis-
(STA) well, but misses important precedents (PRE) ing methods. (2) For abstractive summarization,
and arguments (ARG). Legal-Pegasus (pretrained and finetuned) is a good
In general, the experts opined that the summaries choice. (3) For long documents, fine-tuning mod-
generated by several algorithms are good in the ini- els through chunking seems a promising way.
tial parts, but their quality degrades gradually from (4) Document-wide evaluation does not give the
the middle. Also, the experts felt the abstractive complete picture; domain-specific evaluation meth-
summaries to be less organized, often having in- ods, including domain experts, should also be used.
complete sentences; they felt that the abstractive
summaries have potential but need improvement. Acknowledgements
Correlation between expert judgments and the The authors acknowledge the anonymous review-
automatic metrics: As stated above, there seems ers for their suggestions. The authors thank the
to be some discrepancy between expert judgements Law domain experts from the Rajiv Gandhi School
and the automatic metrics for summarization. To of Intellectual Property Law, India (Amritha Shaji,
explore this issue further, we compute the corre- Ankita Mohanty, and Prof. Uday Shankar) and
lation between the expert judgments (average of from the West Bengal National University of Ju-
the ‘Overall’ scores of the three annotators) and ridical Sciences, India (Prof. Shouvik Guha) who
the automatic metrics (ROUGE-1,2, L Fscores and helped in developing the gold standard summaries
BERT-Scores). The human evaluation was con- (IN-Ext dataset) and evaluating the summaries. The
ducted over 5 documents and 7 algorithms. So, for research is partially supported by the TCG Centres
each metric, correlation was calculated between the for Research and Education in Science and Tech-
5 human-assigned overall scores and the 5 metric nology (CREST) through a project titled “Smart
scores, and then an average was taken across all the Legal Consultant: AI-based Legal Analytics”.
1056
References Yihong Gong and Xin Liu. 2001. Generic text summa-
rization using relevance measure and latent seman-
Legal pegasus. https://huggingface.co/ tic analysis. In Proc. International conference on
nsi319/legal-pegasus. [Online]. Research and development in information retrieval
(SIGIR).
Ahsaas Bajaj, Pavitra Dangati, Kalpesh Krishna, Prad-
hiksha Ashok Kumar, Rheeya Uppaal, Bradford
Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Lijun
Windsor, Eliot Brenner, Dominic Dotterrer, Rajarshi
Zhang, Deng Cai, and Xiaofei He. 2012. Document
Das, and Andrew McCallum. 2021. Long document
summarization based on data reconstruction. In Proc.
summarization in a low resource setting using pre-
AAAI Conference on Artificial Intelligence, pages
trained language models. CoRR, abs/2103.00751.
620–626.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Longformer: The long-document transformer. CoRR, Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
abs/2004.05150. stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. 2015. Teaching machines to read
Paheli Bhattacharya, Kaustubh Hiware, Subham Raj- and comprehend. In Advances in Neural Information
garia, Nilay Pochhi, Kripabandhu Ghosh, and Sap- Processing Systems, volume 28. Curran Associates,
tarshi Ghosh. 2019. A comparative study of summa- Inc.
rization algorithms applied to legal case judgments.
In Proc. European Conference on Information Re- Dandan Huang, Leyang Cui, Sen Yang, Guangsheng
trieval. Bao, Kun Wang, Jun Xie, and Yue Zhang. 2020a.
What have we achieved on text summarization? In
Paheli Bhattacharya, Shounak Paul, Kripabandhu Proceedings of the 2020 Conference on Empirical
Ghosh, and Saptarshi Ghosh. 2021. Deeprhole: deep Methods in Natural Language Processing (EMNLP),
learning for rhetorical role labeling of sentences in le- pages 446–469, Online. Association for Computa-
gal case documents. Artificial Intelligence and Law. tional Linguistics.

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Dandan Huang, Leyang Cui, Sen Yang, Guangsheng
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Bao, Kun Wang, Jun Xie, and Yue Zhang. 2020b.
Goharian. A discourse-aware attention model for What have we achieved on text summarization? In
abstractive summarization of long documents. In Proceedings of the 2020 Conference on Empirical
Proceedings of the 2018 Conference of the North Methods in Natural Language Processing (EMNLP),
American Chapter of the Association for Computa- pages 446–469.
tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 615–621. Hongyan Jing. 2000. Sentence reduction for automatic
text summarization. In Proc. Applied Natural Lan-
Yue Dong. 2018. A survey on neural network-based guage Processing Conference.
summarization methods. CoRR, abs/1804.04589.
Chris Kedzie, Kathleen Mckeown, and Hal Daumé III.
Günes Erkan and Dragomir R. Radev. 2004. Lexrank: 2018. Content selection in deep learning models of
Graph-based lexical centrality as salience in text sum- summarization. In Proceedings of the 2018 Con-
marization. J. Artif. Int. Res., 22(1). ference on Empirical Methods in Natural Language
Processing, pages 1818–1828.
Atefeh Farzindar and Guy Lapalme. 2004. Letsum,
an automatic legal text summarizing system. Proc.
Philippe Laban, Tobias Schnabel, Paul N. Bennett, and
Legal knowledge and information systems (JURIX).
Marti A. Hearst. 2022. SummaC: Re-visiting NLI-
Diego Feijo and Viviane P. Moreira. 2021. Improving based models for inconsistency detection in summa-
abstractive summarization of legal rulings through rization. Transactions of the Association for Compu-
textual entailment. Artificial Intelligence and Law. tational Linguistics, 10:163–177.

Diego Feijó and Viviane Moreira. 2018. Rulingbr: A Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
summarization dataset for legal texts. In Proc. Inter- Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
national Conference on Computational Processing Veselin Stoyanov, and Luke Zettlemoyer. 2019.
of the Portuguese Language, pages 255–264. BART: denoising sequence-to-sequence pre-training
for natural language generation, translation, and com-
Dephne Gelbart and JC Smith. 1991. Beyond boolean prehension. CoRR, abs/1910.13461.
search: Flexicon, a legal tex-based intelligent system.
In Proceedings of the 3rd international conference on Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Artificial intelligence and law, pages 225–234. ACM. Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Alexios Gidiotis and Grigorios Tsoumakas. 2020. A BART: Denoising sequence-to-sequence pre-training
divide-and-conquer approach to the summarization for natural language generation, translation, and com-
of academic articles. CoRR, abs/2004.06190. prehension. In Proc. ACL, pages 7871–7880.
1057
Chao-Lin Liu and Kuan-Chun Chen. 2019. Extracting Eva Sharma, Chen Li, and Lu Wang. BIGPATENT:
the gist of chinese judgments of the supreme court. A large-scale dataset for abstractive and coherent
In Proc. International Conference on Artificial Intel- summarization. In Proc. ACL, pages 2204–2213.
ligence and Law (ICAIL).
Sajad Sotudeh, Arman Cohan, and Nazli Goharian.
Yang Liu and Mirella Lapata. 2019. Text summarization 2021. On generating extended summaries of long
with pretrained encoders. In Proc. EMNLP-IJCNLP. documents. In The AAAI-21 Workshop on Scientific
Document Understanding (SDU 2021).
Laura Manor and Junyi Jessy Li. 2019. Plain English
summarization of contracts. In Proceedings of the Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
Natural Legal Language Processing Workshop 2019, ter J Liu. 2020. Pegasus: pre-training with extracted
pages 1–11. gap-sentences for abstractive summarization. In Pro-
ceedings of the 37th International Conference on
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Machine Learning, pages 11328–11339.
Summarunner: A recurrent neural network based
sequence model for extractive summarization of doc- Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
uments. In Proc. AAAI Conference on Artificial In- Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
telligence. uating text generation with bert. arXiv preprint
Courtney Napoles, Matthew Gormley, and Benjamin arXiv:1904.09675.
Van Durme. 2012. Annotated Gigaword. In Proceed- Hao Zheng and Mirella Lapata. 2019. Sentence cen-
ings of the Joint Workshop on Automatic Knowledge trality revisited for unsupervised summarization. In
Base Construction and Web-scale Knowledge Extrac- Proc. ACL, pages 6236–6247.
tion (AKBC-WEKEX), pages 95–100.
Linwu Zhong, Ziyi Zhong, Zinian Zhao, Siyuan Wang,
Shashi Narayan, Shay B Cohen, and Mirella Lapata.
Kevin D Ashley, and Matthias Grabmair. 2019. Auto-
2018. Ranking sentences for extractive summariza-
matic summarization of legal decisions using iterative
tion with reinforcement learning. In Proc. NAACL-
masking of predictive sentences. In Proc. Interna-
HLT, pages 1747–1759.
tional Conference on Artificial Intelligence and Law
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia (ICAIL).
Tsvetkov. 2021. Understanding Factuality in Ab-
stractive Summarization with FRANK: A Benchmark
for Factuality Metrics. In Proc. NAACL-HLT, pages
4812–4829.
Seth Polsley, Pooja Jhunjhunwala, and Ruihong Huang.
2016. Casesummarizer: A system for automated
summarization of legal texts. In Proc. Iinternational
conference on Computational Linguistics (COLING)
System Demonstrations.
Tharindu Ranasinghe, Constantin Orasan, and Ruslan
Mitkov. 2019. Enhancing unsupervised sentence
similarity methods with deep contextualised word
representations. In Proceedings of the International
Conference on Recent Advances in Natural Language
Processing (RANLP 2019), pages 994–1003, Varna,
Bulgaria. INCOMA Ltd.
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing. Associa-
tion for Computational Linguistics.
M Saravanan, B Ravindran, and S Raman. 2006. Im-
proving legal document summarization using graph-
ical models. In Legal knowledge and information
systems, JURIX.
Abigail See, Peter J. Liu, and Christopher D. Manning.
2017. Get to the point: Summarization with pointer-
generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1073–
1083.
1058
A Appendix variable number of Reasoning & Evidential Sup-
port sentences selected using Maximum Margin
A.1 Rhetorical Role Labels in a Legal Case Relevance, (v) one sentence from the conclusion.
Document Pattern-based regex extractors are used to identify
According to our legal experts, rhetorical role la- the sentences (i)-(iii) and (v).
bels/segments define a semantic function of the Reasoning & Evidential Support sentences are
sentences in a legal case documents. A good sum- identified using a 2-step supervised classification
mary should contain a concise representation of method – in the first step, sentences predictive of
each segment. These rhetorical segments are de- a case’s outcome are detected using Convolutional
fined as follows: Neural Networks. In the second step, a Random
(i) Facts (abbreviated as FAC): refers to the Forest Classifier is used to specifically extract the
chronology of events that led to filing the case; “Reasoning & Evidential Support” sentences from
(ii) Argument (ARG): arguments of the contend- the predictive sentences. In the absence of such
ing parties; annotated training datasets to build a 2-stage clas-
(iii) Statute (STA): Established laws referred to sification framework for India and UK, we adopt
by the present court; only the Maximum Margin Relevance module of
(iv) Precedent (PRE): Precedents/prior cases that their work as a baseline.
were referred to; This method decides the inclusion of a sentence
(v) Ratio of the decision (Ratio): reason- Si to the summary based on λ × Sim(Si , Case) +
ing/rationale for the final judgement given by the (1 − λ) × Sim(Si , Summary), where Case indi-
present court; cates the set of sentences in the original case doc-
(vi) Ruling by Present Court (RPC): the final ument and Summary represents the current set of
judgement given by the present court. sentences in the summary. λ acts as the weight that
balances the relevance and diversity; we consider
A.2 Implementations details of λ = 0.5.
Domain-Specific Extractive
• Gist: Gist uses the following handcrafted
Summarization Methods
features to represent every sentence in the input
We state here the reproducibility details of the legal case document (which is to be summarized) –
domain-specific summarization methods, which (i) Quantitative features: number of words, number
could not be stated in the main paper due to lack of of characters, number of unique words, and
space. position of the sentence
(ii) Case category information: The original paper
• Legal Dictionary: Some domain-specific sum-
produced summaries of Chinese documents which
marization methods like CaseSummarizer and
contain information like whether a document is
Gist use a set of legal keywords for identify-
recorded as a judgment or as a ruling (which is a
ing importance of sentences in the input docu-
category of judicial judgments) and specific words
ment. We identify these keywords using a glos-
that are used by the courts to indicate subcategories
sary from the legal repository https://www.
of the judgments. These information are absent in
advocatekhoj.com/library. This website
Indian and UK Supreme Court Case documents.
provides several legal resources for Indian legal
So we do not consider this category of features.
documents, including a comprehensive glossary of
(iii) Specific Legal Terms: We use a
legal terms.
legal dictionary for the purpose (from
• MMR: The original paper experiments on BVA https://www.advocatekhoj.com/
decision of the US jurisdiction. The MMR method library/glossary/a.php, as stated in the
creates a template-based summary considering vari- main paper).
ous semantic parts of a legal case document, and se- (iv) Word Embeddings: To construct the embed-
lecting a certain number of sentences from each se- ding of a sentence, we take the average of the
mantic part. Specifically, the summary is assumed embeddings of the words in the sentence. To this
to contain (i) one sentence from the procedural end, we train a word2vec model on the training
history, (ii) one sentence from issue, (iii) one sen- corpus (7030 documents of the IN-Abs and 693
tence from the service history of the veteran, (iv) a documents of the UK-Abs dataset). During
1059
evaluation, the trained word2vec model is used to labels are used as a post-summarization step that
derive the embeddings. is mainly used for displaying the summary in a
(v) One-hot vectors of first k POS tags in the structured way. We therefore implement only the
sequence, where k = 10 as mentioned in the paper sentence ranking modules for these methods – i.e,
(vi) Word Embeddings of the opening words: we TF-IDF based summarization for LetSum and K-
take the average of the embeddings of the first mixture model based summarization for KMM.
5 words in the sentence, since the paper did not
clearly mention how to obtain them. A.3 Implementation Details of
Based on the above features, Gist uses 3 models Domain-Independent Extractive
– MLP, Gradient Boosted Decision Tree, LSTM Summarization Methods
and a combination of LSTM and MLP classifiers We use the publicly available implementations of
– to rank sentences in order of their likelihood to the domain-independent extractive methods from
be included in the summary. We observe the best the following sources:
performance by using Gradient Boosted Decision
Tree as the ML classifier, which we report. • LexRank, LSA, Luhn and Reduction:
https://pypi.org/project/sumy/
• CaseSummarizer: The original method of Cas-
eSummarizer was developed for Australian doc- • PacSum: https://github.com/
uments. All sentences in the input document mswellhao/PacSum
are ranked using the following score: wnew =
• SummaRuNNer: https://github.
wold + σ (0.2d + 0.3e + 1.5s), where wold is the
com/hpzhao/SummaRuNNer)
sum of the TF-IDF values of its constituent words,
normalized over the sentence length, d is the num- • BERTSUM: https://github.com/
ber of ‘dates’ present in the sentence, e is the num- nlpyang/PreSumm. The original BERT-
ber of named entity mentions in the sentence, s is SUM model uses a post-processing step
a boolean variable indicating whether the sentence called Trigram Blocking that excludes a
is at the start of any section, and σ is the standard candidate sentence if it has a significant
deviation among the sentence scores. amount of trigram overlap with the already
The Indian case documents used in our study generated summary (to minimize redundancy
(IN-Ext and IN-Abs) are less structured than Aus- in the summary). However, we observed that
tralian case documents, and they do not contain this step leads to summaries that are too short,
‘section headings’. So, in place of that feature we as also observed in (Sotudeh et al., 2021).
used a count of the number of legal terms (identi- Hence we ignore this step.
fied by a legal dictionary) present in the sentence.
We could find section numbers of Acts in our gold A.4 Methods for obtaining Training Data for
standard summaries, for example, “section 302 of Extractive Supervised Methods
the Indian Penal Code”. Hence, for the parameter As stated in Section 5, we tried three methods for
“d” in the formulation, we included both dates and generating training data for extractive supervised
section numbers. The authors did not clearly men- methods from abstractive reference summaries.
tion how they have identified the “entities” in the The best-performing Avr method (which we fi-
texts. So, we have used the Stanford NER Tagger nally used in our experiments) was described in
for identifying entities within the sentence. For Section 5. Here we describe the other two methods
ensuring a fair comparison, we have used the same that we tried.
setting on UK-Abs too.
(i) Maximal: In this approach proposed in (Nal-
• LetSum and KMM: Both the LetSum and KMM lapati et al., 2017) the basic premise was to max-
methods initially assign rhetorical labels to sen- imize the ROUGE score between the extractive
tences (using certain cue-phrases and Conditional and the abstractive gold-standard summaries. How-
Random Fields respectively). The sentences are ever global optimization is computationally expen-
then ranked, for which LetSum uses TF-IDF scores sive; a faster greedy strategy is – keep adding sen-
and KMM uses a K-Mixture Model based score. tences to the extractive summary one by one, each
However, the rhetorical role information is not used time selecting the sentence that when added to
for generating the summary. Rather, the rhetorical the already extracted summary has the maximum
1060
Model Fine-tuning parameters
Learning rate - 2e-5, Epochs - 3, Batch size - 1
is similar to the MCS approach; only here instead
BART
Max input length - 1024, Max output length - 512 of mean, we consider a weighted mean, and we
Learning rate - 5e-5, Epochs - 2, Batch size - 1 use a pre-trained BERT model. The weight of
Legal-Pegasus
Max input length - 512, Max output length - 256 a
Learning rate - 1e-3, Epochs - 3, Batch size - 4 every token w is given by a+p(w) Where p(w) is
Legal-LED
Max input length - 16384, Max output length - 1024 the estimated frequency of a word in the whole
Table 9: Hyper-paramaters used in finetuning BART,
dataset. In other word, the weight for a word would
Legal-Pegasus and Legal-LED. be inversely proportional to the number of word
occurrences.
ROUGE score with respect to the abstractive gold-
(ii) Cosine similarity with BERT [CLS] token
standard summary. This process is repeated till the
(CLS-CS): Here we consider the cosine similarity
ROUGE score does not increase anymore. Finally,
of the encodings of the CLS tokens of the two sen-
all the sentences in this extractive summary are
tences (as given by the pre-trained BERT model).
labelled as 1, the rest as 0.
(iii) MCS_RR: Here, we using Rhetorical Roles
(ii) TF-IDF: We calculated the TF-IDF vectors
(RR) for generating finetuning data that incorpo-
for all the sentences in the source document and
rates legal domain knowledge. As described earlier
those in the summary. For each sentence in the sum-
in Section 3, a legal case document consists of
mary, we find three sentences in the full text that
7 rhetorical segments such as Facts, Statutes, etc.
are most similar to it. The similarity is measured as
We incorporate this knowledge into our abstractive
the cosine-similarity between the TF-IDF vectors
summarization process by combining it with the
of a sentence in the summary and a sentence in the
divide and conquer approach presented in (Gid-
source document, and similarity should be greater
iotis and Tsoumakas, 2020) (which is originally
than 0.4. We label the sentences in the source doc-
designed for summarizing research articles that are
ument that are similar to some summary-sentence
already segmented into logical segments).
as 1, rest as 0.
We first use a state-of-the-art classifier for rhetor-
A.5 Implementation details of Abstractive ical labeling of sentences in a legal document (Bhat-
Summarization Methods tacharya et al., 2021) to assign one of the labels –
We use the publicly available implementations of RPC, FAC, STA, RLC, Ratio, PRE, ARG – to each
the abstractive methods from the following sources: sentence of a document. We collate sentences of
a particular role as one segment. Thus, effectively,
• BART: https://huggingface.co/ we partition a document into 7 segments, each seg-
facebook/BART_large ment corresponding to a rhetorical role. Then we
apply the same approach as stated above to gener-
• Legal-Pegasus (trained on legal documents):
ate the summary of each segment; for this, we use
https://huggingface.co/nsi319/
the MCS sentence similarity measure (which per-
legal-pegasus
forms the best, as we shall see later in Section 7).
• Legal-LED (trained on legal documents): Note that, some of these rhetorical segments them-
https://huggingface.co/nsi319/ selves may be longer than the input token limit of
legal-led-base-16384 BART and Pegasus; in such cases, we further di-
vide the rhetorical segments into smaller chunks,
The hyper-parameters for finetuning are given in and then generate the summary of each chunk.
Table 9.

A.6 Methods for obtaining finetuning data for A.7 Detailed Summarization Results
abstractive summarization models Table 10, Table 11 and Table 12 contain the
As stated in Section 6.2, we experimented with document-wide ROUGE and BERTScores for the
several sentence similarity measures for generating IN-Ext, IN-Abs and UK-Abs datasets respectively.
finetuning data for abstractive models. The best These tables give the results for all summarization
performing sentence similarity measure, MCS, was methods that we have applied (while the tables in
described in Section 6.2. Here we describe the the main text report results of only some of the
other sentence similarity measures that we tried. best-performing methods).
(i) Smooth Inverse frequency with cosine similar- Table 13 and Table 14 contain the segment-
ity (SIF) (Ranasinghe et al., 2019): This approach wise ROUGE scores over the IN-Ext and UK-Abs
1061
ROUGE Scores ROUGE Scores
Algorithm BERTScore Algorithm BERTScore
R-1 R-2 R-L R-1 R-2 R-L
Extractive Methods Extractive Methods
Unsupervised, Domain Independent Unsupervised, Domain Independent
LexRank 0.564 0.344 0.388 0.862 LexRank 0.436 0.195 0.284 0.843
Lsa 0.553 0.348 0.397 0.875 Lsa 0.401 0.172 0.259 0.834
DSDR 0.566 0.317 0.264 0.834 DSDR 0.485 0.222 0.27 0.848
Luhn 0.568 0.373 0.422 0.882 Luhn 0.405 0.181 0.268 0.837
Reduction 0.561 0.358 0.405 0.869 Reduction 0.431 0.195 0.284 0.844
Pacsum_bert 0.590 0.410 0.335 0.879 Pacsum_bert 0.401 0.175 0.242 0.839
Pacsum_tfidf 0.566 0.357 0.301 0.839 Pacsum_tfidf 0.428 0.194 0.262 0.834
Unsupervised, Legal Domain Specific Unsupervised, Legal Domain Specific
MMR 0.563 0.318 0.262 0.833 MMR 0.452 0.21 0.253 0.844
KMM 0.532 0.302 0.28 0.836 KMM 0.455 0.2 0.259 0.843
LetSum 0.591 0.401 0.391 0.875 LetSum 0.395 0.167 0.251 0.833
CaseSummarizer 0.52 0.321 0.279 0.835 CaseSummarizer 0.454 0.229 0.279 0.843
Supervised, Domain Independent Supervised, Domain Independent
SummaRunner 0.532 0.334 0.269 0.829 SummaRunner 0.493 0.255 0.274 0.849
BERT-Ext 0.589 0.398 0.292 0.85 BERT-Ext 0.427 0.199 0.239 0.821
Supervised, Legal Domain Specific Supervised, Legal Domain Specific
Gist 0.555 0.335 0.391 0.864 Gist 0.471 0.238 0.308 0.842
Abstractive Methods Abstractive Methods
Pretrained Pretrained
BART 0.475 0.221 0.271 0.833 BART 0.39 0.156 0.246 0.829
BERT-BART 0.488 0.236 0.279 0.836 BERT-BART 0.337 0.112 0.212 0.809
Legal-Pegasus 0.465 0.211 0.279 0.842 Legal-Pegasus 0.441 0.19 0.278 0.845
Legal-LED 0.175 0.036 0.12 0.799 Legal-LED 0.223 0.053 0.159 0.813
Finetuned Finetuned
BART_CLS 0.534 0.29 0.349 0.853 BART_CLS 0.484 0.231 0.311 0.85
BART_MCS 0.557 0.322 0.404 0.868 BART_MCS 0.495 0.249 0.33 0.851
BART_SIF 0.540 0.304 0.369 0.857 BART_SIF 0.49 0.246 0.326 0.851
BERT_BART_MCS 0.553 0.316 0.403 0.869 BERT_BART_MCS 0.487 0.243 0.329 0.853
Legal-Pegasus_MCS 0.575 0.351 0.419 0.864 Legal-Pegasus_MCS 0.488 0.252 0.341 0.851
Legal-LED 0.471 0.26 0.341 0.863 Legal-LED 0.471 0.235 0.332 0.856
BART_MCS_RR 0.574 0.345 0.402 0.864 BART_MCS_RR 0.49 0.234 0.311 0.849

Table 10: Document-wide ROUGE-L and BERTScores Table 11: Document-wide ROUGE-L and BERTScores
(Fscore) on the IN-Ext dataset. All values averaged over (Fscore) on the IN-Abs dataset, averaged over the 100
the 50 documents in the dataset. The best value in a test documents. The best value in a particular class of
particular class of methods is in bold. methods is in bold.

datasets, for all methods that we have applied. In-Ext dataset and the Background segment in the
UK-Abs dataset are large segments that appear at
A.8 More Insights from Segment-wise the beginning of the case documents. On the other
Evaluation hand, the RPC (Ruling by Present Court) segment
Table 13 shows the segment-wise ROUGE-L Re- in In-Ext and the ‘Final judgement’ segment in UK-
call scores of all methods on the IN-Ext dataset, Abs are short segments appearing at the end of the
considering the 5 rhetorical segments RPC, FAC, documents. Most domain-independent models, like
STA, ARG, and Ratio+PRE. Similarly, Table 14 Luhn and BERT-Ext, perform much better for the
shows the segment-wise ROUGE-L Recall scores FAC and Background segments, than for the RPC
of all methods on the UK-Abs dataset, considering and ‘Final judgement’ segments. Such models may
the 3 segments Background, Reasons, and Final be suffering from the lead-bias problem (Kedzie
Judgement. In this section, we present some more et al., 2018) whereby a method has a tendency to
observations from these segment-wise evaluations, pick initial sentences from the document for inclu-
which could not be reported in the main paper due sion in the summary.
to lack of space. However, the RPC and ‘Final judgement’ seg-
An interesting observation is that the perfor- ments are important from a legal point of view, and
mances of several methods on a particular segment should be represented well in the summary accord-
depend on the size and location of the said segment ing to domain experts (Bhattacharya et al., 2019).
in the documents. The FAC (Facts) segment in the In fact, the performances of all methods are rela-
1062
ROUGE Scores Rouge L Recall
Algorithm BERTScore Algorithms
RPC FAC STA Ratio+Pre ARG
R-1 R-2 R-L (6.42%) (34.85%) (13.42%) (28.83%) (16.45%)
Extractive Methods Extractive Methods
Unsupervised, Domain Independent LexRank 0.039 0.204 0.104 0.208 0.127
Lsa 0.037 0.241 0.091 0.188 0.114
LexRank 0.481 0.187 0.265 0.848
DSDR 0.053 0.144 0.099 0.21 0.104
Lsa 0.426 0.149 0.236 0.843 Luhn 0.037 0.272 0.097 0.175 0.117
DSDR 0.484 0.174 0.221 0.832 Reduction 0.038 0.236 0.101 0.196 0.119
Pacsum_bert 0.038 0.238 0.087 0.154 0.113
Luhn 0.444 0.171 0.25 0.844
Pacsum_tfidf 0.039 0.189 0.111 0.18 0.111
Reduction 0.447 0.169 0.253 0.844 MMR 0.049 0.143 0.092 0.198 0.096
Pacsum_bert 0.448 0.175 0.228 0.843 KMM 0.049 0.143 0.1 0.198 0.103
Pacsum_tfidf 0.414 0.146 0.213 0.825 LetSum 0.036 0.237 0.115 0.189 0.1
CaseSummarizer 0.044 0.148 0.084 0.212 0.104
Unsupervised, Legal Domain Specific SummaRunner 0.059 0.158 0.08 0.209 0.096
MMR 0.440 0.151 0.205 0.83 BERT-Ext 0.038 0.199 0.082 0.162 0.093
KMM 0.430 0.138 0.201 0.827 Gist 0.041 0.191 0.102 0.223 0.093
Pretrained Abstractive Methods
LetSum 0.437 0.158 0.233 0.842 BART 0.037 0.148 0.076 0.187 0.087
CaseSummarizer 0.445 0.166 0.227 0.835 BERT-BART 0.038 0.154 0.078 0.187 0.084
Supervised, Domain Independent Legal-Pegasus 0.043 0.139 0.076 0.186 0.092
Legal-LED 0.049 0.131 0.078 0.228 0.091
SummaRunner 0.502 0.205 0.237 0.846 Finetuned Abstractive Methods
BERT-Ext 0.431 0.184 0.24 0.821 BART_MCS 0.036 0.206 0.082 0.228 0.092
Supervised, Legal Domain Specific BERT_BART_MCS 0.037 0.205 0.085 0.237 0.094
Legal-Pegasus_MCS 0.037 0.192 0.09 0.257 0.101
Gist 0.427 0.132 0.215 0.819
Legal-LED 0.053 0.245 0.086 0.187 0.124
Abstractive Methods BART_MCS_RR 0.061 0.192 0.082 0.237 0.086
Pretrained
Pointer_Generator 0.420 0.133 0.193 0.812 Table 13: Segment-wise ROUGE-L Recall scores of
BERT-Abs 0.362 0.087 0.208 0.803 all methods on the IN-Ext dataset. All values averaged
BART 0.436 0.142 0.236 0.837
over the 50 documents in the dataset. The best value for
BERT-BART 0.369 0.099 0.198 0.805
Legal-Pegasus 0.452 0.155 0.248 0.843
each segment in a particular class of methods is in bold.
Legal-LED 0.197 0.038 0.138 0.814
Finetuned in the evaluation by the domain experts. But the
BART_CLS 0.481 0.172 0.255 0.844
BART_MCS 0.496 0.188 0.271 0.848
number of summaries that we could get evaluated
BART_SIF 0.485 0.18 0.262 0.845 was limited by the availability of the experts.
BERT_BART_MCS 0.476 0.172 0.259 0.847
Legal-Pegasus_MCS 0.476 0.171 0.261 0.838 Framing the questions asked in the survey: We
Legal-LED 0.482 0.186 0.264 0.851 framed the set of questions (described in Sec-
BART_MCS_RR 0.492 0.184 0.26 0.839
tion 7.3) based on the parameters stated in (Bhat-
Table 12: Document-wide ROUGE-L and BERTScores tacharya et al., 2019; Huang et al., 2020b) about
(Fscore) on UK-Abs dataset, averaged over the 100 test how a legal document summary should be evalu-
documents. The best value for each category of methods ated.
is in bold.
Pearson Correlation as IAA : The human anno-
tively poor for for these segments (see Table 13 tators were asked to rate the summaries on a scale
and Table 14). Hence, another open challenge in of 0-5, for different parameters. Here we discuss
domain-specific long document summarization is the IAA in the ‘Overall’ parameter. For a particular
to develop algorithms that perform well on short summary of a document, consider that Annotator 1
segments that have domain-specific importance. and Annotator have given scores of 2 and 3 respec-
tively. Now, there are two choices for calculating
A.9 Expert Evaluation Details the IAA – (i) in a regression setup, these scores
denote a fairly high agreement between the anno-
We mention below some more details of the expert
tators, (ii) in a classification setup, if we consider
evaluation, which could not be accommodated in
each score to be a ‘class’, then Annotator 1 has
the main paper due to lack of space.
assigned a ‘class 2’ and Annotator 2 has assigned
Choice of documents for the survey: We se- a ‘class 3’; this implies a total disagreement be-
lected 5 documents from the IN-Abs test set, specif- tween the two experts. In our setting, we find the
ically, those five documents that gave the best aver- regression setup for calculating IAA more suitable
age ROUGE-L F-scores over the 7 summarization than the Classification setup. Therefore we use
methods chosen for the human evaluation. Pearson Correlation between the expert scores as
Ideally, some summaries that obtained lower the inter-annotator agreement (IAA) measure. For
ROUGE scores should also have been included each algorithmic summary, we calculate the corre-
1063
Rouge-L Recall
Algorithms
Background Final Judgement Reasons
lation between ROUGE-1 Fscore and the overall
(39%) (5%) (56%) scores assigned by the human evaluators.
Extractive Methods Likewise, we compute the correlation between
LexRank 0.197 0.037 0.161
Lsa 0.175 0.036 0.141 other automatic metrics (e.g., ROUGE-2 Fscore,
DSDR 0.151 0.041 0.178 BertScore) and the human-assigned overall scores.
Luhn 0.193 0.034 0.146
Reduction 0.188 0.035 0.158
Pacsum_bert 0.176 0.036 0.148
A.10 Ethics and limitations statement
Pacsum_tfidf 0.154 0.035 0.157
All the legal documents and summaries used in the
MMR 0.152 0.04 0.17
KMM 0.133 0.037 0.157 paper are publicly available data on the Web, ex-
LetSum 0.133 0.037 0.147 cept the reference summaries for the In-Ext dataset
CaseSummarizer 0.153 0.036 0.17
SummaRunner 0.172 0.044 0.165 which were written by the Law experts whom we
BERT-Ext 0.203 0.034 0.135 consulted. The law experts were informed of the
Gist 0.123 0.041 0.195
Pretrained Abstractive Methods
purpose for which the annotations/surveys were
BART 0.161 0.04 0.175 being carried out, and they were provided with
BERT-BART 0.143 0.04 0.158
a mutually agreed honorarium for conducting the
Legal-Pegasus 0.169 0.042 0.177
Legal-LED 0.177 0.066 0.219 annotations/surveys as well as for writing the refer-
Finetuned Abstractive Methods ence summaries in the IN-Ext dataset.
BART_MCS 0.168 0.041 0.184
BERT_BART_MCS 0.174 0.047 0.183 The study was performed over legal documents
Legal-Pegasus_MCS 0.166 0.039 0.202 from two countries (India and UK). While the meth-
Legal-LED 0.187 0.058 0.172
BART_MCS_RR 0.165 0.042 0.18
ods presented in the paper should be applicable
to legal documents of other countries as well, it
Table 14: Segment-wise ROUGE-L Recall scores of all is not certain whether the reported trends in the
methods on the UK-Abs dataset. All values averaged results (e.g., relative performances of the various
over 100 documents in the evaluation set. Best value for summarization algorithms) will generalize to legal
each segment in a particular class of methods is in bold.
documents of other countries.
The evaluation study by experts was conducted
lation between the two sets of ‘Overall’ scores. We over a relatively small number of summaries (35)
then take the average across all the seven ‘Overall’ which was limited by the availability of the ex-
correlation scores for the seven algorithmic sum- perts. Also, different Law practitioners have dif-
maries. ferent preferences about summaries of case judge-
Computing the correlation between human ments. The observations presented are according
judgements and the automatic metrics: Recall to the Law practitioners we consulted, and can vary
that we have 5 documents for the human evaluation. in case of other Law practitioners.
For a particular algorithm, e.g. DSDR, suppose the
average ‘Overall score given by human annotators
to the summaries of the 5 documents generated by
DSDR are [h1 , h2 , h3 , h4 , h5 ], where hi denotes
the average ‘Overall’ score given by humans for
the ith document’s summary (range [0-1]).
Suppose, the ROUGE-1 FScore of the DSDR
summaries (computed with respect to the reference
summaries) are [d1 , d2 , d3 , d4 , d5 ], where di de-
notes the ROUGE-1 Fscore for the ith document’s
DSDR-generated summary (range [0-1]).
We then compute the Pearson Correlation
cDSDR between the list of human scores and the
list of Rouge-1 Fscores for DSDR. We repeat the
above procedure for all the 7 algorithms for a par-
ticular metric (e.g. ROUGE-1 Fscore) to get 7 c
values (e.g., cDSDR , cGist , etc.) and then take the
average of the 7 values. This gives the final corre-
1064

You might also like