A
Project Based Learning Report
For
“Develop text summarization tool by using extractive
summarization techniques.”
is submitted in partial fulfillment of the requirement for the award of degree of
Bachelor of Technology
(VII Semester B. Tech. for the course Data Science for NLP (PECAD703T))
in
Artificial Intelligence & Data Science
Submitted by
Aashay Kale Rohan Chouhan
Rajesh Dandwe Tanu Patil
Under the guidance of
Prof. Kalyani Pendke
Assistant Professor
Department of Emerging Technologies
(Artificial Intelligence & Data Science)
S. B. Jain Institute of Technology,
Management and Research, Nagpur
(An Autonomous Institute Affiliated to R. T.M. Nagpur University)
Academic Session: 2024-2025 (ODD)
Problem Statement:
Develop text summarization tool by using extractive summarization techniques.
Objectives:
1. Develop an Extractive Summarization Model: Design and implement an extractive text
summarization model that selects and ranks the most relevant sentences from a given text based on
predefined criteria such as frequency, position, and importance.
2. Improve Summarization Efficiency: Ensure that the model can handle large volumes of text
while maintaining high performance in terms of speed and accuracy. The tool should be capable of
summarizing content from various sources, including news articles, research papers, and reports.
3. Ensure Relevance and Coherence: Create summaries that are coherent and retain the essential
meaning of the original text. The tool must focus on producing summaries that are both concise and
representative of the main points.
4. Evaluate Model Performance: Establish a framework for evaluating the accuracy, precision, and
readability of the summaries generated by the tool. Use metrics such as ROUGE scores, human
evaluation, or domain-specific criteria to assess the quality of the summaries.
5. Provide a User-Friendly Interface: Develop a simple and intuitive user interface (UI) for end-
users to input text and receive a summary. The UI should cater to both technical and non-technical
users, ensuring accessibility and ease of use.
6. Support Multi-Domain Text: Ensure that the summarization tool can handle different types of
text across various domains, such as technical documents, legal papers, and general news, by
adjusting extraction strategies to suit the text type.
7. Incorporate Customization: Allow users to adjust the summary length or level of detail,
providing flexibility for generating summaries based on the user’s specific needs.
Introduction:
Automatic text summarization refers to a group of methods that employ algorithms to compress a
certain amount of text while preserving the text’s key points. Although it may not receive as much
attention as other machine learning successes, this field of computer automation has witnessed
consistent advancement and improvement. Therefore, systems capable of extracting the key concepts
from the text while maintaining the overall meaning have the potential to revolutionize a variety of
industries, including banking, law, and even healthcare.
• Types of Text Summarization
There are typically two basic methods for automatic text summarization:
1. Extractive summarization
2. Abstractive summarization
➢ Extractive Summarization
Extractive summarization algorithms are employed to generate a summary by selecting and
combining key passages from the source material. Unlike humans, these models emphasize creating
the most essential sentences from the original text rather than generating new ones.
Extractive summarization utilizes the Text Rank algorithm, which is highly suitable for text
summarization tasks. Let’s explore how it functions by considering a sample text summarization
scenario. The process of extractive summarizing involves picking the most relevant sentences from
an article and systematically organizing them. The sentences making up the summary are taken
verbatim from the source material. Extractive summarization systems, as we know them now,
revolve around three fundamental operations:
1) Construction of an intermediate representation of the input text
Topic representation and indicator representation are examples of representation-based methods. To
understand the subject(s) mentioned in the text, topic representation converts the text into an
intermediate representation.
2) Scoring the sentences based on the representation
At the time of the generation of the intermediate representation, each sentence is given a significance
score. When using a method that relies on topic representation, a sentence's score reflects how
effectively it elucidates critical concepts in the text. In indicator representation, the score is
computed by aggregating the evidence from different weighted indicators.
3) Selection of a summary comprising several sentences
To generate a summary, the summarizer software picks the top k sentences. For example, some
methods use greedy algorithms to pick and choose which sentences are most relevant, while others
may transform sentence selection into an optimization problem in which a set of sentences is
selected under the stipulation that it must maximize overall importance and coherence while
minimizing the quantity of redundant information.
➢ Utilizing the TextRank Algorithm for Extractive Text Summarization:
The implementation of TextRank offers a spaCy pipeline as an additional feature. SpaCy is an
excellent Python library for addressing challenges in natural language processing. Additionally, you
need pytextrank, a spaCy extension that effectively implements the TextRank algorithm. It is evident
that the TextRank algorithm can produce reasonably satisfactory results. Nevertheless, extractive
summarization techniques merely provide a modified version of the original text, retaining certain
phrases that were not eliminated, instead of generating new text (new data) to summarize the
information contained in the original text.
Code:
Spacy
To Install the Spacy and Dowload the English Language Dependency run the below code in terminal
!pip install spacy
To install the english laguage dependency
!python3 -m spacy download en_core_web_lg
TextRank
To Install the TextRank
!pip install pytextrank
Text Summarizations
This code uses spaCy and PyTextRank to automatically summarize a given text. It first installs the
required packages, downloads a spaCy language model, and loads the model with the TextRank
summarization pipeline. It then processes a lengthy text and generates a summary of the text’s key
phrases and sentences. The summary is limited to 2 phrases and 2 sentences.
import spacy
import pytextrank
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textrank")
example_text = """Deep learning (also known as deep structured learning) is part of a broader family
of machine learning methods based on artificial neural networks with representation learning.
Learning can be supervised, semi-supervised or unsupervised. Deep-learning architectures such as
deep neural networks, deep belief networks, deep reinforcement learning, recurrent neural networks
and convolutional neural networks have been applied to fields including computer vision, speech
recognition, natural language processing, machine translation, bioinformatics, drug design, medical
image analysis, material inspection and board game programs, where they have produced results
comparable to and in some cases surpassing human expert performance. Artificial neural networks
(ANNs) were inspired by information processing and distributed communication nodes in biological
systems. ANNs have various differences from biological brains. Specifically, neural networks tend
to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic)
and analogue. The adjective "deep" in deep learning refers to the use of multiple layers in the
network. Early work showed that a linear perceptron cannot be a universal classifier, but that a
network with a nonpolynomial activation function with one hidden layer of unbounded width
can.Deep learning is a modern variation which is concerned with an unbounded number of layers of
bounded size, which permits practical application and optimized implementation, while retaining
theoretical universality under mild conditions. In deep learning the layers are also permitted to be
heterogeneous and to deviate widely from biologically informed connectionist models, for the sake
of efficiency, trainability and understandability, whence the structured part."""
print('Original Document Size:',len(example_text))
doc = nlp(example_text)
for sent in doc._.textrank.summary(limit_phrases=2, limit_sentences=2):
print(sent)
print('Summary Length:',len(sent))
Output:
Conclusion:
The development of an extractive text summarization tool offers a practical solution to efficiently
process large volumes of text while retaining key information. By selecting the most relevant
sentences, this tool enhances productivity across industries like healthcare, finance, and law. Despite
challenges in maintaining coherence and relevance, extractive techniques provide a scalable
approach to summarization. Ultimately, the tool improves information retrieval and decision-
making, enabling users to quickly access essential insights from extensive data.
Evaluation Parameters
Sr. Roll No. Faculty Submissio Viva Total Signature
No /USN No. Name of Student Assessment n (3M) (3M) (10M)
(4M)
1 Aashay Subhash Kale
AD21061
2 Rohan Chouhan
AD21062
3 Rajesh Umesh Dandwe
AD21063
4 Tanu Prakash Patil
AD22D001
Signature of Course In-Charge