Final
Final
BCSE497J - Project-I
Bachelor of Technology
in
Computer Science and Engineering
by
i
DECLARATION
Place : Vellore
Date : 20-11-2024
ii
CERTIFICATE
This is to certify that the project entitled Keyword Extraction submitted by Aryan
Goyal(21BCE2572), Sumit Kejriwal(21BCE2671) and Pari Gawli (21BKT0023) , School
of Computer Science and Engineering, VIT, for the award of the degree of Bachelor of
Technology in Computer Science and Engineering, is a record of bonafide work carried
out by him / her under my supervision during Fall Semester 2024-2025, as per the VIT code
of academic and research ethics.
The contents of this report have not been submitted and will not be submitted either
in part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The project fulfills the requirements and regulations ofthe University
and in my opinion meets the necessary standards for submission.
Place : Vellore
Date : 20-11-2024
Examiner(s)
iii
ACKNOWLEDGEMENT
My sincere thanks to Dr. Ramesh Babu K, the Dean of the School of Computer Science and
Engineering (SCOPE), for his unwavering support and encouragement. His leadership and vision
have greatly inspired me to strive for excellence. The Dean’s dedication to academic excellence
and innovation has been a constant source of motivation for me. I appreciate his efforts in creating
an environment that nurtures creativity and critical thinking.
I express my profound appreciation to Dr. Gopinath M.P, the Head of the Department of
Information Security, for his insightful guidance and continuous support. His expertise and advice
have been crucial in shaping the direction of my project. The Head of Department’s commitment
to fostering a collaborative and supportive atmosphere has greatly enhanced my learning
experience. His constructive feedback and encouragement have been invaluable in overcoming
challenges and achieving my project goals.
ARYAN GOYAL
SUMIT KEJRIWAL
PARI GAWLI
iv
TABLE OF CONTENTS
v
5. METHODOLOGY AND TESTING 31-37
6. PROJECT DEMONSTRATION 38-45
7. RESULT AND DISCUSSION (COST ANALYSIS as 46-49
applicable)
8. CONCLUSION 50-52
9. REFERENCES 53
APPENDIX A – SAMPLE CODE 54-61
vi
ABSTRACT
Keyword extraction is a fundamental task in natural language processing (NLP), essential for
applications such as topic modeling and document classification. While many keyword
extraction methods exist, most of them operate in an unguided manner, extracting general
keywords without considering their relevance to specific categories. This paper addresses the
challenge of *class-specific keyword extraction*, where the goal is to identify keywords that
pertain to a predefined class or category. We propose an improved method for class-specific
keyword extraction, which builds upon the popular KEYBERT algorithm. Our method
introduces a more refined approach by focusing exclusively on seed keyword embeddings,
using a two-part scoring system based on cosine similarity to rank and expand the initial set
of seed keywords iteratively.
The proposed method is evaluated on a dataset of 10,000 entries from the German business
registry (Handelsregister), where businesses are classified into predefined economic sectors
based on the WZ 2008 classification scheme. The results of the evaluation show that our
method significantly outperforms traditional keyword extraction techniques such as RAKE,
YAKE, and standard KEYBERT in terms of precision for class-specific keyword extraction.
Precision metrics at various thresholds (Precision@10, Precision@25, Precision@50, and
Precision@100) demonstrate that the proposed approach consistently identifies more relevant
and class-specific keywords compared to previous methods.
In conclusion, our method sets a new standard for class-specific keyword extraction,
providing a robust solution for applications that require precise and targeted keyword
identification. Future research will explore the applicability of this method to other languages
and domains, as well as further optimization of the pipeline parameters.
vii
1. INTRODUCTION
1.1 Background
1.2 Motivation
The primary motivation behind this research is to address the challenge of extracting class-
specific keywords—keywords that are relevant only to a predefined class. In various
applications, such as classifying businesses into economic sectors, it is essential to extract
keywords that are not only meaningful but also specific to particular categories. Existing
keyword extraction methods do not sufficiently address this challenge, and the development
of a more targeted approach is necessary.
This research builds on the popular KEYBERT method, modifying its functionality to focus
entirely on class-specific keywords. The improved method is evaluated using data from the
German business registry (Handelsregister), with the goal of extracting keywords relevant to
predefined economic sectors.
1
2. PROJECT DESCRIPTION AND GOALS
Despite the effectiveness of these methods, they are largely unguided, extracting any
keywords that seem relevant without considering class specificity. This lack of focus on
class-specific keywords has been identified as a gap in the existing literature.
2.3 Objectives
2
• To evaluate the effectiveness of the method based on precision metrics at different
thresholds.
The existing methods for keyword extraction do not effectively address the challenge of
extracting class-specific keywords. These methods are unguided, and as a result, they extract
keywords without regard for the downstream classification or the specific class to which the
keywords should pertain. This project seeks to develop a method that can extract keywords
relevant to predefined classes, improving classification accuracy in applications such as
business categorization.
Research and Literature Review: Investigate the existing methods for keyword
extraction and identify the gaps in class-specific keyword extraction.
Method Development: Develop a pipeline that builds on the existing KEYBERT
method, incorporating modifications to focus on class-specific keywords.
Data Collection and Preparation: Use a dataset of 10,000 entries from the German
business registry to test the method.
Implementation: Implement the proposed method and compare its performance with
traditional methods like RAKE, YAKE, and standard KEYBERT.
Evaluation: Use precision metrics to evaluate the method's effectiveness at extracting
class-specific keywords, and report the results.
Documentation and Reporting: Compile the findings into a research report and
suggest future improvements.
3
3.TECHNICAL SPECIFICATION
3.1 Requirements
6
2. Keyword Extraction Pipeline
The keyword extraction pipeline implements the enhanced methodology for class-specific
keyword extraction using an iterative, scoring-based process.
1. Pipeline Components
o Initial Extraction:
▪ Utilize an enhanced KEYBERT algorithm to extract general keywords
from documents based on word embeddings.
o Seed Keyword Integration:
▪ Refine results using seed keywords representing class-specific contexts
(e.g., terms from the WZ 2008 economic classification scheme).
o Iterative Refinement:
▪ Apply iterative scoring using cosine similarity between extracted
keywords and seed keyword embeddings.
▪ Refine results in successive iterations to improve relevance.
2. Scoring Mechanism
o Cosine Similarity-Based Ranking:
▪ Calculate both average and maximum similarity scores for each
keyword to evaluate its relevance to a given class.
o Threshold-Based Filtering:
▪ Set a similarity threshold to filter out irrelevant keywords while retaining
class-specific terms.
3. Output Formats
o Export extracted keywords with associated scores in structured formats (e.g.,
JSON, CSV) for downstream analysis.
3. Document Categorization
The system classifies documents into predefined categories based on the extracted keywords.
1. Classification Workflow
o Keyword Matching:
▪ Match extracted keywords with predefined category labels.
o Machine Learning Classifier (Optional):
▪ Enhance accuracy using ML models like logistic regression, SVM, or
transformer-based classifiers trained on labeled data.
2. Category Definitions
o Support customizable category schemas (e.g., WZ 2008 classification for
economic sectors).
3. Output and Reporting
o Generate classification reports, highlighting key metrics such as the proportion of
documents in each category and confidence scores.
7
4. Precision Evaluation Metrics
The system evaluates its performance using precision-based metrics to measure keyword
extraction and document categorization accuracy.
1. Key Metrics
o Precision@N:
▪ Evaluate precision at various levels (e.g., Precision@10, Precision@25),
indicating the proportion of correctly identified class-specific keywords
among the top-N extracted keywords.
o Fuzzy Matching:
▪ Incorporate fuzzy string matching or semantic similarity to assess near-
matches that retain class relevance.
2. Comparative Analysis
o Compare results against baseline algorithms (e.g., RAKE, YAKE, standard
KEYBERT) to validate improvements.
1. Efficiency
1. Processing Speed
o Optimize the pipeline for batch processing, enabling the system to handle large
datasets (e.g., 10,000+ documents) within acceptable timeframes.
o Employ parallel processing techniques for computationally intensive tasks like
cosine similarity calculations.
2. Resource Utilization
o Design the system to operate efficiently on mid-range hardware, minimizing
dependency on high-performance computing resources.
2. Scalability
1. Dataset Expansion
o Support larger datasets without degradation in performance, ensuring
compatibility with future data expansions.
2. Cross-Domain and Multilingual Support
o Ensure adaptability to other languages and domains by incorporating language-
agnostic embeddings like M-BERT or XLM-Roberta.
3. Dynamic Seed Keyword Management
o Allow updates to seed keywords and category definitions without requiring
system reconfiguration.
9
3.2 Feasibility Study
This aspect evaluates whether the project can be effectively developed and implemented using
the available technology, tools, and resources.
Existing Technology
The project builds upon established keyword extraction techniques like KEYBERT, which
leverages transformer-based embeddings for high-quality keyword identification. By
incorporating methods for improving class-specific keyword extraction, the project aims to
enhance domain-specific categorization. Given the extensive documentation and accessibility of
libraries and tools in the field of natural language processing (NLP), such as Hugging Face
Transformers, spaCy, and NLTK, this is technically feasible. The robustness of these libraries
ensures that the foundational requirements of the project are met efficiently.
Data Availability
The availability of structured data, such as information from business registries, ensures that the
project has access to sufficient training and testing datasets. These sources often provide well-
organized and labeled data, critical for developing and validating NLP models. Additionally,
publicly accessible datasets, like those available from governmental or organizational sources,
further strengthen data availability. Preprocessing methods can refine these datasets for project-
specific goals.
Skill Set
The project requires expertise in several domains:
• NLP: Understanding text preprocessing, feature extraction, and embeddings.
• Machine Learning (ML): Building and evaluating classification models.
• Tools & Libraries: Experience with KEYBERT, scikit-learn, and potentially deep
learning frameworks like PyTorch or TensorFlow.
If the project team possesses intermediate to advanced skills in these areas, the project can
proceed smoothly. Gaps in expertise can be addressed through online courses, tutorials, or
collaboration with domain experts.
Implementation Tools
To handle computational demands, the project can leverage modern computing resources, such
as:
• Local Systems: Multi-core processors with adequate RAM for medium-sized datasets.
• Cloud Services: Platforms like AWS, Google Cloud, or Azure can handle larger datasets
and allow for distributed processing.
• Specialized Hardware: For deep learning extensions, GPUs may accelerate embedding
computations and iterative keyword scoring.
Given these tools, the project is technically feasible with mid- to high-range resources.
10
3.2.2 Economic Feasibility
This section evaluates the financial viability of the project, weighing costs against potential
benefits and returns.
Cost of Development
Development costs are expected to be moderate, encompassing:
• Data access or acquisition (if specific registries charge for usage).
• Computational resources, including local systems or cloud credits.
• Software tools, which are often open-source but may require licenses for advanced
features.
For university projects, these costs can be minimized through free tiers of cloud platforms, open-
source software, and institutional support.
Maintenance Costs
Post-development, costs will include:
• Model retraining and updating with new data to ensure continued relevance.
• Refining keyword lists to match evolving business needs.
• Managing and maintaining a backend system for real-time usage in production
environments.
By planning for these costs upfront, the project can maintain long-term economic feasibility.
11
3.2.3 Social Feasibility
This dimension evaluates the project’s acceptance and potential benefits to users, stakeholders,
and the wider community.
12
3.3 System Specification
This section outlines the hardware requirements necessary to support the efficient
implementation, testing, and scaling of the project.
Processor
• Minimum Requirements:
o A multi-core CPU such as Intel Core i5 (8th generation or newer) or AMD Ryzen
5 (3000 series or newer). These processors are adequate for handling moderate
NLP tasks and computations.
• Recommended:
o Higher-end processors like Intel Core i7/i9 or AMD Ryzen 7/9 for optimal
performance when working with large datasets or running complex models.
o For deep learning workflows, a GPU-compatible system is recommended for
reduced training time.
Memory (RAM)
• Minimum:
o 16 GB RAM to support NLP operations like tokenization, embeddings, and
model training on small to medium datasets.
• Optimal:
o 32 GB RAM or higher to handle extensive datasets, multiple processes, or
memory-intensive tasks such as parallel computations.
Storage
• Primary Storage:
o A Solid State Drive (SSD) with a minimum of 256 GB for faster read/write
speeds. This enhances data retrieval and reduces latency when managing
intermediate files during model training and evaluation.
• Additional Storage:
o External storage options like an HDD or additional SSDs for long-term storage of
large datasets or backups. A minimum of 1 TB is advisable for projects
processing multiple document classes or high-dimensional embeddings.
13
o Dedicated GPUs such as NVIDIA GTX 1060 or better (e.g., RTX 20 or 30 series)
are beneficial for tasks requiring GPU-accelerated libraries (e.g., PyTorch,
TensorFlow) or when working with transformer models.
Internet Connection
This section describes the software environment needed for development and deployment.
1. Operating System
2. Programming Languages
• Primary Language: Python 3.x, favored for its extensive ecosystem of libraries tailored
for NLP and machine learning tasks.
• NLTK or spaCy: Essential for preprocessing tasks, including tokenization, named entity
recognition, and part-of-speech tagging.
• KEYBERT: Central to the project for keyword extraction, leveraging transformer
embeddings.
• scikit-learn: For implementing classification algorithms, hyperparameter tuning, and
evaluation metrics such as precision, recall, and F1-score.
• Transformers (Hugging Face): Enables the use of pre-trained transformer models like
BERT for generating contextual embeddings or enhancing classification accuracy.
14
4. Data Processing and Analysis Tools
• Pandas and NumPy: For data manipulation, handling structured datasets, and performing
mathematical operations.
• Matplotlib or Seaborn: Visualization libraries to analyze and present data trends, model
performance, and results effectively.
6. IDE/Code Editor
• Interactive Development:
o Jupyter Notebook or Google Colab, ideal for prototyping, visualizing results, and
collaborative work.
• Robust Development:
o VS Code or PyCharm for comprehensive code editing, debugging, and project
management.
7. Version Control
• Platforms:
o Google Cloud Platform (GCP), AWS, or Microsoft Azure to scale processing
power or storage when working with large datasets or deploying models for real-
time applications.
• Services:
o Access to GPU or TPU instances for accelerated computations.
o Cloud-based databases or storage (e.g., AWS S3 or Google Cloud Storage) for
managing large datasets.
15
Additional Considerations
16
4. DESIGN APPROACH AND DETAILS
This layer is responsible for gathering and preprocessing raw data, ensuring it is clean and ready
for analysis.
17
2. Keyword Extraction Layer
This layer focuses on deriving meaningful keywords from documents to facilitate categorization.
This layer assigns documents to predefined categories based on the extracted keywords and
machine learning models.
Handles the storage and indexing of raw, processed, and categorized data.
4.1 Database
• NoSQL Databases: MongoDB for flexible storage of unstructured text data.
• SQL Databases: PostgreSQL or MySQL for structured metadata (e.g., categorization
results, logs).
• Hybrid Setup: Combine NoSQL for raw data and SQL for indexing metadata.
4.2 Indexing
• Elasticsearch or Solr to enable fast keyword-based retrieval and search functionalities.
5. Application Layer
19
6. Monitoring and Feedback Layer
Summary of Layers
20
DIAGRAM:
The provided system architecture is designed for processing text data, extracting keywords, and
generating output files while maintaining detailed activity logs. The process begins at the
frontend, where users provide raw text input via an interface. This input is then sent to the
backend for further processing.
The backend is divided into three main modules. The Text Preprocessing Module standardizes
the text by converting it to lowercase, removing special characters, and eliminating unimportant
stopwords. It also applies lemmatization to reduce words to their base forms, ensuring the text is
clean and ready for analysis. Following preprocessing, the text is passed to the Keyword
Extraction Module, which identifies important words or phrases using three algorithms:
TextRank (graph-based ranking), TF-IDF (statistical analysis of term uniqueness), and RAKE
(pattern and frequency analysis). Each algorithm offers a unique perspective on what constitutes
a keyword, enhancing the reliability of the results.
Once the keywords are extracted, the File Export Module allows users to save the results in
readable formats like PDFs or Word documents. The system also includes a Logger Module that
records every significant action, such as preprocessing steps, algorithm usage, and file
generation, into a log file for debugging or auditing. All processed files and logs are stored in
local storage for future access. This architecture is highly modular, ensuring that text data is
efficiently cleaned, analyzed, and output while maintaining detailed operational transparency.
21
4.2 DESIGN
22
The diagram represents the process flow of a text analysis application, showcasing how user
input is processed, analyzed, and transformed into output files, while maintaining logs for system
activities. It is composed of interconnected modules that define the application's workflow.
The process begins with the User Interface (UI), where the user provides text input and chooses
an export option. The input text flows into the Text Preprocessing Module, which cleans and
prepares the data for analysis. This involves converting the text to lowercase, removing special
characters, filtering out stopwords, and applying lemmatization to simplify words to their root
forms. Once the text is cleaned, it is passed to the Keyword Extraction Module, which identifies
significant keywords using techniques like TextRank, TF-IDF, and RAKE. Each algorithm
applies different logic, ensuring diverse and robust keyword extraction.
After keyword extraction, the results and processed text are directed to the File Export Module,
where they can be saved as PDF or Word (DOCX) files. These files are then stored in Output
Storage for later use. The system also integrates a Logger Module, which records each step of
the process—from text preprocessing to file generation—into a log file for transparency and
debugging. This modular design ensures that text data is efficiently transformed into meaningful
insights, with every step systematically monitored and documented.
23
4.2.2 USE CASE DIAGRAM
24
The provided use case diagram illustrates the interaction between a user and a text processing
and analysis application, focusing on user actions and corresponding system activities. It
highlights the modular approach of the application, where user inputs trigger backend
processing, followed by options for result viewing and export.
From the user's perspective, the process begins with providing text input. Once submitted, the
user can view the results, which include extracted keywords generated by the system. The user is
then given the option to export these results in various formats, such as PDF or Word (DOCX).
This ensures flexibility in accessing and sharing the output.
On the system side, the input text undergoes preprocessing, where it is cleaned by removing
unwanted elements like special characters and stopwords, converting to lowercase, and
lemmatizing words. This cleaned text is then analyzed to extract meaningful keywords using
algorithms like TextRank, TF-IDF, and RAKE. Throughout this process, the system logs key
activities, such as preprocessing steps and keyword extraction events, to maintain transparency
and enable debugging.
Finally, the system supports file export functionality, where the processed results can be saved in
user-specified formats. The export events are also logged, ensuring that all critical actions are
traceable. This structured approach demonstrates how the application seamlessly integrates user
interactions with backend processing to deliver a robust text analysis solution.
25
4.2.3 CLASS DIAGRAM
26
These classes work together to perform text preprocessing, keyword extraction, logging, and
exporting processed data. Below is an explanation of the components:
Core Classes:
TextPreprocessor: This class handles the preprocessing of raw text. It takes raw input text and
processes it (e.g., lowercasing, removing stopwords, lemmatization). It also includes attributes to
store the raw text, processed text, and the language of the text. This ensures flexibility for
handling multi-language text processing.
StopWordsHandler: This class manages stopword removal. It stores a list of stopwords and
generates filtered text after removing them from the input. It works closely with the
TextPreprocessor to streamline the preprocessing pipeline.
Keyword Extraction:
TextRankExtractor: This class implements the TextRank algorithm to identify ranked keywords
from the processed text. It stores attributes for the input text, the ID for tracking purposes, and
the ranked keywords.
RAKEExtractor: This class performs keyword extraction using the RAKE (Rapid Automatic
Keyword Extraction) algorithm. It identifies multi-word keywords and phrases, storing them as
extractedKeywords. It operates independently, allowing for comparison or parallel processing
with TextRank.
TFIDFExtractor: This class uses the TF-IDF method to score words based on their importance in
the text. It stores the input text, the computed TF-IDF scores, and an ID for identification. This
ensures multiple keyword extraction approaches can be utilized for different use cases.
KeywordExtractor: This class acts as a composite, aggregating keywords from various extraction
methods (e.g., TextRank, RAKE, TF-IDF). It stores the final list of keywords and text, providing
a unified output for further use.
Logger: The logger is crucial for tracking system operations. It records log levels (e.g., debug,
info, error), messages, and timestamps. This helps monitor the system's activities and identify
any issues during the workflow.
FileExporter: This class handles the exporting of processed data or reports. It stores details about
the file type (e.g., PDF, Word), file path, and export status, ensuring that the final output is saved
and logged appropriately.
27
4.2.4 SEQUENCE DIAGRAM
28
The system consists of multiple components working in a structured manner to preprocess input
text, extract keywords and key phrases, and generate reports in different formats (PDF or Word).
Below is an explanation of its components:
User Interaction: The user starts the process by inputting raw text. The system initiates the
workflow based on user-provided input and selected report preferences (e.g., PDF or Word).
Logger: The logger is an integral part of the system, tracking each step of the process. It records
activities such as text preprocessing steps (e.g., lowercasing, removing stopwords),
keyword/phrase extraction, and report generation. This ensures transparency and helps debug
any issues during execution.
Preprocessing Function: The raw input text is passed to the preprocessing function. Here, several
operations are performed sequentially, including:
RAKE (Rapid Automatic Keyword Extraction) Function: Parallel to TextRank, the RAKE
function is used for extracting key phrases from the preprocessed text. This algorithm identifies
top phrases, typically focusing on multi-word expressions. These key phrases are logged and
returned to the system for inclusion in the final output.
Report Generation:
Based on the user’s selection, either a PDF Generator or a Word Document Generator is
invoked.
For PDFs, the system creates the report, logs the PDF creation process, and saves the file.
Similarly, for Word documents, the system generates the file, logs the process, and saves the
output. The logging here ensures that the report creation is monitored, and any issues can be
easily identified.
Completion: After generating the requested report, the system logs the completion of the
process, signaling the end of the workflow.
29
5. METHODOLOGY AND TESTING
➢ Module description
Expanded Purpose:
Handles the collection of diverse types of raw data, including structured, semi-structured, and
unstructured documents, from multiple sources. Ensures scalability to handle large data volumes
and different data formats like JSON, XML, CSV, and plain text.
Additional Functions:
• Source Integration Management: Configures and maintains integrations with APIs,
databases, and web scraping frameworks (e.g., Scrapy, BeautifulSoup).
• Format Conversion: Converts diverse input formats into a unified data format.
• Scheduling: Automates periodic data collection through cron jobs or task schedulers.
• Error Handling: Logs and retries failed data collection attempts.
Challenges:
• Handling rate limits for API integrations.
• Dealing with incomplete or inconsistent data from external sources.
• Ensuring compliance with data privacy regulations, such as GDPR.
Expanded Purpose:
Transforms raw text into structured data ready for analysis, accounting for linguistic nuances,
such as abbreviations, synonyms, and domain-specific terminology.
Additional Functions:
• Language Detection: Identifies and processes multilingual text, translating where
necessary.
• Custom Text Filters: Filters domain-specific noise, e.g., legal disclaimers or boilerplate
text.
• Named Entity Recognition (NER): Identifies and isolates entities like names, dates, and
locations for specialized analysis.
• Custom Dictionary Support: Incorporates custom dictionaries or ontologies to better
handle domain-specific terms.
Challenges:
• Standardizing formats across multiple languages and domains.
• Balancing text reduction (e.g., removing stopwords) with retaining meaningful content.
30
3. Keyword Extraction Module
Expanded Purpose:
Goes beyond simple keyword detection to derive semantic meaning and prioritize keywords that
capture domain-specific context. Supports unsupervised and semi-supervised extraction
approaches.
Additional Functions:
• Phrase Extraction: Extracts multi-word terms (e.g., "data science" instead of "data" and
"science").
• Contextual Weighting: Uses embeddings (e.g., Word2Vec, BERT) to weigh keywords
based on context.
• Dynamic Updates: Continuously updates seed keywords and filtering rules based on
user feedback.
• Topic Modeling (Optional): Clusters similar keywords into topics for better
categorization insights (e.g., using LDA or NMF).
Challenges:
• Balancing precision and recall for keyword relevance.
• Handling polysemy (words with multiple meanings) in keywords.
Expanded Purpose:
Combines rules-based and machine learning-based approaches for accurate classification,
supporting hierarchical and multi-label categorization.
Additional Functions:
• Hybrid Models: Integrates rules (seed keywords) with advanced ML/NLP models
(transformers, embeddings).
• Confidence Scoring: Assigns confidence scores to categorization for reliability
assessment.
• Uncertainty Handling: Flags documents with ambiguous classifications for manual
review.
• Cross-Domain Learning: Adapts pre-trained models to specific domains using fine-
tuning techniques.
Challenges:
• Managing overlapping categories or keywords.
• Ensuring model interpretability for end-users.
31
5. Data Storage Module
Expanded Purpose:
Supports efficient querying, retrieval, and long-term management of processed and categorized
data while ensuring scalability and security.
Additional Functions:
• Metadata Management: Stores metadata (e.g., ingestion source, processing time) for
document traceability.
• Data Partitioning: Optimizes large-scale storage with partitioning and sharding.
• Data Encryption: Ensures secure storage of sensitive or proprietary information.
• Access Control: Implements role-based access to different data layers.
Challenges:
• Balancing cost and performance for large datasets.
• Maintaining consistency across backups and active storage.
Expanded Purpose:
Simplifies interaction with the system, allowing users to perform custom queries, visualize
results, and download reports.
Additional Functions:
• Search Interface: Enables keyword-based or semantic search across documents.
• Customization Options: Allows users to define custom categories or update seed
keywords directly.
• Interactive Dashboards: Provides insights into keyword extraction and categorization
performance using visual analytics tools (e.g., charts, heatmaps).
Challenges:
• Ensuring usability for non-technical users.
• Supporting multiple languages or user-specific customization.
Expanded Purpose:
Provides secure, scalable, and version-controlled API endpoints, allowing external systems to
seamlessly integrate with the platform.
Additional Functions:
• Authentication and Authorization: Secures endpoints using OAuth, JWT, or API keys.
• Rate Limiting: Prevents abuse by limiting API calls per user or application.
• API Documentation: Generates comprehensive documentation (e.g., using Swagger).
• Webhook Support: Notifies external systems about categorization results in real-time.
32
Challenges:
• Ensuring high availability and scalability under load.
• Handling diverse API consumers with varying levels of technical expertise.
Expanded Purpose:
Improves system performance and classification accuracy through proactive monitoring,
logging, and iterative updates based on user and system feedback.
Additional Functions:
• Anomaly Detection: Identifies unexpected behaviors in the system, such as performance
drops.
• Continuous Learning: Uses feedback to retrain machine learning models and update
keyword lists dynamically.
• Custom Alerts: Notifies administrators of issues like data ingestion failures or
performance anomalies.
• A/B Testing: Tests different configurations or models for improved performance.
Challenges:
• Capturing meaningful feedback from users.
• Ensuring feedback loops do not introduce bias into the system.
General Enhancements:
• Scalability: Ensure all modules can handle increasing data and user loads.
• Explainability: Provide clear explanations for keyword extraction and categorization
decisions to enhance user trust.
• Compliance: Incorporate legal and regulatory considerations for data processing and
storage.
33
➢ Testing
1. Unit Testing
1.1 Purpose
To validate that each function or module operates as intended, ensuring baseline reliability
before integrating them into larger workflows.
1.2 Tests
1. Data Ingestion
oTest API connections to ensure proper retrieval of data from external sources like
business registries or file uploads.
oValidate file format handling (e.g., CSV, JSON, XML, and text files).
oSimulate faulty inputs (e.g., incomplete or corrupted files) to check error handling.
2. Data Preprocessing
oText Cleaning: Test if unwanted characters, HTML tags, and special symbols are
consistently removed.
oNormalization: Verify lowercase conversion and stemming/lemmatization for
consistency.
oTokenization: Confirm text splits correctly into words or phrases while maintaining
semantic integrity.
3. Keyword Extraction
oTest individual functions for embedding generation, cosine similarity calculation,
and filtering.
oValidate that extracted keywords meet predefined class-specific criteria.
4. Document Categorization
oSimulate inputs with varying complexity to ensure the logic accurately assigns
documents to the correct categories.
oTest fallback mechanisms for edge cases (e.g., documents with insufficient relevant
keywords).
5. Database Operations
oValidate CRUD (Create, Read, Update, Delete) operations for data storage.
oEnsure proper indexing and retrieval times, even for large datasets.
34
2. Integration Testing
2.1 Purpose
To validate smooth interactions between interdependent modules and identify interface-level
issues.
2.2 Tests
1. Data Pipeline Flow
oTest the complete flow from data ingestion through preprocessing, keyword
extraction, and categorization.
oSimulate real-world scenarios, such as batch processing of large datasets.
3. Functional Testing
Functional testing validates that the core features align with the system's requirements.
3.1 Purpose
To ensure the system performs its intended tasks accurately and reliably.
3.2 Tests
1. Keyword Extraction Accuracy
oValidate the relevance and precision of extracted keywords for each predefined
class.
oCompare performance against baseline algorithms (RAKE, YAKE, and standard
KEYBERT).
2. Document Classification
oTest diverse document types to ensure accurate categorization across multiple
predefined categories.
oMeasure accuracy by comparing predicted categories with ground truth.
3. Iterative Keyword Expansion
oVerify that iterative refinement improves keyword quality without introducing
irrelevant terms.
oTest multiple iteration scenarios to confirm stability and convergence.
35
4. Precision and Recall Metrics
oCalculate precision, recall, and F1 scores for each category.
oEnsure metrics meet project benchmarks (e.g., >85% cosine similarity for extracted
keywords).
4. Performance Testing
Performance testing evaluates the system’s ability to operate efficiently under varying
conditions.
4.1 Purpose
4.2 Tests
1. Execution Speed
oMeasure the time taken for each processing stage (e.g., ingestion, preprocessing,
extraction, classification).
oBenchmark end-to-end processing for small (10 documents) and large datasets
(10,000+ documents).
2. Scalability
oGradually increase dataset sizes to evaluate performance.
oIdentify bottlenecks and test optimizations, such as parallel processing.
3. Memory Usage
oMonitor memory consumption, especially during intensive tasks like iterative
keyword refinement.
oEnsure the system stays within acceptable limits for mid-range hardware.
5. Load Testing
Load testing simulates high-demand scenarios to evaluate system stability.
5.1 Purpose
To assess system behavior under peak loads.
5.2 Tests
1. Simultaneous Document Processing
oTest multiple concurrent uploads to ensure queueing and parallel processing
mechanisms function effectively.
2. API Load
oSimulate heavy API usage, such as thousands of requests per minute, to evaluate
response times and stability.
oMonitor for timeouts and latency issues.
36
6. User Interface (UI) Testing
6.1 Purpose
To validate that users can interact with the system effortlessly.
6.2 Tests
1. Functionality
oVerify upload, search, and feedback submission features.
oTest edge cases, such as unsupported file formats or invalid inputs.
2. Usability
oConduct heuristic evaluations to ensure the UI is intuitive.
oTest navigation flows and accessibility features (e.g., screen reader support).
3. Compatibility
oTest across different devices (desktop, mobile, tablet) and browsers (Chrome,
Firefox, Safari).
7. Security Testing
Security testing ensures the system is protected against unauthorized access and vulnerabilities.
7.1 Purpose
To safeguard sensitive data and maintain system integrity.
7.2 Tests
1. Data Security
oTest encryption mechanisms for stored and transmitted data.
2. Access Control
oValidate user authentication and role-based access control.
3. Injection Testing
oSimulate SQL injection, XSS, and other attack vectors to ensure the system is
secure.
8. Regression Testing
8.1 Purpose
To verify stability after updates or enhancements.
8.2 Tests
1. Re-run Core Tests
oExecute all unit and functional tests post-update to confirm no unintended effects.
2. Backward Compatibility
oEnsure legacy data and workflows remain functional after system updates.
37
9. User Acceptance Testing (UAT)
UAT validates the system's readiness from an end-user perspective.
9.1 Purpose
To confirm that the system meets user expectations and business requirements.
9.2 Tests
1. Feedback Collection
oInvolve users in testing scenarios to gather insights on usability, functionality, and
performance.
2. Acceptance Criteria
oValidate that all initial project requirements are met.
oEnsure precision metrics align with user expectations (e.g., >90% classification
accuracy for specific cases).
38
6. PROJECT DEMONSTRATION
The system consists of multiple components working in a structured manner to preprocess input
text, extract keywords and key phrases, and generate reports in different formats (PDF or Word).
Below is an explanation of its components:
User Interaction: The user starts the process by inputting raw text. The system initiates the
workflow based on user-provided input and selected report preferences (e.g., PDF or Word).
Logger: The logger is an integral part of the system, tracking each step of the process. It records
activities such as text preprocessing steps (e.g., lowercasing, removing stopwords),
keyword/phrase extraction, and report generation. This ensures transparency and helps debug
any issues during execution.
Preprocessing Function: The raw input text is passed to the preprocessing function. Here, several
operations are performed sequentially, including:
39
RAKE (Rapid Automatic Keyword Extraction) Function: Parallel to TextRank, the RAKE
function is used for extracting key phrases from the preprocessed text. This algorithm identifies
top phrases, typically focusing on multi-word expressions. These key phrases are logged and
returned to the system for inclusion in the final output.
Report Generation:
Based on the user’s selection, either a PDF Generator or a Word Document Generator is
invoked.
For PDFs, the system creates the report, logs the PDF creation process, and saves the file.
Similarly, for Word documents, the system generates the file, logs the process, and saves the
output. The logging here ensures that the report creation is monitored, and any issues can be
easily identified.
Completion: After generating the requested report, the system logs the completion of the
process, signaling the end of the workflow.
These classes work together to perform text preprocessing, keyword extraction, logging, and
exporting processed data. Below is an explanation of the components:
Core Classes:
TextPreprocessor: This class handles the preprocessing of raw text. It takes raw input text and
processes it (e.g., lowercasing, removing stopwords, lemmatization). It also includes attributes to
store the raw text, processed text, and the language of the text. This ensures flexibility for
handling multi-language text processing.
StopWordsHandler: This class manages stopword removal. It stores a list of stopwords and
generates filtered text after removing them from the input. It works closely with the
TextPreprocessor to streamline the preprocessing pipeline.
Keyword Extraction:
TextRankExtractor: This class implements the TextRank algorithm to identify ranked keywords
from the processed text. It stores attributes for the input text, the ID for tracking purposes, and
the ranked keywords.
RAKEExtractor: This class performs keyword extraction using the RAKE (Rapid Automatic
Keyword Extraction) algorithm. It identifies multi-word keywords and phrases, storing them as
extractedKeywords. It operates independently, allowing for comparison or parallel processing
with TextRank.
40
TFIDFExtractor: This class uses the TF-IDF method to score words based on their importance in
the text. It stores the input text, the computed TF-IDF scores, and an ID for identification. This
ensures multiple keyword extraction approaches can be utilized for different use cases.
KeywordExtractor: This class acts as a composite, aggregating keywords from various extraction
methods (e.g., TextRank, RAKE, TF-IDF). It stores the final list of keywords and text, providing
a unified output for further use.
Logger: The logger is crucial for tracking system operations. It records log levels (e.g., debug,
info, error), messages, and timestamps. This helps monitor the system's activities and identify
any issues during the workflow.
FileExporter: This class handles the exporting of processed data or reports. It stores details about
the file type (e.g., PDF, Word), file path, and export status, ensuring that the final output is saved
and logged appropriately.
41
Upload the File
Once the Streamlit application interface opens, you'll see a section to Upload File.
Use the Drag and Drop feature or Browse Files button to upload a .doc or .docx file.
Ensure the file size does not exceed the 200MB limit.
42
Process the Uploaded File
After uploading, the content of the file will be displayed in a text editor format.
The application will automatically preprocess the text using methods like:
Lowercasing the text.
Removing stopwords, special characters, and numbers.
Extracting keywords using TextRank, TF-IDF, and RAKE.
43
Select File Export Format
Below the text editor, choose the desired export format:
PDF
DOC
Accept the terms and conditions (checkbox) before proceeding.
44
Export Processed Data
Click on the Export as PDF or the respective button for .doc to download the file.
The generated file will contain the extracted keywords along with their rankings.
45
7. RESULT AND DISCUSSION
The study develops and evaluates an innovative methodology for class-specific keyword
extraction, refining the capabilities of existing algorithms such as KEYBERT to achieve
higher precision and scalability. It demonstrates practical applications in structured
datasets, such as the German business registry, offering a robust, adaptable solution for
document categorization tasks.
1. Results
Key Outcomes
1. Performance Improvements
o The enhanced methodology demonstrates superior precision metrics,
surpassing traditional approaches:
▪ Precision@10: Achieved up to 28.10%, significantly outperforming
guided KEYBERT’s 2.38%.
▪ Cosine Similarity Scores: Maintained an average match rate of
85% or higher, indicating strong relevance of extracted keywords.
o Demonstrated proficiency in reducing irrelevant terms and enhancing the
specificity of class-relevant keywords.
2. Application in Business Domains
o The method excels in classifying documents, particularly within the
economic sectors of the German business registry.
o It aligns extracted keywords with WZ 2008 classification scheme
categories, ensuring relevance and improving usability for downstream
applications.
Experimental Dataset
• Data Source:
o The German Handelsregister (Business Registry) provided a dataset of
10,000 entries containing structured and semi-structured records.
• Evaluation Benchmark:
o Predefined categories were derived from the WZ 2008 classification
scheme, ensuring a well-established taxonomy for validation.
• Annotation Process:
o Manual and automated comparisons were employed to assess keyword
relevance and categorization accuracy, providing a robust ground truth.
46
Methodology Highlights
1. Iterative Refinement
o Introduced an iterative process that uses seed keyword embeddings to
refine keyword lists progressively.
o Keywords are scored based on average and maximum cosine similarity
with seed embeddings, filtering out irrelevant terms and emphasizing class-
specific relevance.
2. Focus on Domain-Specific Precision
o The scoring mechanism targets economic sector-specific terms, drastically
reducing noise and improving document classification accuracy.
o Ensures that extracted terms are highly relevant to the target categories,
improving downstream applications' performance.
2. Discussion
The proposed approach sets a benchmark in class-specific keyword extraction, addressing
longstanding challenges in targeting domain-relevant terms for structured datasets. It
balances innovative design with practical implementation, showcasing its adaptability to
real-world scenarios.
Advantages
1. High Precision and Relevance
o The iterative methodology refines seed keywords, ensuring greater
specificity and relevance to predefined classes.
o This precision translates directly to improved document categorization and
downstream task performance.
2. Scalability
o Designed to handle large datasets efficiently, the system scales well for
enterprise applications involving thousands to millions of documents.
o Mid-range computational requirements make the solution feasible for small-
to-medium enterprises (SMEs) as well as large corporations.
3. Generalizability
o Though tested primarily on the German business registry, the approach can
be adapted to other languages and domains with minimal customization.
o The reliance on language-agnostic embeddings like BERT ensures broader
applicability.
47
Challenges and Limitations
1. Multi-Word Keyphrase Extraction
o The current focus on unigrams limits the methodology’s ability to capture
nuanced multi-word expressions (e.g., “supply chain management” vs.
“supply,” “chain,” and “management”).
2. Language Dependency
o While the method performed well on German datasets, further validation
across diverse languages and datasets is required to confirm
generalizability.
3. Iterative Complexity
o The iterative refinement process, while effective, introduces additional
computational overhead, especially with large datasets and extensive seed
keyword lists.
Future Directions
1. Multi-Word Keyphrase Extraction
o Integrate multi-word extraction techniques such as noun phrase chunking
or collocation detection to broaden the applicability of the method.
2. Cross-Language Adaptation
o Validate the approach on multilingual datasets using embeddings like M-
BERT or XLM-R, ensuring global applicability.
3. Real-World Deployment
o Extend testing to real-world scenarios such as automated document
categorization systems in sectors like healthcare, legal, or e-commerce.
4. Parameter Optimization
o Study the impact of key parameters (e.g., the number of iterations, similarity
thresholds) to balance computational efficiency and performance.
3. Feasibility Analysis
The project’s technical, economic, and social viability is explored, demonstrating its
practicality for real-world implementation.
Technical Feasibility
1. Hardware Requirements
o Mid-range systems: Multi-core processors with 16–32 GB RAM suffice for
most datasets.
o Optional GPU Acceleration: Enhances performance for large-scale
datasets but is not mandatory.
48
2. Software Requirements
o Open-source libraries like KEYBERT, Scikit-learn, and Hugging Face
Transformers minimize software costs.
o Integration with scalable data processing frameworks (e.g., Apache Spark)
allows for efficient handling of massive datasets.
Economic Feasibility
1. Development Costs
o Moderate investment in personnel (e.g., NLP specialists) and computational
resources.
o Cloud-based solutions (e.g., AWS, Google Cloud) provide scalable
alternatives with predictable costs.
2. Maintenance Costs
o Ongoing efforts include updating models, refining seed keywords, and
occasional retraining.
o Costs are minimized by relying on open-source tools and cloud storage.
3. Return on Investment (ROI)
o Cost Savings: Automation reduces manual efforts in document
categorization, saving significant labor costs.
o Efficiency Gains: Improved accuracy reduces errors, enhances decision-
making, and optimizes business workflows.
Social Feasibility
1. Ethical Considerations
o The use of publicly available data (e.g., German business registry)
minimizes ethical concerns.
o Aligns with privacy regulations, ensuring compliance in sensitive data
handling.
2. Industry Alignment
o Addresses critical industry needs for precise, scalable document
classification.
o Likely to gain acceptance across sectors due to its adaptability and cost-
effectiveness.
49
Conclusion and Final Remarks
The methodology represents a significant advancement in class-specific keyword
extraction, offering a powerful tool for document categorization and related tasks. Its
innovative scoring mechanism and iterative refinement process establish a new
standard in the field.
By balancing precision, scalability, and feasibility, this approach is well-suited for real-
world deployment, enabling industries to process and categorize unstructured data with
unprecedented accuracy and efficiency. Future enhancements, such as multi-word phrase
extraction and multilingual adaptation, promise to expand its applicability, cementing its
role as a cornerstone of modern NLP-driven solutions.
50
8. CONCLUSION
1. Key Contributions
2. Evaluation Outcomes
Performance Metrics
• Precision Gains: Outperforms traditional methods (e.g., RAKE, YAKE, standard
KEYBERT) by a significant margin, particularly in class-specific keyword
relevance.
• Recall and F1 Scores: Maintains high recall and F1 scores, demonstrating balanced
performance in capturing relevant keywords without introducing excessive noise.
51
3. Implementation Feasibility
Technical Feasibility
• Relies on widely available open-source NLP tools (e.g., spaCy, Hugging Face
Transformers, Scikit-learn) and libraries for similarity scoring (e.g., cosine similarity
using embeddings like Word2Vec or BERT).
• Can be executed on mid-range computational resources, making it accessible to
small and medium-sized enterprises without high-performance computing
infrastructure.
Economic Viability
• Offers a cost-efficient alternative to manual keyword extraction and classification,
reducing operational overhead while improving accuracy.
• Potential for high ROI through automation, freeing human resources for higher-
value tasks.
Primary Applications
• Business Document Management: Streamlines the organization and retrieval of
large volumes of corporate or registry documents.
• Research and Academic Classification: Enhances categorization in bibliographic
databases, aiding literature reviews and citation analysis.
• Legal and Compliance: Assists in sorting regulatory filings, contracts, and
compliance reports.
• Data Organization in Knowledge Management Systems: Automates structuring
and tagging for better access and retrieval.
Broader Impact
• Reduces human intervention in data categorization, saving significant time and
costs.
• Enhances decision-making by ensuring quick access to class-relevant, structured
information.
• Paves the way for improved AI-driven workflows, particularly in industries relying
on large volumes of unstructured data.
52
5. Limitations and Future Directions
Current Limitations
• Restricted to unigram extraction, limiting the scope of certain domain-specific or
nuanced phrases.
• Primarily validated on German-language datasets, with potential challenges in
applying the method to other languages or domains.
Future Directions
1. Extension to Multi-Word Phrases:
o Incorporate phrase detection techniques (e.g., noun phrase chunking, co-
occurrence analysis) to extract meaningful multi-word keyphrases.
2. Cross-Language Adaptability:
o Validate and fine-tune the method for multilingual datasets by leveraging
multilingual embeddings (e.g., M-BERT, XLM-R).
3. Optimization of Iterative Refinement:
o Investigate the optimal number of iterations and parameters (e.g., similarity
thresholds) to balance precision and computational efficiency.
4. Real-World Deployment:
o Test the methodology in diverse real-world scenarios such as e-commerce
(product categorization) or healthcare (medical record classification).
Final Remark
53
9. REFERENCES
▪ Meisenbacher, S., Schopf, T., Yan, W., Holl, P., & Matthes, F.
An Improved Method for Class-specific Keyword Extraction: A Case Study in the
German Business Registry.
Technical University of Munich and Fusionbase GmbH, 2024.
GitHub Repository for Code.
▪ Campos, R., Mangaravite, V., Pasquali, A., Jorge, A., Nunes, C., & Jatowt, A.
YAKE! Keyword extraction from single documents using multiple local features.
Information Sciences, 509, 257–289, 2020.
▪ Grootendorst, M.
KeyBERT: Minimal keyword extraction with BERT.
GitHub Repository: KeyBERT, 2023.
▪ Shi, W., Zheng, W., Yu, J. X., Cheng, H., & Zou, L.
Keyphrase extraction using knowledge graphs.
Data Science and Engineering, 2, 275–288, 2017
54
APPENDIX A – SAMPLE CODE
main.py:-
from nbformat import read
import streamlit as st
from io import StringIO
import docx2txt
from logger import Logger
from PyPDF2 import PdfFileReader
import os
import time
from streamlit_quill import st_quill
from process import text_process, text_to_pdf, text_doc
from docx import Document
def get_doc(uploaded_file):
if uploaded_file is not None:
if st.button("proceed"):
str_data = process_data(uploaded_file)
if str_data:
st.subheader('Edit Data')
st.session_state['str_value'] = str_data
logger.log(file, "updated data to session from doc string")
st.session_state['load_editor'] = True
return str_data
else:
st.subheader('File Corrupted please upload other file')
return str_data
st.session_state['str_value'] = content
logger.log(file, "returning editor new content")
return content
st.session_state['user_data'] = 0
st.session_state['load_state'] = True
boundary = "\n"*4 + "=====Keywords======" + "\n"*4
st.title("Keyword Extractor")
st.caption("Keyword extraction technique will sift through the whole set of
data in minutes and obtain the words and phrases that best describe each subject.
This way, you can easily identify which parts of the available data cover the
subjects you are looking for while saving your teams many hours of manual
processing.")
st.write("\n")
st.subheader("Upload File")
if str_data or st.session_state['load_editor']:
data = run_editor(str_data)
st.download_button(label="Export as PDF",
data=PDFbyte,
file_name="keywords.pdf",
mime='application/octet-stream')
else:
text_doc(df, 'keywords')
with open(os.path.join("keywords.doc"), "rb") as doc_file:
docbyte = doc_file.read()
st.download_button(label="Export as DOC",
data=docbyte,
file_name="keywords.doc",
mime='application/octet-stream')
58
process.py:-
from cmath import log
import spacy
import re
import string
import textwrap
from fpdf import FPDF
from logger import Logger
import os
import base64
import streamlit as st
from docx import Document
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords # Import stopwords from nltk.corpus
from nltk.stem import WordNetLemmatizer
import en_core_web_sm
from nltk.corpus import wordnet as wn
import pandas as pd
from rake_nltk import Rake
import pytextrank
def preprocessing(text):
logger.log(file, f"starting preprocessing")
# Make lower
text = text.lower()
text = text.split()
except:
logger.log(file, f"load failed downloading stopwords from nlkt")
nltk.download('stopwords')
stop_words = stopwords.words('english')
lis = set(lis + stop_words)
finally:
lis = list(lis) + ['hi', 'im']
try:
logger.log(file, f"trying loading wordlemma")
lem = WordNetLemmatizer()
lem.lemmatize("testing")
except:
logger.log(file, f"loading failed trying to download wordnetm and
omw 1.4")
#call the nltk downloader
nltk.download('wordnet')
nltk.download('omw-1.4')
lem = WordNetLemmatizer() #stemming
finally:
logger.log(file, f"lemmatize words preprocessing done")
text_filtered = [lem.lemmatize(word) for word in text if not word in
lis]
return " ".join(text_filtered)
60
def text_process(text):
text = preprocessing(text)
data = textrank(text)
logger.log(file, f"text rank done")
data = ", \n".join(str(d) for d in data)
if data == "":
data = "None Keyword Found"
logger.log(file, "data cleaned and returned")
return data
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
df = df.transpose().reset_index()
df.columns = ['words', 'value']
df = df.sort_values('value', ascending = False)
r.extract_keywords_from_text(text)
keywordList = []
rankedList = r.get_ranked_phrases_with_scores()
for keyword in rankedList:
keyword_updated = keyword[1].split()
keyword_updated_string = " ".join(keyword_updated[:2])
keywordList.append(keyword_updated_string)
if(len(keywordList)>9):
break
logger.log(file, f"used rake now returning")
return keywordList
def textrank(text):
logger.log(file, f"spacy + text rank function starting")
nlp = en_core_web_sm.load()
nlp.add_pipe("textrank")
doc = nlp(text)
# examine the top-ranked phrases in the document
return [text.text for text in doc._.phrases[:40] if len(text.text) < 30]
62