Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
27 views22 pages

RoBERTa Token Classification With Additional PLODv2 Data

The Token Classification Report details the analysis, experiments, and results of a project using the PLOD-CW-25 dataset for biomedical text token classification. It highlights the effectiveness of transformer models like BERT and RoBERTa, with RoBERTa outperforming BERT in F1-score and precision. Additionally, the report discusses the impact of different optimizers and the benefits of increasing dataset size on model performance, showing that larger datasets improve the detection of complex multi-token entities.

Uploaded by

aadityas2111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views22 pages

RoBERTa Token Classification With Additional PLODv2 Data

The Token Classification Report details the analysis, experiments, and results of a project using the PLOD-CW-25 dataset for biomedical text token classification. It highlights the effectiveness of transformer models like BERT and RoBERTa, with RoBERTa outperforming BERT in F1-score and precision. Additionally, the report discusses the impact of different optimizers and the benefits of increasing dataset size on model performance, showing that larger datasets improve the detection of complex multi-token entities.

Uploaded by

aadityas2111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

COMM061

Token Classification Report

Table of Contents
1. Analyse and visualise the dataset …………..………………………..2
2. Experiments …………………………………………………………4
2.1 Experiment 1
2.2 Experiment 2
2.3 Experiment 3
3. Testing for each experiment ………………………………………...7
3.1 Test for experiment 1
3.2 Test for experiment 2
3.3 Test for experiment 3
4. Error Analysis ……….………………………………………………12
5. Evaluate / Outcome / Results / Conclusion …………………………14
6. Deployment Web Service…………………………………………….15
7. Monitoring capability ……………………………………………….17
8. Performance of the Web service ……………………………………18
9. Reference……………………………………………………………19
1. Analyse and visualise the dataset

The dataset used in this project is PLOD-CW-25, which contains biomedical scientific text
annotated at the token level for abbreviation and long-form detection. Each token is labelled
using the BIO scheme:

• B-AC: beginning of an abbreviation


• B-LF: beginning of a long form
• I-LF: inside a long form
• O: not part of any entity

This dataset enables the training of models for token classification, where the goal is to
correctly label each token based on its role in an entity.

Label Distribution Analysis

To understand the structure of the dataset, we extracted all token labels from the train.conll
file and visualised their frequency.

Label Counts:

• O: 62,471
• I-LF: 9,525
• B-AC: 6,626
• B-LF: 3,923

Figure 1: Distribution of NER Tags in the Training Data

This distribution reveals a clear label imbalance — the O tag dominates, making up over
70% of all tokens. Among the entity tags, I-LF appears more than B-LF, indicating that many
long forms span multiple tokens. Interestingly, abbreviations (B-AC) appear more frequently
than long-form beginnings (B-LF), suggesting that abbreviations are often reused across the
dataset.

Entity Diversity: Abbreviations and Long Forms

We further analysed the training data to count unique abbreviations and long forms.

• Total unique abbreviations: 1,223


• Total unique long forms: 1,832

This confirms that:

• Many abbreviations map to multiple long forms (context-dependent),


• Long forms tend to span 2–5 tokens (requiring sequence-aware models),
• Some long forms share partial overlaps or nested terminology, especially in
biomedical contexts.

Domain-Specific Observations

As the data is sourced from scientific biomedical texts, the vocabulary includes specialised
and technical terms (e.g., “EEG”, “IL-6”, “angiotensin-converting enzyme”). This creates
challenges in:

• Correctly tagging unfamiliar abbreviations,


• Handling rare or ambiguous terms,
• Mapping abbreviations to domain-specific long forms with multiple token spans.

These characteristics are not typically seen in general-purpose NER datasets, which is why
transformer-based models like BERT and RoBERTa (pretrained on large corpora) were
selected — to help the model generalise across unknown medical terms and complex
sequences.

Impact on Experiment Design

This analysis directly influenced how we approached the modelling phase:

• Label imbalance encouraged us to use weighted evaluation metrics (F1-score).


• Multi-token entities justified using transformer models capable of handling long-
range dependencies.
• Ambiguity and overlap led us to include data preprocessing like subword
tokenization and POS tagging.
• Overall, this dataset analysis helped guide both model selection and evaluation
strategies, ensuring our experiments were grounded in the data’s real-world structure
and complexity.
2. Experiments
2.1 Experiment 1: Modelling Techniques
This experiment involves experimenting with different modeling techniques, Where we
have chosen pre-trained Transformer architectures BERT and RoBERTa. The decision
to use BERT and RoBERTa stems from their proven effectiveness in token
classification tasks, particularly in Named Entity Recognition (NER) and other
sequence labeling problems [1].
The comparison between BERT and RoBERTa shows the practical experiment between
these two models because RoBERTa modifies and improves and outperforms BERT's
training methodology on token classification tasks[2], as it is trained on more data and
for longer and involves dynamic masking. Even though Roberta usually outperforms
BERT, performance may vary depending on the domain (medical, scientific, etc.).
By choosing and treating BERT as a standard baseline model for the task. By
comparing RoBERTa with BERT, we can validate the RoBERTa adding value and
improving the performance on the dataset.
• Methodology
1. Data Preparation
Dataset used was first formatted to token classification style, using BIO scheme.
Before training, the string-based NER tags in the dataset were converted into
numerical labels using the label list. This mapping ensured integer-based labels on
NER column. The input sequence was tokenized and aligned with their
corresponding labels, along with the subword tokenization.

2. Model Selection and Initialisation


BERT and RoBERTA models were selected for comparion in this setup. Both the
models are pretrained on large corpus. The models were utilized from the hugging
face transformer library, BERT-base-cased and RoBERTa-base. The model were
initialised using the AutoModelForTokenClassification class.

3. Tokenizer
Each model used its respective tokenizer from the transformer library-
BERT used BertTokenizerFast with the bert-base-cased checkpoint
RoBERTa used the RobertaTokenizerFast with the roberta-base checkpoint.

4. Training
The models were fine-tuned using the trainer API from hugging face on the
following configuration-
Loss function - The model was trained using the cross-entropy loss, which is
suitable for the multi-class token classification.
Optimizer - AdamW was used in both the setup.
Learning Rate - A learning rate of 5e-5 was chosen after experimentations.
Batch Size- A batch size of 16 was used.
Epochs- The models were trained for 3 epochs.
Evaluation- The evaluation was performed at the end of every epoch using
precision, recall and F1-score metrics, using the seqeval library.
5. Testing
The model was tested on the test dataset and metrics were calculated.

2.2 Experiment 2 Experimenting with different optimizers


While improving our model's performance and training efficiency, we
experimented with several advanced optimization techniques beyond the
conventional Adam and SGD optimizers. Specifically, we tested GrokAdamW,
Schedule-Free AdamW, and Adafactor. Each optimizer brings a distinct approach
to gradient-based optimization, and our goal was to evaluate their impact on
convergence behavior, generalization, and training stability.

1. GrokAdamW

GrokAdamW is an enhancement of the standard AdamW optimizer designed to


encourage better generalization and stability during training. Inspired by findings
from grokking phenomena in neural networks, GrokAdamW incorporates
adaptive techniques that improve learning dynamics, especially in scenarios where
overfitting or delayed generalization occurs.

2. Schedule-Free AdamW

Schedule-Free AdamW modifies the traditional AdamW by removing the need for
manually defined learning rate schedules. It dynamically adjusts learning rates
based on the training progress, reducing the need for hyperparameter tuning and
simplifying the training pipeline.
3. Adafactor
Adafactor is a memory-efficient optimizer designed to reduce the memory
footprint of training large models. It uses a factored approximation of the second-
moment estimator, which allows it to scale better in environments with limited
hardware resources.

2.3 Experiment 3: Using additional train/validate dataset from the optional


dataset
This experiment looks at the impact of increasing dataset size using additional
samples from PLODv2–filtered. Usually, we notice that a larger or more diverse
dataset can improve model generalization, especially in token classifications or tasks
such as NER. The PLOD-CW-25 dataset is quite small with only 2000 training
samples, we hypothesize that providing more labelled data would enhance its
abbreviation and long form detection ability more robustly.

• Methodology
1. Sample size determination
We sampled 25% and 50% of the training and validation sets from the PLODv2-
filtered dataset, resulting in approximately 28,000 and 56,000 additional examples
respectively.
2. Deduplication
To ensure data quality, duplicates were removed based on identical token, POS,
and NER tag combination. After deduplication, the final training set for 50%
additional dataset after merging and deduplication was 56,075 unique sample and
for 25% the amount of unique samples was 29,621.
3. Label Preprocessing
The NER tags were initially given in BIO (beginning, inside and outside of an
entity) format with labels such as B-AC, B-LF, I-LF and O. These were mapped to
integer ids allowing the model to predict the output as is would do for a
classification problem. This can be observed being implemented in the final layer
of the model. The BIO tags were persevered later so the model could still detect
not just single words but phrases.
4. Tokenization and label alignment
The Roberta-base tokenizer was used with add_prefix_space=True which is
recommended for Roberta based models. Original labels are assigned one per
word in the original dataset, but since Roberta often split a word into multiple
subword tokens we need a way for the model to label each token not just word by
label alignment. The tokenizer provides a way to map tokens by .word_ids(). To
ignore special tokens during training loss calculation we set their values to -100
which can be counted as a “don’t care” value by Hugging Face and PyTorch.
5. Evaluation
The model achieved an F1-score (test) of 0.9003 in 25% additional dataset and
50% additional dataset the score was 0.9036.
3. Analyse Testing
For the experiments, we evaluated our models on the test dataset, and their performance
was measured using various metrics. We have used seqeval library to evaluate the token
classification tasks. We have used F1-score as our primary evaluation metric.

3.1 Test on Modelling techniques (BERT vs RoBERTa)


Models F1-score Precision Recall Accuracy
BERT 0.8219 0.7821 0.8661 0.9124

RoBERTa 0.8304 0.8893 0.8589 0.9247

RoBERTa outperforms BERT. The F1-Score for RoBERTa is 0.8304, which is higher
than BERT’s 0.8219, indicating a balance between precision and recall. The precision for
RoBERTa is higher than BERT which suggests the model is better at minizing
false positives. Whereas the recall suggests that the BERT is better at classififying
tokens but at the cost of precision.
Tokens Precision Recall F1-score
AC 0.8318 0.901 0.8653
LF 0.8681 0.9082 0.8877

Confusion Matrix (BERT vs RoBERTa)


BERT RoBERTa

Token Classification Bar Chat


:

BERT RoBERTa

The F1-score as the primary evaluation metric highlights RoBERTa’s superior performance,
and the confusion matrices reinforce this by showing better token classification accuracy.
RoBERTa demonstrates its ability to manage the trade-off between precision and recall more
effectively. The results from the confusion matrix confirm that RoBERTa handles token
classification tasks more reliably in distinguishing abbreviations and long forms. RoBERTa's
enhanced architecture and training provide a clear advantage for this task.

3.2 Test on different optimizers


To evaluate the real-world effectiveness of each optimizer, we tested the trained
models on a held-out test set and computed key performance metrics: accuracy,
precision, recall, and F1-score. These metrics provide insight into not only how
accurate the model is overall, but also how well it balances false positives and false
negatives, which is especially important in critical prediction tasks.

Optimizer Accuracy Precision Recall F1-Score


GrokAdamW 92.65% 82.83% 90.76% 86.61%
Schedule-Free 92.89% 82.70% 90.34% 86.35%
AdamW
Adafactor 92.62% 81.50% 92.37% 86.60%

• Schedule-Free AdamW achieved the highest test accuracy (92.89%), showing


strong generalization performance without requiring manual tuning of the learning
rate schedule. Its self-adaptive learning dynamics simplify training while
maintaining robustness across epochs.

• GrokAdamW recorded the highest precision (82.83%), meaning it was the most
conservative in its predictions—less prone to false positives. Combined with a
competitive F1-score, this makes it a strong choice for tasks where precision is
critical.

• Adafactor achieved the highest recall (92.37%), indicating that it was most
effective at identifying all relevant (positive) instances. This makes it valuable in
use cases like medical diagnostics or anomaly detection, where missing a positive
case could be costly.

• While their strengths differ slightly, all three optimizers maintained nearly
identical F1-scores (~86.6%), signifying a well-balanced trade-off between
precision and recall. This consistency confirms the reliability of each optimizer
across different decision-making aspects.

• From a computational perspective, Adafactor offers memory efficiency, Schedule-


Free AdamW minimizes hyperparameter overhead, and GrokAdamW encourages
smoother convergence with strong generalization.
The bar plot above provides a clear visual summary of the performance metrics across
optimizers. Notably:
• Adafactor’s recall spike is visible, with a slight dip in precision.
• Schedule-Free AdamW leads in overall accuracy but lags slightly on F1.
• GrokAdamW performs most consistently across all metrics.

3. Test for experiment 3

This experiment analyses the impact of increasing the training and validation data size
by integrating 25% and 50% of samples from the PLODv2-filtered dataset. The
objective was to evaluate whether more data helps the model to perform better token
classification.
Evaluation was conducted using seqeval library, and performances was measured
using F1-score as the primary metric. We tracked validation loss and overall scores
using .evaluate() method to understand learning behaviour during training. Visual
metrics of confusion matrices is used to analyse.
Evaluation Summary (seqeval result)
Metric 25% additional data 50% additional data
F1-score 0.8906 0.9032
Precision 0.8752 0.8708
Recall 0.9067 0.9382
Evaluation Summary (F1)
Metric 25% additional data 50% additional data
AC 0.9003 0.9037
LF 0.8737 0.9025

While both models performed strongly, the model trained with 50% additional dataset
had a huge improvement in recall metric which meant it had a better ability at
detecting relevant tokens even in longer entity spans. This can suggest that using
additional training data can lead to an improvement in a model’s ability to detect
complex, multi-token entities which is very essential in NER tasks.

Setup 1: PLOD-CW-25 + 25% PLODv2-filtered

Setup 2: PLOD-CW-25 + 50% PLODv2-filtered


In the 50% setup, there is a visible drop in amount of misclassifications observed for
B-LF and I-LF labels when compared to the 25% setup. Most notable improvements
are the reduction in false predictions of O for I-LF tokens and the increase in correct
predictions for B-LF and I-LF. This suggests that the model better recognises multi-
token entities with more training data. The 50% model demonstrates stronger
generalization and reduced confusion across all entity classes.

Conclusion
Augmenting the training and validation sets with 50% additional data led to a 1.3%
gain in F1-score and improvement in recall on the test set, especially for long-form
entities. The model’s capacity to generalize improves with access to a larger training
set. The increase in performance, makes the 50% configuration more suitable in our
case.
4. Error Analysis

To understand our models’ limitations beyond just metrics, we manually reviewed a subset of
misclassifications made by the best-performing model — RoBERTa fine-tuned on PLOD-
CW-25 + 50% additional PLODv2-filtered data.

Our goal was to identify patterns in the mistakes and assess why they occurred, so we could
improve future modelling strategies.

Observed Error Patterns

1. Abbreviations labelled as ‘O’

Several rare or highly domain-specific abbreviations (e.g., “IL-10”, “PPAR”) were


misclassified as O. These terms likely didn’t appear frequently in training data, so the model
didn’t learn to recognise them as abbreviations.

2. Incomplete long-form spans

In some cases, long-form phrases like “angiotensin-converting enzyme inhibitor” were only
partially labelled — e.g., the model predicted B-LF I-LF O O instead of B-LF I-LF I-LF I-LF.
This suggests the model struggled with multi-token entities, especially when long forms
include uncommon or compound terms.

3. Confusion between B-AC and B-LF

Some tokens were mislabeled as the wrong entity type — for instance, “SNP” (a known
abbreviation) being predicted as B-LF. These errors may stem from overlapping token
patterns or inconsistent examples in the data.

Below is a real example from the test set that demonstrates how the model handles
biomedical entity tagging. The table shows the model’s predictions compared to the true
labels for each token in the sentence. The highlighted rows represent errors where the model
mislabelled parts of a long-form phrase.

Table 1: Token-level comparison of true vs predicted labels on a test sample

Token True Label Predicted Label Match

intracellular B-LF B-LF

multiplication I-LF O
( O O

Icm B-LF B-LF

) O O

defective I-LF I-LF

organelle I-LF I-LF

trafficking I-LF O

Dot B-AC B-AC

Table 1: Sample token predictions from the test set showing correct and incorrect
classifications.

Likely Causes of Errors

• Label imbalance: With O dominating the dataset, the model likely leaned toward
over-predicting this class.
• Tokenization effects: Subword splitting (e.g., “angiotensin” becoming [angi,
##otensin]) can break entity alignment and cause partial predictions.
• Low-frequency entities: Many long forms and abbreviations appeared very few
times, limiting the model’s ability to learn their patterns.
• Ambiguity in biomedical terms: Certain tokens can be interpreted as abbreviations
or long forms depending on context (e.g., "TNF" vs "tumour necrosis factor").

Suggestions for Improvement

• Use BioBERT or SciBERT, pretrained on biomedical corpora, to better handle


scientific terms.
• Include more entity-balanced training data (especially long forms).
• Add post-processing rules to clean up broken long-form predictions.
• Use context windows around target tokens to enhance classification accuracy.

Conclusion

The model performed well overall, but the error patterns reveal a consistent weakness in
handling long, rare, or domain-specific sequences — especially long forms that span multiple
tokens. These findings highlight the need for domain-adapted models and data enrichment in
future work.
5. Evaluate / Outcomes / Results / Conclusion
The model developed in this project successfully fulfil their purpose of detecting
abbreviations and long form using token-level classification. Across all the 3 different
experimental setups, the model was able to achieve higher accuracy and F1-scores
after each experiment demonstrating that the model can accurately classify entities
regardless of short or long span. The best performing model was the PLOD-CW 25 +
50% additional PLODv2-filtered data indicating that use of more domain-specific
data significantly improved long-form entity detection.
In practical terms, a reliable F1-score for models in token classification tasks is above
0.85 when it comes to long-form extraction. All tested models met or exceeded this
benchmark. Long-form extraction is more complex due to multi-token dependencies,
but the 50% additional dataset shows notable gains in recall which is very critical, as
missing a single entity can result in missing vital information. The baseline BERT-
based model fell short on long-form recall, emphasising the importance of data size
and domain generalisation.
We did not directly evaluate LLM such as GPT-3 or T5 for token classification. LLMs
excel at zero-shot or few-shot tasks that involve generating text, summarization or
text classification when prompted properly and from a computational perspective
LLMs are resource-intensive and require paid APIs or high-end GPUs. Fine-tuned
transformers like BERT or RoBERTa when modified for token classification using
AutoModelForTokenClassification architecture are great at assigning a label to each
token in a sequence. BERT-like models use supervised learning with labelled datasets,
which leads to deterministic outputs. These models are computationally more efficient
than LLMs for this task.
The most accurate model from our experiments - PLOD-CW 25 + 50% additional
PLODv2-filtered was the most effective. It balanced higher recall with precision
making it suitable for real-world systems. In domains like healthcare or research, false
negatives prove to be costly. Therefore, prioritizing recall without sacrificing F1
makes this model valuable.
6. Deployment Web Service
To deploy the token classification system, a web application was developed using streamlit.
The system leverages our best model to perform the token classification for abbreviation
detection tasks. This deployed service allows users to input any sentenece and receive
highlighted predictions identifying abbreviations and long forms in the sentence.
The application allows easy integration of different classification models.
We chose streamlit as our serving choice, due to streamlit ability to create interactive web
apps with less and complex code. It is perfect for small-scale models.
The system architecture involves:
1) User Interface
The user interface involves the text field and a submit button which allows users to
input the sentence and submit the sentence to trigger the model predictions, the
predictions are then displayed as the output with highlights of thier predicted class
using HTML/CSS styling. The interface also includes the hyperlink to view the
google sheet which shows the user inputs and predictions stored in it.

2) Model Inference
This part utilizes our best performing model saved in hugging face (link). The input
text is split into words and tokenized using the tokenizer and then the tokens are
passed to the model which then outputs the prediction for each token indicating its
predicted labels.

3) Logging and Monitoring


Every user interaction is logged to the centralized Google Sheet, which stores the
timestamp, execution time, user input and its predicted tokens and labels. This uses
the Google service account with gspread and oauth2clients libraries.
4) Deployment
The application was initially tested locally, later for public access the app is deployed
to the Streamlit Cloud.
7. Building Monitoring Capabilities
Our deployed app includes a monitoring and logging system, which uses the google sheets
api to store the user inputs and model predictions. The primary goal of this feature is to log
each user interactions, input, output and the time of execution. The stored data in the google
sheets can be download and which can be analyse programmatically.
Implementaiton Details-
Each time the user enters a sentence and requests a prediction, the prediction are generated by
the model and all the data. I.e Input User, predictions, timestamp and its execution time is
appended in the google sheet. This system is implemented using gspread and oauth2client
which allows the access to Google Sheets via a Service Account.
The logging logic is encapsulated in a save_into_sheets() function, which is called after every
model prediction. This ensures consistent logging without affecting the main application
logic.

Programmatic Accessibility: -

Google Sheets serves as both a logging interface and a lightweight database. The logs are
structured and stored in real-time, and can be:

• Downloaded and access the data


• Programmatically accessed via the Sheets API
• Can Filtered the contents, predictions from the file.
8. Performance of the Web Service

An analysis was conducted to evaluate the relationship between the execution time and the
length of the user input. Duplicate entries were removed to ensure that only the first
occurrence of each unique input was considered.

Summary Statistics:

• Average Input Length: 240.75 characters


• Average Execution Time: 0.337 seconds
• Max Input Length: 471 characters
• Max Execution Time: 1.159 seconds
• Correlation Coefficient: 0.27

Insight:

There is a noticeable trend indicating that longer input queries tend to result in longer
execution times. This is further supported by a positive correlation between input length
and execution time, as visualized in the scatter plot with a fitted trend line. Although the
correlation is not very strong, it suggests that input length is one of the contributing factors to
processing time.
9. References
[1]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding,” Proceedings of the 2019 Conference of the
North, vol. 1, 2019, doi: https://doi.org/10.18653/v1/n19-1423.
[2]Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv.org, Jul.
26, 2019. https://arxiv.org/abs/1907.11692

You might also like