0% found this document useful (0 votes)

27 views22 pages

RoBERTa Token Classification With Additional PLODv2 Data

The Token Classification Report details the analysis, experiments, and results of a project using the PLOD-CW-25 dataset for biomedical text token classification. It highlights the effectiveness of transformer models like BERT and RoBERTa, with RoBERTa outperforming BERT in F1-score and precision. Additionally, the report discusses the impact of different optimizers and the benefits of increasing dataset size on model performance, showing that larger datasets improve the detection of complex multi-token entities.

Uploaded by

aadityas2111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views22 pages

RoBERTa Token Classification With Additional PLODv2 Data

Uploaded by

aadityas2111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

COMM061

Token Classification Report

Table of Contents
1. Analyse and visualise the dataset …………..………………………..2
2. Experiments …………………………………………………………4
2.1 Experiment 1
2.2 Experiment 2
2.3 Experiment 3
3. Testing for each experiment ………………………………………...7
3.1 Test for experiment 1
3.2 Test for experiment 2
3.3 Test for experiment 3
4. Error Analysis ……….………………………………………………12
5. Evaluate / Outcome / Results / Conclusion …………………………14
6. Deployment Web Service…………………………………………….15
7. Monitoring capability ……………………………………………….17
8. Performance of the Web service ……………………………………18
9. Reference……………………………………………………………19
1. Analyse and visualise the dataset

The dataset used in this project is PLOD-CW-25, which contains biomedical scientific text
annotated at the token level for abbreviation and long-form detection. Each token is labelled
using the BIO scheme:

• B-AC: beginning of an abbreviation

• B-LF: beginning of a long form
• I-LF: inside a long form
• O: not part of any entity

This dataset enables the training of models for token classification, where the goal is to
correctly label each token based on its role in an entity.

Label Distribution Analysis

To understand the structure of the dataset, we extracted all token labels from the train.conll
file and visualised their frequency.

Label Counts:

• O: 62,471
• I-LF: 9,525
• B-AC: 6,626
• B-LF: 3,923

Figure 1: Distribution of NER Tags in the Training Data

This distribution reveals a clear label imbalance — the O tag dominates, making up over
70% of all tokens. Among the entity tags, I-LF appears more than B-LF, indicating that many
long forms span multiple tokens. Interestingly, abbreviations (B-AC) appear more frequently
than long-form beginnings (B-LF), suggesting that abbreviations are often reused across the
dataset.

Entity Diversity: Abbreviations and Long Forms

We further analysed the training data to count unique abbreviations and long forms.

• Total unique abbreviations: 1,223

• Total unique long forms: 1,832

This confirms that:

• Many abbreviations map to multiple long forms (context-dependent),

• Long forms tend to span 2–5 tokens (requiring sequence-aware models),
• Some long forms share partial overlaps or nested terminology, especially in
biomedical contexts.

Domain-Specific Observations

As the data is sourced from scientific biomedical texts, the vocabulary includes specialised
and technical terms (e.g., “EEG”, “IL-6”, “angiotensin-converting enzyme”). This creates
challenges in:

• Correctly tagging unfamiliar abbreviations,

• Handling rare or ambiguous terms,
• Mapping abbreviations to domain-specific long forms with multiple token spans.

These characteristics are not typically seen in general-purpose NER datasets, which is why
transformer-based models like BERT and RoBERTa (pretrained on large corpora) were
selected — to help the model generalise across unknown medical terms and complex
sequences.

Impact on Experiment Design

This analysis directly influenced how we approached the modelling phase:

• Label imbalance encouraged us to use weighted evaluation metrics (F1-score).

• Multi-token entities justified using transformer models capable of handling long-
range dependencies.
• Ambiguity and overlap led us to include data preprocessing like subword
tokenization and POS tagging.
• Overall, this dataset analysis helped guide both model selection and evaluation
strategies, ensuring our experiments were grounded in the data’s real-world structure
and complexity.
2. Experiments
2.1 Experiment 1: Modelling Techniques
This experiment involves experimenting with different modeling techniques, Where we
have chosen pre-trained Transformer architectures BERT and RoBERTa. The decision
to use BERT and RoBERTa stems from their proven effectiveness in token
classification tasks, particularly in Named Entity Recognition (NER) and other
sequence labeling problems [1].
The comparison between BERT and RoBERTa shows the practical experiment between
these two models because RoBERTa modifies and improves and outperforms BERT's
training methodology on token classification tasks[2], as it is trained on more data and
for longer and involves dynamic masking. Even though Roberta usually outperforms
BERT, performance may vary depending on the domain (medical, scientific, etc.).
By choosing and treating BERT as a standard baseline model for the task. By
comparing RoBERTa with BERT, we can validate the RoBERTa adding value and
improving the performance on the dataset.
• Methodology
1. Data Preparation
Dataset used was first formatted to token classification style, using BIO scheme.
Before training, the string-based NER tags in the dataset were converted into
numerical labels using the label list. This mapping ensured integer-based labels on
NER column. The input sequence was tokenized and aligned with their
corresponding labels, along with the subword tokenization.

2. Model Selection and Initialisation

BERT and RoBERTA models were selected for comparion in this setup. Both the
models are pretrained on large corpus. The models were utilized from the hugging
face transformer library, BERT-base-cased and RoBERTa-base. The model were
initialised using the AutoModelForTokenClassification class.

3. Tokenizer
Each model used its respective tokenizer from the transformer library-
BERT used BertTokenizerFast with the bert-base-cased checkpoint
RoBERTa used the RobertaTokenizerFast with the roberta-base checkpoint.

4. Training
The models were fine-tuned using the trainer API from hugging face on the
following configuration-
Loss function - The model was trained using the cross-entropy loss, which is
suitable for the multi-class token classification.
Optimizer - AdamW was used in both the setup.
Learning Rate - A learning rate of 5e-5 was chosen after experimentations.
Batch Size- A batch size of 16 was used.
Epochs- The models were trained for 3 epochs.
Evaluation- The evaluation was performed at the end of every epoch using
precision, recall and F1-score metrics, using the seqeval library.
5. Testing
The model was tested on the test dataset and metrics were calculated.

2.2 Experiment 2 Experimenting with different optimizers

While improving our model's performance and training efficiency, we
experimented with several advanced optimization techniques beyond the
conventional Adam and SGD optimizers. Specifically, we tested GrokAdamW,
Schedule-Free AdamW, and Adafactor. Each optimizer brings a distinct approach
to gradient-based optimization, and our goal was to evaluate their impact on
convergence behavior, generalization, and training stability.

1. GrokAdamW

GrokAdamW is an enhancement of the standard AdamW optimizer designed to

encourage better generalization and stability during training. Inspired by findings
from grokking phenomena in neural networks, GrokAdamW incorporates
adaptive techniques that improve learning dynamics, especially in scenarios where
overfitting or delayed generalization occurs.

2. Schedule-Free AdamW

Schedule-Free AdamW modifies the traditional AdamW by removing the need for
manually defined learning rate schedules. It dynamically adjusts learning rates
based on the training progress, reducing the need for hyperparameter tuning and
simplifying the training pipeline.
3. Adafactor
Adafactor is a memory-efficient optimizer designed to reduce the memory
footprint of training large models. It uses a factored approximation of the second-
moment estimator, which allows it to scale better in environments with limited
hardware resources.

2.3 Experiment 3: Using additional train/validate dataset from the optional

dataset
This experiment looks at the impact of increasing dataset size using additional
samples from PLODv2–filtered. Usually, we notice that a larger or more diverse
dataset can improve model generalization, especially in token classifications or tasks
such as NER. The PLOD-CW-25 dataset is quite small with only 2000 training
samples, we hypothesize that providing more labelled data would enhance its
abbreviation and long form detection ability more robustly.

• Methodology
1. Sample size determination
We sampled 25% and 50% of the training and validation sets from the PLODv2-
filtered dataset, resulting in approximately 28,000 and 56,000 additional examples
respectively.
2. Deduplication
To ensure data quality, duplicates were removed based on identical token, POS,
and NER tag combination. After deduplication, the final training set for 50%
additional dataset after merging and deduplication was 56,075 unique sample and
for 25% the amount of unique samples was 29,621.
3. Label Preprocessing
The NER tags were initially given in BIO (beginning, inside and outside of an
entity) format with labels such as B-AC, B-LF, I-LF and O. These were mapped to
integer ids allowing the model to predict the output as is would do for a
classification problem. This can be observed being implemented in the final layer
of the model. The BIO tags were persevered later so the model could still detect
not just single words but phrases.
4. Tokenization and label alignment
The Roberta-base tokenizer was used with add_prefix_space=True which is
recommended for Roberta based models. Original labels are assigned one per
word in the original dataset, but since Roberta often split a word into multiple
subword tokens we need a way for the model to label each token not just word by
label alignment. The tokenizer provides a way to map tokens by .word_ids(). To
ignore special tokens during training loss calculation we set their values to -100
which can be counted as a “don’t care” value by Hugging Face and PyTorch.
5. Evaluation
The model achieved an F1-score (test) of 0.9003 in 25% additional dataset and
50% additional dataset the score was 0.9036.
3. Analyse Testing
For the experiments, we evaluated our models on the test dataset, and their performance
was measured using various metrics. We have used seqeval library to evaluate the token
classification tasks. We have used F1-score as our primary evaluation metric.

3.1 Test on Modelling techniques (BERT vs RoBERTa)

Models F1-score Precision Recall Accuracy
BERT 0.8219 0.7821 0.8661 0.9124

RoBERTa 0.8304 0.8893 0.8589 0.9247

RoBERTa outperforms BERT. The F1-Score for RoBERTa is 0.8304, which is higher
than BERT’s 0.8219, indicating a balance between precision and recall. The precision for
RoBERTa is higher than BERT which suggests the model is better at minizing
false positives. Whereas the recall suggests that the BERT is better at classififying
tokens but at the cost of precision.
Tokens Precision Recall F1-score
AC 0.8318 0.901 0.8653
LF 0.8681 0.9082 0.8877

Confusion Matrix (BERT vs RoBERTa)

BERT RoBERTa

Token Classification Bar Chat

BERT RoBERTa

The F1-score as the primary evaluation metric highlights RoBERTa’s superior performance,
and the confusion matrices reinforce this by showing better token classification accuracy.
RoBERTa demonstrates its ability to manage the trade-off between precision and recall more
effectively. The results from the confusion matrix confirm that RoBERTa handles token
classification tasks more reliably in distinguishing abbreviations and long forms. RoBERTa's
enhanced architecture and training provide a clear advantage for this task.

3.2 Test on different optimizers

To evaluate the real-world effectiveness of each optimizer, we tested the trained
models on a held-out test set and computed key performance metrics: accuracy,
precision, recall, and F1-score. These metrics provide insight into not only how
accurate the model is overall, but also how well it balances false positives and false
negatives, which is especially important in critical prediction tasks.

Optimizer Accuracy Precision Recall F1-Score

GrokAdamW 92.65% 82.83% 90.76% 86.61%
Schedule-Free 92.89% 82.70% 90.34% 86.35%
AdamW
Adafactor 92.62% 81.50% 92.37% 86.60%

• Schedule-Free AdamW achieved the highest test accuracy (92.89%), showing

strong generalization performance without requiring manual tuning of the learning
rate schedule. Its self-adaptive learning dynamics simplify training while
maintaining robustness across epochs.

• GrokAdamW recorded the highest precision (82.83%), meaning it was the most
conservative in its predictions—less prone to false positives. Combined with a
competitive F1-score, this makes it a strong choice for tasks where precision is
critical.

• Adafactor achieved the highest recall (92.37%), indicating that it was most
effective at identifying all relevant (positive) instances. This makes it valuable in
use cases like medical diagnostics or anomaly detection, where missing a positive
case could be costly.

• While their strengths differ slightly, all three optimizers maintained nearly
identical F1-scores (~86.6%), signifying a well-balanced trade-off between
precision and recall. This consistency confirms the reliability of each optimizer
across different decision-making aspects.

• From a computational perspective, Adafactor offers memory efficiency, Schedule-

Free AdamW minimizes hyperparameter overhead, and GrokAdamW encourages
smoother convergence with strong generalization.
The bar plot above provides a clear visual summary of the performance metrics across
optimizers. Notably:
• Adafactor’s recall spike is visible, with a slight dip in precision.
• Schedule-Free AdamW leads in overall accuracy but lags slightly on F1.
• GrokAdamW performs most consistently across all metrics.

3. Test for experiment 3

This experiment analyses the impact of increasing the training and validation data size
by integrating 25% and 50% of samples from the PLODv2-filtered dataset. The
objective was to evaluate whether more data helps the model to perform better token
classification.
Evaluation was conducted using seqeval library, and performances was measured
using F1-score as the primary metric. We tracked validation loss and overall scores
using .evaluate() method to understand learning behaviour during training. Visual
metrics of confusion matrices is used to analyse.
Evaluation Summary (seqeval result)
Metric 25% additional data 50% additional data
F1-score 0.8906 0.9032
Precision 0.8752 0.8708
Recall 0.9067 0.9382
Evaluation Summary (F1)
Metric 25% additional data 50% additional data
AC 0.9003 0.9037
LF 0.8737 0.9025

While both models performed strongly, the model trained with 50% additional dataset
had a huge improvement in recall metric which meant it had a better ability at
detecting relevant tokens even in longer entity spans. This can suggest that using
additional training data can lead to an improvement in a model’s ability to detect
complex, multi-token entities which is very essential in NER tasks.

Setup 1: PLOD-CW-25 + 25% PLODv2-filtered

Setup 2: PLOD-CW-25 + 50% PLODv2-filtered

In the 50% setup, there is a visible drop in amount of misclassifications observed for
B-LF and I-LF labels when compared to the 25% setup. Most notable improvements
are the reduction in false predictions of O for I-LF tokens and the increase in correct
predictions for B-LF and I-LF. This suggests that the model better recognises multi-
token entities with more training data. The 50% model demonstrates stronger
generalization and reduced confusion across all entity classes.

Conclusion
Augmenting the training and validation sets with 50% additional data led to a 1.3%
gain in F1-score and improvement in recall on the test set, especially for long-form
entities. The model’s capacity to generalize improves with access to a larger training
set. The increase in performance, makes the 50% configuration more suitable in our
case.
4. Error Analysis

To understand our models’ limitations beyond just metrics, we manually reviewed a subset of
misclassifications made by the best-performing model — RoBERTa fine-tuned on PLOD-
CW-25 + 50% additional PLODv2-filtered data.

Our goal was to identify patterns in the mistakes and assess why they occurred, so we could
improve future modelling strategies.

Observed Error Patterns

1. Abbreviations labelled as ‘O’

Several rare or highly domain-specific abbreviations (e.g., “IL-10”, “PPAR”) were

misclassified as O. These terms likely didn’t appear frequently in training data, so the model
didn’t learn to recognise them as abbreviations.

2. Incomplete long-form spans

In some cases, long-form phrases like “angiotensin-converting enzyme inhibitor” were only
partially labelled — e.g., the model predicted B-LF I-LF O O instead of B-LF I-LF I-LF I-LF.
This suggests the model struggled with multi-token entities, especially when long forms
include uncommon or compound terms.

3. Confusion between B-AC and B-LF

Some tokens were mislabeled as the wrong entity type — for instance, “SNP” (a known
abbreviation) being predicted as B-LF. These errors may stem from overlapping token
patterns or inconsistent examples in the data.

Below is a real example from the test set that demonstrates how the model handles
biomedical entity tagging. The table shows the model’s predictions compared to the true
labels for each token in the sentence. The highlighted rows represent errors where the model
mislabelled parts of a long-form phrase.

Table 1: Token-level comparison of true vs predicted labels on a test sample

Token True Label Predicted Label Match

intracellular B-LF B-LF

multiplication I-LF O
( O O

Icm B-LF B-LF

) O O

defective I-LF I-LF

organelle I-LF I-LF

trafficking I-LF O

Dot B-AC B-AC

Table 1: Sample token predictions from the test set showing correct and incorrect
classifications.

Likely Causes of Errors

• Label imbalance: With O dominating the dataset, the model likely leaned toward
over-predicting this class.
• Tokenization effects: Subword splitting (e.g., “angiotensin” becoming [angi,
##otensin]) can break entity alignment and cause partial predictions.
• Low-frequency entities: Many long forms and abbreviations appeared very few
times, limiting the model’s ability to learn their patterns.
• Ambiguity in biomedical terms: Certain tokens can be interpreted as abbreviations
or long forms depending on context (e.g., "TNF" vs "tumour necrosis factor").

Suggestions for Improvement

• Use BioBERT or SciBERT, pretrained on biomedical corpora, to better handle

scientific terms.
• Include more entity-balanced training data (especially long forms).
• Add post-processing rules to clean up broken long-form predictions.
• Use context windows around target tokens to enhance classification accuracy.

Conclusion

The model performed well overall, but the error patterns reveal a consistent weakness in
handling long, rare, or domain-specific sequences — especially long forms that span multiple
tokens. These findings highlight the need for domain-adapted models and data enrichment in
future work.
5. Evaluate / Outcomes / Results / Conclusion
The model developed in this project successfully fulfil their purpose of detecting
abbreviations and long form using token-level classification. Across all the 3 different
experimental setups, the model was able to achieve higher accuracy and F1-scores
after each experiment demonstrating that the model can accurately classify entities
regardless of short or long span. The best performing model was the PLOD-CW 25 +
50% additional PLODv2-filtered data indicating that use of more domain-specific
data significantly improved long-form entity detection.
In practical terms, a reliable F1-score for models in token classification tasks is above
0.85 when it comes to long-form extraction. All tested models met or exceeded this
benchmark. Long-form extraction is more complex due to multi-token dependencies,
but the 50% additional dataset shows notable gains in recall which is very critical, as
missing a single entity can result in missing vital information. The baseline BERT-
based model fell short on long-form recall, emphasising the importance of data size
and domain generalisation.
We did not directly evaluate LLM such as GPT-3 or T5 for token classification. LLMs
excel at zero-shot or few-shot tasks that involve generating text, summarization or
text classification when prompted properly and from a computational perspective
LLMs are resource-intensive and require paid APIs or high-end GPUs. Fine-tuned
transformers like BERT or RoBERTa when modified for token classification using
AutoModelForTokenClassification architecture are great at assigning a label to each
token in a sequence. BERT-like models use supervised learning with labelled datasets,
which leads to deterministic outputs. These models are computationally more efficient
than LLMs for this task.
The most accurate model from our experiments - PLOD-CW 25 + 50% additional
PLODv2-filtered was the most effective. It balanced higher recall with precision
making it suitable for real-world systems. In domains like healthcare or research, false
negatives prove to be costly. Therefore, prioritizing recall without sacrificing F1
makes this model valuable.
6. Deployment Web Service
To deploy the token classification system, a web application was developed using streamlit.
The system leverages our best model to perform the token classification for abbreviation
detection tasks. This deployed service allows users to input any sentenece and receive
highlighted predictions identifying abbreviations and long forms in the sentence.
The application allows easy integration of different classification models.
We chose streamlit as our serving choice, due to streamlit ability to create interactive web
apps with less and complex code. It is perfect for small-scale models.
The system architecture involves:
1) User Interface
The user interface involves the text field and a submit button which allows users to
input the sentence and submit the sentence to trigger the model predictions, the
predictions are then displayed as the output with highlights of thier predicted class
using HTML/CSS styling. The interface also includes the hyperlink to view the
google sheet which shows the user inputs and predictions stored in it.

2) Model Inference
This part utilizes our best performing model saved in hugging face (link). The input
text is split into words and tokenized using the tokenizer and then the tokens are
passed to the model which then outputs the prediction for each token indicating its
predicted labels.

3) Logging and Monitoring

Every user interaction is logged to the centralized Google Sheet, which stores the
timestamp, execution time, user input and its predicted tokens and labels. This uses
the Google service account with gspread and oauth2clients libraries.
4) Deployment
The application was initially tested locally, later for public access the app is deployed
to the Streamlit Cloud.
7. Building Monitoring Capabilities
Our deployed app includes a monitoring and logging system, which uses the google sheets
api to store the user inputs and model predictions. The primary goal of this feature is to log
each user interactions, input, output and the time of execution. The stored data in the google
sheets can be download and which can be analyse programmatically.
Implementaiton Details-
Each time the user enters a sentence and requests a prediction, the prediction are generated by
the model and all the data. I.e Input User, predictions, timestamp and its execution time is
appended in the google sheet. This system is implemented using gspread and oauth2client
which allows the access to Google Sheets via a Service Account.
The logging logic is encapsulated in a save_into_sheets() function, which is called after every
model prediction. This ensures consistent logging without affecting the main application
logic.

Programmatic Accessibility: -

Google Sheets serves as both a logging interface and a lightweight database. The logs are
structured and stored in real-time, and can be:

• Downloaded and access the data

• Programmatically accessed via the Sheets API
• Can Filtered the contents, predictions from the file.
8. Performance of the Web Service

An analysis was conducted to evaluate the relationship between the execution time and the
length of the user input. Duplicate entries were removed to ensure that only the first
occurrence of each unique input was considered.

Summary Statistics:

• Average Input Length: 240.75 characters

• Average Execution Time: 0.337 seconds
• Max Input Length: 471 characters
• Max Execution Time: 1.159 seconds
• Correlation Coefficient: 0.27

Insight:

There is a noticeable trend indicating that longer input queries tend to result in longer
execution times. This is further supported by a positive correlation between input length
and execution time, as visualized in the scatter plot with a fitted trend line. Although the
correlation is not very strong, it suggests that input length is one of the contributing factors to
processing time.
9. References
[1]J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding,” Proceedings of the 2019 Conference of the
North, vol. 1, 2019, doi: https://doi.org/10.18653/v1/n19-1423.
[2]Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv.org, Jul.
26, 2019. https://arxiv.org/abs/1907.11692

Assignment 3: Named Entity Recognition: Training Dataset
No ratings yet
Assignment 3: Named Entity Recognition: Training Dataset
4 pages
Computer Science 2
No ratings yet
Computer Science 2
66 pages
Unit IV Part 2
No ratings yet
Unit IV Part 2
11 pages
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
No ratings yet
Joshua K. Cage - Python Transformers by Huggingface Hands On - 101 Practical Implementation Hands-On of ALBERT - ViT - BigBird and Other Latest Models With Huggingface Transformers
186 pages
Unit-4 Part 2 Bioinformatics
No ratings yet
Unit-4 Part 2 Bioinformatics
11 pages
Python NLP with Transformers
No ratings yet
Python NLP with Transformers
275 pages
Improving Genomic Models Via Task-Specific Self-Pretraining: Sohan Mupparapu Parameswari Krishnamurthy Ratish Puduppully
No ratings yet
Improving Genomic Models Via Task-Specific Self-Pretraining: Sohan Mupparapu Parameswari Krishnamurthy Ratish Puduppully
7 pages
Papers
No ratings yet
Papers
16 pages
ArXiv 2020 DeepChem 0 ChemBERTa Large Scale Self Supervised
No ratings yet
ArXiv 2020 DeepChem 0 ChemBERTa Large Scale Self Supervised
7 pages
A Comparative Study On Deep Learning Models For Text Classification of Unstructured Medical Notes With Various Levels of Class Imbalance
No ratings yet
A Comparative Study On Deep Learning Models For Text Classification of Unstructured Medical Notes With Various Levels of Class Imbalance
12 pages
Report
No ratings yet
Report
10 pages
Project Documentation
No ratings yet
Project Documentation
2 pages
Keras NER with Transformers Guide
No ratings yet
Keras NER with Transformers Guide
7 pages
Btad 617
No ratings yet
Btad 617
10 pages
DL Practical 09text Pre Processing
No ratings yet
DL Practical 09text Pre Processing
6 pages
Hugging Face
100% (1)
Hugging Face
11 pages
ML4Molecules 2020 Paper 67
No ratings yet
ML4Molecules 2020 Paper 67
7 pages
BioM-Transformers: Building Large Biomedical Language Models With
No ratings yet
BioM-Transformers: Building Large Biomedical Language Models With
7 pages
We Take 10 COVID Classes Each of Which Has 500 FASTA Sequences
No ratings yet
We Take 10 COVID Classes Each of Which Has 500 FASTA Sequences
2 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
BioMegatron Biomedical
No ratings yet
BioMegatron Biomedical
7 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
Glove
100% (1)
Glove
10 pages
Alpha Genome
No ratings yet
Alpha Genome
103 pages
Pre-Training BERT With Hugging Face Transformers and Habana Gaudi
No ratings yet
Pre-Training BERT With Hugging Face Transformers and Habana Gaudi
12 pages
A Generative Deep Learning Approach For Alzheimer's Disease Drug Discovery
No ratings yet
A Generative Deep Learning Approach For Alzheimer's Disease Drug Discovery
6 pages
CS772 Project Report
No ratings yet
CS772 Project Report
9 pages
Using Natural Language Processing To Evaluate The Impact of Specialized Transformers Models On Medical Domain Tasks
No ratings yet
Using Natural Language Processing To Evaluate The Impact of Specialized Transformers Models On Medical Domain Tasks
9 pages
Paper 006
No ratings yet
Paper 006
18 pages
Assignment3 Part2
No ratings yet
Assignment3 Part2
9 pages
Deep Neural Evolution
No ratings yet
Deep Neural Evolution
23 pages
Tutorial 3 - 206009L
No ratings yet
Tutorial 3 - 206009L
34 pages
FULLTEXT01
No ratings yet
FULLTEXT01
89 pages
Topic Modelling - Deep Learning Interview Questions
No ratings yet
Topic Modelling - Deep Learning Interview Questions
19 pages
Choosing and Implementing Hugging Face Models - by Stephanie Kirmer - Towards Data Science
No ratings yet
Choosing and Implementing Hugging Face Models - by Stephanie Kirmer - Towards Data Science
15 pages
Open Pre-trained Transformers for Researchers
No ratings yet
Open Pre-trained Transformers for Researchers
30 pages
Longformer: Efficient Long-Doc NLP
No ratings yet
Longformer: Efficient Long-Doc NLP
22 pages
Upload 1
No ratings yet
Upload 1
26 pages
Yi: Open Foundation Models by 01.AI
No ratings yet
Yi: Open Foundation Models by 01.AI
26 pages
Master Thesis 2021 AndersGrinde
No ratings yet
Master Thesis 2021 AndersGrinde
78 pages
Pre ML02
No ratings yet
Pre ML02
33 pages
Unit 3
No ratings yet
Unit 3
22 pages
Kirkvik Acit2022
No ratings yet
Kirkvik Acit2022
155 pages
Cheatsheet Transformers Large Language Models
No ratings yet
Cheatsheet Transformers Large Language Models
4 pages
Modernbert or Debertav3
No ratings yet
Modernbert or Debertav3
11 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
Enhanced Viral Genome Classification Using Large L
No ratings yet
Enhanced Viral Genome Classification Using Large L
16 pages
Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
Little Guide To Building Large Language Models in 2024
No ratings yet
Little Guide To Building Large Language Models in 2024
65 pages
Emoberta: Speaker-Aware Emotion Recognition in Conversation With Roberta
No ratings yet
Emoberta: Speaker-Aware Emotion Recognition in Conversation With Roberta
7 pages
AI Intern Interview Complete Questions Harpreet
No ratings yet
AI Intern Interview Complete Questions Harpreet
3 pages
Research Paper
No ratings yet
Research Paper
6 pages
Project 5
No ratings yet
Project 5
31 pages
15 ML
No ratings yet
15 ML
60 pages
Predicting Drug Solubilty Wtih Deep Learning
No ratings yet
Predicting Drug Solubilty Wtih Deep Learning
9 pages
De Novo Drug Design With Deep Generative Models: An Empirical Study
No ratings yet
De Novo Drug Design With Deep Generative Models: An Empirical Study
8 pages
Little Guide To Building Large Language Models in 2024
100% (1)
Little Guide To Building Large Language Models in 2024
65 pages
Report 1
No ratings yet
Report 1
7 pages
ML
No ratings yet
ML
131 pages
Question Bank AI Grade X 2024-25
No ratings yet
Question Bank AI Grade X 2024-25
21 pages
Object Detection
No ratings yet
Object Detection
17 pages
Machine Learning Quiz for Students
No ratings yet
Machine Learning Quiz for Students
8 pages
TP3.ipynb - Colab
No ratings yet
TP3.ipynb - Colab
17 pages
Ai QP 2
No ratings yet
Ai QP 2
7 pages
ErlAusg ICONIP NER Shmuel
No ratings yet
ErlAusg ICONIP NER Shmuel
15 pages
AutoSAM: Medical Image Segmentation
No ratings yet
AutoSAM: Medical Image Segmentation
15 pages
Iris Classification Report
No ratings yet
Iris Classification Report
3 pages
Grade X AI Term1 QP - Oct - 2024 - Edited
No ratings yet
Grade X AI Term1 QP - Oct - 2024 - Edited
7 pages
Performance Analysis of Deep Neural Network and Machine Learning Algorithms For Diabetes Prediction
No ratings yet
Performance Analysis of Deep Neural Network and Machine Learning Algorithms For Diabetes Prediction
6 pages
Cervical Cancer Detection Final Project Report
No ratings yet
Cervical Cancer Detection Final Project Report
32 pages
Azure Machine Learning Studio
No ratings yet
Azure Machine Learning Studio
17 pages
Efficient Handling of Data Imbalance in Health Insurance Fraud Detection Using Meta-Reinforcement Learning
No ratings yet
Efficient Handling of Data Imbalance in Health Insurance Fraud Detection Using Meta-Reinforcement Learning
16 pages
Evaluation Measures For Text Summarization
No ratings yet
Evaluation Measures For Text Summarization
26 pages
IoT-Based Smart Biofloc Monitoring System For Fish Farming Using Machine Learning
No ratings yet
IoT-Based Smart Biofloc Monitoring System For Fish Farming Using Machine Learning
13 pages
Google Professional Machine Learning Engineer Updated Dumps
100% (1)
Google Professional Machine Learning Engineer Updated Dumps
54 pages
Linear Regression for CPU Usage Prediction
No ratings yet
Linear Regression for CPU Usage Prediction
31 pages
Mini Project (Fdudl) Upd
No ratings yet
Mini Project (Fdudl) Upd
25 pages
Assignment 5
No ratings yet
Assignment 5
43 pages
CS 3308 Discussion Assignment Unit 6
No ratings yet
CS 3308 Discussion Assignment Unit 6
5 pages
Malware Detection and Classification Using Generative Adversarial Network
No ratings yet
Malware Detection and Classification Using Generative Adversarial Network
18 pages
ICACCS 2024 Paper 250
No ratings yet
ICACCS 2024 Paper 250
7 pages
Amharic Text Entity Relation Extraction
No ratings yet
Amharic Text Entity Relation Extraction
12 pages
Agronomy CCMT - 14-00500
No ratings yet
Agronomy CCMT - 14-00500
13 pages
Others Indigo Case Study
No ratings yet
Others Indigo Case Study
9 pages
IR Unit 5
No ratings yet
IR Unit 5
5 pages
W XR Health 1037
No ratings yet
W XR Health 1037
6 pages
Automated Software Vulnerability Assessment With Concept Drift
No ratings yet
Automated Software Vulnerability Assessment With Concept Drift
12 pages

RoBERTa Token Classification With Additional PLODv2 Data

Uploaded by

RoBERTa Token Classification With Additional PLODv2 Data

Uploaded by

COMM061

Token Classification Report

• B-AC: beginning of an abbreviation

Label Distribution Analysis

Figure 1: Distribution of NER Tags in the Training Data

Entity Diversity: Abbreviations and Long Forms

• Total unique abbreviations: 1,223

This confirms that:

• Many abbreviations map to multiple long forms (context-dependent),

• Correctly tagging unfamiliar abbreviations,

Impact on Experiment Design

This analysis directly influenced how we approached the modelling phase:

• Label imbalance encouraged us to use weighted evaluation metrics (F1-score).

2. Model Selection and Initialisation

2.2 Experiment 2 Experimenting with different optimizers

GrokAdamW is an enhancement of the standard AdamW optimizer designed to

2.3 Experiment 3: Using additional train/validate dataset from the optional

3.1 Test on Modelling techniques (BERT vs RoBERTa)

RoBERTa 0.8304 0.8893 0.8589 0.9247

Confusion Matrix (BERT vs RoBERTa)

Token Classification Bar Chat

3.2 Test on different optimizers

Optimizer Accuracy Precision Recall F1-Score

• Schedule-Free AdamW achieved the highest test accuracy (92.89%), showing

• From a computational perspective, Adafactor offers memory efficiency, Schedule-

3. Test for experiment 3

Setup 1: PLOD-CW-25 + 25% PLODv2-filtered

Setup 2: PLOD-CW-25 + 50% PLODv2-filtered

Observed Error Patterns

1. Abbreviations labelled as ‘O’

Several rare or highly domain-specific abbreviations (e.g., “IL-10”, “PPAR”) were

2. Incomplete long-form spans

3. Confusion between B-AC and B-LF

Table 1: Token-level comparison of true vs predicted labels on a test sample

Token True Label Predicted Label Match

intracellular B-LF B-LF

Icm B-LF B-LF

defective I-LF I-LF

organelle I-LF I-LF

Dot B-AC B-AC

Likely Causes of Errors

Suggestions for Improvement

• Use BioBERT or SciBERT, pretrained on biomedical corpora, to better handle

3) Logging and Monitoring

• Downloaded and access the data

• Average Input Length: 240.75 characters

You might also like