Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views10 pages

Unsupervised Data Validation Methods For

This paper addresses the challenges of training machine learning models for low-resource languages, emphasizing the need for quality data and innovative data generation methods. It reviews current methodologies such as data augmentation, multilingual transfer learning, and synthetic data generation, while identifying gaps in research related to data selection and validation. The goal is to enhance the accessibility and performance of advanced machine learning models in low-resource settings, ultimately improving their utility across various sectors.

Uploaded by

anhtien13082005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Unsupervised Data Validation Methods For

This paper addresses the challenges of training machine learning models for low-resource languages, emphasizing the need for quality data and innovative data generation methods. It reviews current methodologies such as data augmentation, multilingual transfer learning, and synthetic data generation, while identifying gaps in research related to data selection and validation. The goal is to enhance the accessibility and performance of advanced machine learning models in low-resource settings, ultimately improving their utility across various sectors.

Uploaded by

anhtien13082005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Unsupervised Data Validation Methods for

Efficient Model Training

Yurii Paniv

Ukrainian Catholic University


arXiv:2410.07880v1 [cs.CL] 10 Oct 2024

[email protected]

Abstract. This paper investigates the challenges and potential solu-


tions for improving machine learning systems for low-resource languages.
State-of-the-art models in natural language processing (NLP), text-to-
speech (TTS), speech-to-text (STT), and vision-language models (VLM)
rely heavily on large datasets, which are often unavailable for low-resource
languages. This research explores key areas such as defining "quality
data," developing methods for generating appropriate data and enhanc-
ing accessibility to model training. A comprehensive review of current
methodologies, including data augmentation, multilingual transfer learn-
ing, synthetic data generation, and data selection techniques, highlights
both advancements and limitations. Several open research questions are
identified, providing a framework for future studies aimed at optimizing
data utilization, reducing the required data quantity, and maintaining
high-quality model performance. By addressing these challenges, the pa-
per aims to make advanced machine learning models more accessible for
low-resource languages, enhancing their utility and impact across various
sectors.

1 Introduction

State-of-the-art (SOTA) machine learning systems, particularly those utilized


in natural language processing (NLP), text-to-speech (TTS) (Casanova et al.,
2024), speech-to-text (STT) (Radford et al., 2023), and combined modalities
models, like vision-language models (VLM) (Liu, Li, Wu, & Lee, 2024), depend
heavily on large datasets to achieve optimal performance. However, low-resource
languages face significant challenges due to the lack of extensive datasets. The
collection of data for these languages is often impractical, too costly, or even
unfeasible, especially in cases involving nearly extinct languages or those spoken
by a very limited number of individuals.
The prevalence of large language models (LLMs) in various applications un-
derscores the data disparity issue. These models, which have become integral
to NLP, TTS, STT, and VLM applications, require substantial quantities of
training data.
This context prompts several critical questions: What constitutes "quality
data" necessary for effective model training? How can the appropriate type of
2 Yurii Paniv

data be generated? Furthermore, how can model training be made more acces-
sible given these requirements?
Addressing these questions is crucial not only for advancing technical capa-
bilities but also for broader economic and social impacts. Enhanced language
tools can improve economic productivity by increasing the accessibility of tech-
nology. The current direction of research (Team, 2024) and industry (OpenAI,
2024) is in integrating multiple modalities, which have significant applications
in fields such as robotics.
This research proposal aims to investigate methods for optimizing data uti-
lization, thereby reducing the quantity required while maintaining high-quality
model performance. By tackling these issues, we seek to make advanced machine
learning models more accessible for low-resource languages, thus enhancing their
utility and impact across various sectors.

2 Related Work

This section reviews existing strategies and methodologies developed to address


the challenges associated with data scarcity in low-resource languages, including
data quality evaluation and modern methods to address them.

2.1 Data Scarcity Solutions for Low-Resource Languages

In addressing the challenges posed by low-resource languages, recent advance-


ments in model architectures such as GPT-4o (OpenAI, 2024) and Chameleon
(Team, 2024) have demonstrated superior performance compared to text-only
language models. Additionally, models like LLava (Liu et al., 2024) provide meth-
ods for extending large language models (LLMs) into the image modality, en-
hancing their applicability. However, the implementation of such models often
requires massive datasets, as exemplified by the Whisper paper (Radford et al.,
2023), in which authors trained a state-of-the-art speech recognition system on a
vast corpus of weakly-labeled data, necessitating substantial resources not read-
ily available to all researchers. On the other hand, in (Pratap et al., 2024) paper
researchers explore how to train models for low-resource languages using multi-
lingual transfer, enabling a single model capable of performing speech recognition
on 1406 languages, addressing very low-resource languages in the process.
Various approaches have been proposed to address the data scarcity challenge
in low-resource languages. (Hedderich, Lange, Adel, Strötgen, & Klakow, 2021)
discuss different approaches to how researchers tackle the problem of lack of data
for low-resource languages, ranging from data augmentation, learning with noisy
labels, transfer learning, domain-specific pertaining, and multilingual language
models. (McKinzie et al., 2024) describe their experience of building a multi-
modal language model. They mix image-text data, text-only data, and visual
instruction tuning data, showing a transfer from text-only knowledge to vision-
language one. Another technique for compute-efficient data selection is described
in (Ankner et al., 2024). The authors find that with a limited compute and data
Unsupervised Data Validation Methods for Efficient Model Training 3

budget, one can filter data based on the perplexity of smaller-language model,
leading to a 1.45x reduction in pretraining steps. Recent work like Llama 3
disproves a claim from Chinchilla’s paper about underperformance for small
models (Meta, 2024), suggesting even better improvements for larger models for
particular tasks.
Current literature contains a plethora of data selection methods. (Northcutt, Jiang, & Chuang,
2021) explore the idea that current approaches to machine learning focus com-
monly on model predictions, not data. They present a method for dealing with
noisy data, which can improve model performance substantially. A method is
to estimate joint distribution between noisy and uncorrupted labels. Similarly,
using a perplexity as filtering criteria, (Paniv, Chaplynskyi, Trynus, & Kyrylov,
2024) discovered that finetuning on selected 20% of data provides the same
performance improvement as training on the full dataset, data even providing
performance improvements by selecting 60% as a threshold for filtering. This in-
dicates that the model itself could assess the quality of the data and be used for
data selection. it isn’t limited to supervised learning: (Vo et al., 2024) use auto-
matic data curation for self-supervised learning. Researchers were able to curate
image and text data using unsupervised clustering approach, creating a balanced
dataset across concepts using automated k-means iterations. Unsupervised data
curation was applied even more deeply in (Kaddour, Key, Nawrot, Minervini, & Kusner,
2023), where authors explored data selection methods during training, such as
selective backdrop, but haven’t found any performance improvements. Another
angle to attack this problem is removing social bias. It could be considered a
data selection technique, especially in vision-language models. Researchers from
Google describe how filtering social bias improved model performance by expos-
ing hidden relations in models (Alabdulmohsin et al., 2024). Data selection could
be even extended to "unlearn" the data (Sepahvand, Dumoulin, Triantafillou, & Dziugaite,
2024). This approach could be helpful for data validation purposes by "forget-
ting" the point which we would like to assess with the model.
Besides Llama 3, (Young et al., 2024) train models for English and Chinese,
but apply custom filtering rules, employing smaller models for filtering, like Qual-
ity Scorer, a classifier trained to recognize and favor pages, similar to Wikipedia
in quality, Document Coherence Scorer and Safety Scorer, designed to remove
undesirable content, clustering based filtering and simple ones like deduplication.
There are methods to perform filtering, but how to quantify an impact?
(Blakeney, Paul, Larsen, Owen, & Frankle, 2024) describe a technique for evalu-
ating the quality of a particular dataset on model performance by appending it to
the end of the training and measuring it on downstream benchmarks, providing
a FLOPS-efficient way to estimate the quality of data (given enough compute is
available for pretraining). There are other works, like (Covert, Ji, Hashimoto, & Zou,
2024), where researchers test the impact of individual datapoints on model train-
ing and provide a method for evaluating that impact, but, unfortunately, it could
be not used for data selection.
Another thing to consider is data contamination. (Blevins & Zettlemoyer,
2022) investigate origins of cross-lingual capabilities of large language models,
4 Yurii Paniv

finding that there are a lot of non-English data present in those datasets, helping
explain the transfer of knowledge from one language to another. This finding
affirms that multilingual training, even with such a small mixture, could transfer
capabilities from one language to another. (Udandarao et al., 2024) complete
this point of view, finding that "zero-shot" capabilities of multi-modal models
have a direct relationship with the appearance of that concept in a training
dataset. This implies that in the context of low-resource languages, a need for
new architecture or adding a mixture of English data to have a transfer of that
performance to low-resource language.
However, with every technique applied, researchers should be cautious. As
shown in (Goyal, Maini, Lipton, Raghunathan, & Kolter, 2024), there are scal-
ing laws for data filtering. Researchers discovered that LAION filtering leads to
worse downstream performance than training a vision transformer on the un-
filtered dataset. As a conclusion, no data cleaning methods should be applied
blindly with an expectation of improved performance.
One of the other promising approaches is tokenizer transfer (Minixhofer, Ponti, & Vulić,
2024). This work indicates a promising research direction to improve model to-
kenization, and, subsequently, compression, and will lead to performance im-
provement and enable transfer learning to a new language.
In (Team, 2024) researchers show that adding an additional vision modality
contributes to performance improvements on downstream tasks, even for text-
only tasks. We can assume that this would help train low-resource languages
more efficiently and serve as a call to train combined modalities models for
those languages.

2.2 Synthetic data


Synthetic data is data that has been generated using a purpose-built mathemat-
ical model or algorithm, with the aim of solving a (set of ) data science task(s)
(Jordon et al., 2022). In the scope of this paper, synthetic data is generated using
large language models and used for further training. They are used to improve
model performance without gathering additional data. For example, (Lee et al.,
2024) describes a pipeline of how to augment data for small dataset for specific
task by asking teacher LM to generate synthetic data based on incorrect data
points, improving performance on more challenging examples. This could be used
for low-resource languages to improve performance with that small amount of
data. Another approach is demonstrated in (Yuan et al., 2024), where authors
present a method to improve model capabilities by a round of self-improvements
using Llama 2 70B model, finding a limit of 3 iterations that can improve model
performance. This is a promising way for low-resource languages to improve its
training data, and it would be interesting to test it in a multi-modal fashion.
Additionally, there is a demonstrated capability of cross-modal models to
self-improve (Wang et al., 2024). Using self-critic approach, they were able to
improve model performance on vision-language hallucination benchmarks, im-
proving alignment between images and corresponding captions, potentially im-
proving performance in low-resource settings. Self-improving is quite popular, for
Unsupervised Data Validation Methods for Efficient Model Training 5

example, for training Llama 3, SOTA large language model, data was generated
from Llama 2 for text-quality classifiers (Meta, 2024), pushing a new boundary
for the performance of open source models.

2.3 Model Architecture Improvements and Computational Limits

According to (Godey, de la Clergerie, & Sagot, 2024), there is a need for suf-
ficiently large models to address optimization bottlenecks. The authors argue
that even with the best data selection algorithms, new models and methods are
necessary to enhance downstream performance.
(Huang, Zhang, Shan, & He, 2024) indicate that the downstream performance
of models is almost linearly related to the efficiency of the model’s ability to com-
press information. This suggests a need for improved compression algorithms and
advancements in tokenization techniques.

3 Research Gaps

Even though LLMs are a hot area of research, there are a number of questions
that, to the best of my knowledge, don’t currently have an answer in existing
literature.

Question 1. What is a formal definition of low-resource language? There is


no formal definition of what is a "low-resource language". The most cited work
from (Hedderich et al., 2021) defines this term as "availability of datasets per
specific task", but there is no overview available of which languages are low-
resource and what is a threshold to cross from low-resource to medium-resource
to high-resource language. Can we define what is a threshold (resources needed)
to cross to medium-resource and high-resource? As a part of my PhD work, I
would like to clarify that issue and give certainty to that definition.

Question 2. How can we assess the validity of a specific datapoint to select


it for training? As described in the "Related works" section, there are plenty
of ways to assess that impact. As part of my PhD work, it would be crucial to
make an overview of data selection methods and their impact on downstream
performance.

Question 3. What methods are available for data selection and data cleaning
and what is their impact on training multimodal language models? Providing an
overview of these methods and their impact on downstream performance would
be beneficial for other researchers to use for their training regimes.
6 Yurii Paniv

Question 4. How to validate performance on downstream tasks in low-resource


languages? Quite commonly, low-resource languages lack human-created bench-
marks, having only a common like machine translation, so it would be, in some
cases, the only measure to track. In some cases, there is unlabeled data, and
efforts to fix this like Bible TTS (Meyer et al., 2022) to create new labeled
datasets for low-resource languages. The lack of benchmarks is addressed in
(Pratap et al., 2024), where researchers create benchmarks for low-resource lan-
guages by themselves using newly extracted data. I argue for unsupervised meth-
ods that can preprocess and create such kinds of datasets automatically.

Question 5. Extending to additional modalities such as audio, images, or even


videos can improve downstream performance for low-resource languages (Team,
2024). How to construct such a multimodal dataset for model to perform for low-
resource languages in unsupervised manner? Even with datasets available, as we
show in (Paniv et al., 2024), most of the dataset could be discarded without a
degradation of performance, even improving it. Ukrainian is considered a low-
resource language, but taking into account vast quantities of unlabelled multi-
modal data, such as Ukrainika (Ukrainika, 2024) library, which contains nearly 34
TB of books, there is a case to use that data to improve language understanding.
The pros and cons of this approach should be measured on downstream tasks as
described in Question 4. Additionally, to track knowledge transfer across modal-
ities, we should set an experiment to test this explicitly, for example, present a
concept that would be explained only in speech modality (like prosody which
could impact meaning of a sentence) or image modality (like the Visual Word
Sense Disambiguation benchmark (Laba et al., 2024)).

4 Conclusion
In this paper, I investigated the significant challenges and potential solutions
for training large language models for low-resource languages. The reliance on
large datasets for training state-of-the-art models in natural language processing
(NLP), text-to-speech (TTS), speech-to-text (STT), and vision-language mod-
els (VLM) is well-established. However, the scarcity of data for low-resource
languages presents a substantial barrier to achieving optimal performance in
these systems.
The research highlighted several critical areas, including defining "quality
data," developing methods for generating appropriate data and making model
training more accessible. These issues are crucial for advancing both the technical
capabilities and the socio-economic impacts of language technologies. Enhanced
language tools can improve economic productivity, increase accessibility, and
facilitate the creation of robust and diverse datasets, which are essential for
integrating multiple modalities in future technologies, particularly in robotics.
The review of related work underscored advancements and limitations in
current methodologies. Various strategies such as data augmentation, multilin-
gual transfer learning, data selection, and synthetic data generation have been
References 7

proposed to address data scarcity. Notably, synthetic data and self-improvement


mechanisms, as demonstrated in models like LLaMa and Chameleon, offer promis-
ing directions for enhancing model performance with limited data.
Several open research questions were identified to guide future studies. These
include defining what constitutes a low-resource language, validating perfor-
mance on downstream tasks, constructing multimodal datasets, assessing data
validity, and evaluating the impact of data selection and cleaning methods. These
questions aim to provide a comprehensive framework for advancing research in
this area and developing more effective machine learning models for low-resource
languages.
In conclusion, this research proposal aims to optimize data utilization for
model training, reduce the quantity required, and maintain high-quality model
performance. By addressing these challenges, advanced machine learning models
would be more accessible for low-resource languages. Continued exploration and
development of innovative solutions will be crucial in bridging the data disparity
gap and advancing the capabilities of machine learning systems globally. Last but
not least, better performance with limited data should have downstream effects
on high-resource languages such as English, enabling researchers to use less data
for training and, subsequently, iterate faster to create new model architectures.

References

Alabdulmohsin, I., Wang, X., Steiner, A. P., Goyal, P., D’Amour, A., & Zhai, X.
(2024). CLIP the bias: How useful is balancing data in multimodal learn-
ing? In The twelfth international conference on learning representations.
Retrieved from https://openreview.net/forum?id=FIGXAxr9E4
Ankner, Z., Blakeney, C., Sreenivasan, K., Marion, M., Leavitt, M. L., & Paul,
M. (2024). Perplexed by perplexity: Perplexity-based data pruning with
small reference models.
Blakeney, C., Paul, M., Larsen, B. W., Owen, S., & Frankle, J. (2024). Does
your data spark joy? performance gains from domain upsampling at the
end of training. arXiv e-prints, arXiv–2406.
Blevins, T., & Zettlemoyer, L. (2022, December). Language con-
tamination helps explains the cross-lingual capabilities of English
pretrained models. In Y. Goldberg, Z. Kozareva, & Y. Zhang
(Eds.), Proceedings of the 2022 conference on empirical methods in
natural language processing (pp. 3563–3574). Abu Dhabi, United
Arab Emirates: Association for Computational Linguistics. Re-
trieved from https://aclanthology.org/2022.emnlp-main.233 doi:
https://doi.org/10.18653/v1/2022.emnlp-main.233
Casanova, E., Davis, K., Gölge, E., Göknar, G., Gulea, I., Hart, L., . . . Weber,
J. (2024). Xtts: a massively multilingual zero-shot text-to-speech model.
Covert, I. C., Ji, W., Hashimoto, T., & Zou, J. (2024). Scaling laws for
the value of individual data points in machine learning. In Forty-
8 Yurii Paniv

first international conference on machine learning. Retrieved from


https://openreview.net/forum?id=scSB9RynSd
Godey, N., de la Clergerie, É., & Sagot, B. (2024). Why do small language
models underperform? studying language model saturation via the softmax
bottleneck. arXiv preprint arXiv:2404.07647 .
Goyal, S., Maini, P., Lipton, Z. C., Raghunathan, A., & Kolter, J. Z. (2024).
Scaling laws for data filtering–data curation cannot be compute agnostic.
arXiv preprint arXiv:2404.07177 .
Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2021,
June). A survey on recent approaches for natural language process-
ing in low-resource scenarios. In K. Toutanova et al. (Eds.), Proceed-
ings of the 2021 conference of the north american chapter of the asso-
ciation for computational linguistics: Human language technologies (pp.
2545–2568). Online: Association for Computational Linguistics. Re-
trieved from https://aclanthology.org/2021.naacl-main.201 doi:
https://doi.org/10.18653/v1/2021.naacl-main.201
Huang, Y., Zhang, J., Shan, Z., & He, J. (2024). Compression represents intel-
ligence linearly. arXiv preprint arXiv:2404.09937 .
Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C.,
. . . Weller, A. (2022). Synthetic data–what, why and how? arXiv preprint
arXiv:2205.03257 .
Kaddour, J., Key, O., Nawrot, P., Minervini, P., & Kusner, M. J. (2023). No
train no gain: Revisiting efficient training algorithms for transformer-based
language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko,
M. Hardt, & S. Levine (Eds.), Advances in neural information processing
systems (Vol. 36, pp. 25793–25818). Curran Associates, Inc. Retrieved from
https://proceedings.neurips.cc/paper_files/paper/2023/file/51f3d6252706100325ddc435ba0a

Laba, Y., Mohytych, Y., Rohulia, I., Kyryleyza, H., Dydyk-Meush, H., Dobo-
sevych, O., & Hryniv, R. (2024). Ukrainian visual word sense disambigua-
tion benchmark. In Proceedings of the third ukrainian natural language
processing workshop (unlp)@ lrec-coling 2024 (pp. 61–66).
Lee, N., Wattanawong, T., Kim, S., Mangalam, K., Shen, S., Anumanchipali,
G., . . . Gholami, A. (2024). Llm2llm: Boosting llms with novel iterative
data enhancement. arXiv preprint arXiv:2403.15042 .
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2024). Visual instruction tuning. Advances
in neural information processing systems, 36 .
McKinzie, B., Gan, Z., Fauconnier, J.-P., Dodge, S., Zhang, B., Dufter, P., . . .
Yang, Y. (2024). Mm1: Methods, analysis & insights from multimodal llm
pre-training.
Meta. (2024). Introducing Meta Llama 3: The most capable openly available LLM
to date — ai.meta.com. https://ai.meta.com/blog/meta-llama-3/.
([Accessed 11-06-2024])
Meyer, J., Adelani, D. I., Casanova, E., Öktem, A., Weber, D. W. J., Kabongo,
S., . . . others (2022). Bibletts: a large, high-fidelity, multilingual, and
References 9

uniquely african speech corpus. arXiv preprint arXiv:2207.03546 .


Minixhofer, B., Ponti, E. M., & Vulić, I. (2024). Zero-shot tokenizer transfer.
arXiv preprint arXiv:2405.07883 .
Northcutt, C., Jiang, L., & Chuang, I. (2021). Confident learning: Estimating
uncertainty in dataset labels. Journal of Artificial Intelligence Research,
70 , 1373–1411.
OpenAI. (2024). Hello GPT-4o. https://openai.com/index/hello-gpt-4o/.
([Accessed 12-06-2024])
Paniv, Y., Chaplynskyi, D., Trynus, N., & Kyrylov, V. (2024, May). Setting up
the data printer with improved English to Ukrainian machine translation.
In M. Romanyshyn, N. Romanyshyn, A. Hlybovets, & O. Ignatenko (Eds.),
Proceedings of the third ukrainian natural language processing workshop
(unlp) @ lrec-coling 2024 (pp. 41–50). Torino, Italia: ELRA and ICCL.
Retrieved from https://aclanthology.org/2024.unlp-1.6
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., . . . others
(2024). Scaling speech technology to 1,000+ languages. Journal of Machine
Learning Research, 25 (97), 1–52.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever,
I. (2023). Robust speech recognition via large-scale weak supervision. In
International conference on machine learning (pp. 28492–28518).
Sepahvand, N. M., Dumoulin, V., Triantafillou, E., & Dziugaite, G. K. (2024).
Data selection for transfer unlearning.
Team, C. (2024). Chameleon: Mixed-modal early-fusion foundation models.
Udandarao, V., Prabhu, A., Ghosh, A., Sharma, Y., Torr, P. H., Bibi, A., . . .
Bethge, M. (2024). No" zero-shot" without exponential data: Pretrain-
ing concept frequency determines multimodal model performance. arXiv
preprint arXiv:2404.04125 .
Ukrainika. (2024). Electronic library "ukrainika" - an integrated national elec-
tronic information resource national library of ukraine named after v. i.
vernadskyi; — nbuv.gov.ua. http://www.nbuv.gov.ua/node/3699. ([Ac-
cessed 12-06-2024])
Vo, H. V., Khalidov, V., Darcet, T., Moutakanni, T., Smetanin, N., Szafraniec,
M., . . . Bojanowski, P. (2024). Automatic data curation for self-supervised
learning: A clustering-based approach.
Wang, X., Chen, J., Wang, Z., Zhou, Y., Zhou, Y., Yao, H., . . . Xiao, C. (2024).
Enhancing visual-language modality alignment in large vision language
models via self-improvement.
Young, A., Chen, B., Li, C., Huang, C., Zhang, G., Zhang, G., . . . others (2024).
Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 .
Yuan, W., Pang, R. Y., Cho, K., Li, X., Sukhbaatar, S., Xu, J., & Weston, J.
(2024). Self-rewarding language models.
ϱϬ
ϰϱ
ϰϬ ĂƚĂ
ϯϱ ĂƚĂ
ϯϬ
Ϯϱ
ϮϬ
ϭϱ
ϭϬ
ϱ
Ϭ
Ϭ ϱ ϭϬ ϭϱ ϮϬ Ϯϱ ϯϬ

You might also like