ArabicLLM

Project Overview

The goal of the project was to adapt large language models for the Arabic language and develop a new state-of-the-art Arabic LLM. Due to the scarcity of Arabic data, few LLMs have been trained specifically for Arabic, falling short of meeting the needs of its vast number of speakers. Our final model was trained on a high-quality instruction fine-tuning (IFT) dataset, generated synthetically and then evaluated using the Hugging Face Arabic leaderboard.

Details

The methodology introduced:

Generate synthetic data in English
Translate the data to Arabic using NLLB
Fine-tune the Gemma-2-2b-it model using the new data
Evaluate using the Arabic leaderboard benchmarks.

Synthetic Data

This is the approach we used to construct our dataset:

Scraping Wikipedia articles about Arabic culture, science, Arab countries, religion, etc.
Generating questions and answers from paragraphs, with 75% of the paragraphs shortened, while ensuring the responses are comprehensive
Generating 3-turn conversational data using the remaining 25% of longer paragraphs: the first multi-turn dataset in Arabic (Slim205/multi_turn_nllb).

Training Settings

The 2B version was trained for 2 days on 1 A100 GPU using LoRA with a rank of 128, a learning rate of 1e-4, and a cosine learning rate schedule.

The 9B version was trained on 4 A100 GPU using LoRA with a rank of 128, a learning rate of 1e-4, and a cosine learning rate schedule.

Datasets Used

The following datasets were used for instruction fine-tuning. Each dataset includes the number of instruction and answer tokens:

Dataset	Instruction	Answer
Slim205/boolq_ift	1,290,801	275,243
Slim205/race_ift_v02_filtered	33,798,422	4,466,553
Slim205/copa_ift_v02_filtered	11,663	11,703
Slim205/hellaswag_ift_v02_filtered	7,264,646	1,448,588
Slim205/sciq_ift_v02_filtered	1,559,005	276,289
Slim205/arc_challenge_ift_v02_filtered	64,031	41,144
Slim205/arc_easy_ift_v02_filtered	112,489	80,226
Slim205/piqa_ift_v02_filtered	931,060	503,598
Slim205/gsm8k_ift_v02	435,759	563,772
Slim205/wiki_data_full_filtered	1,934,101	10,900,116
Slim205/multi_turn_v02_filtered	2,740,059	14,347,695
Total	50,142,036	32,914,927

Model: BARKA-9B-it

The table below compares the performance of the base model Gemma-2-9b-it and a fine-tuned version across multiple benchmarks:

Benchmark	Gemma-2-9b-it	Fine-tuned
ACVA	48.98	73.55
AlGhafa	54.17	54.63
MMLU	53.33	52.51
EXAMS	50.65	52.89
ARC Challenge	55.00	58.88
ARC Easy	52.45	59.46
BOOLQ	83.99	86.38
COPA	60.00	62.22
HELLAWSWAG	32.33	38.00
OPENBOOK QA	50.30	56.16
PIQA	70.54	71.96
RACE	45.55	48.73
SCIQ	51.06	50.35
TOXIGEN	82.35	85.45
All	56.48	60.80

Evaluation

Evaluation script: using https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard benchmarks

```bash
accelerate launch --multi_gpu --num_processes=2 run_evals_accelerate.py --model_args="pretrained=EleutherAI/pythia-70m,dtype=bfloat16" --custom_tasks community_tasks/arabic_evals.py --tasks examples/tasks/arabic_tasks_0.txt --override_batch_size 32 --save_details --output_dir="./output_arabic_benchmark" --dataset_loading_processes 10
```

This command will generate a .json file in the output_arabic_benchmark directory with the evaluation results for the model EleutherAI/pythia-70m over all 137 tasks listed in arabic_tasks_0.txt.

Data Link

The full dataset for this project can be found at https://huggingface.co/datasets/Slim205/Barka_data_2B.

The 9B model : https://huggingface.co/Slim205/BARKA-9b-it : Better than Llama-3.1-70B-Instruct, Gemma-2-27b-it in the Arabic Leaderboard

The 2B model : https://huggingface.co/Slim205/BARKA-2b-it : The Best 2B Arabic LLM

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
DATA_v02		DATA_v02
Wikipedia		Wikipedia
community_tasks		community_tasks
examples		examples
output_arabic_benchmark_axolotl		output_arabic_benchmark_axolotl
output_bench_old_evals		output_bench_old_evals
src		src
tests		tests
IFT_configs		IFT_configs
LICENSE		LICENSE
README.md		README.md
W&B_Chart_eval_loss.png		W&B_Chart_eval_loss.png
all_data_together_v05.py		all_data_together_v05.py
aya_data_process.py		aya_data_process.py
count_token.py		count_token.py
fewshots.py		fewshots.py
mmlu_process.py		mmlu_process.py
multi_together.py		multi_together.py
photo.png		photo.png
prepare_axolotl.py		prepare_axolotl.py
race_process.py		race_process.py
run_evals_accelerate.py		run_evals_accelerate.py
setup.py		setup.py
test.py		test.py
translate_0shots.py		translate_0shots.py
translate_gsm8k.py		translate_gsm8k.py
translation_processing.py		translation_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArabicLLM

Project Overview

Details

Synthetic Data

Training Settings

Datasets Used

Model: BARKA-9B-it

Evaluation

Data Link

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ArabicLLM

Project Overview

Details

Synthetic Data

Training Settings

Datasets Used

Model: BARKA-9B-it

Evaluation

Data Link

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages