This repository is a system that applies LoRA fine-tuning to the CodeLLaMA model using Java and Python code data.
An article is being prepared on the development process.
*Because it is created in Japanese, comments in the code, output results, and option displays are in Japanese.
Japanese description →日本語版はこちらから
- NVIDIA GPU (RTX 4070 Ti Super 16GB or higher recommended)
- Docker environment
- Physical memory 32GB or more
This system is optimized for execution with the following specifications:
- CPU: Intel i7-14700K (20 cores: 8P+12E) equivalent or higher
- Memory: 64GB or more
- GPU: RTX 4070 Ti Super (VRAM 16GB) equivalent or higher
The Docker environment is optimized according to specified resource limits:
- CPU: Allocate 16 cores to container (80% of host)
- Memory: Up to 48GB available, 44GB reserved (4GB reserved for WSL)
- GPU: RTX 4070 Ti Super VRAM limited to 10GB
- Shared memory: 16GB allocated (for large dataset processing)
To adjust resource allocation, edit the following files:
-
docker-compose.yml:
- CPU percentage: Change the value of
deploy.resources.reservations.cpus(e.g., '16') - Memory limit: Change
deploy.resources.limits.memoryanddeploy.resources.reservations.memory - GPU VRAM limit: Adjust the
PYTORCH_CUDA_ALLOC_CONFenvironment variable and theVRAM_LIMIT_BYTESvalue in the Python script in thecommandsection
- CPU percentage: Change the value of
-
Dockerfile:
- GPU VRAM limit: Change the
limit_gb=10value inlimit_gpu.py - CPU thread count: Adjust environment variables such as
OMP_NUM_THREADS
- GPU VRAM limit: Change the
After making changes, rebuild the container:
docker-compose down
docker-compose build
docker-compose up -dgit clone https://github.com/goodaymmm/LoRA_Java_Python.git
cd LoRA_Java_Pythondocker-compose build --no-cache
docker-compose up -dTo check container resource allocation:
docker stats codellama-finetuningIf there are issues with resource usage, adjust the memory and CPU settings in docker-compose.yml.
Connect to the container and download the CodeLLaMA model from Hugging Face. The downloaded model will be saved locally and can be used for subsequent fine-tuning.
# Connect to container
docker-compose exec codellama-finetuning bash
# Log in to Hugging Face (token required)
huggingface-cli login
# Download model (ELYZA-japanese-CodeLlama-7b-instruct model)
mkdir -p ./models/codellama-base
# First, tokenizer only
python -c "from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained('elyza/ELYZA-japanese-CodeLlama-7b-instruct'); tokenizer.save_pretrained('./models/codellama-base')"
# Next, model (explicitly using CPU)
python -c "from transformers import AutoModelForCausalLM; import torch; model = AutoModelForCausalLM.from_pretrained('elyza/ELYZA-japanese-CodeLlama-7b-instruct', torch_dtype=torch.float16, device_map='cpu'); model.save_pretrained('./models/codellama-base')"An API token is required for GitHub scraping. Set the environment variable as follows:
export GITHUB_TOKEN="your_github_token_here"python src/scrapers/github_scraper.pyAdditional parameters:
--min_stars: Minimum number of stars (default: 1000)--max_file_size: Maximum file size (in MB, default: 1)--max_repos: Maximum repositories per language (default: 100)--max_depth: Maximum depth for directory traversal (default: 2)
python src/scrapers/atcoder_scraper.py --output_dir data/atcoderThe AtCoder scraper is also optimized for parallel processing, efficiently extracting and processing datasets.
It uses parallel extraction with pigz to quickly process large datasets.
Additional parameters:
--max_workers: Maximum number of workers for parallel processing (default: 10)--no_cleanup: Option to not delete temporary files--temp_dir: Temporary directory path (default: data/temp)
Note: If pigz is not installed on your system, it will attempt to install automatically, but if it fails due to permission issues, please install manually.
# Ubuntu/Debian
sudo apt-get install -y pigz
# CentOS/RHEL/Fedora
sudo yum install -y pigz
# macOS (Homebrew)
brew install pigzFormat scraped data into JSON and create datasets.
python src/preprocessing/data_cleaning.pyClean the created datasets.
python src/preprocessing/enhanced_data_cleaning.pyFine-tune the locally downloaded CodeLLaMA model.
python src/training/train_lora.py \
--train_file data/processed/train_premium_heavy.jsonl \
--output_dir models/codellama-lora-premium \
--num_epochs 2 --learning_rate 2e-4 --batch_size 2You can change the weight of the dataset to import and adjust parameters.
Please adjust according to your environment and plan.
Perform inference (code generation) using the fine-tuned model.
python src/inference/inference.py \
--base_model ./models/codellama-base \
--peft_model ./models/codellama-lora-premiumIf you want to check the LoRA score, clone the repository from Bigcode, move the following file, and then execute.
Bigcode's GitHub is here
Execute up to pip install and proceed to the next step.
The default measurement model is set to codellama-lora-premium.
If it has been changed, adjust run_humaneval_full164.sh.
mv run_humaneval_full164.sh ./bigcode-evaluation-harness/
bash ./bigcode-evaluation-harness/run_humaneval_full164.sh- Scraped data: `data/{github,atcoder}`
- Preprocessed data: `data/processed`
- Base model: `models/codellama-base`
- Fine-tuned model: `models/codellama-lora-ultra-clean`
- Inference results: Specified output file (default: `generation_results.txt`)
- A Hugging Face account is required
- Model download may take time (due to large model size)
- Ensure sufficient disk space (approximately 14GB required for 7B model)
This project is released under the MIT License.
This model is based on:
- Llama 2 / CodeLlama by Meta Platforms, Inc., licensed under the Llama 2 Community License
- ELYZA-japanese-CodeLlama-7b-instruct by ELYZA, Inc.
- Fine-tuned using Project CodeNet by IBM Research, licensed under CC BY-SA 4.0
This fine-tuned model is distributed under CC BY-SA 4.0 license, subject to the Llama 2 Community License terms.
- Llama 2: Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
- CodeLlama: Rozière et al., "Code Llama: Open Foundation Models for Code" (2023)
- Project CodeNet: Puri et al., "Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks" (2021)
- ELYZA-japanese-CodeLlama-7b: Sasaki et al. (2023) - HuggingFace