Thanks to visit codestin.com
Credit goes to github.com

Skip to content

goodaymmm/LoRA_Java_Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeLLaMA LoRA Fine-tuning System

This repository is a system that applies LoRA fine-tuning to the CodeLLaMA model using Java and Python code data.

An article is being prepared on the development process.

*Because it is created in Japanese, comments in the code, output results, and option displays are in Japanese.

Japanese description →日本語版はこちらから

Requirements

  • NVIDIA GPU (RTX 4070 Ti Super 16GB or higher recommended)
  • Docker environment
  • Physical memory 32GB or more

System Optimization

This system is optimized for execution with the following specifications:

  • CPU: Intel i7-14700K (20 cores: 8P+12E) equivalent or higher
  • Memory: 64GB or more
  • GPU: RTX 4070 Ti Super (VRAM 16GB) equivalent or higher

The Docker environment is optimized according to specified resource limits:

  • CPU: Allocate 16 cores to container (80% of host)
  • Memory: Up to 48GB available, 44GB reserved (4GB reserved for WSL)
  • GPU: RTX 4070 Ti Super VRAM limited to 10GB
  • Shared memory: 16GB allocated (for large dataset processing)

How to Change Resource Allocation

To adjust resource allocation, edit the following files:

  1. docker-compose.yml:

    • CPU percentage: Change the value of deploy.resources.reservations.cpus (e.g., '16')
    • Memory limit: Change deploy.resources.limits.memory and deploy.resources.reservations.memory
    • GPU VRAM limit: Adjust the PYTORCH_CUDA_ALLOC_CONF environment variable and the VRAM_LIMIT_BYTES value in the Python script in the command section
  2. Dockerfile:

    • GPU VRAM limit: Change the limit_gb=10 value in limit_gpu.py
    • CPU thread count: Adjust environment variables such as OMP_NUM_THREADS

After making changes, rebuild the container:

docker-compose down
docker-compose build
docker-compose up -d

Setup

1. Clone the Repository

git clone https://github.com/goodaymmm/LoRA_Java_Python.git
cd LoRA_Java_Python

2. Build and Start Docker Container

docker-compose build --no-cache
docker-compose up -d

To check container resource allocation:

docker stats codellama-finetuning

If there are issues with resource usage, adjust the memory and CPU settings in docker-compose.yml.

3. Download CodeLLaMA Model

Connect to the container and download the CodeLLaMA model from Hugging Face. The downloaded model will be saved locally and can be used for subsequent fine-tuning.

# Connect to container
docker-compose exec codellama-finetuning bash

# Log in to Hugging Face (token required)
huggingface-cli login

# Download model (ELYZA-japanese-CodeLlama-7b-instruct model)
mkdir -p ./models/codellama-base

# First, tokenizer only
python -c "from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained('elyza/ELYZA-japanese-CodeLlama-7b-instruct'); tokenizer.save_pretrained('./models/codellama-base')"

# Next, model (explicitly using CPU)
python -c "from transformers import AutoModelForCausalLM; import torch; model = AutoModelForCausalLM.from_pretrained('elyza/ELYZA-japanese-CodeLlama-7b-instruct', torch_dtype=torch.float16, device_map='cpu'); model.save_pretrained('./models/codellama-base')"

4. Set API Tokens

An API token is required for GitHub scraping. Set the environment variable as follows:

export GITHUB_TOKEN="your_github_token_here"

Scrape GitHub Code

python src/scrapers/github_scraper.py

Additional parameters:

  • --min_stars: Minimum number of stars (default: 1000)
  • --max_file_size: Maximum file size (in MB, default: 1)
  • --max_repos: Maximum repositories per language (default: 100)
  • --max_depth: Maximum depth for directory traversal (default: 2)

Scrape AtCoder Code

python src/scrapers/atcoder_scraper.py --output_dir data/atcoder

The AtCoder scraper is also optimized for parallel processing, efficiently extracting and processing datasets.
It uses parallel extraction with pigz to quickly process large datasets.

Additional parameters:

  • --max_workers: Maximum number of workers for parallel processing (default: 10)
  • --no_cleanup: Option to not delete temporary files
  • --temp_dir: Temporary directory path (default: data/temp)

Note: If pigz is not installed on your system, it will attempt to install automatically, but if it fails due to permission issues, please install manually.

# Ubuntu/Debian
sudo apt-get install -y pigz

# CentOS/RHEL/Fedora
sudo yum install -y pigz

# macOS (Homebrew)
brew install pigz

Data Preprocessing

Format scraped data into JSON and create datasets.

python src/preprocessing/data_cleaning.py

Clean the created datasets.

python src/preprocessing/enhanced_data_cleaning.py

LoRA Fine-tuning

Fine-tune the locally downloaded CodeLLaMA model.

python src/training/train_lora.py \
  --train_file data/processed/train_premium_heavy.jsonl \
  --output_dir models/codellama-lora-premium \
  --num_epochs 2 --learning_rate 2e-4 --batch_size 2

You can change the weight of the dataset to import and adjust parameters.
Please adjust according to your environment and plan.

Inference

Perform inference (code generation) using the fine-tuned model.

Interactive Mode

python src/inference/inference.py \
  --base_model ./models/codellama-base \
  --peft_model ./models/codellama-lora-premium

Benchmark Measurement

If you want to check the LoRA score, clone the repository from Bigcode, move the following file, and then execute.

Installing Bigcode

Bigcode's GitHub is here Execute up to pip install and proceed to the next step.

Measurement

The default measurement model is set to codellama-lora-premium.
If it has been changed, adjust run_humaneval_full164.sh.

mv run_humaneval_full164.sh ./bigcode-evaluation-harness/
bash ./bigcode-evaluation-harness/run_humaneval_full164.sh

Data Storage Locations

- Scraped data: `data/{github,atcoder}`
- Preprocessed data: `data/processed`
- Base model: `models/codellama-base`
- Fine-tuned model: `models/codellama-lora-ultra-clean`
- Inference results: Specified output file (default: `generation_results.txt`)

Notes on Model Download

  • A Hugging Face account is required
  • Model download may take time (due to large model size)
  • Ensure sufficient disk space (approximately 14GB required for 7B model)

License

This project is released under the MIT License.

This model is based on:

  • Llama 2 / CodeLlama by Meta Platforms, Inc., licensed under the Llama 2 Community License
  • ELYZA-japanese-CodeLlama-7b-instruct by ELYZA, Inc.
  • Fine-tuned using Project CodeNet by IBM Research, licensed under CC BY-SA 4.0

This fine-tuned model is distributed under CC BY-SA 4.0 license, subject to the Llama 2 Community License terms.

References

  • Llama 2: Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
  • CodeLlama: Rozière et al., "Code Llama: Open Foundation Models for Code" (2023)
  • Project CodeNet: Puri et al., "Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks" (2021)
  • ELYZA-japanese-CodeLlama-7b: Sasaki et al. (2023) - HuggingFace

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published