Thanks to visit codestin.com
Credit goes to github.com

Skip to content

davidlee1229/HARVEST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spring 2025 AIFirst Project Starter Code

Project Introduction

harvest As a final project for the course [ECE 2806] AI Foundations of Spring 2025, we present the development of an AI-First GPT, a large language model (LLM) specifically customized for Artificial Intelligence (AI) subjects. The work is divided into two parts. First, we conduct an in-depth analysis of the transformer architecture and identify the latent vector representation. Next, we perform ablation studies on training loss, validation loss, and overall model performance to determine an effective training recipe. To achieve our goal of building a specialized LLM, we introduce HARVEST—a framework for customizing LLMs by generating a biased sub-dataset from a given dataset using a specialized cropping algorithm. HARVEST identifies topics relevant to the AI-First curriculum and biases the model toward the resulting sub-dataset, enabling it to specialize in course-specific content. To validate our framework, we visualize the distinct data distributions of the fine-tuned sub-dataset in 2D latent space using PCA and t-SNE.

Checklist

  1. Git setting
cd ~
git clone https://github.gatech.edu/mlee864/Ming-yu-Lee.git
cd ./Ming-yu-Lee
  1. Check Disk Space
  2. Conda Environment
conda create -n aifirst
conda activate aifirst
pip install -r requirements.txt

Part 1 - Ablations - Training Recipe

  1. Training The Model
cd ~/Ming-yu-Lee/final_part1/
bash script/train.sh
  1. Testing The Parameters
cd ~/Ming-yu-Lee/final_part1/
bash script/test.sh 

Part 2 - Model Customization

  1. Generated Text file Presented in ~/Ming-yu-Lee/final_part2/bias_generate/txt, is the parsed text file using data extraction tool, cloudconvert.

  2. Generated Clean Text file Presented in ~/Ming-yu-Lee/final_part2/bias_generate/clean_txt, is the generted clean text file using generation tool, chatgpt4o.

  3. Generate Biased Text file

cd ~/Ming-yu-Lee/final_part2
python ./bias_generate/bias_all_subjects.py
  1. Train Model with Clean and Biased text file
bash script/train.sh
  1. 2D Visulaize Latent Vectors
bash script/2d_bias.sh

-> Check visualized (PCA/t-SNE/SVM) image in ./final_part2/2d_plot/ 4. Finetune and check the results.

bash script/tune.sh

Results

result

Acknowledgements/Resources

This code base draws from the NLP courses taught by Andrej Karpathy. His github can be found at https://github.com/karpathy/nn-zero-to-hero. You are free to use his resources and incorporate his ideas. However, as discussed earlier, all training must be done by you.

About

HARVEST: A Dataset Customizing Framework for AI Subjects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •