Spring 2025 AIFirst Project Starter Code

Project Introduction

As a final project for the course [ECE 2806] AI Foundations of Spring 2025, we present the development of an AI-First GPT, a large language model (LLM) specifically customized for Artificial Intelligence (AI) subjects. The work is divided into two parts. First, we conduct an in-depth analysis of the transformer architecture and identify the latent vector representation. Next, we perform ablation studies on training loss, validation loss, and overall model performance to determine an effective training recipe. To achieve our goal of building a specialized LLM, we introduce HARVEST—a framework for customizing LLMs by generating a biased sub-dataset from a given dataset using a specialized cropping algorithm. HARVEST identifies topics relevant to the AI-First curriculum and biases the model toward the resulting sub-dataset, enabling it to specialize in course-specific content. To validate our framework, we visualize the distinct data distributions of the fine-tuned sub-dataset in 2D latent space using PCA and t-SNE.

Checklist

Git setting

cd ~
git clone https://github.gatech.edu/mlee864/Ming-yu-Lee.git
cd ./Ming-yu-Lee

Check Disk Space
Conda Environment

conda create -n aifirst
conda activate aifirst
pip install -r requirements.txt

Part 1 - Ablations - Training Recipe

Training The Model

cd ~/Ming-yu-Lee/final_part1/
bash script/train.sh

Testing The Parameters

cd ~/Ming-yu-Lee/final_part1/
bash script/test.sh

Part 2 - Model Customization

Generated Text file Presented in ~/Ming-yu-Lee/final_part2/bias_generate/txt, is the parsed text file using data extraction tool, cloudconvert.
Generated Clean Text file Presented in ~/Ming-yu-Lee/final_part2/bias_generate/clean_txt, is the generted clean text file using generation tool, chatgpt4o.
Generate Biased Text file

cd ~/Ming-yu-Lee/final_part2
python ./bias_generate/bias_all_subjects.py

Train Model with Clean and Biased text file

bash script/train.sh

2D Visulaize Latent Vectors

bash script/2d_bias.sh

-> Check visualized (PCA/t-SNE/SVM) image in ./final_part2/2d_plot/ 4. Finetune and check the results.

bash script/tune.sh

Results

Acknowledgements/Resources

This code base draws from the NLP courses taught by Andrej Karpathy. His github can be found at https://github.com/karpathy/nn-zero-to-hero. You are free to use his resources and incorporate his ideas. However, as discussed earlier, all training must be done by you.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
final_part1		final_part1
final_part2		final_part2
img		img
.gitignore		.gitignore
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spring 2025 AIFirst Project Starter Code

Project Introduction

Checklist

Part 1 - Ablations - Training Recipe

Part 2 - Model Customization

Results

Acknowledgements/Resources

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

davidlee1229/HARVEST

Folders and files

Latest commit

History

Repository files navigation

Spring 2025 AIFirst Project Starter Code

Project Introduction

Checklist

Part 1 - Ablations - Training Recipe

Part 2 - Model Customization

Results

Acknowledgements/Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages