As a final project for the course [ECE 2806] AI Foundations of Spring 2025, we present the development of an AI-First GPT, a large language model (LLM) specifically customized for Artificial Intelligence (AI) subjects. The work is divided into two parts. First, we conduct an in-depth analysis of the transformer architecture and identify the latent vector representation. Next, we perform ablation studies on training loss, validation loss, and overall model performance to determine an effective training recipe. To achieve our goal of building a specialized LLM, we introduce HARVEST—a framework for customizing LLMs by generating a biased sub-dataset from a given dataset using a specialized cropping algorithm. HARVEST identifies topics relevant to the AI-First curriculum and biases the model toward the resulting sub-dataset, enabling it to specialize in course-specific content. To validate our framework, we visualize the distinct data distributions of the fine-tuned sub-dataset in 2D latent space using PCA and t-SNE.
- Git setting
cd ~
git clone https://github.gatech.edu/mlee864/Ming-yu-Lee.git
cd ./Ming-yu-Lee- Check Disk Space
- Conda Environment
conda create -n aifirst
conda activate aifirst
pip install -r requirements.txt
- Training The Model
cd ~/Ming-yu-Lee/final_part1/
bash script/train.sh- Testing The Parameters
cd ~/Ming-yu-Lee/final_part1/
bash script/test.sh -
Generated Text file Presented in ~/Ming-yu-Lee/final_part2/bias_generate/txt, is the parsed text file using data extraction tool, cloudconvert.
-
Generated Clean Text file Presented in ~/Ming-yu-Lee/final_part2/bias_generate/clean_txt, is the generted clean text file using generation tool, chatgpt4o.
-
Generate Biased Text file
cd ~/Ming-yu-Lee/final_part2
python ./bias_generate/bias_all_subjects.py- Train Model with Clean and Biased text file
bash script/train.sh- 2D Visulaize Latent Vectors
bash script/2d_bias.sh-> Check visualized (PCA/t-SNE/SVM) image in ./final_part2/2d_plot/ 4. Finetune and check the results.
bash script/tune.shThis code base draws from the NLP courses taught by Andrej Karpathy. His github can be found at https://github.com/karpathy/nn-zero-to-hero. You are free to use his resources and incorporate his ideas. However, as discussed earlier, all training must be done by you.