ZeroMAT: Zero-training MATerial Autonomous Analysis with Large Language Model and Retrieval-Augmented Generation

Hyungjun Kim, Dohun Kang, Jaejun Lee, Jiwon Sun, and Seunghan Lee

Elif Ertekin Group, University of Illinois at Urbana-Champaign
Chris Wolverton Group, Northwestern University
Implementation by Jaejun Lee

Project Overview

This repository demonstrates the superior performance and efficiency of TabPFN (large language model) when dealing with unseen small datasets. We leverage Retrieval-Augmented Generation (RAG) using OpenAI and Materials Project databases to extract domain-specific knowledge, taking advantage of TabPFN's in-context learning capabilities.

Key Innovations

RAG-Enhanced Feature Selection: AI-driven domain knowledge extraction for optimal descriptor selection
Intelligent Clustering: Handles TabPFN's 10k sample limitation through chemistry-aware data segmentation
Real-time Adaptability: Dynamic feature updates based on specific materials properties
Efficient Training: Significantly faster than traditional LLM fine-tuning approaches

Key Advantages

Superior Performance: TabPFN fine-tuning outperforms LLM-based fine-tuning on unseen small datasets
Training Efficiency: Significantly more efficient learning process compared to traditional approaches
Robust Feature Handling: Leverages TabPFN's in-context learning to handle:
- Variable feature lengths without retraining
- High proportions of missing (NaN) values
- Direct deployment with high performance expectations

Installation & Setup

Prerequisites

Python 3.12.11
CUDA-compatible GPU (recommended for TabPFN)
Google Colab or Colab Pro/Pro+ (recommended due to GPU memory requirements)

Requirements

pip install pymatgen==2025.6.14
pip install mp-api==0.45.8
pip install openai
pip install tabpfn

or you can just use

pip install -r requirements.txt

Required API Keys

OpenAI API Key
- Get from: https://platform.openai.com/api-keys
- Edit rag.py line 8: API_KEY = "your-openai-key-here"
Materials Project API Key
- Get from: https://next-gen.materialsproject.org/api
- Set as environment variable: export MP_API_KEY="your-mp-key-here"

Installation

# Clone repository
git clone https://github.com/Ahri111/ZEROMAT.git
cd ZEROMAT
conda create -n yourenv python=3.12.11
conda activate yourenv

# Install dependencies
pip install -r requirements.txt

Core Components

1. RAG Feature Recommender (`rag.py`)

Purpose: Intelligent feature selection using domain expertise from GPT-4

Key Features:

Physics-informed feature recommendations
Interactive chat interface for query refinement
Automatic parsing and validation of recommended features
Integration with Materials Project feature mapping

Usage:

python rag.py

print("Choose mode:")
print("1. Interactive chat")
print("2. Quick test")

Example Interaction:
Your question: What are the most important features for predicting bandgap in perovskites?

AI Response:
**RECOMMENDED FEATURES:**
1. formation_energy_per_atom - Thermodynamic stability indicator
2. density - Structural compactness affects electronic properties
3. band_gap - Direct electronic property correlation
4. dielectric_total - Electronic screening effects
5. bulk_modulus - Mechanical stiffness relates to bonding

These Recommended features will be saved as a name of feature_rocommendations.txt

2. Materials Project Data Fetcher (`production_mp_fetcher.py`)

Purpose: Production-ready data augmentation using Materials Project database

Key Features:

Batch processing for large datasets (configurable batch sizes)
Automatic material ID validation and cleaning
Robust error handling and retry mechanisms

Usage:

export 
python production_mp_fetcher.py
export MP_API_KEY="your-materials-project-key"

Write all the requirements for implementing  production_mp_fetcher.py as follows:

3. Script(Colab).ipynb

Purpose: To evaluate the performance of our proposed mathod compared to Bert + Finetuning. Note that it is better to implement this Script at Colab rather than your own comptuter

Quick Start

Performance Results

Comprehensive Performance Comparison

Approach	Dataset Size	Training Time	GPU for Training	R² Score	MAE (eV)
LLM-Prop + Fine-tuning	Small (<10k)	38 min	~8 GB	0.3881	0.7999
LLM-Prop + TabPFN	Small (<10k)	62.34 s	<1 GB	0.5788	0.6559
LLM-Prop + TabPFN + RAG	Small (<10k)	78.32 s	<1 GB	0.8261	0.3652
LLM-Prop + TabPFN + RAG	Large (40k)	157.40 s	<1 GB	0.9876	0.0060

Key Performance Highlights

RAG Integration Impact: 43% improvement in R² score (0.5788 → 0.8261)
Training Speed: 30x faster than LLM-Prop fine-tuning (38 min → 78.32 s)
Memory Efficiency: 2-4x lower GPU memory requirements
Accuracy: Best-in-class R² score of 0.8261 with RAG enhancement
Error Reduction: 44% lower MAE compared to LLM-Prop + TabPFN alone

Method-Specific Results

LLM-Prop + Fine-tuning (Baseline):

Longest training time (38 minutes)
Poorest performance (R² = 0.3881, MAE = 0.7999)
High computational cost

LLM-Prop + TabPFN:

37x speed improvement over fine-tuning
Moderate performance gains (R² = 0.5788)
Significant memory efficiency

LLM-Prop + TabPFN + RAG (Our Method):

Best overall performance (R² = 0.8261, MAE = 0.3652)
Minimal additional training time (+16 seconds)
Superior domain knowledge integration

Clustering Performance

Dataset Size: Successfully handles ~40k materials
Cluster Coherence: Maintains chemical similarity within clusters
Scalability: Automatic splitting for large datasets
Coverage: 98%+ successful materials property prediction

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Demo.ipynb		Demo.ipynb
README.md		README.md
Script(Colab).ipynb		Script(Colab).ipynb
feature_recommendations.txt		feature_recommendations.txt
production_mp_fetcher.py		production_mp_fetcher.py
rag.py		rag.py
rag_jupyter.py		rag_jupyter.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZeroMAT: Zero-training MATerial Autonomous Analysis with Large Language Model and Retrieval-Augmented Generation

Table of Contents

Project Overview

Key Innovations

Key Advantages

Installation & Setup

Prerequisites

Requirements

Required API Keys

Installation

Core Components

1. RAG Feature Recommender (`rag.py`)

2. Materials Project Data Fetcher (`production_mp_fetcher.py`)

3. Script(Colab).ipynb

Performance Results

Comprehensive Performance Comparison

Key Performance Highlights

Method-Specific Results

Clustering Performance

About

Uh oh!

Releases

Packages

Languages

Ahri111/ZEROMAT

Folders and files

Latest commit

History

Repository files navigation

ZeroMAT: Zero-training MATerial Autonomous Analysis with Large Language Model and Retrieval-Augmented Generation

Table of Contents

Project Overview

Key Innovations

Key Advantages

Installation & Setup

Prerequisites

Requirements

Required API Keys

Installation

Core Components

1. RAG Feature Recommender (rag.py)

2. Materials Project Data Fetcher (production_mp_fetcher.py)

3. Script(Colab).ipynb

Performance Results

Comprehensive Performance Comparison

Key Performance Highlights

Method-Specific Results

Clustering Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. RAG Feature Recommender (`rag.py`)

2. Materials Project Data Fetcher (`production_mp_fetcher.py`)

Packages