Gemini — LLM Evaluation & Classification Experiments

Compact, reproducible experiments demonstrating how to evaluate LLM-based classification strategies (zero-shot, one-shot, few-shot) and compare their performance.

Why this repo matters (for recruiters)

Shows end-to-end LLM integration: adapter, data pipeline, experiment drivers, and evaluation.
Demonstrates prompt engineering and measurement of outcomes (accuracy / F1 / comparison reports).
Produces reproducible artifacts so results can be audited by reviewers.

Core features

Experiment drivers: zero_shot_classification.py, one_shot_classification.py, few_shot_classification.py.
LLM client adapter: gemini_client.py (reads credentials from environment).
Evaluation utilities: evaluate_llm_predictions.py, compare_results.py, analyze_saved_results.py.
Sanity checks: test_api_keys.py, test_setup.py.
Results: CSV/JSON artifacts written to results/ (sample artifacts included).

Quick start (Windows PowerShell)

Create environment and install deps

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Add credentials

Copy .env.example to .env and fill in your API key and any provider URL. Do NOT commit secrets.

Validate

python test_api_keys.py
python test_setup.py

Run an experiment

python zero_shot_classification.py
python one_shot_classification.py
python few_shot_classification.py

Analyze results

python run_analysis.py
python show_results.py

Files & what to look at

gemini_client.py — where API calls are made; first place to add retries or change provider settings.
data_loader.py — dataset expectations and preprocessing.
*_classification.py — experiment drivers that generate outputs and save to results/.
results/ — contains sample outputs: predictions CSVs and metrics JSON files.

Outputs produced

*_results_*.csv — model predictions with metadata.
*_metrics_*.json — computed evaluation metrics for a run.
comparison_report_*.json — aggregated comparisons across runs.

Design notes

Keep prompts modular and colocated with the driver scripts for easy experimentation.
Store full raw outputs and parsed labels to allow post-hoc analysis.
Keep client adapter minimal so it can be swapped for other providers.

Recommended next steps (I can implement any of these)

Add .github/workflows/ci.yml to run test_setup.py on pushes.
Add a short results/README.md summarizing included sample outputs.
Add thorough retry/backoff in gemini_client.py for production stability.

Contributing

Fork and branch.
Run tests and add new ones for new behavior.
Submit a PR describing changes and sample outputs where relevant.

License

Add a LICENSE file or update this section.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemini — LLM Evaluation & Classification Experiments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
results		results
.env		.env
README.md		README.md
REPORT.md		REPORT.md
analyze_saved_results.py		analyze_saved_results.py
compare_results.py		compare_results.py
data_loader.py		data_loader.py
evaluate_llm_predictions.py		evaluate_llm_predictions.py
few_shot_classification.py		few_shot_classification.py
gemini_client.py		gemini_client.py
monitor_progress.py		monitor_progress.py
one_shot_classification.py		one_shot_classification.py
requirements.txt		requirements.txt
run_analysis.py		run_analysis.py
setup.sh		setup.sh
show_results.py		show_results.py
test_api_keys.py		test_api_keys.py
test_setup.py		test_setup.py
zero_shot_classification.py		zero_shot_classification.py

shanthanu47/Gemini-LLM-Evaluation

Folders and files

Latest commit

History

Repository files navigation

Gemini — LLM Evaluation & Classification Experiments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages