Thanks to visit codestin.com
Credit goes to github.com

Skip to content

shanthanu47/Gemini-LLM-Evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gemini — LLM Evaluation & Classification Experiments

Compact, reproducible experiments demonstrating how to evaluate LLM-based classification strategies (zero-shot, one-shot, few-shot) and compare their performance.

Why this repo matters (for recruiters)

  • Shows end-to-end LLM integration: adapter, data pipeline, experiment drivers, and evaluation.
  • Demonstrates prompt engineering and measurement of outcomes (accuracy / F1 / comparison reports).
  • Produces reproducible artifacts so results can be audited by reviewers.

Core features

  • Experiment drivers: zero_shot_classification.py, one_shot_classification.py, few_shot_classification.py.
  • LLM client adapter: gemini_client.py (reads credentials from environment).
  • Evaluation utilities: evaluate_llm_predictions.py, compare_results.py, analyze_saved_results.py.
  • Sanity checks: test_api_keys.py, test_setup.py.
  • Results: CSV/JSON artifacts written to results/ (sample artifacts included).

Quick start (Windows PowerShell)

  1. Create environment and install deps
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
  1. Add credentials

Copy .env.example to .env and fill in your API key and any provider URL. Do NOT commit secrets.

  1. Validate
python test_api_keys.py
python test_setup.py
  1. Run an experiment
python zero_shot_classification.py
python one_shot_classification.py
python few_shot_classification.py
  1. Analyze results
python run_analysis.py
python show_results.py

Files & what to look at

  • gemini_client.py — where API calls are made; first place to add retries or change provider settings.
  • data_loader.py — dataset expectations and preprocessing.
  • *_classification.py — experiment drivers that generate outputs and save to results/.
  • results/ — contains sample outputs: predictions CSVs and metrics JSON files.

Outputs produced

  • *_results_*.csv — model predictions with metadata.
  • *_metrics_*.json — computed evaluation metrics for a run.
  • comparison_report_*.json — aggregated comparisons across runs.

Design notes

  • Keep prompts modular and colocated with the driver scripts for easy experimentation.
  • Store full raw outputs and parsed labels to allow post-hoc analysis.
  • Keep client adapter minimal so it can be swapped for other providers.

Recommended next steps (I can implement any of these)

  • Add .github/workflows/ci.yml to run test_setup.py on pushes.
  • Add a short results/README.md summarizing included sample outputs.
  • Add thorough retry/backoff in gemini_client.py for production stability.

Contributing

  1. Fork and branch.
  2. Run tests and add new ones for new behavior.
  3. Submit a PR describing changes and sample outputs where relevant.

License

Add a LICENSE file or update this section.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published