Multi-Agent ML Chatbot and Evaluation

This folder contains the multi-agent ML chatbot, a local RAG pipeline, a Streamlit UI, and an evaluation script.

agents.py: the multi-agent chatbot runtime.
rag.py: local RAG indexing and similarity search for data/machine-learning.pdf.
app_streamlit.py: Streamlit chat frontend.
eval.py: the automated evaluation pipeline (test set generation + LLM-as-a-judge scoring).

Project Structure

src/
├── agents.py          # Multi-agent chatbot orchestration and CLI loop
├── rag.py             # PDF chunking, embedding, JSON vector store, retrieval
├── app_streamlit.py   # Streamlit frontend
├── eval.py            # Dataset generation + evaluation pipeline
├── test_set.json      # Input test dataset (generated or user-provided)
└── eval_results.json  # Evaluation output (summary + per-case details)
data/
├── machine-learning.pdf
└── machine_learning_vector_store.json  # generated locally, not required before indexing

How `agents.py` Works

Head_Agent orchestrates several sub-agents in a fixed route:

Context_Rewriter_Agent
Obnoxious_Agent
Query_Agent(plan)
Query_Agent(search) with local JSON RAG by default, or Pinecone when selected
Relevant_Documents_Agent
Answering_Agent

Routing Logic (per user turn)

Rewrite the latest user query into a standalone query using recent conversation history.
Detect obnoxious/rude input:
- If detected, return refusal (Refused: Obnoxious query detected.).
Plan whether to search the vector store (SEARCH or NO_SEARCH).
If searching:
- Embed query and retrieve top-k chunks from the local JSON vector store or Pinecone.
- Judge document relevance.
- If not relevant, return refusal (Refused: Retrieved documents are not relevant.).
Generate final answer grounded in retrieved documents.

The script also prints the runtime path, e.g.:

Context_Rewriter_Agent -> Obnoxious_Agent -> Query_Agent(plan) -> Query_Agent(search) -> Relevant_Documents_Agent -> Answering_Agent

How `eval.py` Works

eval.py evaluates the chatbot behavior with six categories:

obnoxious (10 cases)
irrelevant (10 cases)
relevant (10 cases)
small_talk (5 cases)
hybrid (8 cases)
multi_turn (7 conversations)

Main components:

TestDatasetGenerator: builds synthetic prompts (with fallback fixed prompts).
LLM_Judge: behavior-only binary scoring (score: 0/1).
EvaluationPipeline: runs all tests, stores per-case results, computes summary metrics.

Prerequisites

Install dependencies in your environment:

cd /Users/gongjin/Downloads/LLM_course/multi-agent
pip install -r requirements.txt

Set the OpenAI key:

export OPENAI_API_KEY="..."

Local RAG Usage

Build the local vector store from the copied ML textbook:

cd /Users/gongjin/Downloads/LLM_course/multi-agent
python src/rag.py --pdf data/machine-learning.pdf --out data/machine_learning_vector_store.json

Run chatbot in the CLI:

cd /Users/gongjin/Downloads/LLM_course/multi-agent
RAG_BACKEND=local python src/agents.py

Run the Streamlit frontend:

cd /Users/gongjin/Downloads/LLM_course/multi-agent
streamlit run src/app_streamlit.py

The frontend can also build/rebuild the local vector store from the sidebar.

Pinecone Mode

The old Pinecone-backed RAG path is still supported:

export OPENAI_API_KEY="..."
export PINECONE_API_KEY="..."
export PINECONE_INDEX_NAME="machine-learning-textbook"
export PINECONE_NAMESPACE="ns-2500"
RAG_BACKEND=pinecone python src/agents.py

Run evaluation with an existing or auto-generated test set:

cd /Users/gongjin/Downloads/LLM_course/multi-agent/src
python eval.py --test_set test_set.json --out eval_results.json --judge_model gpt-4.1-nano

Generate test set only:

cd src
python eval.py --create_test_set --test_set test_set.json

Output Format

eval_results.json contains:

summary:
- total, passed, accuracy
- by_category with per-category totals and accuracies
results:
- full per-case records (user_input/conversation, bot_response, agent_path, score)

Notes

Default chat/completion model in both scripts is gpt-4.1-nano.
Embedding model in agents.py is text-embedding-3-small.
Pinecone namespace fallback logic is implemented in Query_Agent if the preferred namespace has no matches.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent ML Chatbot and Evaluation

Project Structure

How `agents.py` Works

Routing Logic (per user turn)

How `eval.py` Works

Prerequisites

Local RAG Usage

Pinecone Mode

Output Format

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent ML Chatbot and Evaluation

Project Structure

How agents.py Works

Routing Logic (per user turn)

How eval.py Works

Prerequisites

Local RAG Usage

Pinecone Mode

Output Format

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

How `agents.py` Works

How `eval.py` Works

Packages