This folder contains the multi-agent ML chatbot, a local RAG pipeline, a Streamlit UI, and an evaluation script.
agents.py: the multi-agent chatbot runtime.rag.py: local RAG indexing and similarity search fordata/machine-learning.pdf.app_streamlit.py: Streamlit chat frontend.eval.py: the automated evaluation pipeline (test set generation + LLM-as-a-judge scoring).
src/
├── agents.py # Multi-agent chatbot orchestration and CLI loop
├── rag.py # PDF chunking, embedding, JSON vector store, retrieval
├── app_streamlit.py # Streamlit frontend
├── eval.py # Dataset generation + evaluation pipeline
├── test_set.json # Input test dataset (generated or user-provided)
└── eval_results.json # Evaluation output (summary + per-case details)
data/
├── machine-learning.pdf
└── machine_learning_vector_store.json # generated locally, not required before indexing
Head_Agent orchestrates several sub-agents in a fixed route:
Context_Rewriter_AgentObnoxious_AgentQuery_Agent(plan)Query_Agent(search)with local JSON RAG by default, or Pinecone when selectedRelevant_Documents_AgentAnswering_Agent
- Rewrite the latest user query into a standalone query using recent conversation history.
- Detect obnoxious/rude input:
- If detected, return refusal (
Refused: Obnoxious query detected.).
- If detected, return refusal (
- Plan whether to search the vector store (
SEARCHorNO_SEARCH). - If searching:
- Embed query and retrieve top-k chunks from the local JSON vector store or Pinecone.
- Judge document relevance.
- If not relevant, return refusal (
Refused: Retrieved documents are not relevant.).
- Generate final answer grounded in retrieved documents.
The script also prints the runtime path, e.g.:
Context_Rewriter_Agent -> Obnoxious_Agent -> Query_Agent(plan) -> Query_Agent(search) -> Relevant_Documents_Agent -> Answering_Agent
eval.py evaluates the chatbot behavior with six categories:
obnoxious(10 cases)irrelevant(10 cases)relevant(10 cases)small_talk(5 cases)hybrid(8 cases)multi_turn(7 conversations)
Main components:
TestDatasetGenerator: builds synthetic prompts (with fallback fixed prompts).LLM_Judge: behavior-only binary scoring (score: 0/1).EvaluationPipeline: runs all tests, stores per-case results, computes summary metrics.
Install dependencies in your environment:
cd /Users/gongjin/Downloads/LLM_course/multi-agent
pip install -r requirements.txtSet the OpenAI key:
export OPENAI_API_KEY="..."Build the local vector store from the copied ML textbook:
cd /Users/gongjin/Downloads/LLM_course/multi-agent
python src/rag.py --pdf data/machine-learning.pdf --out data/machine_learning_vector_store.jsonRun chatbot in the CLI:
cd /Users/gongjin/Downloads/LLM_course/multi-agent
RAG_BACKEND=local python src/agents.pyRun the Streamlit frontend:
cd /Users/gongjin/Downloads/LLM_course/multi-agent
streamlit run src/app_streamlit.pyThe frontend can also build/rebuild the local vector store from the sidebar.
The old Pinecone-backed RAG path is still supported:
export OPENAI_API_KEY="..."
export PINECONE_API_KEY="..."
export PINECONE_INDEX_NAME="machine-learning-textbook"
export PINECONE_NAMESPACE="ns-2500"
RAG_BACKEND=pinecone python src/agents.pyRun evaluation with an existing or auto-generated test set:
cd /Users/gongjin/Downloads/LLM_course/multi-agent/src
python eval.py --test_set test_set.json --out eval_results.json --judge_model gpt-4.1-nanoGenerate test set only:
cd src
python eval.py --create_test_set --test_set test_set.jsoneval_results.json contains:
summary:total,passed,accuracyby_categorywith per-category totals and accuracies
results:- full per-case records (
user_input/conversation,bot_response,agent_path,score)
- full per-case records (
- Default chat/completion model in both scripts is
gpt-4.1-nano. - Embedding model in
agents.pyistext-embedding-3-small. - Pinecone namespace fallback logic is implemented in
Query_Agentif the preferred namespace has no matches.