Thanks to visit codestin.com
Credit goes to github.com

Skip to content

a chatbot designed to emulate a specific professor using a rag and cag hybrid and some custom message pasing

Notifications You must be signed in to change notification settings

SandyYuan/astro-rag

Repository files navigation

Academic Research Assistant: Professor-Specific Chatbot

A specialized chatbot system that emulates academic professors using a hybrid Retrieval-Augmented Generation (RAG) + Knowledge Graph architecture powered by Google's Gemini AI models.

Project Overview

This project builds a conversational AI system that:

  1. Collects research papers by a specific professor (currently using Risa Wechsler as an example)
  2. Processes papers into both vector embeddings and a knowledge graph
  3. Uses dual retrieval with intelligent fusion to provide comprehensive responses
  4. Hosts the chatbot through a web interface with conversation continuity

Architecture

The system combines vector search and graph traversal for comprehensive knowledge retrieval:

Hybrid RAG + Knowledge Graph Approach

Single Advanced Mode:

  • KG-Enriched Sequential: Knowledge Graph intelligence enhances vector search

KG-Enriched Sequential Pipeline

The system uses an intelligent sequential approach that combines the best of both worlds:

  • Pipeline Flow: User Query → KG Retrieval → LLM Filter → Query Enrichment → Vector Search → Results
  • LLM Filtering: Uses Gemini 2.5 Flash (temperature=0.0) to filter KG results for relevance
  • Smart Enrichment: Original query enhanced with filtered KG context while preserving intent
  • Length Management: Intelligent truncation prevents token overflow while maintaining query quality
  • Fail-Fast Design: Clear error reporting without fallbacks that mask issues
  • Cost Optimized: Single LLM call per query for efficient operation

Conversational Agent Interface

All interactions use ReAct agent with full conversation memory:

  • Session Memory: Maintains context across conversation turns using LangGraph
  • ReAct Pattern: Follows Thought → Action → Observation → Final Answer loop
  • Tool Integration: Uses document search tool with all retrieval modes
  • Session Isolation: Different conversations maintain separate contexts
  • Reasoning Transparency: Provides visible reasoning steps in responses

KG-Enriched Sequential Architecture

flowchart LR
    %% Input & Condensation (LLM)
    subgraph Input[" "]
      U["User Query"]:::io --> QC{{"Query Condenser<br/>LLM"}}:::llm
    end

    %% KG-Enriched Sequential Pipeline
    subgraph KGPipeline["KG-Enriched Sequential Pipeline"]
      direction TB
      G[("Knowledge Graph<br/>Neo4j")]:::store
      F{{"LLM Filter"}}:::llm
      E["Query Enrichment<br/>Original + KG Context"]:::algo
    end

    %% Vector Search
    subgraph VectorSearch["Enhanced Vector Search"]
      V[("Vector DB<br/>FAISS")]:::store
    end

    %% Answer Generation (LLM + Agent)
    subgraph Generation["ReAct Agent Generation"]
      direction TB
      R{{"ReAct Agent<br/>with Memory"}}:::llm
      A["Final Answer<br/>+ Sources + Reasoning"]:::io
    end

    %% Flow connections
    QC -->|"Standalone Question"| G
    G -->|"Raw KG Results"| F
    F -->|"Filtered Context"| E
    E -->|"Enriched Query"| V
    V -->|"Enhanced Results"| R
    R --> A

    %% Styles
    classDef io fill:#ffffff,stroke:#222,stroke-width:2px,color:#111
    classDef llm fill:#fff4e6,stroke:#b85c00,stroke-width:2px,color:#000
    classDef store fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
    classDef algo fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000

    style Input fill:none,stroke:none
    style KGPipeline fill:none,stroke:#ddd,stroke-width:1px
    style VectorSearch fill:none,stroke:#ddd,stroke-width:1px
    style Generation fill:none,stroke:#ddd,stroke-width:1px
Loading

Primary-Author Corpus Preference

The system strongly prioritizes the primary-author corpus (papers/) over the non‑primary set (papers_np/). This is applied as a post‑vector reranker inside the KG‑enriched pipeline and is fully configurable via environment variables.

  • Default behavior: ensure a minimum share of primary sources in the final top‑k (e.g., ≥80% of top‑5 when available), while preserving the original order within each corpus.
  • Configuration (env):
    • PRIMARY_AUTHOR_BIAS_ENABLED=true
    • PRIMARY_AUTHOR_PREFIX=papers/
    • PRIMARY_AUTHOR_MIN_SHARE=0.8
    • PRIMARY_AUTHOR_FINAL_K=5

This bias reflects that the target professor’s own papers should be favored when relevant, without excluding clearly superior non‑primary matches when primary supply is limited.

Conversation Context Management

The system maintains conversation history through an innovative dual-context approach:

  1. Document Retrieval Context

    • For follow-up questions, previous queries are included in the retrieval query
    • Example: If a user asks "What is dark matter?" followed by "Why is it important?", the retrieval query becomes "Context: What is dark matter? Question: Why is it important?"
    • This helps the system retrieve documents relevant to the entire conversation flow
  2. Response Generation Context

    • Stores the last 3 conversation exchanges (question-answer pairs)
    • Includes this conversational history in the prompt to the LLM
    • Explicitly instructs the model to maintain continuity with previous exchanges
    • Preserves the "thread" of conversation across multiple turns
  3. Single-Stage RAG Implementation

    • Uses a custom document QA chain with direct document processing
    • Manually retrieves documents using context-enhanced queries
    • Combines system instructions, conversation history, and retrieved documents in a carefully crafted prompt

This architecture ensures the chatbot can handle follow-up questions naturally, maintain professor-specific knowledge, and provide responses that feel like a cohesive conversation rather than isolated Q&A pairs.

Components

Core System

  • chatbot.py: Main chatbot with dual retrieval modes and query condensation
  • llm_provider.py: Gemini AI integration for embeddings and text generation
  • app.py: Web application with conversation interface

Knowledge Processing

  • paper_collector.py: Downloads research papers by target professor
  • rag_processor.py: Creates FAISS vector database from papers
  • graph_rag/index.py: Builds Neo4j knowledge graph with entity extraction
  • graph_rag/neo4j_client.py: Graph retrieval with semantic neighborhood expansion

Fusion & Retrieval

  • retrieval/fusion.py: Reciprocal Rank Fusion algorithm and token budget management

Testing

  • test_dual_mode_integration.py: Integration tests for dual retrieval
  • test_phase3_dual_retrieval.py: Fusion algorithm unit tests
  • test_real_e2e_dual_mode.py: End-to-end tests with real components
  • test_graphrag_comprehensive.py: Knowledge graph functionality tests

Setup Instructions

Quick Start (Dual Mode)

  1. Clone the repository:

    git clone https://github.com/SandyYuan/astro-rag.git
    cd astro-rag
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables in .env file:

    # Google AI
    GOOGLE_API_KEY=your_google_api_key_here
    
    # Neo4j (for graph mode)
    NEO4J_URI=bolt://localhost:7687
    NEO4J_USER=neo4j
    NEO4J_PASSWORD=your_password
    
    # System automatically uses KG-enriched sequential retrieval (only mode available)
    # Agent mode with conversation memory is always enabled
  4. Set up Neo4j (for graph functionality):

    # Install Neo4j Desktop or use Docker
    docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5.15
  5. Build the knowledge base:

    # Create FAISS vector database
    python rag_processor.py
    
    # Build Neo4j knowledge graph
    python -m graph_rag.index
  6. Start the web application:

    python app.py
  7. Access the chatbot at http://localhost:8000

If you would like to run your own literature database or emulate a different professor:

  1. Configure the target professor (defaults to Risa Wechsler as an example):

    # In paper_collector.py, modify:
    collector = PaperCollector(author_name="Professor Name")
  2. Run the paper collector to gather research content:

    python paper_collector.py
    
  3. Process the papers to build the RAG system:

    python rag_processor.py
    
  4. Start the web application:

    python app.py
    
  5. Access the chatbot at http://localhost:8000

Usage

Once the web application is running, you can interact with the chatbot through the web interface:

  1. Type your question in the input field
  2. Press "Send" or hit Enter
  3. The chatbot will respond based on the professor's research papers
  4. Sources used to generate the response will be displayed below each answer

Customization

Adjusting the Chatbot Personality

You can modify the system prompt in chatbot.py to refine how the chatbot emulates the professor.

Adding More Papers

To expand the knowledge base, run the paper collector again with higher max_papers value:

collector = PaperCollector(author_name="Professor Name")
papers = collector.collect_papers(max_papers=50)

Then reprocess the papers to update the vector database.

Dependencies

Core AI & ML

  • Google Generative AI: Gemini 2.5 Flash for text generation and embeddings (text-embedding-004)
  • LangChain: RAG pipeline and document processing
  • FAISS: High-performance vector similarity search
  • Neo4j: Knowledge graph database with Cypher queries

Fusion & Retrieval

  • Reciprocal Rank Fusion: Multi-retriever result combination
  • Maximum Marginal Relevance (MMR): Diverse document selection
  • Semantic Entity Extraction: LLM-powered knowledge graph construction

Web & Infrastructure

  • FastAPI: Modern web framework for the chat interface
  • Scholarly: Academic paper collection from Google Scholar
  • Python 3.11+: Core runtime environment

Testing

  • pytest: Comprehensive test suite (unit, integration, E2E)
  • unittest.mock: Component mocking for isolated testing

Performance & Capabilities

Retrieval Quality

  • FAISS Mode: Excellent for document similarity and methodological details
  • Neo4j Mode: Superior for entity relationships and scientific parameter queries
  • Dual Mode: Best overall quality with ~2x more diverse sources

Performance Metrics

  • Response Time: 8-18 seconds for complex queries
  • Dual Mode Overhead: Only 4% slower than single modes
  • Source Coverage: 5-10 sources per response with intelligent deduplication
  • Conversation Continuity: Multi-turn context resolution with query condensation

Key Features

  • Query Condensation: Resolves conversational ambiguity ("What about S8?" → "What is the S8 tension in cosmology?")
  • Intelligent Fusion: Combines complementary sources from vector and graph retrieval
  • Scientific Accuracy: Entity filtering ensures focus on scientific parameters vs generic terms
  • Comprehensive Testing: 100% test coverage with real component validation

Notes

  • Requires Google API key with Gemini access and Neo4j database
  • Optimized for scientific/academic content with entity-relationship focus
  • Production-ready with comprehensive test coverage
  • For educational and research purposes

About

a chatbot designed to emulate a specific professor using a rag and cag hybrid and some custom message pasing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published