Academic Research Assistant: Professor-Specific Chatbot

A specialized chatbot system that emulates academic professors using a hybrid Retrieval-Augmented Generation (RAG) + Knowledge Graph architecture powered by Google's Gemini AI models.

Project Overview

This project builds a conversational AI system that:

Collects research papers by a specific professor (currently using Risa Wechsler as an example)
Processes papers into both vector embeddings and a knowledge graph
Uses dual retrieval with intelligent fusion to provide comprehensive responses
Hosts the chatbot through a web interface with conversation continuity

Architecture

The system combines vector search and graph traversal for comprehensive knowledge retrieval:

Hybrid RAG + Knowledge Graph Approach

Single Advanced Mode:

KG-Enriched Sequential: Knowledge Graph intelligence enhances vector search

KG-Enriched Sequential Pipeline

The system uses an intelligent sequential approach that combines the best of both worlds:

Pipeline Flow: User Query → KG Retrieval → LLM Filter → Query Enrichment → Vector Search → Results
LLM Filtering: Uses Gemini 2.5 Flash (temperature=0.0) to filter KG results for relevance
Smart Enrichment: Original query enhanced with filtered KG context while preserving intent
Length Management: Intelligent truncation prevents token overflow while maintaining query quality
Fail-Fast Design: Clear error reporting without fallbacks that mask issues
Cost Optimized: Single LLM call per query for efficient operation

Conversational Agent Interface

All interactions use ReAct agent with full conversation memory:

Session Memory: Maintains context across conversation turns using LangGraph
ReAct Pattern: Follows Thought → Action → Observation → Final Answer loop
Tool Integration: Uses document search tool with all retrieval modes
Session Isolation: Different conversations maintain separate contexts
Reasoning Transparency: Provides visible reasoning steps in responses

KG-Enriched Sequential Architecture

flowchart LR
    %% Input & Condensation (LLM)
    subgraph Input[" "]
      U["User Query"]:::io --> QC{{"Query Condenser<br/>LLM"}}:::llm
    end

    %% KG-Enriched Sequential Pipeline
    subgraph KGPipeline["KG-Enriched Sequential Pipeline"]
      direction TB
      G[("Knowledge Graph<br/>Neo4j")]:::store
      F{{"LLM Filter"}}:::llm
      E["Query Enrichment<br/>Original + KG Context"]:::algo
    end

    %% Vector Search
    subgraph VectorSearch["Enhanced Vector Search"]
      V[("Vector DB<br/>FAISS")]:::store
    end

    %% Answer Generation (LLM + Agent)
    subgraph Generation["ReAct Agent Generation"]
      direction TB
      R{{"ReAct Agent<br/>with Memory"}}:::llm
      A["Final Answer<br/>+ Sources + Reasoning"]:::io
    end

    %% Flow connections
    QC -->|"Standalone Question"| G
    G -->|"Raw KG Results"| F
    F -->|"Filtered Context"| E
    E -->|"Enriched Query"| V
    V -->|"Enhanced Results"| R
    R --> A

    %% Styles
    classDef io fill:#ffffff,stroke:#222,stroke-width:2px,color:#111
    classDef llm fill:#fff4e6,stroke:#b85c00,stroke-width:2px,color:#000
    classDef store fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#000
    classDef algo fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px,color:#000

    style Input fill:none,stroke:none
    style KGPipeline fill:none,stroke:#ddd,stroke-width:1px
    style VectorSearch fill:none,stroke:#ddd,stroke-width:1px
    style Generation fill:none,stroke:#ddd,stroke-width:1px

Primary-Author Corpus Preference

The system strongly prioritizes the primary-author corpus (papers/) over the non‑primary set (papers_np/). This is applied as a post‑vector reranker inside the KG‑enriched pipeline and is fully configurable via environment variables.

Default behavior: ensure a minimum share of primary sources in the final top‑k (e.g., ≥80% of top‑5 when available), while preserving the original order within each corpus.
Configuration (env):
- PRIMARY_AUTHOR_BIAS_ENABLED=true
- PRIMARY_AUTHOR_PREFIX=papers/
- PRIMARY_AUTHOR_MIN_SHARE=0.8
- PRIMARY_AUTHOR_FINAL_K=5

This bias reflects that the target professor’s own papers should be favored when relevant, without excluding clearly superior non‑primary matches when primary supply is limited.

Conversation Context Management

The system maintains conversation history through an innovative dual-context approach:

Document Retrieval Context
- For follow-up questions, previous queries are included in the retrieval query
- Example: If a user asks "What is dark matter?" followed by "Why is it important?", the retrieval query becomes "Context: What is dark matter? Question: Why is it important?"
- This helps the system retrieve documents relevant to the entire conversation flow
Response Generation Context
- Stores the last 3 conversation exchanges (question-answer pairs)
- Includes this conversational history in the prompt to the LLM
- Explicitly instructs the model to maintain continuity with previous exchanges
- Preserves the "thread" of conversation across multiple turns
Single-Stage RAG Implementation
- Uses a custom document QA chain with direct document processing
- Manually retrieves documents using context-enhanced queries
- Combines system instructions, conversation history, and retrieved documents in a carefully crafted prompt

This architecture ensures the chatbot can handle follow-up questions naturally, maintain professor-specific knowledge, and provide responses that feel like a cohesive conversation rather than isolated Q&A pairs.

Components

Core System

chatbot.py: Main chatbot with dual retrieval modes and query condensation
llm_provider.py: Gemini AI integration for embeddings and text generation
app.py: Web application with conversation interface

Knowledge Processing

paper_collector.py: Downloads research papers by target professor
rag_processor.py: Creates FAISS vector database from papers
graph_rag/index.py: Builds Neo4j knowledge graph with entity extraction
graph_rag/neo4j_client.py: Graph retrieval with semantic neighborhood expansion

Fusion & Retrieval

retrieval/fusion.py: Reciprocal Rank Fusion algorithm and token budget management

Testing

test_dual_mode_integration.py: Integration tests for dual retrieval
test_phase3_dual_retrieval.py: Fusion algorithm unit tests
test_real_e2e_dual_mode.py: End-to-end tests with real components
test_graphrag_comprehensive.py: Knowledge graph functionality tests

Setup Instructions

Quick Start (Dual Mode)

Clone the repository:

git clone https://github.com/SandyYuan/astro-rag.git
cd astro-rag

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables in .env file:

# Google AI
GOOGLE_API_KEY=your_google_api_key_here

# Neo4j (for graph mode)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password

# System automatically uses KG-enriched sequential retrieval (only mode available)
# Agent mode with conversation memory is always enabled

Set up Neo4j (for graph functionality):

# Install Neo4j Desktop or use Docker
docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5.15

Build the knowledge base:

# Create FAISS vector database
python rag_processor.py

# Build Neo4j knowledge graph
python -m graph_rag.index

Start the web application:
```
python app.py
```
Access the chatbot at http://localhost:8000

If you would like to run your own literature database or emulate a different professor:

Configure the target professor (defaults to Risa Wechsler as an example):

# In paper_collector.py, modify:
collector = PaperCollector(author_name="Professor Name")

Run the paper collector to gather research content:
```
python paper_collector.py
```
Process the papers to build the RAG system:
```
python rag_processor.py
```
Start the web application:
```
python app.py
```
Access the chatbot at http://localhost:8000

Usage

Once the web application is running, you can interact with the chatbot through the web interface:

Type your question in the input field
Press "Send" or hit Enter
The chatbot will respond based on the professor's research papers
Sources used to generate the response will be displayed below each answer

Customization

Adjusting the Chatbot Personality

You can modify the system prompt in chatbot.py to refine how the chatbot emulates the professor.

Adding More Papers

To expand the knowledge base, run the paper collector again with higher max_papers value:

collector = PaperCollector(author_name="Professor Name")
papers = collector.collect_papers(max_papers=50)

Then reprocess the papers to update the vector database.

Dependencies

Core AI & ML

Google Generative AI: Gemini 2.5 Flash for text generation and embeddings (text-embedding-004)
LangChain: RAG pipeline and document processing
FAISS: High-performance vector similarity search
Neo4j: Knowledge graph database with Cypher queries

Fusion & Retrieval

Reciprocal Rank Fusion: Multi-retriever result combination
Maximum Marginal Relevance (MMR): Diverse document selection
Semantic Entity Extraction: LLM-powered knowledge graph construction

Web & Infrastructure

FastAPI: Modern web framework for the chat interface
Scholarly: Academic paper collection from Google Scholar
Python 3.11+: Core runtime environment

Testing

pytest: Comprehensive test suite (unit, integration, E2E)
unittest.mock: Component mocking for isolated testing

Performance & Capabilities

Retrieval Quality

FAISS Mode: Excellent for document similarity and methodological details
Neo4j Mode: Superior for entity relationships and scientific parameter queries
Dual Mode: Best overall quality with ~2x more diverse sources

Performance Metrics

Response Time: 8-18 seconds for complex queries
Dual Mode Overhead: Only 4% slower than single modes
Source Coverage: 5-10 sources per response with intelligent deduplication
Conversation Continuity: Multi-turn context resolution with query condensation

Key Features

Query Condensation: Resolves conversational ambiguity ("What about S8?" → "What is the S8 tension in cosmology?")
Intelligent Fusion: Combines complementary sources from vector and graph retrieval
Scientific Accuracy: Entity filtering ensures focus on scientific parameters vs generic terms
Comprehensive Testing: 100% test coverage with real component validation

Notes

Requires Google API key with Gemini access and Neo4j database
Optimized for scientific/academic content with entity-relationship focus
Production-ready with comprehensive test coverage
For educational and research purposes

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
agent		agent
graph_rag		graph_rag
papers_np		papers_np
retrieval		retrieval
templates		templates
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
agent.md		agent.md
app.py		app.py
chatbot.py		chatbot.py
graph.md		graph.md
llm_provider.py		llm_provider.py
overview.md		overview.md
paper_collector.py		paper_collector.py
rag_processor.py		rag_processor.py
requirements.txt		requirements.txt
roadmap.md		roadmap.md
sequential.md		sequential.md
status.md		status.md

SandyYuan/astro-rag

Folders and files

Latest commit

History

Repository files navigation

Academic Research Assistant: Professor-Specific Chatbot

Project Overview

Architecture

Hybrid RAG + Knowledge Graph Approach

KG-Enriched Sequential Pipeline

Conversational Agent Interface

KG-Enriched Sequential Architecture

Primary-Author Corpus Preference

Conversation Context Management

Components

Core System

Knowledge Processing

Fusion & Retrieval

Testing

Setup Instructions

Quick Start (Dual Mode)

Usage

Customization

Adjusting the Chatbot Personality

Adding More Papers

Dependencies

Core AI & ML

Fusion & Retrieval

Web & Infrastructure

Testing

Performance & Capabilities

Retrieval Quality

Performance Metrics

Key Features

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages