An AI-powered system that answers questions about Kafka, React, and Spark documentation using retrieval augmented generation (RAG) with a conversational interface.
- Conversational Interface: Multi-turn conversations with follow-up questions
- Documentation Search: Processes HTML and Markdown documentation files
- Vector-based Retrieval: Uses FAISS and TF-IDF for efficient information retrieval
- LLM Integration: Uses OpenAI for answer generation and conversation management
- Citation System: Provides proper citations to source documentation
- Streamlit UI: Interactive chat interface with conversation memory
- Python 3.10+
- Streamlit
- OpenAI API key
- FAISS
- BeautifulSoup4
- Markdown
- scikit-learn
- NLTK
- python-dotenv
- Extract the zip file:
unzip documentation-assistant.zip
cd documentation-assistant- Create and activate a virtual environment:
# Using venv
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# OR using conda
conda create -n doc-assistant python=3.10
conda activate doc-assistant- Install the required dependencies:
pip install -r requirements.txt- Set up environment variables:
# Create a .env file in the project root
echo "OPENAI_API_KEY=your_api_key_here" > .envBefore using the AI assistant, you need to process the documentation files:
- Place your documentation files in the
files/directory (HTML and/or Markdown formats) - Run the document processing script:
python process_documents.pyThis will:
- Parse all documentation files
- Extract content and metadata
- Create document chunks
- Build a searchable index
- Save the processed documents and index
To start the Streamlit chat interface:
python -m streamlit run streamlit_app.pyThis will open a browser window with the chat interface where you can ask questions about the documentation.
-
Document Processing Pipeline:
document_parser.py: Parses HTML and Markdown documentsdocument_indexer.py: Creates searchable indices from processed documentsprocess_documents.py: Main script to process all documentation
-
Query Processing & LLM Integration:
query_answerer.py: Core RAG implementation with LLM integration- OpenAI API integration for answer generation
-
User Interface:
streamlit_app.py: Streamlit-based conversational interface
-
Query Clarity Check:
- Incoming user query is analyzed for clarity and specificity
- If unclear, follow-up questions are generated
-
Context-Aware Retrieval:
- Conversation history is used to enhance retrieval
- Queries are reformulated based on context when appropriate
-
RAG Processing:
- Relevant documents are retrieved from the vector store
- Retrieved documents and query are sent to the LLM
- LLM generates a comprehensive answer with citations
The system uses several carefully designed prompts:
-
Query Clarity Prompt:
- Determines if a user question is specific enough to search for
- Generates follow-up questions for ambiguous queries
- Focuses on extracting the single most important missing piece of information
-
Query Reformulation Prompt:
- Combines original questions with follow-up clarifications
- Creates coherent search queries that focus on documentation terms
- Ensures specific technologies (Kafka, React, Spark) are properly prioritized
-
Context Consideration Prompt:
- Detects when a query references previous conversation
- Enhances queries with relevant context from conversation history
- Handles pronouns and implicit references
-
Answer Generation Prompt:
- Instructions to use HTML formatting for better display
- Citation guidelines to only cite when information is used
- Specific formatting for different response types (lists, code, etc.)
-
Query Reformulation Accuracy:
- GPT-3.5-turbo sometimes over-interprets minimal context
- May connect related technologies even when not explicitly mentioned
- Example: When asking about "streaming", might pull in both Kafka and Spark contexts
-
Context Window Limitations:
- Limited conversation history due to model context constraints
- Very long conversations may lose earlier context
-
Citation Precision:
- Citations are based on retrieved documents, not the model's knowledge
- The system may not always cite the most relevant source
-
Follow-up Question Quality:
- Follow-up questions might sometimes be too generic
- Multiple rounds may be needed for very ambiguous queries
These limitations are acceptable for an MVP and could be addressed in future versions with more advanced models (GPT-4) or fine-tuning.
Based on analysis of conversation patterns, the following improvements are recommended for future iterations:
- Implement a pre-processing step that analyzes queries for partial technology matches
- Create a mapping of common terms to technologies (e.g., "cluster" → check all technologies)
- Use a lightweight classifier to detect the likely technology domain even with ambiguous terms
- Add common synonyms and related concepts to each technology (e.g., "node", "instance", "cluster")
- Modify the query clarity prompt to detect domain-specific but vague queries
- When no results are found, have the LLM generate follow-up clarification options first
- Use a confidence threshold - if below a certain level, ask for clarification before responding negatively
- Include "did you mean..." suggestions based on similar terms in the documentation
- Update the HTML formatting in the prompt to use superscript tags for citations
- Create a more elegant styling for citations with smaller, less intrusive formatting
- Consider moving detailed citations to footnotes with hover effects for details
- Use a more natural in-text citation style (e.g., "According to Spark documentation...")
These enhancements could be implemented with relatively minor changes to the existing prompts and UI elements, without requiring architectural changes to the overall system.
document_parser.py: Parses HTML and Markdown documentsdocument_indexer.py: Creates searchable indices from processed documentsquery_answerer.py: Answers questions using the document index and LLMprocess_documents.py: Main script to process all documentationstreamlit_app.py: Streamlit chat interfacefiles/: Source documentation filesprocessed_files/: Directory containing processed documents and indices
- Upgrade to embedding-based vector search
- Add re-ranking of retrieved documents
- Implement caching for frequent queries
- Switch to a more advanced LLM like GPT-4
- Add document-grounded fact checking
- If OpenAI API calls fail, verify your API key in the .env file
- For processing errors, ensure NLTK has the required data:
import nltk nltk.download('punkt')