MLOps-driven LLM assistant that learns your unique writing style from your online content. It implements end-to-end LLMOps best practices with an FTI pipeline (Features → Training → Inference) integrating Retrieval-Augmented Generation (RAG) for context grounding, and automated pipeline orchestration through ZenML.
- Features
- Dependencies
- Project Architecture
- Project Structure
- Installation
- Infrastructure
- Pipelines
- Model Training & Evaluation
- RAG System
- Experiment Tracking & Prompt Monitoring
- Quality Check
- Run Project
- Roadmap
- Disclaimer
- Acknowledgments
- Attribution
- License
Key capabilities of the project:
- 📝 Data collection & generation
- 🔄 LLM training pipeline
- 📊 Retrieval-Augmented Generation (RAG) system
- 🔍 Comprehensive monitoring
- 🧪 Testing and evaluation framework
You can download and use the final trained model on Hugging Face - ahmedshahriar/GhostWriterLlama-3.2-1B-DPO.
To install and run the project locally, the following dependencies are required.
| Tool | Version | Purpose | Installation Link |
|---|---|---|---|
| pyenv | ≥2.6.12 | Multiple Python versions (optional) | Install Guide |
| Python | 3.11 | Runtime environment | Download |
| Poetry | >2.2.1 | Package management | Install Guide |
| Docker | ≥28.5.1 | Containerization | Install Guide |
The project also uses the following cloud services:
| Service | Purpose |
|---|---|
| Hugging Face | Model registry (fine-tuned LLM) |
| Google Colab | LLM training (SFT, DPO) |
| Kaggle | LLM Evaluation (Custom Framework) |
| Comet ML | Experiment tracker (e.g., train/eval metrics) |
| Opik | Prompt monitoring (tracing RAG inference) |
| ZenML | Orchestrator and artifacts layer |
| MongoDB | NoSQL database (DWH) |
| Qdrant | Vector database (RAG) |
| GitHub Actions | CI/CD pipeline (QA, Container Registry) |
%%{init: {
'theme': 'base',
'flowchart': {
'curve': 'linear',
'nodeSpacing': 40,
'rankSpacing': 60,
'htmlLabels': true
}
}}%%
flowchart TB
%% ---- classes ----
classDef stage fill:#f9f9f9,stroke:#888,stroke-width:1px,corner-radius:4px,color:#111;
classDef node fill:#fff,stroke:#444,stroke-width:1px,corner-radius:4px,color:#000;
classDef faint fill:#f3f4f6,stroke:#7a869a,color:#1f2937;
classDef optional stroke-dasharray: 3 3,stroke:#888,color:#111,fill:#fafafa;
%% === Data Collection Pipeline ===
subgraph DC[<b>Data Collection Pipeline]
direction TB
BP["Articles<br>(Blog Posts)"]:::node
GH["GitHub"]:::node
ETL[["ETL"]]:::faint
NDB[("NoSQL DWH<br/>(MongoDB)")]:::node
BP --> ETL
GH --> ETL
ETL --> NDB
end
class DC stage;
%% === Feature Pipeline ===
subgraph FP["<div style="width:17em; height:8em; display:flex;"><b>Feature Pipeline</b></div>"]
direction TB
SRC_HUB{{"Data Sources"}}:::node
ART["Articles"]:::node
COD["Code"]:::node
MERGE["Data for Training & RAG"]:::faint
SRC_HUB --> ART
SRC_HUB --> COD
ART --> MERGE
COD --> MERGE
end
class FP stage;
%% connect Data Collection → Feature Pipeline
NDB --> SRC_HUB
%% === Logical Feature Store ===
subgraph LFS["<div style="width:10em; height:8em; padding-right:12em; display:flex;"><b>Logical Feature Store</b></div>"]
direction TB
ID[("Instruct Dataset")]:::node
RC[["Retrieval Client"]]:::faint
VDB[("Vector DB<br/>(Qdrant)")]:::node
ID --> RC --> VDB
end
class LFS stage;
%% connect Feature Pipeline → Feature Store
MERGE --> ID
%% === Training Pipeline ===
subgraph TP[<b>Training Pipeline]
direction TB
ET["Experiment Tracker<br/>(Comet ML)"]:::optional
FT[["LLM Training<br/>(SFT, DPO)"]]:::faint
TST["Test LLM Candidate"]:::node
MR[("Model Registry<br/>(HuggingFace)")]:::node
ET --> FT
ID -- "Training Data" --> FT
FT -- "LLM Production Candidate" --> MR
MR --> TST
TST -- "Accepted LLM" --> MR
end
class TP stage;
%% === Inference Pipeline ===
subgraph IP["<div style="width:17.5em; height:8em; display:flex;"><b>Inference Pipeline</b></div>"]
direction TB
DEP["Deploy"]:::node
LT["GhostWriter LLM"]:::node
RCI[["Retrieval Client"]]:::faint
API["REST API"]:::node
MON["Prompt & System Monitoring<br/>(Opik)"]:::optional
DEP --> LT --> RCI --> API
API -- "Write a post about..." --> MON
MON -- "Generated Post" --> API
end
class IP stage;
%% cross-pipeline connections
MR --> DEP
VDB -. "RAG Data" .-> RCI
Here is the directory overview:
.
├── configs/ # Pipeline configuration files (YAML)
├── core/ # Core project package
│ ├── application/ # Crawlers, preprocessing, RAG orchestration, embedding singleton
│ ├── domain/ # Pydantic models: documents, datasets, vectors, prompts, etc.
│ ├── infrastructure/ # DB connectors (mongodb, qdrant), FastAPI app wiring, Opik config
│ ├── model/ # LLM fine-tuning (SFT, DPO), evaluation (LLMs as Judge), inference modules
│ └── settings.py # .env/ZenML-secrets driven settings
├── pipelines/ # ML pipeline definitions (ETL, Feature Engineering, Dataset Generation, Export)
├── steps/ # Pipeline components (Modular ZenML pipeline steps)
│ ├── etl/ # ETL steps
│ ├── export/ # Data export steps
│ ├── feature_engineering/ # Feature engineering steps
│ └── dataset_generation/ # Dataset generation steps
├── tests/ # Test examples
├── tools/ # Utility scripts
│ ├── run.py # Entry point to run ZenML pipelines
│ ├── ml_service.py # Starts the REST API inference server
│ ├── rag.py # Demonstrates usage of the RAG retrieval module
│ └── data_warehouse.py # Export/import data from MongoDB data warehouse
└── pyproject.toml # dependencies, Poe tasks and project metadata
core/ package is the heart of the project, implementing both the RAG and LLM train/inference pipelines. It follows Domain-Driven Design (DDD) principles, promoting modularity:
domain/: Core business entities and structures (Pydantic models for documents, datasets, vectors etc.)application/: Encapsulates business logic, crawlers, data preprocessing flows, and RAG implementationmodel/: LLM training, evaluation, and inferenceinfrastructure/: External service integrations (Qdrant, MongoDB, FastAPI, Opik)
Source-code dependencies only point inward (outer → inner). Inner layers know nothing about outer ones—this is the core “dependency rule” of DDD (DDD Reference, PDF).
-
The
domainlayer is independent. -
The
applicationlayer depends ondomainand exposes use cases. -
The
modellayer (LLM training/tuning/inference) may depend onapplicationanddomainas needed. (Note:model/here refers to ML/LLM codes, not the DDD "domain model".) -
The
infrastructurelayer depends on inner layers (application,domain, and, where applicable,model).┌────────────────────────────┐ │ configs/ │ │ (YAML pipeline settings) │ └────────────┬───────────────┘ │ ▼ ┌────────────────────────────┐ │ pipelines/ │ │ ZenML pipeline definitions│ └────────────┬───────────────┘ │ ▼ ┌────────────────────────────┐ │ steps/ │ │ Reusable ZenML steps │ │ (ETL, FE, dataset gen...) │ └────────────┬───────────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ core/ │ │ Main package implementing LLM + RAG logic │ ├──────────────────────────────────────────────┤ │ domain/ → Pydantic entities/models │ │ application/ → Crawlers, preprocessing, │ │ RAG orchestration │ │ model/ → LLM training + inference │ │ infrastructure/ → DB connectors, FastAPI │ │ settings.py → Environment-driven config │ └──────────────────────┬───────────────────────┘ │ ▼ ┌───────────────────────────────┐ │ tools/ │ │ CLI scripts & API launchers │ │ (run.py, ml_service.py etc.) │ └─────────────┬─────────────────┘ │ ▼ ┌───────────────────────────────┐ │ tests/ │ │ CI examples & validations │ └───────────────────────────────┘
Contains YAML configuration files for ZenML pipelines, steps, and runtime parameters.
Contains ZenML ML pipelines that serve as entry points for data preparation, feature engineering, model training, evaluation, and RAG updates.
Each pipeline coordinates multiple reusable steps (from steps/) to define complete end-to-end workflows.
This folder contains ZenML step implementations, each encapsulating a single, reusable operation within a ZenML pipeline. Examples include:
etl/– Data ingestion, cleaning, and transformationexport/– Model or dataset export utilitiesfeature_engineering/– Feature generation and transformation logicdataset_generation/– Creation of labeled or synthetic datasets for training
Steps perform specific tasks (e.g., data loading, preprocessing) and can be combined within the ML pipelines.
Utility scripts that simplify interaction with the system:
run.py: Runs ZenML pipelines directly from the CLI.ml_service.py: Starts the REST API inference server.rag.py: Demonstrates usage of the RAG retrieval and document ranking functionality.data_warehouse.py: Handles JSON-based import/export to/from MongoDB data warehouse.
These scripts serve as both operational utilities and usage examples.
Includes lightweight test examples, covering functional and integration use cases. They are primarily designed for CI/CD validation and as templates for extending test coverage.
Follow these steps to set up the project locally.
Start by cloning the repository and navigating to the project directory:
git clone https://github.com/ahmedshahriar/llm-ghostwriter.git
cd llm-ghostwriterNext, prepare the Python environment and related dependencies.
The project requires Python 3.11. Either use your global Python installation or set up a project-specific version using pyenv.
Verify Python version:
python --version # Should show Python 3.11.x- Verify pyenv installation:
pyenv --version # Should show pyenv 2.6.12 or later- Install Python 3.11.8:
pyenv install 3.11.8- Verify the installation:
python --version # Should show Python 3.11.8- Confirm Python version in the project directory:
python --version
# Output: Python 3.11.8Note
The project includes a .python-version file that automatically sets the correct Python version when you're in the project directory.
The project uses Poetry for dependency management.
- Verify Poetry installation:
poetry --version # Should show Poetry version 2.2.1 or later- Set up the project environment and install dependencies:
poetry env use 3.11
poetry install
poetry run pre-commit installThis will:
- Configure Poetry to use Python 3.11
- Install project dependencies
- Set up pre-commit hooks for code verification
Use Poe the Poet as the task runner for project scripts.
- Start a Poetry shell:
poetry shell- Run project commands using Poe the Poet:
poetry poe ...🔧 Troubleshooting Poe the Poet Installation
If issues occur with poethepoet, project commands can be run directly through Poetry:
- Locate the command definition in
pyproject.toml - Execute it using
poetry runfollowed by the command itself
Instead of:
poetry poe local-infrastructure-upUse the direct command from pyproject.toml:
poetry run <actual-command-from-pyproject-toml>Note: All project commands are defined in the [tool.poe.tasks] section of pyproject.toml
After dependencies are installed, create a .env file at the repository root and define credentials for external services.
Storing secrets in .env prevents them from being committed to GitHub.
- First, copy the example env file by running the following:
cp .env.example .env # The file must be at your repository's root!- Fill in the essential variables to enable local execution:
Provide an authentication token for OpenAI’s API:
OPENAI_API_KEY=your_api_key_here→ Check out this tutorial to learn how to provide one from OpenAI.
Provide a Hugging Face access token:
HUGGINGFACE_ACCESS_TOKEN=your_token_here→ Check out this tutorial to learn how to provide one from Hugging Face.
To authenticate to Comet ML (required only during training) and Opik, fill out the COMET_API_KEY env var with your authentication token.
COMET_API_KEY=your_api_key_here→ Check out this tutorial to learn how to get started with Opik. You can also access Opik's dashboard using 🔗this link.
To deploy the project, configure the external databases MongoDB and Qdrant.
Update the DATABASE_HOST variable with the MongoDB cluster URL:
DATABASE_HOST=your_mongodb_url→ Check out this tutorial to learn how to create and host a MongoDB cluster for free.
Enable Qdrant Cloud and set the endpoint and API key:
Change USE_QDRANT_CLOUD to true, set QDRANT_CLOUD_URL with the URL pointing to your cloud Qdrant cluster, and set QDRANT_APIKEY with its API key.
USE_QDRANT_CLOUD=true
QDRANT_CLOUD_URL=your_qdrant_cloud_url
QDRANT_APIKEY=your_qdrant_api_keyImportant
Additional configuration options are available in settings.py. Any variable in the Settings class can be configured through the .env file.
For local execution, the project runs MongoDB and Qdrant instances through Docker. A local ZenML server is also started via the Python package.
Warning
Docker (>= v28.5.1) must be installed.
Start the entire local development stack:
poetry poe local-infrastructure-upStop the ZenML server and all Docker containers:
poetry poe local-infrastructure-downWarning
When running on macOS, before starting the server, export the following environment variable:
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
Otherwise, the connection between the local server and pipeline will break. 🔗 More details in this issue.
This is done by default when using Poe the Poet.
Start the real-time inference REST API:
poetry poe run-inference-ml-serviceDashboard URL: http://localhost:8237
Default credentials:
username: default
→ Learn more: ZenML.
REST API URL: http://localhost:6333
Dashboard URL: http://localhost:6333/dashboard
→ Learn more: Qdrant with Docker.
Database URI: mongodb://ghost:[email protected]:27017 (local development only)
Database name: ghost_db
Default credentials:
username: ghostpassword: ghost
→ Learn more: MongoDB with Docker.
All the ML pipelines will be orchestrated behind the scenes by ZenML. A few exceptions exist when running utility scripts, such as exporting or importing from the data warehouse.
The ZenML pipelines are the entry point for most processes throughout this project. They are under the pipelines/ folder. Thus, when you want to understand or debug a workflow, starting with the ZenML pipeline is the best approach.
To see the pipelines running and their results:
- go to your ZenML dashboard
- go to the
Pipelinessection - click on a specific pipeline (e.g.,
digital_data_etl) - click on a specific run (e.g.,
digital_data_etl_run_2025_09_20_14_00_52) - click on a specific step or artifact of the DAG to find more details about it
Here are some sample screenshots from the ETL pipeline run:
Now, let's explore all the pipelines you can run.
Run the data collection ETL:
poetry poe run-digital-data-etlTo add additional links to collect from, go to configs/digital_data_etl_[author_name].yaml and add them to the links field.
Run the feature engineering pipeline:
poetry poe run-feature-engineering-pipelineGenerate the instruct dataset:
poetry poe run-generate-instruct-datasets-pipelineGenerate the preference dataset:
poetry poe run-generate-preference-datasets-pipelineExport the data from the data warehouse to JSON files:
poetry poe run-export-data-warehouse-to-jsonImport data to the data warehouse from JSON files (by default, it imports the data from the data/data_warehouse_raw_data directory):
poetry poe run-import-data-warehouse-from-jsonExport ZenML artifacts to JSON:
poetry poe run-export-artifact-to-json-pipelineThis will export the following ZenML artifacts to the output folder as JSON files (it will use their latest version):
- cleaned_documents.json
- instruct_datasets.json
- preference_datasets.json
- raw_documents.json
You can configure what artifacts to export by tweaking the configs/export_artifact_to_json.yaml configuration file.
Model fine-tuning and evaluation were conducted in Google Colab using GPU accelerators.
The training code is available in the following notebooks (under model/finetuning/ and model/evaluation/):
Dependencies:
-
Environment: Google Colab and Kaggle (T4/A100 GPU)
- Base model: Llama-3.2-1B
- Models released:
- Uses a 95% training / 5% test split for validation during training.
- Performs LoRA-based fine-tuning using Unsloth for accelerated training and reduced memory footprint.
- Utilizes mixed precision training (
fp16/bf16) and8-bit Adam optimizer (weight decay=0.01)for efficiency. Depending on the GPU, it will automatically use FP16 or BF16 for the activations. - Metrics logged to Comet ML for live monitoring
- Loads the instruct dataset: ( datasets/ahmedshahriar/llmGhostWriter ) and supplements it with additional domain-curated samples (datasets/yahma/alpaca-cleaned) to improve coverage and stylistic diversity.
- Initializes the base model Llama-3.2-1B.
- Formats all samples using an Alpaca-style chat template to standardize instruction–response structure and ensure clean prompt alignment
- Trains via the
SFTTrainerwith key parameters:learning_rate=3e-4,num_train_epochs=3,batch_size=2, and gradient accumulation (effective batch size 16)
- Final checkpoints published to HF Hub: GhostWriterLlama-3.2-1B.
- Comet ML Metrics dashboard: GhostWriterLlama-3.2-1B-SFT Experiment
- Initializes the SFT-tuned checkpoint as base:
GhostWriterLlama-3.2-1B. - Uses a (prompt, chosen, rejected) preference dataset (datasets/ahmedshahriar/llmGhostWriter-dpo) formatted with an Alpaca-style template; the uniform prompt shape reduces prompt-induced variance during preference learning.
- Trains with
DPOTrainerusing:beta=0.5to strengthen reference-model regularization. It controls the importance of the reference model. A standard value of 0.1 works well in most scenarios.- Lower
learning rate(from 3e-4 for SFT to 2e-6 here), train for 1 epoch instead of 3. - The
max_seq_lengthparameter is now broken down into two new parameters:max_prompt_length(prompt only) andmax_length(prompt and answer).
- Final checkpoints published to HF Hub: GhostWriterLlama-3.2-1B-DPO
- Comet ML Metrics dashboard: GhostWriterLlama-3.2-1B-DPO Experiment
beta=0.5was chosen due to the fact that the trained model used formal language with lower values. Having it closer to the reference model helps to fix this issue.
✏️ Response Comparison Example: SFT vs. DPO Output
Prompt:
“Write a paragraph to introduce supervised fine-tuning.”
SFT Model (GhostWriterLlama-3.2-1B):
Supervised fine-tuning is a technique used to improve a language model by teaching it to predict the next word based on examples provided by a labeled dataset. This process involves training the model to understand the relationships between words and how to predict the next one given a set of previous tokens. Supervised fine-tuning is commonly used with language models like GPT-3 and is considered a state-of-the-art approach for text generation.
DPO Model (GhostWriterLlama-3.2-1B-DPO):
Supervised fine-tuning is a technique used in natural language processing and machine learning to improve the performance of a model by training it on a large dataset of labeled examples. This process involves feeding the model examples of correct and incorrect responses, and using this information to adjust the model's parameters to better predict the correct response. Supervised fine-tuning is commonly used in tasks such as text classification, sentiment analysis, and machine translation, where the goal is to train the model to accurately classify or translate between languages. By fine-tuning the model on a labeled dataset, the model learns to better recognize and respond to the specific prompts and expectations of the task at hand.
💡 The DPO-tuned model produces a more detailed, technically grounded explanation—showing improved factual completeness and task awareness—while maintaining clarity and relevance for educational or applied writing contexts.
Evaluation was conducted via core/model/evaluation/llm_ghostwriter_evaluate.ipynb, which performs automated answer generation and LLM-based scoring to evaluate accuracy and stylistic alignment across models.
- The custom evaluation framework uses a judge LLM (
GPT-4.1-nano) to evaluate model outputs on the test split of the instruction dataset. - Each prediction is scored on a 1–3 scale across two criteria:
- Accuracy: factual correctness and completeness of the response
- Style: tone and writing suitability for blogs and articles
- The framework feeds test prompts to both models, collects outputs, and scores them via the judge LLM prompt.
- Results are analyzed through qualitative and quantitative evaluation to assess overall writing quality and alignment.
- Generates model outputs for each test instruction using vLLM for fast batched inference.
- Evaluates three models - Meta-Llama-3.2-1B-Instruct as the reference baseline, alongside both GhostWriter LLM variants (SFT-tuned and DPO-tuned).
- Prompts are formatted consistently using a unified chat template.
- Outputs are generated with diversity-oriented sampling (
temperature=0.8,top_p=0.95,min_p=0.05) and saved to the Hugging Face Hub for later evaluation and analysis.
- Uses GPT-4.1-nano as a judge LLM to evaluate model outputs on accuracy and style using a 1–3 scoring scale.
- Each instruction–answer pair is assessed through a structured JSON evaluation prompt
- The process runs in parallel batches for efficiency, with results stored as structured JSON objects.
- Extracted scores (
accuracy,style) are added to the dataset and finally pushed to the Hugging Face Hub for further analysis and comparison across models.- Results Dataset:
| Model | Accuracy ↑ | Style ↑ |
|---|---|---|
| GhostWriterLlama-3.2-1B | 2.42 | 2.92 |
| GhostWriterLlama-3.2-1B-DPO | 2.48 | 2.95 |
| Llama-3.2-1B-Instruct (baseline) | 2.64 | 2.91 |
🧩 The DPO-tuned variant slightly improves both stylistic alignment and factual accuracy over the base fine-tuned model, approaching the performance of the strong Llama-3.2-1B-Instruct baseline. While the baseline retains a small edge in accuracy due to its extensive instruction tuning, the DPO model produces more natural, accessible writing better suited for blog and creative content.
Note
The current inference pipeline API runs on the DPO-tuned model using transformers library for generation.
Future updates will integrate cloud-based training pipelines (AWS SageMaker/Azure ML) within ZenML cloud for automated deployment and scaling.
- Warehouse → Vector DB: MongoDB → Qdrant
- Embeddings: SentenceTransformers (
sentence-transformers/all-MiniLM-L6-v2, configurable) - Re-ranker:
cross-encoder/ms-marco-MiniLM-L4-v2(configurable) - LLMs: OpenAI (
gpt-4.1-nano) or local via Ollama (llama3.2:3b) (configurable) - Processing: LangChain
- Serving: FastAPI (HTTP) → LLM endpoint
- Orchestration: ZenML
- Goal: Keep a logical feature store in sync with MongoDB and prepare data for RAG (chunk and embed) and training (clean text artifacts).
- Why batch: Small volume (thousands → tens of thousands), minute-level freshness is fine. Batch = simpler, cheaper, easier to operate than streaming.
- Outputs:
- RAG-ready index: chunk embeddings + metadata in Qdrant.
- Training snapshot: cleaned documents, versioned via ZenML, for dataset creation.
MongoDB → Clean → Chunk → Embed → Qdrant (vectors with metadata)
└──► ZenML Artifacts (cleaned-doc snapshots)
- Provide a reliable path from raw documents to features used by RAG retrieval and fine-tuning pipelines.
- Keep the feature store synchronized with the warehouse with acceptable freshness for RAG.
- Stay simple to run today; leave a clear path to CDC/streaming when scale demands.
- Documents in MongoDB (articles, repos/code files, posts).
- Metadata:
document ID,source,url,author,created_at,content_type, etc.
- Cleaned documents stored as records/artifacts for fine-tuning dataset creation.
- Chunk embeddings with metadata indexed in Qdrant for retrieval.
- Operational metrics/logs to monitor freshness and pipeline health.
-
Extract (MongoDB)
- Pull latest/updated documents on a schedule or manual trigger.
- Read in batches to control memory and enable retries.
-
Clean
- Normalize encodings/whitespace; replace inline URLs with placeholders.
- Remove emojis/boilerplate; deduplicate (document and paragraph level).
- Attach lineage and provenance (
document ID,source,created_at,updated_at). - Emit a cleaned snapshot for training (versioned via ZenML or as metadata-only rows in Qdrant if needed).
-
Chunk
- Split by content type: wider windows for code; paragraph/sentence for articles.
- Respect the embedding model’s maximum input length; assign a stable
chunk ID; retain the parentdocument IDfor provenance.
-
Embed
- Encode each chunk with a configurable model (default:
sentence-transformers/all-MiniLM-L6-v2). - Batch on CPU/GPU; record failures and retry/fallback.
- Encode each chunk with a configurable model (default:
-
Load to Feature Store
- Vectors + metadata → Qdrant (upsert semantics).
- Cleaned docs snapshot → ZenML artifacts (or Qdrant metadata) for downstream fine-tuning.
-
Retrieval Client
- Thin client hits Qdrant with filters (source/type/date) and returns top-k + metadata for prompt assembly or dataset curation.
- Online store (for RAG): The vector database (Qdrant) holds chunk embeddings and metadata used for retrieval at inference time.
- It can also keep the cleaned documents as metadata-only rows to support fast lookups and filtering without touching the warehouse.
- Offline store (for Training): ZenML artifacts store versioned cleaned text snapshots that are used to build reproducible fine-tuning datasets.
- Training pipeline reads these artifacts, not from the MongoDB warehouse, which makes runs reproducible and easy to roll back.
- Why two snapshots: This keeps the shared MongoDB warehouse generic and avoids per–use-case “cleaned” tables. Training and inference pipelines read exclusively from the feature store, providing one consistent surface with clear lineage (
document IDandchunk ID) and a straightforward path to rebuild or refresh the online index from the same cleaned snapshot.
- Current approach is periodic batch polling that reads the warehouse (MongoDB) and upserts into Qdrant. Works well for this use case (small data).
- When volume grows, upgrade options:
- Timestamp-based pull to fetch only new or updated rows.
- Trigger-based replication for full CRUD tracking.
- Log-based CDC with a queue and a stream processor for low latency and delete handling at scale.
- Embedding model, chunk sizes/overlaps, batch sizes, collection names, and schedules are config-driven.
- Defaults favor portability and modest hardware; can be swapped for stronger models as resources allow.
- Small to medium datasets; near–real-time not required.
- Tokenization at runtime to stay flexible across models/tasks.
- Clear migration path to streaming/CDC when volume, latency, or delete-handling demands it.
Result: a simple, reproducible batch pipeline that keeps retrieval assets fresh while also feeding training with curated, versioned text.
Turn a user query into a grounded answer by expanding the query, retrieving the most relevant chunks from the vector database, re-ranking them, and calling the LLM inference endpoint with a compact, cited context.
-
Self-Query
- Use self-querying to pull the author identifier (name or ID) from the question; if none is found, pass the query through unchanged.
- Resolve the identifier to a canonical author record and attach structured fields (e.g.,
author_id,author_full_name) to the query; use these as strict filters alongside vector similarity to ensure results come from the intended author.
-
Query Expansion
- Generate several diverse reformulations of the query (semantic paraphrases and filter-aware variants).
- Each expansion preserves or tightens the filters discovered during self-query.
In this project, the OpenAI model
gpt-4.1-nano(via the OpenAI API) and, for local setup,llama3.2:3b(via Ollama) were used for self-query and query expansion. Both are configurable and can be swapped for open-source LLMs.
-
Filtered Vector Search
- Run n parallel searches (one per expanded query) in Qdrant while applying hard metadata filters (author, source type, time window) to narrow the search space and improve accuracy and latency.
- Merge results from all expansions and remove near-duplicates or stale chunks using parent document identifiers, yielding up to n×k candidates (n = total queries after expansion; k = top-k per query; e.g., n=4, k=8 → 32).
- Optionally budget results per source category (for example, articles, posts, repositories) so no single source dominates the context.
-
Re-ranking
- Score every candidate against the original question with a cross-encoder or lightweight LLM re-ranker (default:
cross-encoder/ms-marco-MiniLM-L4-v2), reorder by relevance, and keep only the top-k chunks for context. - This sharpens precision, reduces noise and prompt size, and pairs well with query expansion: first cast a wider net, then focus on the most relevant evidence.
- Score every candidate against the original question with a cross-encoder or lightweight LLM re-ranker (default:
-
Context Builder
- Trim content to the prompt token budget while preserving diversity and coverage.
- Maintain chunk order within each document when helpful for readability.
- Attach citations for every chunk (
document ID,chunk ID, andcanonical URL).
-
Prompt Assembly
- Compose the final prompt: system instructions, guardrails, the user query, and the ranked context with citations.
-
LLM Inference
- Send the prompt to the LLM inference endpoint over HTTP API (FastAPI).
- Return the answer together with the list of cited sources. Optional debug information can include the expanded queries and timing.
- User query text and, if provided, explicit filters.
- Qdrant collections containing chunk embeddings and metadata (document ID, source type, author, timestamps, URL).
- Final answer grounded in retrieved context.
- Citations that reference the contributing documents and chunks.
- Optional diagnostics: expanded queries, number of results at each stage, latency per stage.
- Number of query expansions.
- Top-k per search and the number of chunks kept after re-ranking.
- Per-source result budgets to balance articles, posts, and repositories.
- Metadata filters applied automatically or supplied by the caller.
- Token budget for the context and the LLM model or endpoint name.
- Re-ranker model choice.
- Observability: log expansions, hit counts, re-rank scores, prompt tokens, and latency for search, re-rank, and LLM inference. Track end-to-end freshness when queries include time filters.
- Failure handling: if re-ranking fails, fall back to vector similarity order; if a metadata filter is invalid, drop it and warn rather than failing the request.
- Safety: enforce max context size, strip dangerous markup, and cap the number of external links included as citations.
Result: a predictable and debuggable runtime that converts a single query into a well-grounded answer with transparent sources.
The training experiments are tracked using Comet ML. You can explore the public project workspace here → LLMGhostWriter on Comet
Both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) experiments are logged, including:
- Training and validation losses
- Hyperparameters, evaluation metrics, and generated outputs
- System resource usage (GPU, memory, network)
🖼️ Example Screenshots: Experiment Tracking on Comet ML
Experiment tracking in Comet ML with SFT metrics
Experiment tracking in Comet ML with DPO metrics
✅ All metrics, traces, and visualizations are automatically logged when the
COMET_API_KEYenvironment variable is configured.
Prompt–response traces from the RAG pipeline can be visualized interactively using Opik, allowing step-by-step insight into retrieval and generation behavior.
It provides insights into:
- Prompt inputs and generated outputs
- Retrieval context and document sources
- Latency and resource usage
- Error tracking and debugging
Example screenshots: RAG Trace Visualizations:
CI checks are automated via .github/workflows/ci.yml on every pull request.
# Linting & formatting with Ruff via Poetry
poetry poe lint-check # Check code style/lint errors
poetry poe format-check # Check coding style/formatting
poetry poe lint-fix # Apply automatic lint fixes
poetry poe format-fix # Apply automatic format fixes
poetry poe test # Run all the testsThe .github/workflows/cd.yml workflow automates building and publishing the Docker image to GitHub Container Registry (GHCR).
With infrastructure up and .env configured, follow the steps below to run the system end-to-end:
-
Collect data:
poetry poe run-digital-data-etl -
Compute features:
poetry poe run-feature-engineering-pipeline -
Generate instruct dataset:
poetry poe run-generate-instruct-datasets-pipeline -
Generate preference alignment dataset:
poetry poe run-generate-preference-datasets-pipeline
-
Call only the RAG retrieval module:
poetry poe call-rag-retrieval-module -
Start the end-to-end RAG server:
poetry poe run-inference-ml-service -
Test RAG server:
poetry poe call-inference-ml-service(configure the payload intests/payloads/payload.json.example)
-
Export settings to ZenML:
poetry poe export-settings-to-zenml -
Delete settings from ZenML:
poetry poe delete-settings-zenml
- Cloud-based training pipelines (AWS SageMaker/Azure ML) within ZenML cloud for automated deployment and scaling.
- Advanced evaluation framework integrating human-in-the-loop feedback.
- Enhanced RAG capabilities with multi-modal data sources.
- Strengthen domain independence by removing outward references.
- Introduce thin ports for storage/vector services and wire them in the application.
This repository is for educational and research use only. It may reference or link to publicly available third-party content. All copyrights and trademarks remain with their respective owners. No endorsement or affiliation is implied.
By using this project you agree to:
- Comply with the licenses and terms of service of any third-party sources (including robots.txt and rate limits) when collecting or using data.
- Use the provided code, datasets, and model artifacts at your own risk; they are provided “as is” without warranties of any kind.
- Avoid including personal, proprietary, or sensitive data in configuration, prompts, logs, or datasets.
Data scope. This repo does not claim ownership of third-party content. If datasets in this project include links, metadata, or limited excerpts for evaluation or retrieval indexing, they are provided solely for research/illustrative purposes. If you redistribute any third-party text, you are responsible for ensuring you have the right to do so (or removing that content).
Models. Trained models or checkpoints may reflect patterns learned from training data. You are responsible for ensuring any downstream use complies with applicable rights and licenses of the underlying sources.
Non-commercial intent: This project is intended for non-commercial research and education.
Contact/takedown: If you believe something here should be removed, please open a GitHub issue with the relevant URLs.
This project draws on public posts, code, and tutorials from Comet ML, the Hugging Face blog, GitHub, ZenML, Medium, and other community sources. Thanks to the authors whose educational materials helped shape the pipeline design and evaluation approach.
This project references public writings by the following authors/sites for research and evaluation. Please see their original posts for full content and licensing:
- Chip Huyen: huyenchip.com
- Jay Alammar: newsletter.languagemodels.co
- Code is licensed under: Apache-2.0 (see
LICENSE). - Datasets and model artifacts are subject to their original licenses; see respective source pages for details.
- By using this project, you agree to the terms outlined in the Disclaimer section.








