llm-ghostwriter

MLOps-driven LLM assistant that learns your unique writing style from your online content. It implements end-to-end LLMOps best practices with an FTI pipeline (Features → Training → Inference) integrating Retrieval-Augmented Generation (RAG) for context grounding, and automated pipeline orchestration through ZenML.

Features

Key capabilities of the project:

📝 Data collection & generation
🔄 LLM training pipeline
📊 Retrieval-Augmented Generation (RAG) system
🔍 Comprehensive monitoring
🧪 Testing and evaluation framework

You can download and use the final trained model on Hugging Face - ahmedshahriar/GhostWriterLlama-3.2-1B-DPO.

Dependencies

Local dependencies

To install and run the project locally, the following dependencies are required.

Tool	Version	Purpose	Installation Link
pyenv	≥2.6.12	Multiple Python versions (optional)	Install Guide
Python	3.11	Runtime environment	Download
Poetry	>2.2.1	Package management	Install Guide
Docker	≥28.5.1	Containerization	Install Guide

Cloud services

The project also uses the following cloud services:

Service	Purpose
Hugging Face	Model registry (fine-tuned LLM)
Google Colab	LLM training (SFT, DPO)
Kaggle	LLM Evaluation (Custom Framework)
Comet ML	Experiment tracker (e.g., train/eval metrics)
Opik	Prompt monitoring (tracing RAG inference)
ZenML	Orchestrator and artifacts layer
MongoDB	NoSQL database (DWH)
Qdrant	Vector database (RAG)
GitHub Actions	CI/CD pipeline (QA, Container Registry)

Project Architecture (high-level)

%%{init: {
  'theme': 'base',
  'flowchart': {
    'curve': 'linear',
    'nodeSpacing': 40,
    'rankSpacing': 60,
    'htmlLabels': true
  }
}}%%

flowchart TB

%% ---- classes ----
classDef stage fill:#f9f9f9,stroke:#888,stroke-width:1px,corner-radius:4px,color:#111;
classDef node fill:#fff,stroke:#444,stroke-width:1px,corner-radius:4px,color:#000;
classDef faint fill:#f3f4f6,stroke:#7a869a,color:#1f2937;
classDef optional stroke-dasharray: 3 3,stroke:#888,color:#111,fill:#fafafa;

%% === Data Collection Pipeline ===
subgraph DC[<b>Data Collection Pipeline]
  direction TB
  BP["Articles<br>(Blog Posts)"]:::node
  GH["GitHub"]:::node
  ETL[["ETL"]]:::faint
  NDB[("NoSQL DWH<br/>(MongoDB)")]:::node
  BP --> ETL
  GH --> ETL
  ETL --> NDB
end
class DC stage;

%% === Feature Pipeline ===
subgraph FP["<div style="width:17em; height:8em; display:flex;"><b>Feature Pipeline</b></div>"]
  direction TB
  SRC_HUB{{"Data Sources"}}:::node
  ART["Articles"]:::node
  COD["Code"]:::node
  MERGE["Data for Training & RAG"]:::faint
  SRC_HUB --> ART
  SRC_HUB --> COD
  ART --> MERGE
  COD --> MERGE
end
class FP stage;

%% connect Data Collection → Feature Pipeline
NDB --> SRC_HUB

%% === Logical Feature Store ===
subgraph LFS["<div style="width:10em; height:8em; padding-right:12em; display:flex;"><b>Logical Feature Store</b></div>"]
  direction TB
  ID[("Instruct Dataset")]:::node
  RC[["Retrieval Client"]]:::faint
  VDB[("Vector DB<br/>(Qdrant)")]:::node
  ID --> RC --> VDB
end
class LFS stage;

%% connect Feature Pipeline → Feature Store
MERGE --> ID

%% === Training Pipeline ===
subgraph TP[<b>Training Pipeline]
  direction TB
  ET["Experiment Tracker<br/>(Comet ML)"]:::optional
  FT[["LLM Training<br/>(SFT, DPO)"]]:::faint
  TST["Test LLM Candidate"]:::node
  MR[("Model Registry<br/>(HuggingFace)")]:::node
  ET --> FT
  ID -- "Training Data" --> FT
  FT -- "LLM Production Candidate" --> MR
  MR --> TST
  TST -- "Accepted LLM" --> MR
end
class TP stage;

%% === Inference Pipeline ===
subgraph IP["<div style="width:17.5em; height:8em;  display:flex;"><b>Inference Pipeline</b></div>"]
  direction TB
  DEP["Deploy"]:::node
  LT["GhostWriter LLM"]:::node
  RCI[["Retrieval Client"]]:::faint
  API["REST API"]:::node
  MON["Prompt &amp; System Monitoring<br/>(Opik)"]:::optional
  DEP --> LT --> RCI --> API
  API -- "Write a post about..." --> MON
  MON -- "Generated Post" --> API
end
class IP stage;

%% cross-pipeline connections
MR --> DEP
VDB -. "RAG Data" .-> RCI

Project Structure

Here is the directory overview:

.
├── configs/                  # Pipeline configuration files (YAML)
├── core/                     # Core project package
│   ├── application/          # Crawlers, preprocessing, RAG orchestration, embedding singleton
│   ├── domain/               # Pydantic models: documents, datasets, vectors, prompts, etc.
│   ├── infrastructure/       # DB connectors (mongodb, qdrant), FastAPI app wiring, Opik config
│   ├── model/                # LLM fine-tuning (SFT, DPO), evaluation (LLMs as Judge), inference modules
│   └── settings.py           # .env/ZenML-secrets driven settings
├── pipelines/                # ML pipeline definitions (ETL, Feature Engineering, Dataset Generation, Export)
├── steps/                    # Pipeline components (Modular ZenML pipeline steps)
│   ├── etl/                  # ETL steps
│   ├── export/               # Data export steps
│   ├── feature_engineering/  # Feature engineering steps
│   └── dataset_generation/   # Dataset generation steps
├── tests/                    # Test examples
├── tools/                    # Utility scripts
│   ├── run.py                # Entry point to run ZenML pipelines
│   ├── ml_service.py         # Starts the REST API inference server
│   ├── rag.py                # Demonstrates usage of the RAG retrieval module
│   └── data_warehouse.py     # Export/import data from MongoDB data warehouse
└── pyproject.toml            # dependencies, Poe tasks and project metadata

core/ package is the heart of the project, implementing both the RAG and LLM train/inference pipelines. It follows Domain-Driven Design (DDD) principles, promoting modularity:

domain/: Core business entities and structures (Pydantic models for documents, datasets, vectors etc.)
application/: Encapsulates business logic, crawlers, data preprocessing flows, and RAG implementation
model/: LLM training, evaluation, and inference
infrastructure/: External service integrations (Qdrant, MongoDB, FastAPI, Opik)

Dependency Flow

Source-code dependencies only point inward (outer → inner). Inner layers know nothing about outer ones—this is the core “dependency rule” of DDD (DDD Reference, PDF).

The domain layer is independent.
The application layer depends on domain and exposes use cases.
The model layer (LLM training/tuning/inference) may depend on application and domain as needed. (Note: model/ here refers to ML/LLM codes, not the DDD "domain model".)

The infrastructure layer depends on inner layers (application, domain, and, where applicable, model).

                    ┌────────────────────────────┐
                    │        configs/            │
                    │  (YAML pipeline settings)  │
                    └────────────┬───────────────┘
                                 │
                                 ▼
                    ┌────────────────────────────┐
                    │       pipelines/           │
                    │  ZenML pipeline definitions│
                    └────────────┬───────────────┘
                                 │
                                 ▼
                    ┌────────────────────────────┐
                    │         steps/             │
                    │   Reusable ZenML steps     │
                    │ (ETL, FE, dataset gen...)  │
                    └────────────┬───────────────┘
                                 │
                                 ▼
          ┌──────────────────────────────────────────────┐
          │                    core/                     │
          │  Main package implementing LLM + RAG logic   │
          ├──────────────────────────────────────────────┤
          │ domain/         → Pydantic entities/models   │
          │ application/    → Crawlers, preprocessing,   │
          │                   RAG orchestration          │
          │ model/          → LLM training + inference   │
          │ infrastructure/ → DB connectors, FastAPI     │
          │ settings.py     → Environment-driven config  │
          └──────────────────────┬───────────────────────┘
                                 │
                                 ▼
                  ┌───────────────────────────────┐
                  │           tools/              │
                  │   CLI scripts & API launchers │
                  │ (run.py, ml_service.py etc.)  │
                  └─────────────┬─────────────────┘
                                │
                                ▼
                  ┌───────────────────────────────┐
                  │            tests/             │
                  │   CI examples & validations   │
                  └───────────────────────────────┘

Configs (`configs/`)

Contains YAML configuration files for ZenML pipelines, steps, and runtime parameters.

Pipelines (`pipelines/`)

Contains ZenML ML pipelines that serve as entry points for data preparation, feature engineering, model training, evaluation, and RAG updates.

Each pipeline coordinates multiple reusable steps (from steps/) to define complete end-to-end workflows.

Steps (`steps/`)

This folder contains ZenML step implementations, each encapsulating a single, reusable operation within a ZenML pipeline. Examples include:

etl/ – Data ingestion, cleaning, and transformation
export/ – Model or dataset export utilities
feature_engineering/ – Feature generation and transformation logic
dataset_generation/ – Creation of labeled or synthetic datasets for training

Steps perform specific tasks (e.g., data loading, preprocessing) and can be combined within the ML pipelines.

Tools (`tools/`)

Utility scripts that simplify interaction with the system:

run.py: Runs ZenML pipelines directly from the CLI.
ml_service.py: Starts the REST API inference server.
rag.py: Demonstrates usage of the RAG retrieval and document ranking functionality.
data_warehouse.py: Handles JSON-based import/export to/from MongoDB data warehouse.

These scripts serve as both operational utilities and usage examples.

Tests (`tests/`)

Includes lightweight test examples, covering functional and integration use cases. They are primarily designed for CI/CD validation and as templates for extending test coverage.

Installation

Follow these steps to set up the project locally.

1. Clone the Repository

Start by cloning the repository and navigating to the project directory:

git clone https://github.com/ahmedshahriar/llm-ghostwriter.git
cd llm-ghostwriter

Next, prepare the Python environment and related dependencies.

2. Set Up the Python Environment

The project requires Python 3.11. Either use your global Python installation or set up a project-specific version using pyenv.

Option A: Using Global Python (if version 3.11 is installed)

Verify Python version:

python --version  # Should show Python 3.11.x

Option B: Using pyenv (recommended)

Verify pyenv installation:

pyenv --version   # Should show pyenv 2.6.12 or later

Install Python 3.11.8:

pyenv install 3.11.8

Verify the installation:

python --version  # Should show Python 3.11.8

Confirm Python version in the project directory:

python --version
# Output: Python 3.11.8

Note

The project includes a .python-version file that automatically sets the correct Python version when you're in the project directory.

3. Install Dependencies

The project uses Poetry for dependency management.

Verify Poetry installation:

poetry --version  # Should show Poetry version 2.2.1 or later

Set up the project environment and install dependencies:

poetry env use 3.11
poetry install
poetry run pre-commit install

This will:

Configure Poetry to use Python 3.11
Install project dependencies
Set up pre-commit hooks for code verification

4. Activate the Environment

Use Poe the Poet as the task runner for project scripts.

Start a Poetry shell:

poetry shell

Run project commands using Poe the Poet:

poetry poe ...

🔧 Troubleshooting Poe the Poet Installation

Alternative Command Execution

If issues occur with poethepoet, project commands can be run directly through Poetry:

Locate the command definition in pyproject.toml
Execute it using poetry run followed by the command itself

Example:

Instead of:

poetry poe local-infrastructure-up

Use the direct command from pyproject.toml:

poetry run <actual-command-from-pyproject-toml>

Note: All project commands are defined in the [tool.poe.tasks] section of pyproject.toml

5. Configure Local Environment

After dependencies are installed, create a .env file at the repository root and define credentials for external services. Storing secrets in .env prevents them from being committed to GitHub.

First, copy the example env file by running the following:

cp .env.example .env # The file must be at your repository's root!

Fill in the essential variables to enable local execution:

OpenAI

Provide an authentication token for OpenAI’s API:

OPENAI_API_KEY=your_api_key_here

→ Check out this tutorial to learn how to provide one from OpenAI.

Hugging Face

Provide a Hugging Face access token:

HUGGINGFACE_ACCESS_TOKEN=your_token_here

→ Check out this tutorial to learn how to provide one from Hugging Face.

Comet ML & Opik

To authenticate to Comet ML (required only during training) and Opik, fill out the COMET_API_KEY env var with your authentication token.

COMET_API_KEY=your_api_key_here

→ Check out this tutorial to learn how to get started with Opik. You can also access Opik's dashboard using 🔗this link.

6. Deployment Setup

To deploy the project, configure the external databases MongoDB and Qdrant.

MongoDB

Update the DATABASE_HOST variable with the MongoDB cluster URL:

DATABASE_HOST=your_mongodb_url

→ Check out this tutorial to learn how to create and host a MongoDB cluster for free.

Qdrant

Enable Qdrant Cloud and set the endpoint and API key:

Change USE_QDRANT_CLOUD to true, set QDRANT_CLOUD_URL with the URL pointing to your cloud Qdrant cluster, and set QDRANT_APIKEY with its API key.

USE_QDRANT_CLOUD=true
QDRANT_CLOUD_URL=your_qdrant_cloud_url
QDRANT_APIKEY=your_qdrant_api_key

Important

Additional configuration options are available in settings.py. Any variable in the Settings class can be configured through the .env file.

Infrastructure

Local infrastructure (for testing and development)

For local execution, the project runs MongoDB and Qdrant instances through Docker. A local ZenML server is also started via the Python package.

Warning

Docker (>= v28.5.1) must be installed.

Start the entire local development stack:

poetry poe local-infrastructure-up

Stop the ZenML server and all Docker containers:

poetry poe local-infrastructure-down

Warning

When running on macOS, before starting the server, export the following environment variable: export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES Otherwise, the connection between the local server and pipeline will break. 🔗 More details in this issue. This is done by default when using Poe the Poet.

Start the real-time inference REST API:

poetry poe run-inference-ml-service

ZenML

Dashboard URL: http://localhost:8237

Default credentials:

username: default

→ Learn more: ZenML.

Qdrant

REST API URL: http://localhost:6333

Dashboard URL: http://localhost:6333/dashboard

→ Learn more: Qdrant with Docker.

MongoDB

Database URI: mongodb://ghost:[email protected]:27017 (local development only)

Database name: ghost_db

Default credentials:

username: ghost
password: ghost

→ Learn more: MongoDB with Docker.

Pipelines

All the ML pipelines will be orchestrated behind the scenes by ZenML. A few exceptions exist when running utility scripts, such as exporting or importing from the data warehouse.

The ZenML pipelines are the entry point for most processes throughout this project. They are under the pipelines/ folder. Thus, when you want to understand or debug a workflow, starting with the ZenML pipeline is the best approach.

To see the pipelines running and their results:

go to your ZenML dashboard
go to the Pipelines section
click on a specific pipeline (e.g., digital_data_etl)
click on a specific run (e.g., digital_data_etl_run_2025_09_20_14_00_52)
click on a specific step or artifact of the DAG to find more details about it

Here are some sample screenshots from the ETL pipeline run:

🔧 Sample Screenshot - ZenML ETL Pipeline Run

🔧 Sample Screenshot - ZenML ETL Pipeline Step Logs

zenml-pipeline-run-digital-data-ETL-step-logs

🔧 Sample Screenshot - ZenML ETL Pipeline Artifact Metadata

zenml-pipeline-run-digital-data-ETL-artifact-metadata

Now, let's explore all the pipelines you can run.

Data pipelines

Run the data collection ETL:

poetry poe run-digital-data-etl

To add additional links to collect from, go to configs/digital_data_etl_[author_name].yaml and add them to the links field.

Run the feature engineering pipeline:

poetry poe run-feature-engineering-pipeline

Generate the instruct dataset:

poetry poe run-generate-instruct-datasets-pipeline

Generate the preference dataset:

poetry poe run-generate-preference-datasets-pipeline

Utility pipelines

Export the data from the data warehouse to JSON files:

poetry poe run-export-data-warehouse-to-json

Import data to the data warehouse from JSON files (by default, it imports the data from the data/data_warehouse_raw_data directory):

poetry poe run-import-data-warehouse-from-json

Export ZenML artifacts to JSON:

poetry poe run-export-artifact-to-json-pipeline

This will export the following ZenML artifacts to the output folder as JSON files (it will use their latest version):

cleaned_documents.json
instruct_datasets.json
preference_datasets.json
raw_documents.json

You can configure what artifacts to export by tweaking the configs/export_artifact_to_json.yaml configuration file.

Model Training & Evaluation

Model fine-tuning and evaluation were conducted in Google Colab using GPU accelerators. The training code is available in the following notebooks (under model/finetuning/ and model/evaluation/):

Dependencies:

Transformers
TRL (Transformer Reinforcement Learning)
vLLM
Unsloth
PEFT (LoRA)
Comet ML
Environment: Google Colab and Kaggle (T4/A100 GPU)

Notebook	Purpose	Link
`llm_ghostwriter_finetune_sft.ipynb`	Supervised fine-tuning (SFT) on personalized writing datasets
`llm_ghostwriter_finetune_dpo.ipynb`	Preference optimization (DPO) using pairwise preference data
`llm_ghostwriter_evaluate.ipynb`	Post-training evaluation and generation quality assessment

⚙️ Training Overview

Base model: Llama-3.2-1B
Models released:
- GhostWriterLlama-3.2-1B (SFT)
- GhostWriterLlama-3.2-1B-DPO
Uses a 95% training / 5% test split for validation during training.
Performs LoRA-based fine-tuning using Unsloth for accelerated training and reduced memory footprint.
Utilizes mixed precision training (fp16/bf16) and 8-bit Adam optimizer (weight decay=0.01) for efficiency. Depending on the GPU, it will automatically use FP16 or BF16 for the activations.
Metrics logged to Comet ML for live monitoring

🧩 Fine-Tuning Process

1. Supervised Fine-Tuning (SFT)

Loads the instruct dataset: ( datasets/ahmedshahriar/llmGhostWriter ) and supplements it with additional domain-curated samples (datasets/yahma/alpaca-cleaned) to improve coverage and stylistic diversity.
Initializes the base model Llama-3.2-1B.
Formats all samples using an Alpaca-style chat template to standardize instruction–response structure and ensure clean prompt alignment
Trains via the SFTTrainer with key parameters:
- learning_rate=3e-4, num_train_epochs=3, batch_size=2, and gradient accumulation (effective batch size 16)
Final checkpoints published to HF Hub: GhostWriterLlama-3.2-1B.
Comet ML Metrics dashboard: GhostWriterLlama-3.2-1B-SFT Experiment

2. Direct Preference Optimization (DPO)

Initializes the SFT-tuned checkpoint as base: GhostWriterLlama-3.2-1B.
Uses a (prompt, chosen, rejected) preference dataset (datasets/ahmedshahriar/llmGhostWriter-dpo) formatted with an Alpaca-style template; the uniform prompt shape reduces prompt-induced variance during preference learning.
Trains with DPOTrainer using:
- beta=0.5 to strengthen reference-model regularization. It controls the importance of the reference model. A standard value of 0.1 works well in most scenarios.
- Lower learning rate (from 3e-4 for SFT to 2e-6 here), train for 1 epoch instead of 3.
- The max_seq_length parameter is now broken down into two new parameters: max_prompt_length (prompt only) and max_length (prompt and answer).
Final checkpoints published to HF Hub: GhostWriterLlama-3.2-1B-DPO
Comet ML Metrics dashboard: GhostWriterLlama-3.2-1B-DPO Experiment

beta=0.5 was chosen due to the fact that the trained model used formal language with lower values. Having it closer to the reference model helps to fix this issue.

✏️ Response Comparison Example: SFT vs. DPO Output

Prompt:

“Write a paragraph to introduce supervised fine-tuning.”

SFT Model (GhostWriterLlama-3.2-1B):

Supervised fine-tuning is a technique used to improve a language model by teaching it to predict the next word based on examples provided by a labeled dataset. This process involves training the model to understand the relationships between words and how to predict the next one given a set of previous tokens. Supervised fine-tuning is commonly used with language models like GPT-3 and is considered a state-of-the-art approach for text generation.

DPO Model (GhostWriterLlama-3.2-1B-DPO):

Supervised fine-tuning is a technique used in natural language processing and machine learning to improve the performance of a model by training it on a large dataset of labeled examples. This process involves feeding the model examples of correct and incorrect responses, and using this information to adjust the model's parameters to better predict the correct response. Supervised fine-tuning is commonly used in tasks such as text classification, sentiment analysis, and machine translation, where the goal is to train the model to accurately classify or translate between languages. By fine-tuning the model on a labeled dataset, the model learns to better recognize and respond to the specific prompts and expectations of the task at hand.

💡 The DPO-tuned model produces a more detailed, technically grounded explanation—showing improved factual completeness and task awareness—while maintaining clarity and relevance for educational or applied writing contexts.

🧪 Evaluation Framework

Evaluation was conducted via core/model/evaluation/llm_ghostwriter_evaluate.ipynb, which performs automated answer generation and LLM-based scoring to evaluate accuracy and stylistic alignment across models.

The custom evaluation framework uses a judge LLM (GPT-4.1-nano) to evaluate model outputs on the test split of the instruction dataset.
Each prediction is scored on a 1–3 scale across two criteria:
- Accuracy: factual correctness and completeness of the response
- Style: tone and writing suitability for blogs and articles
The framework feeds test prompts to both models, collects outputs, and scores them via the judge LLM prompt.
Results are analyzed through qualitative and quantitative evaluation to assess overall writing quality and alignment.

🧾 Answer Generation

Generates model outputs for each test instruction using vLLM for fast batched inference.
Evaluates three models - Meta-Llama-3.2-1B-Instruct as the reference baseline, alongside both GhostWriter LLM variants (SFT-tuned and DPO-tuned).
Prompts are formatted consistently using a unified chat template.
Outputs are generated with diversity-oriented sampling (temperature=0.8, top_p=0.95, min_p=0.05) and saved to the Hugging Face Hub for later evaluation and analysis.

🧩 Answer Evaluation

Uses GPT-4.1-nano as a judge LLM to evaluate model outputs on accuracy and style using a 1–3 scoring scale.
Each instruction–answer pair is assessed through a structured JSON evaluation prompt
The process runs in parallel batches for efficiency, with results stored as structured JSON objects.
Extracted scores (accuracy, style) are added to the dataset and finally pushed to the Hugging Face Hub for further analysis and comparison across models.
- Results Dataset:
  1. SFT-tuned: datasets/ahmedshahriar/GhostWriterLlama-3.2-1B-results
  2. DPO-tuned: datasets/ahmedshahriar/GhostWriterLlama-3.2-1B-DPO-results
  3. Base Model: datasets/ahmedshahriar/Llama-3.2-1B-Instruct-results

📊 Results Comparison

Model	Accuracy ↑	Style ↑
GhostWriterLlama-3.2-1B	2.42	2.92
GhostWriterLlama-3.2-1B-DPO	2.48	2.95
Llama-3.2-1B-Instruct (baseline)	2.64	2.91

🧩 The DPO-tuned variant slightly improves both stylistic alignment and factual accuracy over the base fine-tuned model, approaching the performance of the strong Llama-3.2-1B-Instruct baseline. While the baseline retains a small edge in accuracy due to its extensive instruction tuning, the DPO model produces more natural, accessible writing better suited for blog and creative content.

Note

The current inference pipeline API runs on the DPO-tuned model using transformers library for generation. Future updates will integrate cloud-based training pipelines (AWS SageMaker/Azure ML) within ZenML cloud for automated deployment and scaling.

RAG System

Tech Stack (at a glance)

Warehouse → Vector DB: MongoDB → Qdrant
Embeddings: SentenceTransformers (sentence-transformers/all-MiniLM-L6-v2, configurable)
Re-ranker: cross-encoder/ms-marco-MiniLM-L4-v2 (configurable)
LLMs: OpenAI (gpt-4.1-nano) or local via Ollama (llama3.2:3b) (configurable)
Processing: LangChain
Serving: FastAPI (HTTP) → LLM endpoint
Orchestration: ZenML

Batch RAG Feature Pipeline

Highlights (TL;DR)

Goal: Keep a logical feature store in sync with MongoDB and prepare data for RAG (chunk and embed) and training (clean text artifacts).
Why batch: Small volume (thousands → tens of thousands), minute-level freshness is fine. Batch = simpler, cheaper, easier to operate than streaming.
Outputs:
- RAG-ready index: chunk embeddings + metadata in Qdrant.
- Training snapshot: cleaned documents, versioned via ZenML, for dataset creation.

MongoDB → Clean → Chunk → Embed → Qdrant (vectors with metadata)
                       └──► ZenML Artifacts (cleaned-doc snapshots)

Purpose

Provide a reliable path from raw documents to features used by RAG retrieval and fine-tuning pipelines.
Keep the feature store synchronized with the warehouse with acceptable freshness for RAG.
Stay simple to run today; leave a clear path to CDC/streaming when scale demands.

Inputs

Documents in MongoDB (articles, repos/code files, posts).
Metadata: document ID, source, url, author, created_at, content_type, etc.

Outputs

Cleaned documents stored as records/artifacts for fine-tuning dataset creation.
Chunk embeddings with metadata indexed in Qdrant for retrieval.
Operational metrics/logs to monitor freshness and pipeline health.

High-Level Flow

Extract (MongoDB)
- Pull latest/updated documents on a schedule or manual trigger.
- Read in batches to control memory and enable retries.
Clean
- Normalize encodings/whitespace; replace inline URLs with placeholders.
- Remove emojis/boilerplate; deduplicate (document and paragraph level).
- Attach lineage and provenance (document ID, source, created_at, updated_at).
- Emit a cleaned snapshot for training (versioned via ZenML or as metadata-only rows in Qdrant if needed).
Chunk
- Split by content type: wider windows for code; paragraph/sentence for articles.
- Respect the embedding model’s maximum input length; assign a stable chunk ID; retain the parent document ID for provenance.
Embed
- Encode each chunk with a configurable model (default: sentence-transformers/all-MiniLM-L6-v2).
- Batch on CPU/GPU; record failures and retry/fallback.
Load to Feature Store
- Vectors + metadata → Qdrant (upsert semantics).
- Cleaned docs snapshot → ZenML artifacts (or Qdrant metadata) for downstream fine-tuning.
Retrieval Client
- Thin client hits Qdrant with filters (source/type/date) and returns top-k + metadata for prompt assembly or dataset curation.

Feature Store Design

Online store (for RAG): The vector database (Qdrant) holds chunk embeddings and metadata used for retrieval at inference time.
- It can also keep the cleaned documents as metadata-only rows to support fast lookups and filtering without touching the warehouse.
Offline store (for Training): ZenML artifacts store versioned cleaned text snapshots that are used to build reproducible fine-tuning datasets.
- Training pipeline reads these artifacts, not from the MongoDB warehouse, which makes runs reproducible and easy to roll back.
Why two snapshots: This keeps the shared MongoDB warehouse generic and avoids per–use-case “cleaned” tables. Training and inference pipelines read exclusively from the feature store, providing one consistent surface with clear lineage (document ID and chunk ID) and a straightforward path to rebuild or refresh the online index from the same cleaned snapshot.

Synchronization & CDC Stance

Current approach is periodic batch polling that reads the warehouse (MongoDB) and upserts into Qdrant. Works well for this use case (small data).
When volume grows, upgrade options:
- Timestamp-based pull to fetch only new or updated rows.
- Trigger-based replication for full CRUD tracking.
- Log-based CDC with a queue and a stream processor for low latency and delete handling at scale.

Configurability

Embedding model, chunk sizes/overlaps, batch sizes, collection names, and schedules are config-driven.
Defaults favor portability and modest hardware; can be swapped for stronger models as resources allow.

Assumptions & Limits

Small to medium datasets; near–real-time not required.
Tokenization at runtime to stay flexible across models/tasks.
Clear migration path to streaming/CDC when volume, latency, or delete-handling demands it.

Result: a simple, reproducible batch pipeline that keeps retrieval assets fresh while also feeding training with curated, versioned text.

Online RAG Inference Pipeline

Turn a user query into a grounded answer by expanding the query, retrieving the most relevant chunks from the vector database, re-ranking them, and calling the LLM inference endpoint with a compact, cited context.

End-to-End Flow

Self-Query
- Use self-querying to pull the author identifier (name or ID) from the question; if none is found, pass the query through unchanged.
- Resolve the identifier to a canonical author record and attach structured fields (e.g., author_id, author_full_name) to the query; use these as strict filters alongside vector similarity to ensure results come from the intended author.
Query Expansion
- Generate several diverse reformulations of the query (semantic paraphrases and filter-aware variants).
- Each expansion preserves or tightens the filters discovered during self-query.

In this project, the OpenAI model gpt-4.1-nano (via the OpenAI API) and, for local setup, llama3.2:3b (via Ollama) were used for self-query and query expansion. Both are configurable and can be swapped for open-source LLMs.

Filtered Vector Search
- Run n parallel searches (one per expanded query) in Qdrant while applying hard metadata filters (author, source type, time window) to narrow the search space and improve accuracy and latency.
- Merge results from all expansions and remove near-duplicates or stale chunks using parent document identifiers, yielding up to n×k candidates (n = total queries after expansion; k = top-k per query; e.g., n=4, k=8 → 32).
- Optionally budget results per source category (for example, articles, posts, repositories) so no single source dominates the context.
Re-ranking
- Score every candidate against the original question with a cross-encoder or lightweight LLM re-ranker (default: cross-encoder/ms-marco-MiniLM-L4-v2), reorder by relevance, and keep only the top-k chunks for context.
- This sharpens precision, reduces noise and prompt size, and pairs well with query expansion: first cast a wider net, then focus on the most relevant evidence.
Context Builder
- Trim content to the prompt token budget while preserving diversity and coverage.
- Maintain chunk order within each document when helpful for readability.
- Attach citations for every chunk (document ID, chunk ID, and canonical URL).
Prompt Assembly
- Compose the final prompt: system instructions, guardrails, the user query, and the ranked context with citations.
LLM Inference
- Send the prompt to the LLM inference endpoint over HTTP API (FastAPI).
- Return the answer together with the list of cited sources. Optional debug information can include the expanded queries and timing.

Inputs

User query text and, if provided, explicit filters.
Qdrant collections containing chunk embeddings and metadata (document ID, source type, author, timestamps, URL).

Outputs

Final answer grounded in retrieved context.
Citations that reference the contributing documents and chunks.
Optional diagnostics: expanded queries, number of results at each stage, latency per stage.

Key Parameters (configurable)

Number of query expansions.
Top-k per search and the number of chunks kept after re-ranking.
Per-source result budgets to balance articles, posts, and repositories.
Metadata filters applied automatically or supplied by the caller.
Token budget for the context and the LLM model or endpoint name.
Re-ranker model choice.

Operational Notes

Observability: log expansions, hit counts, re-rank scores, prompt tokens, and latency for search, re-rank, and LLM inference. Track end-to-end freshness when queries include time filters.
Failure handling: if re-ranking fails, fall back to vector similarity order; if a metadata filter is invalid, drop it and warn rather than failing the request.
Safety: enforce max context size, strip dangerous markup, and cap the number of external links included as citations.

Result: a predictable and debuggable runtime that converts a single query into a well-grounded answer with transparent sources.

📈 Experiment Tracking & Prompt Monitoring

Comet ML: experiment tracker

The training experiments are tracked using Comet ML. You can explore the public project workspace here → LLMGhostWriter on Comet

Both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) experiments are logged, including:

Training and validation losses
Hyperparameters, evaluation metrics, and generated outputs
System resource usage (GPU, memory, network)

🖼️ Example Screenshots: Experiment Tracking on Comet ML

Experiment tracking in Comet ML with SFT metrics

GhostWriterLlama-3 2-1B-train-SFT-metrics

Experiment tracking in Comet ML with DPO metrics

GhostWriterLlama-3 2-1B-train-DPO-metrics

✅ All metrics, traces, and visualizations are automatically logged when the COMET_API_KEY environment variable is configured.

Opik: prompt monitoring

Prompt–response traces from the RAG pipeline can be visualized interactively using Opik, allowing step-by-step insight into retrieval and generation behavior.

It provides insights into:

Prompt inputs and generated outputs
Retrieval context and document sources
Latency and resource usage
Error tracking and debugging

Example screenshots: RAG Trace Visualizations:

📊 RAG Context Retrieval

🧩 Query Expansion

🪄 Final RAG Output Generation

⚙️ Trace Metadata (Model & Embedding Info)

Quality: Lint, Tests

CI checks are automated via .github/workflows/ci.yml on every pull request.

# Linting & formatting with Ruff via Poetry
poetry poe lint-check       # Check code style/lint errors
poetry poe format-check     # Check coding style/formatting
poetry poe lint-fix         # Apply automatic lint fixes
poetry poe format-fix       # Apply automatic format fixes

poetry poe test             # Run all the tests

The .github/workflows/cd.yml workflow automates building and publishing the Docker image to GitHub Container Registry (GHCR).

Run Project

With infrastructure up and .env configured, follow the steps below to run the system end-to-end:

Data

Collect data: poetry poe run-digital-data-etl
Compute features: poetry poe run-feature-engineering-pipeline
Generate instruct dataset: poetry poe run-generate-instruct-datasets-pipeline
Generate preference alignment dataset: poetry poe run-generate-preference-datasets-pipeline

Inference

Call only the RAG retrieval module: poetry poe call-rag-retrieval-module
Start the end-to-end RAG server: poetry poe run-inference-ml-service
Test RAG server: poetry poe call-inference-ml-service (configure the payload in tests/payloads/payload.json.example)

Infrastructure

Export settings to ZenML: poetry poe export-settings-to-zenml
Delete settings from ZenML: poetry poe delete-settings-zenml

Roadmap

Cloud-based training pipelines (AWS SageMaker/Azure ML) within ZenML cloud for automated deployment and scaling.
Advanced evaluation framework integrating human-in-the-loop feedback.
Enhanced RAG capabilities with multi-modal data sources.

Technical Debt

Strengthen domain independence by removing outward references.
Introduce thin ports for storage/vector services and wire them in the application.

Disclaimer

This repository is for educational and research use only. It may reference or link to publicly available third-party content. All copyrights and trademarks remain with their respective owners. No endorsement or affiliation is implied.

By using this project you agree to:

Comply with the licenses and terms of service of any third-party sources (including robots.txt and rate limits) when collecting or using data.
Use the provided code, datasets, and model artifacts at your own risk; they are provided “as is” without warranties of any kind.
Avoid including personal, proprietary, or sensitive data in configuration, prompts, logs, or datasets.

Data scope. This repo does not claim ownership of third-party content. If datasets in this project include links, metadata, or limited excerpts for evaluation or retrieval indexing, they are provided solely for research/illustrative purposes. If you redistribute any third-party text, you are responsible for ensuring you have the right to do so (or removing that content).

Models. Trained models or checkpoints may reflect patterns learned from training data. You are responsible for ensuring any downstream use complies with applicable rights and licenses of the underlying sources.

Non-commercial intent: This project is intended for non-commercial research and education.

Contact/takedown: If you believe something here should be removed, please open a GitHub issue with the relevant URLs.

Acknowledgments

This project draws on public posts, code, and tutorials from Comet ML, the Hugging Face blog, GitHub, ZenML, Medium, and other community sources. Thanks to the authors whose educational materials helped shape the pipeline design and evaluation approach.

Attribution

This project references public writings by the following authors/sites for research and evaluation. Please see their original posts for full content and licensing:

Chip Huyen: huyenchip.com
Jay Alammar: newsletter.languagemodels.co

License

Code is licensed under: Apache-2.0 (see LICENSE).
Datasets and model artifacts are subject to their original licenses; see respective source pages for details.
By using this project, you agree to the terms outlined in the Disclaimer section.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github/workflows		.github/workflows
configs		configs
core		core
pipelines		pipelines
steps		steps
tests		tests
tools		tools
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

License

ahmedshahriar/llm-ghostwriter

Folders and files

Latest commit

History

Repository files navigation

llm-ghostwriter

Table of Contents

Features

Dependencies

Local dependencies

Cloud services

Project Architecture (high-level)

Project Structure

Dependency Flow

Configs (configs/)

Pipelines (pipelines/)

Steps (steps/)

Tools (tools/)

Tests (tests/)

Installation

1. Clone the Repository

2. Set Up the Python Environment

Option A: Using Global Python (if version 3.11 is installed)

Option B: Using pyenv (recommended)

3. Install Dependencies

4. Activate the Environment

Alternative Command Execution

Example:

5. Configure Local Environment

OpenAI

Hugging Face

Comet ML & Opik

6. Deployment Setup

MongoDB

Qdrant

Infrastructure

Local infrastructure (for testing and development)

ZenML

Qdrant

MongoDB

Pipelines

Data pipelines

Utility pipelines

Model Training & Evaluation

⚙️ Training Overview

🧩 Fine-Tuning Process

1. Supervised Fine-Tuning (SFT)

2. Direct Preference Optimization (DPO)

🧪 Evaluation Framework

🧾 Answer Generation

🧩 Answer Evaluation

📊 Results Comparison

RAG System

Tech Stack (at a glance)

Batch RAG Feature Pipeline

Highlights (TL;DR)

Purpose

Inputs

Outputs

High-Level Flow

Feature Store Design

Synchronization & CDC Stance

Configurability

Assumptions & Limits

Online RAG Inference Pipeline

End-to-End Flow

Inputs

Outputs

Key Parameters (configurable)

Operational Notes

📈 Experiment Tracking & Prompt Monitoring

Comet ML: experiment tracker

Opik: prompt monitoring

Quality: Lint, Tests

Run Project

Data

Inference

Infrastructure

Roadmap

Configs (`configs/`)

Pipelines (`pipelines/`)

Steps (`steps/`)

Tools (`tools/`)

Tests (`tests/`)

Packages