rag-ingest is a simple pipeline for ingestion of documents into a vector store and serving them for retrieval-augmented generation (RAG). I am using this project to ingest the user manuals of all devices in my house and then chat with Llama 3.1 for troubleshooting any issues I have. I created this repo for educational purpose only, CPUs are not as efficient as GPUs when it comes to inference so for a real use-case it makes sense to work with CUDA cores for siginifcant performance gains.
This project allows you to:
- Prepare and convert language models to the gguf format.
- Split and embed text documents using CPU-agnostic libraries.
- Store embeddings in a vector store for fast similarity search.
- CPU-Agnostic: Use libraries that run efficiently on any CPU without requiring GPU acceleration.
- Modular Pipeline: Separate steps for model preparation, document ingestion, and query serving.
- Reproducible: Clear instructions to download models, convert formats, and ingest data.
- Python 3.8+
- llama-cpp-python (for gguf model loading)
- sentence-transformers or transformers (CPU mode)
- faiss-cpu or chromadb
- Hugging Face CLI (
huggingface_hub)
The below instructions are generic so you can download the model of your choosing. I went for Llama 3.1 8B where I downloaded the model files then converted it to gguf format. I tracked the instructions for my own setup in SETUP.md.
-
Get a pre-trained model
pip install huggingface_hub huggingface-cli login huggingface-cli repo clone <model-id> ./model
-
Convert to GGUF
GGUF is the llama.cpp “general-purpose” file format for quantized inference.git clone https://github.com/ggerganov/llama.cpp cd llama.cpp # build the converter make # convert ./convert-ggml-to-gguf models/<model>.bin models/<model>.gguf
GGUF is a binary format optimized for the llama.cpp runtime. It supports quantized weights, fast loading, and is CPU-friendly.
-
Text Splitting
Use a library likelangchainornltkto split large documents into smaller chunks. -
Embedding
Compute embeddings withsentence-transformersortransformersin CPU mode:from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
-
Vector Store
Store embeddings using FAISS or Chroma:import faiss index = faiss.IndexFlatL2(embeddings.shape[1]) index.add(embeddings)
A vector store is a database of numeric embeddings that supports fast similarity search. It allows you to retrieve the most relevant document chunks given a query embedding.
flowchart TB
%% Model prep
subgraph ModelPrep["Model Preparation"]
A["Download HF Model"] --> B["Convert to GGUF"]
B --> C["Load GGUF Model"]
end
%% Document ingestion
subgraph Ingest["Ingestion"]
D["Raw Documents"] --> E["Text Splitting"]
E --> F["Compute Embeddings"]
F --> G["Store Embeddings in Vector Store"]
end
%% User query path
subgraph Query["Query"]
H["User Query"] --> I["Embed Query"]
I --> J["Vector Store Lookup"]
J --> K["Retrieve Chunks"]
K --> L["LLM Inference (gguf)"]
L --> M["Answer"]
end
- Clone this repo:
git clone https://github.com/akram0zaki/rag-ingest.git cd rag-ingest - Follow Model Preparation and Document Ingestion steps above.
- Run your query script pointing at the vector store and gguf model.