Thanks to visit codestin.com
Credit goes to github.com

Skip to content

dongwonmoon/Yaml-Pipe

Repository files navigation

YamlPipe Logo

🧩 YamlPipe

A lightweight, YAML-driven ETL pipeline that transforms text data into vector embeddings —
with zero boilerplate, full flexibility, and seamless database integration.

GitHub Stars License Python Version YAML Based


🚀 Overview

YamlPipe lets you build end-to-end ETL pipelines for vector embedding workflows — all defined in a single YAML file.

It’s designed for AI developers, data engineers, and RAG (Retrieval-Augmented Generation) builders who want simplicity without losing flexibility.

With YamlPipe, you can:

  • ✅ Load data from files, web, S3, or Postgres
  • 🧠 Chunk text dynamically using multiple strategies
  • ⚙️ Generate embeddings with OpenAI or Sentence Transformers
  • 🧩 Store vectors in LanceDB or ChromaDB
  • 💻 Run everything via CLI or Web UI (Streamlit)

🧠 Features

  • YAML-based Configuration – define your pipeline once, run it anywhere
  • Pluggable Components – modular architecture for each stage
  • Advanced Chunkingrecursive_character, markdown, or adaptive
  • Multiple Embedding Modelssentence_transformer and openai
  • Vector Database Integrationlancedb or chromadb
  • CLI & Streamlit UI – full control, both terminal and browser

⚡ Installation

git clone https://github.com/dongwonmoon/Yaml-Pipe.git
cd Yaml-Pipe
pip install -r requirements.txt

🧩 Quick Start

python main.py init
python main.py run -c pipelines/pipeline.yaml

Example Pipeline

source:
  type: local_files
  config:
    path: ./data
    glob_pattern: "*.txt"

chunker:
  type: adaptive
  config:
    chunk_size: 200
    chunk_overlap: 40

embedder:
  type: sentence_transformer
  config:
    model_name: "jhgan/ko-sbert-nli"

sink:
  type: chromadb
  config:
    path: "./chroma_data"
    collection_name: "my_documents"

🌐 Web Interface

streamlit run app.py

Use the dashboard to visualize your pipelines, test search results, and monitor ingestion progress.


💡 Why YamlPipe?

  • No more boilerplate ETL code — define everything in YAML
  • Designed for RAG, embedding pipelines, and AI data workflows
  • Fully open-source and easily extendable

🧭 Roadmap

  • Add Milvus / Pinecone sinks
  • Support LangChain / LlamaIndex integrations
  • Add benchmarking and pipeline visualization

🤝 Contributing

Contributions are always welcome!
Fork the repo, create a feature branch, and submit a PR.
New ideas, documentation improvements, and bug reports are all appreciated.


⭐ Support

If YamlPipe helps you, please consider giving it a star 🌟
Every star motivates continued development and new features!

Star YamlPipe


🪪 License

MIT © dongwonmoon

About

flexible ETL pipeline

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published