A modular pipeline for scraping, parsing, and processing content into JSONL suitable for Retrieval-Augmented Generation (RAG) ingestion.
- 🌐 URL scraping (HTML snapshots, main content extraction, PDF detection)
- 📄 PDF parsing (via
pdfplumber) - 💾 Local caching (
cache/rawfor raw HTML/PDF text) - 🧠 Sliding window processing with DeepSeek (
deepseek-reasoner) - ✅ Deduplication & JSONL output (
cache/rag_ready) - 🔧 Modular design (
rag_pipeline/package with submodules)
- ☁️ GCS storage integration
- 📦 Automated Pinecone ingestion
- ⏰ Cloud Scheduler for periodic refresh
flowchart LR
A[URLs] --> B[Scraper/PDF Parser] --> C[cache/raw]
C --> D[Sliding Window + DeepSeek] --> E[cache/rag_ready JSONL]
flowchart TD
A[URL list<br/>config/urls.txt] --> B[Scraper<br/>HTML/PDF detect]
B -->|Save raw HTML| C[cache/raw]
B -->|Download PDFs| D[PDF Parser]
D -->|Save raw text| C
C --> E[Sliding Window Parser<br/>DeepSeek API]
E -->|Deduplicate + Clean| F[cache/rag_ready<br/>JSONL]
F --> G[(RAG ingestion<br/>Vector DB / Pinecone)]
.
├── cache/
│ ├── raw/ # raw scraped HTML/PDF text
│ └── rag_ready/ # processed JSONL output
├── config/
│ └── urls.txt # list of target URLs
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── README.md
└── rag_pipeline/
├── cli.py
├── main.py
├── scraping/
│ ├── scraper.py
│ └── pdf_parser.py
├── processing/
│ ├── ai_client.py
│ └── sliding_window.py
├── storage/
│ └── storage.py
└── utils/
└── logger.py
- Clone the repo
- Create
.envwith your DeepSeek key:DEEPSEEK_API_KEY=your_key_here
- Build the image:
docker-compose build
Run the CLI to select a URL or run all:
docker-compose run --rm scraperRun the whole pipeline on all URLs in config/urls.txt:
docker-compose run --rm python -m rag_pipeline.mainRun it on a single URL:
docker-compose run --rm python -m rag_pipeline.main https://example.com/pageExample JSONL (cache/rag_ready/irb_manual.jsonl):
{"text": "Informed consent requires disclosure of risks and benefits..."}
{"text": "Investigators must maintain accurate and complete study records..."}
{"text": "IRB review ensures compliance with federal regulations..."}