Zerdisha Scrapers [Entirely Built By Layers of AI Agents]

Read the Process for detailed breakdown.

A robust, scalable data ingestion engine for gathering news articles from sources worldwide, built with Scrapy and designed for deployment on Zyte Scrapy Cloud.

Project Purpose

The zerdisha-scrapy project is the dedicated data collection component of the Zerdisha ecosystem. It serves as an independent, specialized system responsible for:

Web Crawling: Systematically discovering and accessing news articles from various sources
Data Extraction: Parsing and structuring article content into standardized formats
Data Pipeline: Processing and preparing scraped data for consumption by the main Zerdisha application

This project operates independently from the main Zerdisha API and frontend, enabling specialized deployment strategies and focused development of data ingestion capabilities.

Architectural Overview

Separation of Concerns

The project follows a clean architectural pattern where data ingestion is completely decoupled from data analysis and presentation:

┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐
│   zerdisha-scrapy   │───▶│   Data Storage      │───▶│      Zerdisha       │
│   (Data Collection) │    │   (Structured Data) │    │   (Analysis & API)  │
└─────────────────────┘    └─────────────────────┘    └─────────────────────┘

Key Benefits

Independent Deployment: Deploy scrapers to specialized platforms like Zyte Scrapy Cloud
Scalability: Scale data collection independently from application logic
Maintainability: Focused codebase with clear responsibilities
Flexibility: Easy to add new sources without affecting the main application

Data Contract

All scraped data follows the ArticleItem schema, ensuring consistency across all sources:

url: Canonical article URL
source_name: News source identifier
title: Article headline
full_text: Complete article content
author: Article author (when available)
publication_date: Original publication timestamp (ISO 8601)
scraped_at: Collection timestamp (ISO 8601)
spider_name: Collecting spider identifier

Publication Date Extraction

The spiders implement robust publication date extraction with multiple fallback strategies:

Primary extraction: Parse publication dates from article page elements
URL structure fallback: Extract dates from URL patterns (e.g., /2025/07/03/)
Validation: All dates are validated and formatted to ISO 8601 standard
Error handling: Graceful handling when dates cannot be determined

This ensures reliable date information even when news sources don't provide easily parseable publication dates.

Getting Started

Prerequisites

Python 3.8+
pip package manager or Poetry

Installation

Clone the repository:

git clone https://github.com/awebisam/zerdisha-scrapy.git
cd zerdisha-scrapy

Install dependencies:

Using pip:

pip install -r requirements.txt

Using Poetry (recommended):

poetry install
poetry shell

Verify installation:
```
scrapy version
```

Running Spiders

List available spiders:
```
scrapy list
```
Run the Kathmandu Post spider:
```
scrapy crawl kathmandupost
```

Run with limited items for testing:

scrapy crawl kathmandupost -s CLOSESPIDER_ITEMCOUNT=5

Run with output to file:

scrapy crawl kathmandupost -o articles.json

Project Structure

zerdisha-scrapy/
├── scrapy.cfg                 # Scrapy project configuration
├── zerdisha_scrapers/         # Main project package
│   ├── __init__.py
│   ├── items.py              # Data structure definitions (ArticleItem)
│   ├── middlewares.py        # Custom middleware components
│   ├── pipelines.py          # Data processing pipelines
│   ├── settings.py           # Project settings and configuration
│   └── spiders/              # Spider implementations
│       ├── __init__.py
│       └── kathmandupost.py  # Kathmandu Post hybrid RSS/Scrapy spider
└── README.md                 # This file

How to Contribute

We follow strict coding standards to ensure high-quality, maintainable code:

Coding Standards

Strict Typing: All Python code must include type hints - this is non-negotiable

def parse_article(self, response: Response) -> Generator[ArticleItem, None, None]:

Comprehensive Documentation: Every module, class, and function must include clear docstrings using Google Style

def extract_title(self, response: Response) -> Optional[str]:
    """Extract article title from the response.
    
    Args:
        response: The HTTP response object containing the page.
        
    Returns:
        The extracted title string, or None if not found.
    """

Modern Python Practices: Use current Python idioms and Scrapy best practices
Readability First: Write code for humans first, machines second

Development Workflow

Fork and clone the repository
Create a feature branch for your changes
Follow the coding standards outlined above
Test your changes thoroughly
Submit a pull request with clear description

Adding New Spiders

When creating new spiders, use the kathmandupost.py spider as your template. This spider demonstrates best practices including:

Hybrid RSS/Scrapy approach: Efficient article discovery via RSS with comprehensive content extraction
Robust date extraction: Multiple fallback strategies for publication date extraction
Error handling: Comprehensive exception handling and logging
URL parsing: Smart date extraction from URL structure when article pages don't provide clear dates

Ensure your spider:

Inherits from scrapy.Spider
Uses strict typing throughout
Implements comprehensive logging with self.logger
Properly populates ArticleItem instances
Handles errors gracefully with try/except blocks
Includes thorough documentation
Extracts publication dates reliably (with fallback strategies)

Example Spider Creation

# Generate a new spider using Scrapy's generator
scrapy genspider news_source example-news.com

# Then customize it following our standards and the kathmandupost.py template

Deployment

This project is designed for deployment on Zyte Scrapy Cloud, which provides:

Managed Scrapy hosting
Automatic scaling
Monitoring and logging
Data export capabilities

Deployment configurations and instructions will be added as the project matures.

License

This project is released into the public domain under The Unlicense. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github		.github
docs/external		docs/external
tests		tests
zerdisha_scrapers		zerdisha_scrapers
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
PROCESS.md		PROCESS.md
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zerdisha Scrapers [Entirely Built By Layers of AI Agents]

Read the Process for detailed breakdown.

Project Purpose

Architectural Overview

Separation of Concerns

Key Benefits

Data Contract

Publication Date Extraction

Getting Started

Prerequisites

Installation

Running Spiders

Project Structure

How to Contribute

Coding Standards

Development Workflow

Adding New Spiders

Example Spider Creation

Deployment

License

About

Uh oh!

Releases

Packages

Languages

License

awebisam/zerdisha-scrapy

Folders and files

Latest commit

History

Repository files navigation

Zerdisha Scrapers [Entirely Built By Layers of AI Agents]

Read the Process for detailed breakdown.

Project Purpose

Architectural Overview

Separation of Concerns

Key Benefits

Data Contract

Publication Date Extraction

Getting Started

Prerequisites

Installation

Running Spiders

Project Structure

How to Contribute

Coding Standards

Development Workflow

Adding New Spiders

Example Spider Creation

Deployment

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages