Thanks to visit codestin.com
Credit goes to lib.rs

#web-scraping #web-crawler

ferrisfetcher

A cutting-edge, high-level web scraping library crafted in Rust

1 unstable release

0.1.0 Dec 1, 2025

#303 in HTTP client

Apache-2.0

88KB
1.5K SLoC

FerrisFetcher

Crates.io Documentation License Rust

FerrisFetcher is a cutting-edge, high-level web scraping library crafted in Rust. Leveraging Tokio's asynchronous prowess for concurrent operations and Reqwest's efficient HTTP handling, FerrisFetcher provides a powerful, performant, and user-friendly web scraping solution.

โœจ Features

  • ๐Ÿš€ High Performance: Async HTTP client with connection pooling and concurrent request handling
  • ๐ŸŽฏ Precise Extraction: CSS selector-based data extraction with configurable rules
  • โšก Concurrent Scraping: Built-in concurrency management with configurable limits
  • ๐Ÿ›ก๏ธ Respectful Scraping: Rate limiting, retry mechanisms, and configurable delays
  • ๐Ÿ”ง Flexible Configuration: Comprehensive configuration options for all scraping needs
  • ๐Ÿ“Š Rich Metadata: Automatic extraction of page metadata, titles, descriptions, and structured data
  • ๐Ÿ” Advanced Parsing: Robust HTML parsing with support for malformed content
  • ๐Ÿ“ˆ Statistics: Built-in request statistics and performance monitoring
  • ๐ŸŽจ Builder API: Fluent builder pattern for easy configuration
  • ๐Ÿงช Well Tested: Comprehensive test suite with high code coverage

๐Ÿš€ Quick Start

Add FerrisFetcher to your Cargo.toml:

[dependencies]
ferrisfetcher = "0.1.0"
tokio = { version = "1.0", features = ["full"] }

Basic Usage

use ferrisfetcher::FerrisFetcher;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetcher = FerrisFetcher::new()?;
    let result = fetcher.scrape("https://example.com").await?;
    
    println!("Title: {}", result.title.unwrap_or_default());
    println!("Status: {}", result.status_code);
    
    Ok(())
}

Advanced Usage with Custom Configuration

use ferrisfetcher::{FerrisFetcher, Config};
use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = Config::new()
        .with_user_agent("MyScraper/1.0")
        .with_timeout(Duration::from_secs(30))
        .with_max_concurrent_requests(10)
        .without_rate_limit();

    let fetcher = FerrisFetcher::with_config(config)?;
    let result = fetcher.scrape("https://example.com").await?;
    
    println!("Scraped: {} ({}ms)", result.url, result.scrape_time_ms);
    
    Ok(())
}

๐Ÿ“– Examples

Data Extraction with Rules

use ferrisfetcher::{
    FerrisFetcher, 
    ExtractionRuleBuilder, 
    ExtractionType
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut fetcher = FerrisFetcher::new()?;
    
    // Add extraction rules
    fetcher.add_extraction_rule(
        ExtractionRuleBuilder::new("headings", "h1, h2, h3")
            .extraction_type(ExtractionType::Text)
            .multiple(true)
            .build()
    );
    
    fetcher.add_extraction_rule(
        ExtractionRuleBuilder::new("links", "a[href]")
            .extraction_type(ExtractionType::Attribute)
            .attribute("href")
            .multiple(true)
            .build()
    );

    let result = fetcher.scrape("https://example.com").await?;
    
    // Access extracted data
    if let Some(headings) = result.extracted_data.get("headings") {
        println!("Found {} headings", headings.len());
        for heading in headings {
            println!("  - {}", heading);
        }
    }
    
    Ok(())
}

Concurrent Scraping

use ferrisfetcher::FerrisFetcher;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetcher = FerrisFetcher::new()?;
    
    let urls = vec![
        "https://example.com",
        "https://example.org", 
        "https://example.net",
    ];
    
    let results = fetcher.scrape_multiple(&urls).await?;
    
    println!("Successfully scraped {} URLs", results.len());
    for result in results {
        println!("  {} - {}", result.url, result.title.unwrap_or_default());
    }
    
    Ok(())
}

Using Preset Extraction Rules

use ferrisfetcher::{FerrisFetcher, Config, presets};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetcher = FerrisFetcher::with_config_and_rules(
        Config::new(),
        presets::article() // Predefined rules for article extraction
    )?;
    
    let result = fetcher.scrape("https://news.example.com/article").await?;
    
    // Automatically extracted article data
    println!("Title: {:?}", result.extracted_data.get("title"));
    println!("Author: {:?}", result.extracted_data.get("author"));
    println!("Content: {:?}", result.extracted_data.get("content"));
    
    Ok(())
}

Builder Pattern

use ferrisfetcher::{FerrisFetcherBuilder, ExtractionRuleBuilder, ExtractionType};
use std::time::Duration;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let fetcher = FerrisFetcherBuilder::new()
        .user_agent("AdvancedScraper/1.0")
        .timeout(Duration::from_secs(20))
        .max_concurrent_requests(8)
        .without_rate_limit()
        .header("X-Custom-Header", "value")?
        .add_rule(
            ExtractionRuleBuilder::new("images", "img[src]")
                .extraction_type(ExtractionType::Attribute)
                .attribute("src")
                .multiple(true)
                .build()
        )
        .build()?;

    let result = fetcher.scrape("https://example.com").await?;
    println!("Found {} images", 
        result.extracted_data.get("images").map_or(0, |v| v.len()));
    
    Ok(())
}

๐Ÿ“š Documentation

๐Ÿ—๏ธ Architecture

FerrisFetcher is built with a modular architecture:

  • HTTP Client: High-performance async HTTP client with Reqwest
  • HTML Parser: Robust HTML parsing with CSS selector support
  • Data Extractor: Configurable rule-based data extraction engine
  • Concurrency Manager: Tokio-based concurrent request handling
  • Configuration System: Flexible configuration with sensible defaults

๐Ÿงช Testing

Run the test suite:

cargo test

Run examples:

cargo run --example basic_scraping
cargo run --example data_extraction
cargo run --example concurrent_scraping
cargo run --example builder_pattern

๐Ÿ“Š Performance Benchmarks

FerrisFetcher is designed for high performance:

  • Single Request: < 100ms average response time
  • Concurrent Requests: Handle 100+ concurrent connections
  • Memory Usage: < 50MB for typical scraping workloads
  • CPU Efficiency: Minimal CPU overhead during I/O operations

๐Ÿ›ก๏ธ Respectful Scraping

FerrisFetcher promotes ethical scraping practices:

  • Rate Limiting: Configurable delays between requests
  • User Agent: Proper identification of the scraper
  • Retry Policies: Intelligent retry with exponential backoff
  • Timeout Protection: Prevents hanging requests

๐Ÿ”ง Configuration Options

FerrisFetcher supports extensive configuration:

  • HTTP timeouts and connection settings
  • Concurrent request limits
  • Rate limiting and delays
  • Retry policies with exponential backoff
  • Custom headers and user agents
  • Proxy support
  • Cookie management
  • Redirect handling

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

git clone https://github.com/M1tsumi/FerrisFetcher.git
cd FerrisFetcher
cargo build
cargo test

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Built with Tokio for async runtime
  • HTTP requests powered by Reqwest
  • HTML parsing with Scraper
  • Inspired by the Rust community's web scraping needs

๐Ÿ“ž Support


FerrisFetcher - The Rust web scraping solution that's fast, reliable, and respectful. ๐Ÿฆ€โœจ

Dependencies

~20โ€“40MB
~486K SLoC