1 unstable release
| 0.1.0 | Dec 1, 2025 |
|---|
#303 in HTTP client
88KB
1.5K
SLoC
FerrisFetcher
FerrisFetcher is a cutting-edge, high-level web scraping library crafted in Rust. Leveraging Tokio's asynchronous prowess for concurrent operations and Reqwest's efficient HTTP handling, FerrisFetcher provides a powerful, performant, and user-friendly web scraping solution.
โจ Features
- ๐ High Performance: Async HTTP client with connection pooling and concurrent request handling
- ๐ฏ Precise Extraction: CSS selector-based data extraction with configurable rules
- โก Concurrent Scraping: Built-in concurrency management with configurable limits
- ๐ก๏ธ Respectful Scraping: Rate limiting, retry mechanisms, and configurable delays
- ๐ง Flexible Configuration: Comprehensive configuration options for all scraping needs
- ๐ Rich Metadata: Automatic extraction of page metadata, titles, descriptions, and structured data
- ๐ Advanced Parsing: Robust HTML parsing with support for malformed content
- ๐ Statistics: Built-in request statistics and performance monitoring
- ๐จ Builder API: Fluent builder pattern for easy configuration
- ๐งช Well Tested: Comprehensive test suite with high code coverage
๐ Quick Start
Add FerrisFetcher to your Cargo.toml:
[dependencies]
ferrisfetcher = "0.1.0"
tokio = { version = "1.0", features = ["full"] }
Basic Usage
use ferrisfetcher::FerrisFetcher;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let fetcher = FerrisFetcher::new()?;
let result = fetcher.scrape("https://example.com").await?;
println!("Title: {}", result.title.unwrap_or_default());
println!("Status: {}", result.status_code);
Ok(())
}
Advanced Usage with Custom Configuration
use ferrisfetcher::{FerrisFetcher, Config};
use std::time::Duration;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = Config::new()
.with_user_agent("MyScraper/1.0")
.with_timeout(Duration::from_secs(30))
.with_max_concurrent_requests(10)
.without_rate_limit();
let fetcher = FerrisFetcher::with_config(config)?;
let result = fetcher.scrape("https://example.com").await?;
println!("Scraped: {} ({}ms)", result.url, result.scrape_time_ms);
Ok(())
}
๐ Examples
Data Extraction with Rules
use ferrisfetcher::{
FerrisFetcher,
ExtractionRuleBuilder,
ExtractionType
};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut fetcher = FerrisFetcher::new()?;
// Add extraction rules
fetcher.add_extraction_rule(
ExtractionRuleBuilder::new("headings", "h1, h2, h3")
.extraction_type(ExtractionType::Text)
.multiple(true)
.build()
);
fetcher.add_extraction_rule(
ExtractionRuleBuilder::new("links", "a[href]")
.extraction_type(ExtractionType::Attribute)
.attribute("href")
.multiple(true)
.build()
);
let result = fetcher.scrape("https://example.com").await?;
// Access extracted data
if let Some(headings) = result.extracted_data.get("headings") {
println!("Found {} headings", headings.len());
for heading in headings {
println!(" - {}", heading);
}
}
Ok(())
}
Concurrent Scraping
use ferrisfetcher::FerrisFetcher;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let fetcher = FerrisFetcher::new()?;
let urls = vec![
"https://example.com",
"https://example.org",
"https://example.net",
];
let results = fetcher.scrape_multiple(&urls).await?;
println!("Successfully scraped {} URLs", results.len());
for result in results {
println!(" {} - {}", result.url, result.title.unwrap_or_default());
}
Ok(())
}
Using Preset Extraction Rules
use ferrisfetcher::{FerrisFetcher, Config, presets};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let fetcher = FerrisFetcher::with_config_and_rules(
Config::new(),
presets::article() // Predefined rules for article extraction
)?;
let result = fetcher.scrape("https://news.example.com/article").await?;
// Automatically extracted article data
println!("Title: {:?}", result.extracted_data.get("title"));
println!("Author: {:?}", result.extracted_data.get("author"));
println!("Content: {:?}", result.extracted_data.get("content"));
Ok(())
}
Builder Pattern
use ferrisfetcher::{FerrisFetcherBuilder, ExtractionRuleBuilder, ExtractionType};
use std::time::Duration;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let fetcher = FerrisFetcherBuilder::new()
.user_agent("AdvancedScraper/1.0")
.timeout(Duration::from_secs(20))
.max_concurrent_requests(8)
.without_rate_limit()
.header("X-Custom-Header", "value")?
.add_rule(
ExtractionRuleBuilder::new("images", "img[src]")
.extraction_type(ExtractionType::Attribute)
.attribute("src")
.multiple(true)
.build()
)
.build()?;
let result = fetcher.scrape("https://example.com").await?;
println!("Found {} images",
result.extracted_data.get("images").map_or(0, |v| v.len()));
Ok(())
}
๐ Documentation
๐๏ธ Architecture
FerrisFetcher is built with a modular architecture:
- HTTP Client: High-performance async HTTP client with Reqwest
- HTML Parser: Robust HTML parsing with CSS selector support
- Data Extractor: Configurable rule-based data extraction engine
- Concurrency Manager: Tokio-based concurrent request handling
- Configuration System: Flexible configuration with sensible defaults
๐งช Testing
Run the test suite:
cargo test
Run examples:
cargo run --example basic_scraping
cargo run --example data_extraction
cargo run --example concurrent_scraping
cargo run --example builder_pattern
๐ Performance Benchmarks
FerrisFetcher is designed for high performance:
- Single Request: < 100ms average response time
- Concurrent Requests: Handle 100+ concurrent connections
- Memory Usage: < 50MB for typical scraping workloads
- CPU Efficiency: Minimal CPU overhead during I/O operations
๐ก๏ธ Respectful Scraping
FerrisFetcher promotes ethical scraping practices:
- Rate Limiting: Configurable delays between requests
- User Agent: Proper identification of the scraper
- Retry Policies: Intelligent retry with exponential backoff
- Timeout Protection: Prevents hanging requests
๐ง Configuration Options
FerrisFetcher supports extensive configuration:
- HTTP timeouts and connection settings
- Concurrent request limits
- Rate limiting and delays
- Retry policies with exponential backoff
- Custom headers and user agents
- Proxy support
- Cookie management
- Redirect handling
๐ค Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Development Setup
git clone https://github.com/M1tsumi/FerrisFetcher.git
cd FerrisFetcher
cargo build
cargo test
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Acknowledgments
- Built with Tokio for async runtime
- HTTP requests powered by Reqwest
- HTML parsing with Scraper
- Inspired by the Rust community's web scraping needs
๐ Support
- ๐ Documentation
- ๐ Issue Tracker
- ๐ฌ Discussions
FerrisFetcher - The Rust web scraping solution that's fast, reliable, and respectful. ๐ฆโจ
Dependencies
~20โ40MB
~486K SLoC