Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Intelligent data acquisition framework for GitHub and web sources

License

wzdnzd/harvester

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Harvester - Universal Data Acquisition Framework

πŸ“– δΈ­ζ–‡ζ–‡ζ‘£ | English | πŸ”— More Tools

A universal, adaptive data acquisition framework designed for comprehensive information acquisition from multiple sources including GitHub, network mapping platforms (FOFA, Shodan), and arbitrary web endpoints. While the current implementation focuses on AI service provider key discovery as a practical example, the framework is architected for extensibility to support diverse data acquisition scenarios.


⭐⭐⭐ If this project helps you, please give it a star! Your support motivates us to keep improving and adding new features.


Table of Contents

Project Goals

The system aims to build a universal data acquisition framework primarily targeting:

  • GitHub: Code repositories, issues, commits, and API endpoints
  • Network Mapping Platforms:
    • FOFA - Cyberspace mapping and asset discovery
    • Shodan - Internet-connected device search engine
  • Arbitrary Web Endpoints: Custom APIs, web services, and data sources
  • Extensible Architecture: Plugin-based system for easy integration of new data sources

Current Data Source Support

Data Source Status Description
GitHub API βœ… Implemented Full API integration with rate limiting
GitHub Web βœ… Implemented Web scraping with intelligent parsing
FOFA 🚧 Planned Cyberspace asset discovery integration
Shodan 🚧 Planned IoT and network device enumeration
Custom APIs 🚧 Planned Generic REST/GraphQL API adapter

Architecture

Layered Architecture

graph TB
    %% Entry Layer
    subgraph Entry["Entry Layer"]
        CLI["CLI Interface<br/>(main.py)"]
        App["Application Core<br/>(main.py)"]
    end

    %% Management Layer
    subgraph Management["Management Layer"]
        TaskMgr["Task Manager<br/>(manager/task.py)"]
        Pipeline["Pipeline Manager<br/>(manager/pipeline.py)"]
        WorkerMgr["Worker Manager<br/>(manager/worker.py)"]
        QueueMgr["Queue Manager<br/>(manager/queue.py)"]
        StatusMgr["Status Manager<br/>(manager/status.py)"]
        Shutdown["Shutdown Coordinator<br/>(manager/shutdown.py)"]
    end

    %% Processing Layer
    subgraph Processing["Processing Layer"]
        StageBase["Stage Framework<br/>(stage/base.py)"]
        StageImpl["Stage Implementations<br/>(stage/definition.py)"]
        StageReg["Stage Registry<br/>(stage/registry.py)"]
        StageFactory["Stage Factory<br/>(stage/factory.py)"]
        StageResolver["Dependency Resolver<br/>(stage/resolver.py)"]
    end

    %% Service Layer
    subgraph Service["Service Layer"]
        SearchSvc["Search Service<br/>(search/client.py)"]
        SearchProviders["Search Providers<br/>(search/provider/)"]
        RefineSvc["Query Refinement<br/>(refine/)"]
        RefineEngine["Refine Engine<br/>(refine/engine.py)"]
        RefineOptimizer["Query Optimizer<br/>(refine/optimizer.py)"]
    end

    %% Core Domain Layer
    subgraph Core["Core Domain Layer"]
        Models["Domain Models & Tasks<br/>(core/models.py)"]
        Types["Type System<br/>(core/types.py)"]
        Enums["Enumerations<br/>(core/enums.py)"]
        Metrics["Metrics<br/>(core/metrics.py)"]
        Auth["Authentication<br/>(core/auth.py)"]
    end

    %% Infrastructure Layer
    subgraph Infrastructure["Infrastructure Layer"]
        Config["Configuration<br/>(config/)"]
        Tools["Tools & Utilities<br/>(tools/)"]
        Constants["Constants<br/>(constant/)"]
        Storage["Storage & Persistence<br/>(storage/)"]
    end

    %% State Management Layer
    subgraph StateLayer["State Management Layer"]
        StateCollector["State Collector<br/>(state/collector.py)"]
        StateDisplay["Display Engine<br/>(state/display.py)"]
        StateBuilder["Status Builder<br/>(state/builder.py)"]
        StateModels["State Models<br/>(state/models.py)"]
        StateMonitor["State Monitor<br/>(state/monitor.py)"]
        StateEnums["State Enums<br/>(state/enums.py)"]
        StateTypes["State Types<br/>(state/types.py)"]
    end

    %% External Systems
    subgraph External["External Systems"]
        GitHub["GitHub<br/>(API + Web)"]
        AIServices["AI Service<br/>Providers"]
        FileSystem["File System<br/>(Local Storage)"]
    end

    %% Dependencies (Top-down)
    Entry --> Management
    Management --> Processing
    Processing --> Service
    Service --> Core

    %% Infrastructure dependencies
    Entry -.-> Infrastructure
    Management -.-> Infrastructure
    Processing -.-> Infrastructure
    Service -.-> Infrastructure
    Core -.-> Infrastructure

    %% State management dependencies
    Entry -.-> StateLayer
    Management -.-> StateLayer

    %% External dependencies
    Service --> External
    Infrastructure --> External
Loading

System Architecture Overview

graph TB
    %% User Interface Layer
    subgraph UserLayer["User Interface Layer"]
        User[User]
        CLI[Command Line Interface]
        ConfigMgmt[Configuration Management]
    end

    %% Application Management Layer
    subgraph AppLayer["Application Management Layer"]
        MainApp[Main Application]
        TaskManager[Task Manager]
        StatusManager[Status Manager]
        ResourceManager[Resource Manager]
        ShutdownManager[Shutdown Manager]
    end

    %% Core Pipeline Engine
    subgraph PipelineCore["Pipeline Engine"]
        %% Stage Management System
        subgraph StageSystem["Stage Management System"]
            StageRegistry[Stage Registry]
            DependencyResolver[Dependency Resolver]
            StageFactory[Stage Factory]
        end

        %% Queue Management System
        subgraph QueueSystem["Queue Management System"]
            QueueManager[Queue Manager]
            WorkerManager[Worker Manager]
            MonitoringSystem[System Monitor]
        end

        %% Processing Stages
        subgraph ProcessingStages["Processing Stages"]
            SearchStage[Search Stage]
            GatherStage[Gather Stage]
            CheckStage[Check Stage]
            InspectStage[Inspect Stage]
        end
    end

    %% Search Provider Ecosystem
    subgraph ProviderEcosystem["Search Provider Ecosystem"]
        ProviderRegistry[Provider Registry]
        BaseProvider[Base Provider]
        OpenAIProvider[OpenAI-like Provider]
        CustomProviders[Custom Providers]
    end

    %% Advanced Processing Engines
    subgraph ProcessingEngines["Processing Engines"]
        SearchClient[Search Client]

        %% Query Optimization Engine
        subgraph QueryOptimizer["Query Optimization Engine"]
            RefineEngine[Refine Engine]
            RegexParser[Regex Parser]
            SplittabilityAnalyzer[Splittability Analyzer]
            EnumerationOptimizer[Enumeration Optimizer]
            QueryGenerator[Query Generator]
            OptimizationStrategies[Optimization Strategies]

            %% Internal Flow
            RefineEngine --> RegexParser
            RegexParser --> SplittabilityAnalyzer
            SplittabilityAnalyzer --> EnumerationOptimizer
            EnumerationOptimizer --> OptimizationStrategies
            OptimizationStrategies --> QueryGenerator
        end

        ValidationEngine[API Key Validation]
        RecoveryEngine[Task Recovery]
    end

    %% State & Data Management
    subgraph StateManagement["State & Data Management"]
        StateCollector[State Collector]
        DisplayEngine[Display Engine]
        StatusBuilder[Status Builder]
        StateMonitor[State Monitor]
        PersistenceLayer[Persistence Layer]
        SnapshotManager[Snapshot Manager]
        ResultManager[Result Manager]
    end

    %% Infrastructure Services
    subgraph Infrastructure["Infrastructure Services"]
        RateLimiting[Rate Limiting]
        CredentialMgmt[Credential Management]
        AgentRotation[User Agent Rotation]
        LoggingSystem[Logging System]
        RetryFramework[Retry Framework]
        ResourcePool[Resource Pool]
    end

    %% External Systems
    subgraph External["External Systems"]
        GitHubAPI[GitHub API]
        GitHubWeb[GitHub Web Interface]
        AIServiceAPIs[AI Service APIs]
        FileSystem[Local File System]
    end

    %% User Interactions
    User --> CLI
    User --> ConfigMgmt
    CLI --> MainApp
    ConfigMgmt --> MainApp

    %% Application Flow
    MainApp --> TaskManager
    MainApp --> StatusManager
    MainApp --> ResourceManager
    MainApp --> ShutdownManager
    TaskManager --> StageRegistry
    TaskManager --> QueueManager

    %% Stage Management Flow
    StageRegistry --> DependencyResolver
    StageRegistry --> StageFactory
    DependencyResolver --> ProcessingStages
    StageFactory --> ProcessingStages

    %% Queue Management Flow
    QueueManager --> WorkerManager
    QueueManager --> MonitoringSystem
    WorkerManager --> ProcessingStages

    %% Stage Dependencies (Pipeline)
    SearchStage --> GatherStage
    GatherStage --> CheckStage
    CheckStage --> InspectStage

    %% Processing Engine Integration
    SearchStage --> SearchClient
    SearchStage --> QueryOptimizer
    CheckStage --> ValidationEngine
    ProcessingStages --> RecoveryEngine

    %% Provider Integration
    SearchClient --> ProviderRegistry
    ProviderRegistry --> BaseProvider
    BaseProvider --> OpenAIProvider
    BaseProvider --> CustomProviders

    %% State Management Integration
    ProcessingStages --> StateCollector
    QueueManager --> StateCollector
    StateCollector --> DisplayEngine
    StateCollector --> StatusBuilder
    StateMonitor --> DisplayEngine
    ProcessingStages --> PersistenceLayer
    PersistenceLayer --> SnapshotManager
    PersistenceLayer --> ResultManager

    %% Infrastructure Integration
    SearchClient -.-> RateLimiting
    ResourceManager -.-> CredentialMgmt
    ResourceManager -.-> AgentRotation
    MainApp -.-> LoggingSystem
    ProcessingStages -.-> RetryFramework
    Infrastructure -.-> ResourcePool

    %% External Connections
    SearchClient --> GitHubAPI
    SearchClient --> GitHubWeb
    ValidationEngine --> AIServiceAPIs
    PersistenceLayer --> FileSystem

    %% Styling
    classDef userClass fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef appClass fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef coreClass fill:#e8f5e8,stroke:#388e3c,stroke-width:3px
    classDef providerClass fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef engineClass fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef stateClass fill:#f1f8e9,stroke:#689f38,stroke-width:2px
    classDef infraClass fill:#f5f5f5,stroke:#616161,stroke-width:2px
    classDef externalClass fill:#ffebee,stroke:#d32f2f,stroke-width:2px

    class User,CLI,ConfigMgmt userClass
    class MainApp,TaskManager,StatusManager,ResourceManager,ShutdownManager appClass
    class StageRegistry,DependencyResolver,StageFactory,QueueManager,WorkerManager,MonitoringSystem,SearchStage,GatherStage,CheckStage,InspectStage coreClass
    class ProviderRegistry,BaseProvider,OpenAIProvider,CustomProviders providerClass
    class SearchClient,QueryOptimizer,ValidationEngine,RecoveryEngine engineClass
    class StateCollector,StateMonitor,DisplayEngine,StatusBuilder,PersistenceLayer,SnapshotManager,ResultManager stateClass
    class RateLimiting,CredentialMgmt,AgentRotation,LoggingSystem,RetryFramework,ResourcePool infraClass
    class GitHubAPI,GitHubWeb,AIServiceAPIs,FileSystem externalClass
Loading

The project follows a layered architecture with the following core components:

Multi-Stage Processing Flow

sequenceDiagram
    participant CLI as CLI
    participant App as Application
    participant TM as TaskManager
    participant Pipeline as Pipeline
    participant Search as SearchStage
    participant Gather as GatherStage
    participant Check as CheckStage
    participant Inspect as InspectStage
    participant Storage as Storage
    participant Monitor as StatusManager

    %% Initialization Phase
    CLI->>App: 1. Start Application
    App->>App: 2. Load Configuration
    App->>TM: 3. Create TaskManager
    TM->>TM: 4. Initialize Providers
    TM->>Pipeline: 5. Create Pipeline
    Pipeline->>Search: 6. Register SearchStage
    Pipeline->>Gather: 7. Register GatherStage
    Pipeline->>Check: 8. Register CheckStage
    Pipeline->>Inspect: 9. Register InspectStage
    App->>Monitor: 10. Start Status Manager

    %% Processing Phase
    loop Multi-Stage Processing
        TM->>Search: 11. Submit Search Tasks
        Search->>Search: 12. Query GitHub with Optimization
        Search->>Gather: 13. Forward Search Results

        Gather->>Gather: 14. Acquire Detailed Information
        Gather->>Check: 15. Forward Extracted Keys

        Check->>Check: 16. Validate API Keys
        Check->>Inspect: 17. Forward Valid Keys

        Inspect->>Inspect: 18. Inspect API Capabilities
        Inspect->>Storage: 19. Save Results

        Pipeline->>Monitor: 20. Update Status
        Monitor->>App: 21. Display Progress
    end

    %% Recovery and Persistence
    loop Background Operations
        Storage->>Storage: Auto-save Results
        Storage->>Storage: Create Snapshots
        Pipeline->>Pipeline: Task Recovery
        Monitor->>Monitor: Collect Metrics
    end

    %% Completion Phase
    Pipeline->>Pipeline: 22. Check Completion
    Pipeline->>Storage: 23. Final Persistence
    Pipeline->>Monitor: 24. Final Status Report
    App->>TM: 25. Graceful Shutdown
    TM->>Storage: 26. Save State
Loading

Architecture Layers

1. Presentation Layer

  • CLI Interface (main.py): Command-line entry point with argument parsing and application lifecycle
  • Configuration System (config/): YAML-based configuration management with validation and schemas

2. Application Layer

  • Application Core (main.py): Main application lifecycle and orchestration
  • Task Management (manager/task.py): Provider coordination and task distribution
  • Resource Coordination (tools/coordinator.py): Global resource management and coordination
  • Shutdown Management (manager/shutdown.py): Graceful shutdown coordination
  • Status Management (manager/status.py): Application status management and coordination
  • Worker Management (manager/worker.py): Worker thread management and scaling
  • Queue Management (manager/queue.py): Multi-queue coordination and management

3. Business Service Layer

  • Pipeline Engine (manager/pipeline.py): Multi-stage processing orchestration with DAG execution
  • Stage System (stage/): Pluggable processing stages with dependency resolution and factory pattern
  • Search Service (search/): GitHub code search with provider abstraction and optimization
  • Query Refinement (refine/): Intelligent query optimization with strategy pattern and mathematical foundations

4. Domain Layer

  • Core Models & Tasks (core/models.py): Business domain objects, data structures, and task definitions
  • Type System (core/types.py): Interface definitions and contracts
  • Business Enums (core/enums.py): Domain enumerations and constants
  • Metrics & Analytics (core/metrics.py): Performance measurement and KPI tracking
  • Authentication (core/auth.py): Authentication and authorization logic
  • Custom Exceptions (core/exceptions.py): Domain-specific exception handling
  • Custom Exceptions (core/exceptions.py): Domain-specific exception handling

5. Infrastructure Layer

  • Storage & Persistence (storage/): Result storage, recovery, and snapshot management
    • Atomic Operations (storage/atomic.py): Atomic file operations with fsync
    • Result Management (storage/persistence.py): Multi-format result persistence
    • Task Recovery (storage/recovery.py): Task recovery mechanisms
    • Shard Management (storage/shard.py): NDJSON shard management with rotation
    • Snapshot Management (storage/snapshot.py): Backup and restore functionality
  • Tools & Utilities (tools/): Infrastructure tools and utilities
    • Logging System (tools/logger.py): Structured logging with API key redaction
    • Rate Limiting (tools/ratelimit.py): Adaptive rate control with token bucket algorithm
    • Load Balancing (tools/balancer.py): Resource distribution strategies
    • Credential Management (tools/credential.py): Secure credential rotation and management
    • Agent Management (tools/agent.py): User-agent rotation for web scraping
    • Pattern Matching (tools/patterns.py): Pattern matching utilities and helpers
    • Retry Framework (tools/retry.py): Unified retry mechanisms with backoff strategies
    • Resource Pooling (tools/resources.py): Resource pool management and optimization

6. State Management Layer

  • State Collection (state/collector.py): System metrics gathering and aggregation
  • Display Engine (state/display.py): User-friendly progress visualization and formatting
  • Status Builder (state/builder.py): Status data construction and transformation
  • State Models (state/models.py): Monitoring data structures and metrics
  • State Monitoring (state/monitor.py): Real-time state monitoring and tracking
  • State Enumerations (state/enums.py): State-related enumerations and constants
  • State Types (state/types.py): State type definitions and interfaces

Processing Stages

The system implements a 4-stage pipeline for comprehensive data acquisition and validation:

  1. Search Stage (stage/definition.py:SearchStage):

    • Intelligent GitHub code search with advanced query optimization
    • Multi-provider search support (API + Web)
    • Query refinement using mathematical optimization algorithms
    • Rate-limited search execution with adaptive throttling
  2. Gather Stage (stage/definition.py:GatherStage):

    • Detailed information acquisition from search results
    • Content extraction and parsing
    • Pattern matching for key identification
    • Structured data collection and normalization
  3. Check Stage (stage/definition.py:CheckStage):

    • API key validation against actual service endpoints
    • Authentication verification and capability testing
    • Service availability and response validation
    • Error handling and retry mechanisms
  4. Inspect Stage (stage/definition.py:InspectStage):

    • API capability inspection for validated keys
    • Model enumeration and feature detection
    • Service limits and quota analysis
    • Comprehensive capability profiling

Advanced Query Optimization Engine

The system features a sophisticated Query Optimization Engine with mathematical foundations:

Core Components

  1. Regex Parser

    • Advanced regex pattern parsing with support for complex syntax
    • Handles escaped characters, character classes, and quantifiers
    • Converts patterns into analyzable segment structures
  2. Splittability Analyzer

    • Mathematical analysis of pattern divisibility
    • Recursive depth limiting for safety
    • Value threshold analysis for optimization feasibility
    • Resource cost estimation for performance control
  3. Enumeration Optimizer

    • Intelligent enumeration strategy selection
    • Multi-dimensional optimization (depth, breadth, value)
    • Combinatorial analysis for optimal segment selection
    • Topological sorting for dependency resolution
  4. Query Generator

    • Generates optimized query variants from enumeration strategies
    • Supports configurable enumeration depth
    • Produces mathematically optimal query distributions
    • Maintains query semantic equivalence

Optimization Algorithms

  • Mathematical Modeling: Uses mathematical principles to analyze regex patterns
  • Enumeration Strategy: Intelligent selection of optimal enumeration depth and combinations
  • Resource Management: Prevents resource exhaustion through intelligent limiting
  • Performance Optimization: Singleton pattern ensures optimal memory usage

Supported Data Sources & Use Cases

πŸ” Current Implementation (AI Service Discovery)

  • OpenAI and compatible interfaces
  • Anthropic Claude
  • Azure OpenAI
  • Google Gemini
  • AWS Bedrock
  • GooeyAI
  • Stability AI
  • 百度文心一言
  • ζ™Ίθ°±AI
  • Custom providers

🌐 Planned Data Sources

  • FOFA: Cyberspace asset discovery and network mapping
  • Shodan: Internet-connected device enumeration
  • Custom REST APIs: Generic API integration framework
  • GraphQL Endpoints: Flexible query-based data acquisition
  • Web Scraping: JavaScript-rendered content and dynamic sites
  • Database Connectors: Direct database query capabilities

πŸ“Š Potential Use Cases

  • Data Mining: Large-scale information extraction and analysis

Key Features

🌐 Universal Data Acquisition

  • Multi-Source Support: GitHub, FOFA, Shodan, and custom endpoints
  • Adaptive Query Engine: Intelligent optimization for different data sources
  • Protocol Agnostic: REST, GraphQL, WebSocket, and web scraping support
  • Rate Limiting: Per-source intelligent rate control and quota management

πŸ—οΈ Advanced Architecture

  • Dynamic Stage System: Configurable processing pipelines with DAG execution
  • Plugin Architecture: Extensible framework for custom data sources and processors
  • Dependency Resolution: Automatic stage ordering and dependency management
  • Handler Registration: Pluggable processors for flexible data transformation

⚑ High Performance

  • Asynchronous Processing: Multi-threaded task execution with intelligent queuing
  • Adaptive Load Balancing: Dynamic resource allocation based on workload
  • Query Optimization: Mathematical modeling for optimal search strategies
  • Resource Monitoring: Real-time performance tracking and bottleneck detection

πŸ›‘οΈ Enterprise Ready

  • Fault Tolerance: Comprehensive error handling, retry mechanisms, and recovery
  • State Persistence: Queue state recovery and graceful shutdown capabilities
  • Security: Credential management, API key redaction, and secure storage
  • Monitoring: Real-time analytics, alerting, and performance visualization

System Requirements

Dependencies

  • Python: 3.10+
  • Libraries: PyYAML
  • Optional: uvloop (Linux/macOS performance boost)
  • Development: pytest, black, mypy (for contributors)

Quick Start

πŸ“š For comprehensive documentation, tutorials, and advanced usage guides, please visit DeepWiki

  1. Installation

    git clone https://github.com/wzdnzd/harvester.git
    cd harvester
    pip install -r requirements.txt
  2. Configuration

Choose one of the following methods to create your configuration

Method 1: Generate default configuration

python main.py --create-config

Method 2: Copy from examples

# For basic configuration
cp examples/config-simple.yaml config.yaml

# For full configuration with all options
cp examples/config-full.yaml config.yaml

Edit the configuration file:

  • Set your Github session token or API key
  • Configure provider search patterns
  • Adjust rate limits and thread counts

Configuration Guide

The system provides two configuration templates:

  1. Basic Configuration - Suitable for quick start:

    # Global application settings
    global:
      workspace: "./data"  # Working directory
      github_credentials:
        sessions:
          - "your_github_session_here"  # GitHub session token
        strategy: "round_robin"  # Load balancing strategy
    
    # Pipeline stage configuration
    pipeline:
      threads:
        search: 1    # Search threads (keep low)
        gather: 4   # Acquisition threads
        check: 2     # Validation threads
        inspect: 1    # API capability inspection threads
    
    # System monitoring settings
    monitoring:
      update_interval: 2.0    # Monitoring update interval
      error_threshold: 0.1    # Error rate threshold
    
    # Data persistence configuration
    persistence:
      auto_restore: true      # Auto restore state on startup
      shutdown_timeout: 30    # Shutdown timeout in seconds
    
    # Global rate limiting configuration
    ratelimits:
      github_web:
        base_rate: 0.5       # Base rate in requests per second
        burst_limit: 2       # Maximum burst size
        adaptive: true       # Enable adaptive rate limiting
    
    # Provider task configurations
    tasks:
      - name: "openai"         # Provider name
        enabled: true          # Enable/disable provider
        provider_type: "openai"
        use_api: false         # Use GitHub API for searching
        
        # Pipeline stage settings
        stages:
          search: true         # Enable search stage
          gather: true         # Enable acquisition stage
          check: true          # Enable validation stage
          inspect: true        # Enable API capability inspection
        
        # Pattern matching configuration
        patterns:
          key_pattern: "sk(?:-proj)?-[a-zA-Z0-9]{20}T3BlbkFJ[a-zA-Z0-9]{20}"
        
        # Search conditions
        conditions:
          - query: '"T3BlbkFJ"'
  2. Full Configuration - Includes all advanced options:

    • display: Display and monitoring settings
    • global: Global system configuration
    • pipeline: Pipeline stage configuration
    • monitoring: System monitoring parameters
    • persistence: Data persistence settings
    • worker: Worker pool configuration
    • ratelimits: Rate limiting settings
    • tasks: Provider task configurations

Advanced Task Configuration

πŸ“‹ For complete configuration examples, please refer to:

The tasks section is the core of the configuration, defining what providers to search and how to process them. Refer to the basic configuration example above for a complete tasks configuration.

Key Configuration Options

  • name: Unique identifier for the task
  • provider_type: Determines validation method (openai, openai_like, anthropic, gemini, etc.)
  • api: API endpoint configuration for key validation
  • patterns.key_pattern: Regex pattern to identify valid API keys
  • conditions: Search queries to find potential keys
  • stages: Enable/disable specific processing stages
  • extras.directory: Custom output directory for results
  1. Running
    python main.py                  # Use default config
    python main.py -c custom.yaml   # Use custom config
    python main.py --validate       # Validate config
    python main.py --log-level DEBUG # Enable debug logging

Directory Structure

harvester/
β”œβ”€β”€ config/           # Configuration management
β”‚   β”œβ”€β”€ accessor.py   # Configuration access utilities
β”‚   β”œβ”€β”€ defaults.py   # Default configuration values
β”‚   β”œβ”€β”€ loader.py     # Configuration loading
β”‚   β”œβ”€β”€ schemas.py    # Configuration schemas
β”‚   β”œβ”€β”€ validator.py  # Configuration validation
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ constant/         # System constants
β”‚   β”œβ”€β”€ monitoring.py # Monitoring constants
β”‚   β”œβ”€β”€ runtime.py    # Runtime constants
β”‚   β”œβ”€β”€ search.py     # Search constants
β”‚   β”œβ”€β”€ system.py     # System constants
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ core/             # Core domain models
β”‚   β”œβ”€β”€ auth.py       # Authentication
β”‚   β”œβ”€β”€ enums.py      # System enumerations
β”‚   β”œβ”€β”€ exceptions.py # Custom exceptions
β”‚   β”œβ”€β”€ metrics.py    # Performance metrics
β”‚   β”œβ”€β”€ models.py     # Core data models & task definitions
β”‚   β”œβ”€β”€ types.py      # Core type definitions
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ examples/         # Configuration examples
β”‚   β”œβ”€β”€ config-full.yaml    # Complete configuration template
β”‚   └── config-simple.yaml  # Basic configuration template
β”œβ”€β”€ manager/          # Task and resource management
β”‚   β”œβ”€β”€ base.py       # Base management classes
β”‚   β”œβ”€β”€ pipeline.py   # Pipeline management
β”‚   β”œβ”€β”€ queue.py      # Queue management
β”‚   β”œβ”€β”€ shutdown.py   # Shutdown coordination
β”‚   β”œβ”€β”€ status.py     # Status management
β”‚   β”œβ”€β”€ task.py       # Task management
β”‚   β”œβ”€β”€ worker.py     # Worker thread management
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ refine/           # Query optimization
β”‚   β”œβ”€β”€ config.py     # Refine configuration
β”‚   β”œβ”€β”€ engine.py     # Optimization engine
β”‚   β”œβ”€β”€ generator.py  # Query generation
β”‚   β”œβ”€β”€ optimizer.py  # Query optimization
β”‚   β”œβ”€β”€ parser.py     # Query parsing
β”‚   β”œβ”€β”€ segment.py    # Pattern segmentation
β”‚   β”œβ”€β”€ splittability.py # Splittability analysis
β”‚   β”œβ”€β”€ strategies.py # Optimization strategies
β”‚   β”œβ”€β”€ types.py      # Refine type definitions
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ search/           # Search engines
β”‚   β”œβ”€β”€ client.py     # Search client
β”‚   β”œβ”€β”€ provider/     # Provider implementations
β”‚   β”‚   β”œβ”€β”€ anthropic.py    # Anthropic provider
β”‚   β”‚   β”œβ”€β”€ azure.py        # Azure OpenAI provider
β”‚   β”‚   β”œβ”€β”€ base.py         # Base provider class
β”‚   β”‚   β”œβ”€β”€ bedrock.py      # AWS Bedrock provider
β”‚   β”‚   β”œβ”€β”€ doubao.py       # ByteDance Doubao provider
β”‚   β”‚   β”œβ”€β”€ gemini.py       # Google Gemini provider
β”‚   β”‚   β”œβ”€β”€ gooeyai.py      # GooeyAI provider
β”‚   β”‚   β”œβ”€β”€ openai.py       # OpenAI provider
β”‚   β”‚   β”œβ”€β”€ openai_like.py  # OpenAI-compatible provider
β”‚   β”‚   β”œβ”€β”€ qianfan.py      # Baidu Qianfan provider
β”‚   β”‚   β”œβ”€β”€ registry.py     # Provider registry
β”‚   β”‚   β”œβ”€β”€ stabilityai.py  # Stability AI provider
β”‚   β”‚   β”œβ”€β”€ vertex.py       # Google Vertex AI provider
β”‚   β”‚   └── __init__.py     # Package initialization
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ stage/            # Pipeline stages
β”‚   β”œβ”€β”€ base.py       # Base stage classes
β”‚   β”œβ”€β”€ definition.py # Stage implementations
β”‚   β”œβ”€β”€ factory.py    # Stage factory
β”‚   β”œβ”€β”€ registry.py   # Stage registry
β”‚   β”œβ”€β”€ resolver.py   # Dependency resolver
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ state/            # State management
β”‚   β”œβ”€β”€ builder.py    # Status builder
β”‚   β”œβ”€β”€ collector.py  # State collection
β”‚   β”œβ”€β”€ display.py    # Display engine
β”‚   β”œβ”€β”€ enums.py      # State enumerations
β”‚   β”œβ”€β”€ models.py     # State data models
β”‚   β”œβ”€β”€ monitor.py    # State monitoring
β”‚   β”œβ”€β”€ types.py      # State type definitions
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ storage/          # Storage and persistence
β”‚   β”œβ”€β”€ atomic.py     # Atomic file operations
β”‚   β”œβ”€β”€ persistence.py # Result persistence
β”‚   β”œβ”€β”€ recovery.py   # Task recovery
β”‚   β”œβ”€β”€ shard.py      # NDJSON shard management
β”‚   β”œβ”€β”€ snapshot.py   # Snapshot management
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ tools/            # Tools and utilities
β”‚   β”œβ”€β”€ agent.py      # User agent management
β”‚   β”œβ”€β”€ balancer.py   # Load balancing
β”‚   β”œβ”€β”€ coordinator.py # Resource coordination
β”‚   β”œβ”€β”€ credential.py # Credential management
β”‚   β”œβ”€β”€ logger.py     # Logging system
β”‚   β”œβ”€β”€ patterns.py   # Pattern matching utilities
β”‚   β”œβ”€β”€ ratelimit.py  # Rate limiting
β”‚   β”œβ”€β”€ resources.py  # Resource pooling
β”‚   β”œβ”€β”€ retry.py      # Retry framework
β”‚   β”œβ”€β”€ utils.py      # General utilities
β”‚   └── __init__.py   # Package initialization
β”œβ”€β”€ .dockerignore     # Docker ignore rules
β”œβ”€β”€ .gitignore        # Git ignore rules
β”œβ”€β”€ Dockerfile        # Docker container configuration
β”œβ”€β”€ entrypoint.sh     # Docker entrypoint script
β”œβ”€β”€ LICENSE           # License file
β”œβ”€β”€ main.py           # Entry point and application core
β”œβ”€β”€ README.md         # English documentation
β”œβ”€β”€ README.zh-CN.md   # Chinese documentation
β”œβ”€β”€ requirements.txt  # Python dependencies
└── __init__.py       # Root package initialization

Advanced Features

  1. Real-time Monitoring

    • Task status tracking
    • Performance metrics collection
    • Resource usage monitoring
    • Alert system
  2. Configuration Flexibility

    • Multi-provider configuration
    • Custom search patterns
    • Adjustable performance parameters
    • Dynamic resource allocation
  3. Extensibility

    • Plugin-style providers
    • Custom pipeline stages
    • Configurable monitoring system
    • Flexible recovery strategies

Troubleshooting

Common Issues

1. Installation Problems

# Issue: pip install fails
# Solution: Upgrade pip and use virtual environment
python -m pip install --upgrade pip
python -m venv venv

# Linux/macOS
source venv/bin/activate

# Windows
venv\Scripts\activate

pip install -r requirements.txt

2. Configuration Errors

# Issue: Configuration validation fails
# Solution: Validate configuration file
python main.py --validate

# Issue: Missing configuration file
# Solution: Create from example
cp examples/config-simple.yaml config.yaml

3. Rate Limiting Issues

# Issue: Too many API requests
# Solution: Adjust rate limits in config
rate_limits:
  github_api:
    base_rate: 0.1  # Reduce rate
    adaptive: true  # Enable adaptive limiting

4. Memory Issues

# Issue: High memory usage
# Solution: Reduce batch sizes and thread counts
pipeline:
  threads:
    search: 1
    gather: 2  # Reduce from default
persistence:
  batch_size: 25  # Reduce from default 50

5. Network Connectivity

# Issue: Connection timeouts
# Solution: Increase timeout values
api:
  timeout: 60  # Increase from default 30
  retries: 5   # Increase retry attempts

Debug Mode

# Enable debug logging
python main.py --log-level DEBUG

# Save debug output to file
python main.py --log-level DEBUG > debug.log 2>&1

Security Considerations

Credential Management

  • Never commit credentials to version control
  • Use environment variables for sensitive configuration
  • Rotate credentials regularly to minimize exposure risk
  • Implement least privilege access for API keys

Data Protection

# Example: Secure credential configuration
global:
  github_credentials:
    sessions:
      - "${GITHUB_SESSION_1}"  # Use environment variables
      - "${GITHUB_SESSION_2}"
    tokens:
      - "${GITHUB_TOKEN_1}"

Privacy Considerations

  • Respect robots.txt and website terms of service
  • Implement rate limiting to avoid overwhelming target services
  • Log redaction automatically removes sensitive data from logs
  • Data retention policies should comply with applicable regulations

Compliance Guidelines

  • Review legal requirements before using in production
  • Obtain necessary permissions for data collection
  • Implement data anonymization where required
  • Document data processing activities for compliance

Important Notes

  1. Limitations

    • Respect Github API usage limits
    • Configure rate limits appropriately
    • Mind memory usage
    • Handle sensitive data carefully
  2. Best Practices

    • Use appropriate thread counts
    • Backup results regularly
    • Monitor error rates
    • Handle alerts promptly

TODO & Roadmap

πŸ—οΈ Core Architecture Improvements

Data Source Abstraction

  • Abstract Data Source Interface: Create a unified interface for all data sources
    • Define DataSourceProvider base class with standard methods (search, gather, validate)
    • Implement adapter pattern for different API formats (REST, GraphQL, WebSocket)
    • Add configuration schema for data source registration
    • Support dynamic data source loading and hot-swapping

Stage System Enhancement

  • Flexible Stage Definition: Move beyond the current 4-stage limitation
    • Create StageDefinition configuration format (YAML/JSON)
    • Implement dynamic stage loading from configuration files
    • Add stage composition and conditional execution
    • Support user-defined stage workflows and DAG customization

Handler/Processor Registration System

  • Pluggable Processing Architecture: Replace fixed function calls with configurable handlers
    • Implement HandlerRegistry for stage-specific processors
    • Create ProcessorInterface with standardized input/output contracts
    • Add handler discovery mechanism (annotation-based or configuration-driven)
    • Support middleware chains for request/response processing

🌐 Data Source Integrations

Network Mapping Platforms

  • FOFA Integration

    • Implement FOFA API client with authentication
    • Add FOFA-specific query optimization
  • Shodan Integration

    • Support data querying and extraction from Shodan

Generic Web Sources

  • Universal Web Scraper
    • Build configurable web scraping engine
    • Add support for JavaScript-rendered content (Selenium/Playwright)
    • Implement anti-bot detection bypass mechanisms
    • Create content extraction rule engine

πŸ”§ Framework Enhancements

Configuration & Extensibility

  • Plugin System
    • Design plugin architecture with lifecycle management
    • Create plugin marketplace and discovery mechanism
    • Add plugin sandboxing and security validation
    • Implement plugin dependency resolution

Performance & Scalability

  • Distributed Processing
    • Add support for distributed task execution (Celery/RQ)
    • Implement horizontal scaling with load balancing
    • Create cluster management and node discovery
    • Add distributed state synchronization

Security

  • Enhanced Security Features
    • Implement credential encryption and secure storage
    • Create rate limiting policies per data source

πŸ“Š Monitoring & Analytics

Advanced Monitoring

  • Real-time Analytics Dashboard
    • Build web-based monitoring interface
    • Add real-time metrics visualization
    • Implement alerting and notification system
    • Create performance profiling and bottleneck analysis

πŸš€ Advanced Features

API & Integration

  • RESTful API Server
    • Build comprehensive REST API for external integration
    • Implement webhook support for real-time notifications
    • Create SDK libraries for popular programming languages

Contributing

Contributions are welcome! Before submitting a pull request, please ensure:

  1. Tests are updated
  2. Code follows style guidelines
  3. Documentation is added where necessary
  4. All tests pass

Priority Areas for Contributors

  • πŸ”₯ High Priority: Data source abstraction and FOFA/Shodan integration
  • πŸ”₯ High Priority: Stage system flexibility and handler registration
  • πŸ”₯ High Priority: Plugin architecture and extensibility framework
  • πŸ”₯ Medium Priority: Performance optimization and distributed processing
  • πŸ”₯ Medium Priority: Web-based monitoring dashboard

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the LICENSE file for details.

Disclaimer

⚠️ IMPORTANT NOTICE

This project is developed solely for educational and technical research purposes. Users should exercise caution and responsibility when using this software.

Key Points:

  • This software is intended for learning, research, and educational use only
  • Users must comply with all applicable laws and regulations in their jurisdiction
  • Users are responsible for ensuring their usage complies with the terms of service of any third-party platforms or APIs
  • The project authors do not recommend, encourage, or endorse the use of this software for illegally obtaining others' API keys or credentials
  • The project authors assume no responsibility for any disputes, legal issues, or damages arising from the use of this software
  • Commercial use is strictly prohibited without explicit written permission
  • Users should respect the intellectual property rights and privacy of others

By using this software, you acknowledge that you have read, understood, and agree to these terms. Use at your own risk.

Contact

For questions or other inquiries during usage, please contact the project maintainers through GitHub Issues.

About

Intelligent data acquisition framework for GitHub and web sources

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages