π δΈζζζ‘£ | English | π More Tools
A universal, adaptive data acquisition framework designed for comprehensive information acquisition from multiple sources including GitHub, network mapping platforms (FOFA, Shodan), and arbitrary web endpoints. While the current implementation focuses on AI service provider key discovery as a practical example, the framework is architected for extensibility to support diverse data acquisition scenarios.
βββ If this project helps you, please give it a star! Your support motivates us to keep improving and adding new features.
The system aims to build a universal data acquisition framework primarily targeting:
- GitHub: Code repositories, issues, commits, and API endpoints
- Network Mapping Platforms:
- Arbitrary Web Endpoints: Custom APIs, web services, and data sources
- Extensible Architecture: Plugin-based system for easy integration of new data sources
| Data Source | Status | Description |
|---|---|---|
| GitHub API | β Implemented | Full API integration with rate limiting |
| GitHub Web | β Implemented | Web scraping with intelligent parsing |
| FOFA | π§ Planned | Cyberspace asset discovery integration |
| Shodan | π§ Planned | IoT and network device enumeration |
| Custom APIs | π§ Planned | Generic REST/GraphQL API adapter |
graph TB
%% Entry Layer
subgraph Entry["Entry Layer"]
CLI["CLI Interface<br/>(main.py)"]
App["Application Core<br/>(main.py)"]
end
%% Management Layer
subgraph Management["Management Layer"]
TaskMgr["Task Manager<br/>(manager/task.py)"]
Pipeline["Pipeline Manager<br/>(manager/pipeline.py)"]
WorkerMgr["Worker Manager<br/>(manager/worker.py)"]
QueueMgr["Queue Manager<br/>(manager/queue.py)"]
StatusMgr["Status Manager<br/>(manager/status.py)"]
Shutdown["Shutdown Coordinator<br/>(manager/shutdown.py)"]
end
%% Processing Layer
subgraph Processing["Processing Layer"]
StageBase["Stage Framework<br/>(stage/base.py)"]
StageImpl["Stage Implementations<br/>(stage/definition.py)"]
StageReg["Stage Registry<br/>(stage/registry.py)"]
StageFactory["Stage Factory<br/>(stage/factory.py)"]
StageResolver["Dependency Resolver<br/>(stage/resolver.py)"]
end
%% Service Layer
subgraph Service["Service Layer"]
SearchSvc["Search Service<br/>(search/client.py)"]
SearchProviders["Search Providers<br/>(search/provider/)"]
RefineSvc["Query Refinement<br/>(refine/)"]
RefineEngine["Refine Engine<br/>(refine/engine.py)"]
RefineOptimizer["Query Optimizer<br/>(refine/optimizer.py)"]
end
%% Core Domain Layer
subgraph Core["Core Domain Layer"]
Models["Domain Models & Tasks<br/>(core/models.py)"]
Types["Type System<br/>(core/types.py)"]
Enums["Enumerations<br/>(core/enums.py)"]
Metrics["Metrics<br/>(core/metrics.py)"]
Auth["Authentication<br/>(core/auth.py)"]
end
%% Infrastructure Layer
subgraph Infrastructure["Infrastructure Layer"]
Config["Configuration<br/>(config/)"]
Tools["Tools & Utilities<br/>(tools/)"]
Constants["Constants<br/>(constant/)"]
Storage["Storage & Persistence<br/>(storage/)"]
end
%% State Management Layer
subgraph StateLayer["State Management Layer"]
StateCollector["State Collector<br/>(state/collector.py)"]
StateDisplay["Display Engine<br/>(state/display.py)"]
StateBuilder["Status Builder<br/>(state/builder.py)"]
StateModels["State Models<br/>(state/models.py)"]
StateMonitor["State Monitor<br/>(state/monitor.py)"]
StateEnums["State Enums<br/>(state/enums.py)"]
StateTypes["State Types<br/>(state/types.py)"]
end
%% External Systems
subgraph External["External Systems"]
GitHub["GitHub<br/>(API + Web)"]
AIServices["AI Service<br/>Providers"]
FileSystem["File System<br/>(Local Storage)"]
end
%% Dependencies (Top-down)
Entry --> Management
Management --> Processing
Processing --> Service
Service --> Core
%% Infrastructure dependencies
Entry -.-> Infrastructure
Management -.-> Infrastructure
Processing -.-> Infrastructure
Service -.-> Infrastructure
Core -.-> Infrastructure
%% State management dependencies
Entry -.-> StateLayer
Management -.-> StateLayer
%% External dependencies
Service --> External
Infrastructure --> External
graph TB
%% User Interface Layer
subgraph UserLayer["User Interface Layer"]
User[User]
CLI[Command Line Interface]
ConfigMgmt[Configuration Management]
end
%% Application Management Layer
subgraph AppLayer["Application Management Layer"]
MainApp[Main Application]
TaskManager[Task Manager]
StatusManager[Status Manager]
ResourceManager[Resource Manager]
ShutdownManager[Shutdown Manager]
end
%% Core Pipeline Engine
subgraph PipelineCore["Pipeline Engine"]
%% Stage Management System
subgraph StageSystem["Stage Management System"]
StageRegistry[Stage Registry]
DependencyResolver[Dependency Resolver]
StageFactory[Stage Factory]
end
%% Queue Management System
subgraph QueueSystem["Queue Management System"]
QueueManager[Queue Manager]
WorkerManager[Worker Manager]
MonitoringSystem[System Monitor]
end
%% Processing Stages
subgraph ProcessingStages["Processing Stages"]
SearchStage[Search Stage]
GatherStage[Gather Stage]
CheckStage[Check Stage]
InspectStage[Inspect Stage]
end
end
%% Search Provider Ecosystem
subgraph ProviderEcosystem["Search Provider Ecosystem"]
ProviderRegistry[Provider Registry]
BaseProvider[Base Provider]
OpenAIProvider[OpenAI-like Provider]
CustomProviders[Custom Providers]
end
%% Advanced Processing Engines
subgraph ProcessingEngines["Processing Engines"]
SearchClient[Search Client]
%% Query Optimization Engine
subgraph QueryOptimizer["Query Optimization Engine"]
RefineEngine[Refine Engine]
RegexParser[Regex Parser]
SplittabilityAnalyzer[Splittability Analyzer]
EnumerationOptimizer[Enumeration Optimizer]
QueryGenerator[Query Generator]
OptimizationStrategies[Optimization Strategies]
%% Internal Flow
RefineEngine --> RegexParser
RegexParser --> SplittabilityAnalyzer
SplittabilityAnalyzer --> EnumerationOptimizer
EnumerationOptimizer --> OptimizationStrategies
OptimizationStrategies --> QueryGenerator
end
ValidationEngine[API Key Validation]
RecoveryEngine[Task Recovery]
end
%% State & Data Management
subgraph StateManagement["State & Data Management"]
StateCollector[State Collector]
DisplayEngine[Display Engine]
StatusBuilder[Status Builder]
StateMonitor[State Monitor]
PersistenceLayer[Persistence Layer]
SnapshotManager[Snapshot Manager]
ResultManager[Result Manager]
end
%% Infrastructure Services
subgraph Infrastructure["Infrastructure Services"]
RateLimiting[Rate Limiting]
CredentialMgmt[Credential Management]
AgentRotation[User Agent Rotation]
LoggingSystem[Logging System]
RetryFramework[Retry Framework]
ResourcePool[Resource Pool]
end
%% External Systems
subgraph External["External Systems"]
GitHubAPI[GitHub API]
GitHubWeb[GitHub Web Interface]
AIServiceAPIs[AI Service APIs]
FileSystem[Local File System]
end
%% User Interactions
User --> CLI
User --> ConfigMgmt
CLI --> MainApp
ConfigMgmt --> MainApp
%% Application Flow
MainApp --> TaskManager
MainApp --> StatusManager
MainApp --> ResourceManager
MainApp --> ShutdownManager
TaskManager --> StageRegistry
TaskManager --> QueueManager
%% Stage Management Flow
StageRegistry --> DependencyResolver
StageRegistry --> StageFactory
DependencyResolver --> ProcessingStages
StageFactory --> ProcessingStages
%% Queue Management Flow
QueueManager --> WorkerManager
QueueManager --> MonitoringSystem
WorkerManager --> ProcessingStages
%% Stage Dependencies (Pipeline)
SearchStage --> GatherStage
GatherStage --> CheckStage
CheckStage --> InspectStage
%% Processing Engine Integration
SearchStage --> SearchClient
SearchStage --> QueryOptimizer
CheckStage --> ValidationEngine
ProcessingStages --> RecoveryEngine
%% Provider Integration
SearchClient --> ProviderRegistry
ProviderRegistry --> BaseProvider
BaseProvider --> OpenAIProvider
BaseProvider --> CustomProviders
%% State Management Integration
ProcessingStages --> StateCollector
QueueManager --> StateCollector
StateCollector --> DisplayEngine
StateCollector --> StatusBuilder
StateMonitor --> DisplayEngine
ProcessingStages --> PersistenceLayer
PersistenceLayer --> SnapshotManager
PersistenceLayer --> ResultManager
%% Infrastructure Integration
SearchClient -.-> RateLimiting
ResourceManager -.-> CredentialMgmt
ResourceManager -.-> AgentRotation
MainApp -.-> LoggingSystem
ProcessingStages -.-> RetryFramework
Infrastructure -.-> ResourcePool
%% External Connections
SearchClient --> GitHubAPI
SearchClient --> GitHubWeb
ValidationEngine --> AIServiceAPIs
PersistenceLayer --> FileSystem
%% Styling
classDef userClass fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef appClass fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef coreClass fill:#e8f5e8,stroke:#388e3c,stroke-width:3px
classDef providerClass fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef engineClass fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef stateClass fill:#f1f8e9,stroke:#689f38,stroke-width:2px
classDef infraClass fill:#f5f5f5,stroke:#616161,stroke-width:2px
classDef externalClass fill:#ffebee,stroke:#d32f2f,stroke-width:2px
class User,CLI,ConfigMgmt userClass
class MainApp,TaskManager,StatusManager,ResourceManager,ShutdownManager appClass
class StageRegistry,DependencyResolver,StageFactory,QueueManager,WorkerManager,MonitoringSystem,SearchStage,GatherStage,CheckStage,InspectStage coreClass
class ProviderRegistry,BaseProvider,OpenAIProvider,CustomProviders providerClass
class SearchClient,QueryOptimizer,ValidationEngine,RecoveryEngine engineClass
class StateCollector,StateMonitor,DisplayEngine,StatusBuilder,PersistenceLayer,SnapshotManager,ResultManager stateClass
class RateLimiting,CredentialMgmt,AgentRotation,LoggingSystem,RetryFramework,ResourcePool infraClass
class GitHubAPI,GitHubWeb,AIServiceAPIs,FileSystem externalClass
The project follows a layered architecture with the following core components:
sequenceDiagram
participant CLI as CLI
participant App as Application
participant TM as TaskManager
participant Pipeline as Pipeline
participant Search as SearchStage
participant Gather as GatherStage
participant Check as CheckStage
participant Inspect as InspectStage
participant Storage as Storage
participant Monitor as StatusManager
%% Initialization Phase
CLI->>App: 1. Start Application
App->>App: 2. Load Configuration
App->>TM: 3. Create TaskManager
TM->>TM: 4. Initialize Providers
TM->>Pipeline: 5. Create Pipeline
Pipeline->>Search: 6. Register SearchStage
Pipeline->>Gather: 7. Register GatherStage
Pipeline->>Check: 8. Register CheckStage
Pipeline->>Inspect: 9. Register InspectStage
App->>Monitor: 10. Start Status Manager
%% Processing Phase
loop Multi-Stage Processing
TM->>Search: 11. Submit Search Tasks
Search->>Search: 12. Query GitHub with Optimization
Search->>Gather: 13. Forward Search Results
Gather->>Gather: 14. Acquire Detailed Information
Gather->>Check: 15. Forward Extracted Keys
Check->>Check: 16. Validate API Keys
Check->>Inspect: 17. Forward Valid Keys
Inspect->>Inspect: 18. Inspect API Capabilities
Inspect->>Storage: 19. Save Results
Pipeline->>Monitor: 20. Update Status
Monitor->>App: 21. Display Progress
end
%% Recovery and Persistence
loop Background Operations
Storage->>Storage: Auto-save Results
Storage->>Storage: Create Snapshots
Pipeline->>Pipeline: Task Recovery
Monitor->>Monitor: Collect Metrics
end
%% Completion Phase
Pipeline->>Pipeline: 22. Check Completion
Pipeline->>Storage: 23. Final Persistence
Pipeline->>Monitor: 24. Final Status Report
App->>TM: 25. Graceful Shutdown
TM->>Storage: 26. Save State
- CLI Interface (
main.py): Command-line entry point with argument parsing and application lifecycle - Configuration System (
config/): YAML-based configuration management with validation and schemas
- Application Core (
main.py): Main application lifecycle and orchestration - Task Management (
manager/task.py): Provider coordination and task distribution - Resource Coordination (
tools/coordinator.py): Global resource management and coordination - Shutdown Management (
manager/shutdown.py): Graceful shutdown coordination - Status Management (
manager/status.py): Application status management and coordination - Worker Management (
manager/worker.py): Worker thread management and scaling - Queue Management (
manager/queue.py): Multi-queue coordination and management
- Pipeline Engine (
manager/pipeline.py): Multi-stage processing orchestration with DAG execution - Stage System (
stage/): Pluggable processing stages with dependency resolution and factory pattern - Search Service (
search/): GitHub code search with provider abstraction and optimization - Query Refinement (
refine/): Intelligent query optimization with strategy pattern and mathematical foundations
- Core Models & Tasks (
core/models.py): Business domain objects, data structures, and task definitions - Type System (
core/types.py): Interface definitions and contracts - Business Enums (
core/enums.py): Domain enumerations and constants - Metrics & Analytics (
core/metrics.py): Performance measurement and KPI tracking - Authentication (
core/auth.py): Authentication and authorization logic - Custom Exceptions (
core/exceptions.py): Domain-specific exception handling - Custom Exceptions (
core/exceptions.py): Domain-specific exception handling
- Storage & Persistence (
storage/): Result storage, recovery, and snapshot management- Atomic Operations (
storage/atomic.py): Atomic file operations with fsync - Result Management (
storage/persistence.py): Multi-format result persistence - Task Recovery (
storage/recovery.py): Task recovery mechanisms - Shard Management (
storage/shard.py): NDJSON shard management with rotation - Snapshot Management (
storage/snapshot.py): Backup and restore functionality
- Atomic Operations (
- Tools & Utilities (
tools/): Infrastructure tools and utilities- Logging System (
tools/logger.py): Structured logging with API key redaction - Rate Limiting (
tools/ratelimit.py): Adaptive rate control with token bucket algorithm - Load Balancing (
tools/balancer.py): Resource distribution strategies - Credential Management (
tools/credential.py): Secure credential rotation and management - Agent Management (
tools/agent.py): User-agent rotation for web scraping - Pattern Matching (
tools/patterns.py): Pattern matching utilities and helpers - Retry Framework (
tools/retry.py): Unified retry mechanisms with backoff strategies - Resource Pooling (
tools/resources.py): Resource pool management and optimization
- Logging System (
- State Collection (
state/collector.py): System metrics gathering and aggregation - Display Engine (
state/display.py): User-friendly progress visualization and formatting - Status Builder (
state/builder.py): Status data construction and transformation - State Models (
state/models.py): Monitoring data structures and metrics - State Monitoring (
state/monitor.py): Real-time state monitoring and tracking - State Enumerations (
state/enums.py): State-related enumerations and constants - State Types (
state/types.py): State type definitions and interfaces
The system implements a 4-stage pipeline for comprehensive data acquisition and validation:
-
Search Stage (
stage/definition.py:SearchStage):- Intelligent GitHub code search with advanced query optimization
- Multi-provider search support (API + Web)
- Query refinement using mathematical optimization algorithms
- Rate-limited search execution with adaptive throttling
-
Gather Stage (
stage/definition.py:GatherStage):- Detailed information acquisition from search results
- Content extraction and parsing
- Pattern matching for key identification
- Structured data collection and normalization
-
Check Stage (
stage/definition.py:CheckStage):- API key validation against actual service endpoints
- Authentication verification and capability testing
- Service availability and response validation
- Error handling and retry mechanisms
-
Inspect Stage (
stage/definition.py:InspectStage):- API capability inspection for validated keys
- Model enumeration and feature detection
- Service limits and quota analysis
- Comprehensive capability profiling
The system features a sophisticated Query Optimization Engine with mathematical foundations:
-
Regex Parser
- Advanced regex pattern parsing with support for complex syntax
- Handles escaped characters, character classes, and quantifiers
- Converts patterns into analyzable segment structures
-
Splittability Analyzer
- Mathematical analysis of pattern divisibility
- Recursive depth limiting for safety
- Value threshold analysis for optimization feasibility
- Resource cost estimation for performance control
-
Enumeration Optimizer
- Intelligent enumeration strategy selection
- Multi-dimensional optimization (depth, breadth, value)
- Combinatorial analysis for optimal segment selection
- Topological sorting for dependency resolution
-
Query Generator
- Generates optimized query variants from enumeration strategies
- Supports configurable enumeration depth
- Produces mathematically optimal query distributions
- Maintains query semantic equivalence
- Mathematical Modeling: Uses mathematical principles to analyze regex patterns
- Enumeration Strategy: Intelligent selection of optimal enumeration depth and combinations
- Resource Management: Prevents resource exhaustion through intelligent limiting
- Performance Optimization: Singleton pattern ensures optimal memory usage
- OpenAI and compatible interfaces
- Anthropic Claude
- Azure OpenAI
- Google Gemini
- AWS Bedrock
- GooeyAI
- Stability AI
- ηΎεΊ¦ζεΏδΈθ¨
- ζΊθ°±AI
- Custom providers
- FOFA: Cyberspace asset discovery and network mapping
- Shodan: Internet-connected device enumeration
- Custom REST APIs: Generic API integration framework
- GraphQL Endpoints: Flexible query-based data acquisition
- Web Scraping: JavaScript-rendered content and dynamic sites
- Database Connectors: Direct database query capabilities
- Data Mining: Large-scale information extraction and analysis
- Multi-Source Support: GitHub, FOFA, Shodan, and custom endpoints
- Adaptive Query Engine: Intelligent optimization for different data sources
- Protocol Agnostic: REST, GraphQL, WebSocket, and web scraping support
- Rate Limiting: Per-source intelligent rate control and quota management
- Dynamic Stage System: Configurable processing pipelines with DAG execution
- Plugin Architecture: Extensible framework for custom data sources and processors
- Dependency Resolution: Automatic stage ordering and dependency management
- Handler Registration: Pluggable processors for flexible data transformation
- Asynchronous Processing: Multi-threaded task execution with intelligent queuing
- Adaptive Load Balancing: Dynamic resource allocation based on workload
- Query Optimization: Mathematical modeling for optimal search strategies
- Resource Monitoring: Real-time performance tracking and bottleneck detection
- Fault Tolerance: Comprehensive error handling, retry mechanisms, and recovery
- State Persistence: Queue state recovery and graceful shutdown capabilities
- Security: Credential management, API key redaction, and secure storage
- Monitoring: Real-time analytics, alerting, and performance visualization
- Python: 3.10+
- Libraries:
PyYAML - Optional:
uvloop(Linux/macOS performance boost) - Development:
pytest,black,mypy(for contributors)
π For comprehensive documentation, tutorials, and advanced usage guides, please visit DeepWiki
-
Installation
git clone https://github.com/wzdnzd/harvester.git cd harvester pip install -r requirements.txt -
Configuration
Choose one of the following methods to create your configuration
Method 1: Generate default configuration
python main.py --create-configMethod 2: Copy from examples
# For basic configuration
cp examples/config-simple.yaml config.yaml
# For full configuration with all options
cp examples/config-full.yaml config.yamlEdit the configuration file:
- Set your Github session token or API key
- Configure provider search patterns
- Adjust rate limits and thread counts
The system provides two configuration templates:
-
Basic Configuration - Suitable for quick start:
# Global application settings global: workspace: "./data" # Working directory github_credentials: sessions: - "your_github_session_here" # GitHub session token strategy: "round_robin" # Load balancing strategy # Pipeline stage configuration pipeline: threads: search: 1 # Search threads (keep low) gather: 4 # Acquisition threads check: 2 # Validation threads inspect: 1 # API capability inspection threads # System monitoring settings monitoring: update_interval: 2.0 # Monitoring update interval error_threshold: 0.1 # Error rate threshold # Data persistence configuration persistence: auto_restore: true # Auto restore state on startup shutdown_timeout: 30 # Shutdown timeout in seconds # Global rate limiting configuration ratelimits: github_web: base_rate: 0.5 # Base rate in requests per second burst_limit: 2 # Maximum burst size adaptive: true # Enable adaptive rate limiting # Provider task configurations tasks: - name: "openai" # Provider name enabled: true # Enable/disable provider provider_type: "openai" use_api: false # Use GitHub API for searching # Pipeline stage settings stages: search: true # Enable search stage gather: true # Enable acquisition stage check: true # Enable validation stage inspect: true # Enable API capability inspection # Pattern matching configuration patterns: key_pattern: "sk(?:-proj)?-[a-zA-Z0-9]{20}T3BlbkFJ[a-zA-Z0-9]{20}" # Search conditions conditions: - query: '"T3BlbkFJ"'
-
Full Configuration - Includes all advanced options:
display: Display and monitoring settingsglobal: Global system configurationpipeline: Pipeline stage configurationmonitoring: System monitoring parameterspersistence: Data persistence settingsworker: Worker pool configurationratelimits: Rate limiting settingstasks: Provider task configurations
π For complete configuration examples, please refer to:
examples/config-full.yaml- Comprehensive configuration with all available optionsexamples/config-simple.yaml- Basic configuration for quick start
The tasks section is the core of the configuration, defining what providers to search and how to process them. Refer to the basic configuration example above for a complete tasks configuration.
name: Unique identifier for the taskprovider_type: Determines validation method (openai,openai_like,anthropic,gemini, etc.)api: API endpoint configuration for key validationpatterns.key_pattern: Regex pattern to identify valid API keysconditions: Search queries to find potential keysstages: Enable/disable specific processing stagesextras.directory: Custom output directory for results
- Running
python main.py # Use default config python main.py -c custom.yaml # Use custom config python main.py --validate # Validate config python main.py --log-level DEBUG # Enable debug logging
harvester/
βββ config/ # Configuration management
β βββ accessor.py # Configuration access utilities
β βββ defaults.py # Default configuration values
β βββ loader.py # Configuration loading
β βββ schemas.py # Configuration schemas
β βββ validator.py # Configuration validation
β βββ __init__.py # Package initialization
βββ constant/ # System constants
β βββ monitoring.py # Monitoring constants
β βββ runtime.py # Runtime constants
β βββ search.py # Search constants
β βββ system.py # System constants
β βββ __init__.py # Package initialization
βββ core/ # Core domain models
β βββ auth.py # Authentication
β βββ enums.py # System enumerations
β βββ exceptions.py # Custom exceptions
β βββ metrics.py # Performance metrics
β βββ models.py # Core data models & task definitions
β βββ types.py # Core type definitions
β βββ __init__.py # Package initialization
βββ examples/ # Configuration examples
β βββ config-full.yaml # Complete configuration template
β βββ config-simple.yaml # Basic configuration template
βββ manager/ # Task and resource management
β βββ base.py # Base management classes
β βββ pipeline.py # Pipeline management
β βββ queue.py # Queue management
β βββ shutdown.py # Shutdown coordination
β βββ status.py # Status management
β βββ task.py # Task management
β βββ worker.py # Worker thread management
β βββ __init__.py # Package initialization
βββ refine/ # Query optimization
β βββ config.py # Refine configuration
β βββ engine.py # Optimization engine
β βββ generator.py # Query generation
β βββ optimizer.py # Query optimization
β βββ parser.py # Query parsing
β βββ segment.py # Pattern segmentation
β βββ splittability.py # Splittability analysis
β βββ strategies.py # Optimization strategies
β βββ types.py # Refine type definitions
β βββ __init__.py # Package initialization
βββ search/ # Search engines
β βββ client.py # Search client
β βββ provider/ # Provider implementations
β β βββ anthropic.py # Anthropic provider
β β βββ azure.py # Azure OpenAI provider
β β βββ base.py # Base provider class
β β βββ bedrock.py # AWS Bedrock provider
β β βββ doubao.py # ByteDance Doubao provider
β β βββ gemini.py # Google Gemini provider
β β βββ gooeyai.py # GooeyAI provider
β β βββ openai.py # OpenAI provider
β β βββ openai_like.py # OpenAI-compatible provider
β β βββ qianfan.py # Baidu Qianfan provider
β β βββ registry.py # Provider registry
β β βββ stabilityai.py # Stability AI provider
β β βββ vertex.py # Google Vertex AI provider
β β βββ __init__.py # Package initialization
β βββ __init__.py # Package initialization
βββ stage/ # Pipeline stages
β βββ base.py # Base stage classes
β βββ definition.py # Stage implementations
β βββ factory.py # Stage factory
β βββ registry.py # Stage registry
β βββ resolver.py # Dependency resolver
β βββ __init__.py # Package initialization
βββ state/ # State management
β βββ builder.py # Status builder
β βββ collector.py # State collection
β βββ display.py # Display engine
β βββ enums.py # State enumerations
β βββ models.py # State data models
β βββ monitor.py # State monitoring
β βββ types.py # State type definitions
β βββ __init__.py # Package initialization
βββ storage/ # Storage and persistence
β βββ atomic.py # Atomic file operations
β βββ persistence.py # Result persistence
β βββ recovery.py # Task recovery
β βββ shard.py # NDJSON shard management
β βββ snapshot.py # Snapshot management
β βββ __init__.py # Package initialization
βββ tools/ # Tools and utilities
β βββ agent.py # User agent management
β βββ balancer.py # Load balancing
β βββ coordinator.py # Resource coordination
β βββ credential.py # Credential management
β βββ logger.py # Logging system
β βββ patterns.py # Pattern matching utilities
β βββ ratelimit.py # Rate limiting
β βββ resources.py # Resource pooling
β βββ retry.py # Retry framework
β βββ utils.py # General utilities
β βββ __init__.py # Package initialization
βββ .dockerignore # Docker ignore rules
βββ .gitignore # Git ignore rules
βββ Dockerfile # Docker container configuration
βββ entrypoint.sh # Docker entrypoint script
βββ LICENSE # License file
βββ main.py # Entry point and application core
βββ README.md # English documentation
βββ README.zh-CN.md # Chinese documentation
βββ requirements.txt # Python dependencies
βββ __init__.py # Root package initialization
-
Real-time Monitoring
- Task status tracking
- Performance metrics collection
- Resource usage monitoring
- Alert system
-
Configuration Flexibility
- Multi-provider configuration
- Custom search patterns
- Adjustable performance parameters
- Dynamic resource allocation
-
Extensibility
- Plugin-style providers
- Custom pipeline stages
- Configurable monitoring system
- Flexible recovery strategies
# Issue: pip install fails
# Solution: Upgrade pip and use virtual environment
python -m pip install --upgrade pip
python -m venv venv
# Linux/macOS
source venv/bin/activate
# Windows
venv\Scripts\activate
pip install -r requirements.txt# Issue: Configuration validation fails
# Solution: Validate configuration file
python main.py --validate
# Issue: Missing configuration file
# Solution: Create from example
cp examples/config-simple.yaml config.yaml# Issue: Too many API requests
# Solution: Adjust rate limits in config
rate_limits:
github_api:
base_rate: 0.1 # Reduce rate
adaptive: true # Enable adaptive limiting# Issue: High memory usage
# Solution: Reduce batch sizes and thread counts
pipeline:
threads:
search: 1
gather: 2 # Reduce from default
persistence:
batch_size: 25 # Reduce from default 50# Issue: Connection timeouts
# Solution: Increase timeout values
api:
timeout: 60 # Increase from default 30
retries: 5 # Increase retry attempts# Enable debug logging
python main.py --log-level DEBUG
# Save debug output to file
python main.py --log-level DEBUG > debug.log 2>&1- Never commit credentials to version control
- Use environment variables for sensitive configuration
- Rotate credentials regularly to minimize exposure risk
- Implement least privilege access for API keys
# Example: Secure credential configuration
global:
github_credentials:
sessions:
- "${GITHUB_SESSION_1}" # Use environment variables
- "${GITHUB_SESSION_2}"
tokens:
- "${GITHUB_TOKEN_1}"- Respect robots.txt and website terms of service
- Implement rate limiting to avoid overwhelming target services
- Log redaction automatically removes sensitive data from logs
- Data retention policies should comply with applicable regulations
- Review legal requirements before using in production
- Obtain necessary permissions for data collection
- Implement data anonymization where required
- Document data processing activities for compliance
-
Limitations
- Respect Github API usage limits
- Configure rate limits appropriately
- Mind memory usage
- Handle sensitive data carefully
-
Best Practices
- Use appropriate thread counts
- Backup results regularly
- Monitor error rates
- Handle alerts promptly
- Abstract Data Source Interface: Create a unified interface for all data sources
- Define
DataSourceProviderbase class with standard methods (search,gather,validate) - Implement adapter pattern for different API formats (REST, GraphQL, WebSocket)
- Add configuration schema for data source registration
- Support dynamic data source loading and hot-swapping
- Define
- Flexible Stage Definition: Move beyond the current 4-stage limitation
- Create
StageDefinitionconfiguration format (YAML/JSON) - Implement dynamic stage loading from configuration files
- Add stage composition and conditional execution
- Support user-defined stage workflows and DAG customization
- Create
- Pluggable Processing Architecture: Replace fixed function calls with configurable handlers
- Implement
HandlerRegistryfor stage-specific processors - Create
ProcessorInterfacewith standardized input/output contracts - Add handler discovery mechanism (annotation-based or configuration-driven)
- Support middleware chains for request/response processing
- Implement
-
FOFA Integration
- Implement FOFA API client with authentication
- Add FOFA-specific query optimization
-
Shodan Integration
- Support data querying and extraction from Shodan
- Universal Web Scraper
- Build configurable web scraping engine
- Add support for JavaScript-rendered content (Selenium/Playwright)
- Implement anti-bot detection bypass mechanisms
- Create content extraction rule engine
- Plugin System
- Design plugin architecture with lifecycle management
- Create plugin marketplace and discovery mechanism
- Add plugin sandboxing and security validation
- Implement plugin dependency resolution
- Distributed Processing
- Add support for distributed task execution (Celery/RQ)
- Implement horizontal scaling with load balancing
- Create cluster management and node discovery
- Add distributed state synchronization
- Enhanced Security Features
- Implement credential encryption and secure storage
- Create rate limiting policies per data source
- Real-time Analytics Dashboard
- Build web-based monitoring interface
- Add real-time metrics visualization
- Implement alerting and notification system
- Create performance profiling and bottleneck analysis
- RESTful API Server
- Build comprehensive REST API for external integration
- Implement webhook support for real-time notifications
- Create SDK libraries for popular programming languages
Contributions are welcome! Before submitting a pull request, please ensure:
- Tests are updated
- Code follows style guidelines
- Documentation is added where necessary
- All tests pass
- π₯ High Priority: Data source abstraction and FOFA/Shodan integration
- π₯ High Priority: Stage system flexibility and handler registration
- π₯ High Priority: Plugin architecture and extensibility framework
- π₯ Medium Priority: Performance optimization and distributed processing
- π₯ Medium Priority: Web-based monitoring dashboard
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). See the LICENSE file for details.
This project is developed solely for educational and technical research purposes. Users should exercise caution and responsibility when using this software.
Key Points:
- This software is intended for learning, research, and educational use only
- Users must comply with all applicable laws and regulations in their jurisdiction
- Users are responsible for ensuring their usage complies with the terms of service of any third-party platforms or APIs
- The project authors do not recommend, encourage, or endorse the use of this software for illegally obtaining others' API keys or credentials
- The project authors assume no responsibility for any disputes, legal issues, or damages arising from the use of this software
- Commercial use is strictly prohibited without explicit written permission
- Users should respect the intellectual property rights and privacy of others
By using this software, you acknowledge that you have read, understood, and agree to these terms. Use at your own risk.
For questions or other inquiries during usage, please contact the project maintainers through GitHub Issues.