A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.
-
Intelligent Web Crawling
- Async web content extraction using Playwright
- Smart rate limiting and validation
- Configurable crawling strategies
-
Advanced Content Processing
- Automatic text chunking and summarization
- Vector embeddings via FAISS
- Context-aware response generation
-
Streamlit Chat Interface
- Clean, responsive UI
- Real-time content processing
- Conversation history
- User authentication
-
Vector Database
- FAISS-powered similarity search
- Efficient content retrieval
- Persistent storage
-
User Management
- SQLite database backend
- Secure password hashing
- Chat history tracking
-
Monitoring & Utils
- Request metrics collection
- Progress tracking
- Data import/export
- Content validation
streamlit-chat_app-2025-01-25-23-01-66.webm
Example of CRAWLGPT in action!
- Python >= 3.8
- Operating System: OS Independent
- Required packages are handled by the setup script.
-
Clone the Repository:
cd CRAWLGPT -
Run the Setup Script:
python -m setup_envThis script installs dependencies, creates a virtual environment, and prepares the project.
-
Update Your Environment Variables:
- Create or modify the
.envfile. - Add your Groq API key and Ollama API key. Learn how to get API keys.
GROQ_API_KEY=your_groq_api_key_here OLLAMA_API_TOKEN=your_ollama_api_key_here - Create or modify the
-
Activate the Virtual Environment:
source .venv/bin/activate # On Unix/macOS .venv\Scripts\activate # On Windows -
Run the Application:
python -m streamlit run src/crawlgpt/ui/chat_app.py
streamlit==1.41.1groq==0.15.0sentence-transformers==3.3.1faiss-cpu==1.9.0.post1crawl4ai==0.4.247python-dotenv==1.0.1pydantic==2.10.5aiohttp==3.11.11beautifulsoup4==4.12.3numpy==2.2.0tqdm==4.67.1playwright>=1.41.0asyncio>=3.4.3
pytest==8.3.4pytest-mockito==0.0.4black==24.2.0isort==5.13.0flake8==7.0.0
crawlgpt/
βββ src/
β βββ crawlgpt/
β βββ core/ # Core functionality
β β βββ database.py # SQL database handling
β β βββ LLMBasedCrawler.py # Main crawler implementation
β β βββ DatabaseHandler.py # Vector database (FAISS)
β β βββ SummaryGenerator.py # Text summarization
β βββ ui/ # User Interface
β β βββ chat_app.py # Main Streamlit app
β β βββ chat_ui.py # Development UI
β β βββ login.py # Authentication UI
β βββ utils/ # Utilities
β βββ content_validator.py # URL/content validation
β βββ data_manager.py # Import/export handling
β βββ helper_functions.py # General helpers
β βββ monitoring.py # Metrics collection
β βββ progress.py # Progress tracking
βββ tests/ # Test suite
β βββ test_core/
β βββ test_database_handler.py # Vector DB tests
β βββ test_integration.py # Integration tests
β βββ test_llm_based_crawler.py # Crawler tests
β βββ test_summary_generator.py # Summarizer tests
βββ .github/ # CI/CD
β βββ workflows/
β βββ Push_to_hf.yaml # HuggingFace sync
βββ Docs/
β βββ MiniDoc.md # Documentation
βββ .dockerignore # Docker exclusions
βββ .gitignore # Git exclusions
βββ Dockerfile # Container config
βββ LICENSE # MIT License
βββ README.md # Project documentation
βββ README_hf.md # HuggingFace README
βββ pyproject.toml # Project metadata
βββ pytest.ini # Test configuration
βββ crawlgpt.db # Database
βββ setup_env.py # Environment setup
Run all tests
python -m pytest
The tests include unit tests for core functionality and integration tests for end-to-end workflows.
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by the potential of GPT models for intelligent content processing.
- Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools.
- Jatin Mehra ([email protected])
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal.
- Fork the Project.
- Create your Feature Branch:
git checkout -b feature/AmazingFeature` - Commit your Changes:
git commit -m 'Add some AmazingFeature - Push to the Branch:
git push origin feature/AmazingFeature - Open a Pull Request.