A Python tool for scraping JFK files from the National Archives website, handling pagination, and transforming data from PDF to Markdown to JSON for use as a "Lite LLM" dataset. Includes advanced optimization, performance monitoring, and custom GPT integration for the complete collection of 1,123 declassified documents.
- Project Overview
- Features
- System Architecture
- Installation
- Usage
- Output Structure
- PDF to Markdown Conversion
- Testing
- Advanced Optimization
- GPT Integration
- Project Structure
- Contributing
- License
This project aims to:
- Scrape JFK file URLs from the National Archives website
- Handle pagination across approximately 113 pages with 1,123 entries
- Download PDF files from the extracted URLs with parallel processing
- Convert PDF files to Markdown format with PDF2MD wrapper and OCR capabilities
- Transform Markdown to JSON format with robust conversion methods
- Store the processed data for later use as a "Lite LLM" dataset
- Provide optimization for large-scale processing with adaptive resource management
- Include comprehensive performance monitoring and visualization
- Create a custom GPT (JFK Files Archivist) with the processed data
- Provide tools for querying and analyzing the declassified documents
- Robust Web Scraping: Handles pagination, rate limiting, and network retries
- Parallel Processing: Concurrent downloads with adaptive thread management
- Smart PDF Processing:
- Automatic detection of scanned vs. digital documents
- OCR support for scanned documents with quality options
- Document repair capabilities for problematic PDFs
- Enhanced Markdown Conversion:
- Multiple conversion strategies
- Quality validation and post-processing
- Fallback mechanisms for reliability
- JSON Transformation:
- Structured format for GPT integration
- Document metadata extraction
- Full-text and section-based organization
- Performance Optimization:
- Adaptive thread pool with resource monitoring
- Checkpointing for resumable operations
- Memory usage optimization for large-scale processing
- Monitoring & Visualization:
- Real-time performance metrics
- Resource usage tracking
- Visual progress indicators and charts
- GPT Integration:
- Configurable GPT capabilities
- Optimized knowledge upload
- Test suite for query validation
The JFK Files Scraper follows a pipeline architecture with these stages:
flowchart LR
Scrape[Web Scraping] --> Download[PDF Download]
Download --> PDFtoMD[PDF to Markdown]
PDFtoMD --> MDtoJSON[Markdown to JSON]
MDtoJSON --> Store[Storage]
Store --> GPTPrep[GPT Preparation]
GPTPrep --> GPTUpload[GPT Upload]
Key components:
- Crawler: Handles webpage navigation and link extraction
- Downloader: Manages parallel retrieval and storage of PDF files
- Transformer: Coordinates the PDF β Markdown β JSON pipeline
- Storage: Handles file I/O and data persistence
- Performance Monitor: Tracks resource usage and optimization opportunities
- GPT Integrator: Prepares and uploads data for GPT knowledge base
This project uses a Python environment with specific dependencies. Set up using either:
Option 1: Python venv
# Create and activate virtual environment
python -m venv jfk-env-py310
source jfk-env-py310/bin/activate # Linux/macOS
# OR
jfk-env-py310\Scripts\activate # Windows
# Install dependencies
pip install -r config/requirements.txtOption 2: Conda Environment
# Create and activate conda environment
conda create -n jfkfiles_env python=3.10
conda activate jfkfiles_env
# Install dependencies
pip install -r config/requirements.txtFor automatic environment activation with direnv:
# Create .envrc file
echo "layout python3" > .envrc
direnv allowThe project requires various Python packages and system dependencies:
Python Packages
pip install -r config/requirements.txtKey Python dependencies include:
- Crawl4AI for web scraping
- PyMuPDF (fitz) for PDF processing
- pytesseract and pdf2image for OCR
- psutil for system monitoring
- matplotlib for visualization
- openai and tiktoken for GPT integration
For OCR functionality, install these system dependencies:
Linux (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utilsmacOS
brew install tesseract popplerWindows
- Download and install Tesseract OCR
- Add Tesseract to your PATH environment variable
- Install poppler from poppler-windows
- Add poppler bin directory to your PATH
Verify your OCR installation:
scripts/run_pdf2md_diagnostic.shFor GPT functionality:
pip install openai tiktokenCreate a .env file with your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
python src/jfk_scraper.py--url Base URL for the JFK records page (default: https://www.archives.gov/research/jfk/release-2025)
--start-page Page to start scraping from (default: 1)
--end-page Page to end scraping at (default: scrape all pages)
--limit Limit the number of files to process
--test Run in test mode with a single PDF
--threads Number of parallel download threads (default: 5)
--rate-limit Delay between starting new downloads in seconds (default: 0.5)
--checkpoint-interval Save checkpoint after processing this many files (default: 10)
--ocr Enable OCR processing for scanned documents
--force-ocr Force OCR processing for all documents
--ocr-quality OCR quality setting: low, medium, high (default: high)
--resume Resume from last checkpoint if available
--clean Clean all checkpoints before starting
--log-level Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
--max-workers Maximum number of concurrent downloads
--scrape-all Scrape all 113 pages and process all 1,123 files
--organize Organize PDFs into subdirectories by collection (default: True)
--flat Save PDFs in a flat directory structure
Basic Usage
# Scrape and process a few files for testing
python src/jfk_scraper.py --start-page 1 --end-page 2 --limit 5OCR Processing
# Process all PDFs with OCR support
python src/jfk_scraper.py --ocr
# Control OCR quality (low, medium, high - default is high)
python src/jfk_scraper.py --ocr --ocr-quality medium
# Force OCR for all documents (even if they appear to be digital)
python src/jfk_scraper.py --force-ocr --ocr-quality high
# Run in test mode with OCR
python src/jfk_scraper.py --test --force-ocr --ocr-quality highOCR with Convenience Script
scripts/run_with_ocr.sh --scrape-all --ocr
scripts/run_with_ocr.sh --scrape-all --ocr --ocr-quality medium
scripts/run_with_ocr.sh --scrape-all --force-ocrTesting
# Test with a single file
python src/jfk_scraper.py --testLarge-Scale Processing
# Process specific page range with resource optimization
python src/jfk_scraper.py --start-page 5 --end-page 10 --max-workers 8 --checkpoint-interval 20
# Process all files with maximum optimization
python src/jfk_scraper.py --scrape-all --max-workers 12 --checkpoint-interval 50
# Resume processing from last checkpoint
python src/jfk_scraper.py --resume --max-workers 8Two options for monitoring performance:
Using the Performance Monitoring Module
python -m src.performance_monitoring --mode monitorUsing the Simplified Monitor Script
# View current status
python src/utils/monitor_progress.py --mode status
# Continuous monitoring
python src/utils/monitor_progress.py --mode monitor
# Generate detailed report
python src/utils/monitor_progress.py --mode reportBoth tools support these monitoring options:
--mode Operation mode: 'monitor' for continuous monitoring, 'status' for current status, 'report' for one-time report
--interval Metrics collection interval in seconds (default: 5)
--report-interval Report generation interval in seconds (default: 300)
The project is organized with the following directory structure:
jfk-files/
βββ config/ # Configuration files
βββ data/ # Data files
β βββ json/ # Individual JSON files for each document
β βββ lite_llm/ # Processed JSON data for Lite LLM dataset
β β βββ consolidated_jfk_files.json # Combined file for GPT upload
β β βββ gpt_configuration.json # GPT configuration settings
β βββ markdown/ # Converted Markdown files
β βββ pdfs/ # Downloaded PDF files
β βββ nara-104/ # Organized by collection
βββ docs/ # Documentation files
βββ env/ # Environment files
βββ logs/ # Log files
βββ memory-bank/ # Project memory and context
βββ metrics/ # Performance monitoring data
β βββ charts/ # Generated performance visualization charts
β βββ metrics.csv # CSV file with detailed metrics history
β βββ metrics.json # Latest performance report in JSON format
βββ scripts/ # Helper scripts
βββ src/ # Source code
βββ test_data/ # Test data files
βββ test_output/ # Test output files
βββ tests/ # Test suite
The project uses a custom PDF2MD implementation with extensive capabilities:
- Smart Document Format Detection: Automatically detects if a PDF is scanned or digital
- Multi-tier Conversion Strategy: Uses different approaches based on document type
- OCR Support: Integrated OCR for scanned documents using pytesseract
- Adaptive Quality Settings: Low, medium, and high-quality OCR modes (150, 200, 300 DPI)
- Post-processing: Improves markdown output with consistent formatting
- Fallback Mechanisms: Multiple fallback strategies if primary conversion fails
- Performance Optimization: Efficient resource usage for large-scale processing
- Document Repair: Handles problematic PDFs with repair capabilities
Test the PDF to Markdown conversion with these utilities:
# Test with default settings (no OCR)
python tests/test_pdf2md.py path/to/your/document.pdf
# Test with OCR enabled
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --force-ocr
# Compare different OCR quality levels
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --compare --output test_output
# Test different OCR quality settings
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --force-ocr --quality low
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --force-ocr --quality medium
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --force-ocr --quality highRun the OCR diagnostic script to verify your system setup:
scripts/run_pdf2md_diagnostic.shThe project includes a comprehensive test suite for validating all components:
# Run all tests
pytest tests/
# Run specific test categories
python tests/test_validate_pdf_to_markdown.py
python tests/test_verify_markdown_structure.py
python tests/test_markdown_to_json_validation.py
python tests/test_verify_json_lite_llm.py
python src/gpt/test_gpt_queries.py
python tests/test_end_to_end.py
# Test with verbose output
pytest -v tests/test_api.pyFor large-scale processing with advanced optimization features:
# Use the optimization module directly
python -c "from src.optimization import optimize_full_scale_processing; optimize_full_scale_processing()"
# Run with all optimization flags
python jfk_scraper.py --scrape-all --max-workers auto --checkpoint-interval 30 --ocr --ocr-quality highThe optimization module provides:
- Adaptive thread pool that adjusts based on system resources
- Memory usage monitoring and throttling to prevent OOM errors
- Enhanced checkpointing with atomic writes and versioning
- Optimized PDF processing with parallel OCR for suitable documents
The project includes components for creating a custom GPT with the JFK files collection:
src/gpt/gpt_config.py: Configuration for the JFK Files Archivist GPTsrc/gpt/upload_to_gpt.py: Script for uploading consolidated JSON to GPTsrc/gpt/test_gpt_queries.py: Test script for validating GPT query capabilitiessrc/gpt/refine_instructions.py: Script for refining GPT instructions based on test resultssrc/gpt/documentation/gpt_usage_guidelines.md: Comprehensive usage guidelines
The JFK Files Archivist GPT provides access to and analysis of the complete collection of declassified JFK files.
Setup
# Configure the GPT settings
python -m src.gpt.gpt_config
# Upload the consolidated JSON file to GPT knowledge
python -m src.gpt.upload_to_gpt
# Test the GPT with sample queries
python -m src.gpt.test_gpt_queriesCapabilities
- Retrieve specific documents by record ID
- Search across documents for topics, people, or events
- Analyze connections between documents
- Get historical context for the documents
For detailed usage guidelines, see src/gpt/documentation/gpt_usage_guidelines.md
jfk-files/
βββ config/ # Configuration files
β βββ requirements.txt # Python dependencies
β βββ project_sync.yaml # Project sync configuration
β βββ setup_ocr_env.sh # OCR environment setup script
βββ data/ # Data files
β βββ json/ # Individual JSON files for each document
β βββ lite_llm/ # Processed JSON data for Lite LLM dataset
β β βββ consolidated_jfk_files.json # Combined file for GPT upload
β β βββ gpt_configuration.json # GPT configuration settings
β β βββ validation_report.md # Validation report for GPT data
β βββ markdown/ # Converted Markdown files
β βββ pdfs/ # Downloaded PDF files
β βββ nara-104/ # Organized by collection
βββ docs/ # Documentation
β βββ CODE_OF_CONDUCT.md # Code of conduct guidelines
β βββ CONTRIBUTING.md # Contribution guidelines
β βββ INSTALLATION.md # Installation guide
β βββ LICENSE # License information
β βββ README.md # This file
β βββ RELEASE_NOTES.md # Release notes
β βββ ROADMAP.md # Project roadmap
β βββ RUN.md # Running instructions
β βββ SECURITY.md # Security guidelines
β βββ TASKLIST.md # Project task list
β βββ refactoring_summary.md # Summary of refactoring changes
βββ env/ # Environment files
β βββ activate_env.sh # Environment activation script
β βββ activate_jfk_env.sh # JFK environment activation script
β βββ jfk-env-py310/ # Python virtual environment
βββ logs/ # Log files
β βββ jfk_scraper.log # Main scraper log
β βββ jfk_scraper_errors.log # Error logs
β βββ run_output.log # Run output logs
βββ metrics/ # Performance metrics
β βββ charts/ # Performance visualization charts
β βββ marker_diagnosis_results.txt # Marker diagnosis results
β βββ metrics.json # Metrics in JSON format
β βββ metrics.csv # Metrics in CSV format
β βββ pdf2md_diagnosis_results.txt # PDF2MD diagnosis results
βββ scripts/ # Helper scripts
β βββ combine_json_files.py # Script to combine JSON files
β βββ format_gpt_json.py # GPT JSON formatting
β βββ generate_project_overview.sh # Generate project overview
β βββ run_pdf2md_diagnostic.sh # OCR diagnostics
β βββ run_test.py # Test runner
β βββ run_with_ocr.sh # OCR convenience script
β βββ setup.py # Setup script
β βββ validate_gpt_json.py # Validate GPT JSON files
βββ src/ # Source code modules
β βββ __init__.py
β βββ gpt/ # GPT integration
β β βββ configure_capabilities.py
β β βββ documentation/
β β β βββ gpt_usage_guidelines.md
β β βββ gpt_config.py
β β βββ refine_instructions.py
β β βββ run_gpt_config.py
β β βββ test_gpt_queries.py
β β βββ upload_to_gpt.py
β βββ jfk_scraper.py # Main script
β βββ optimization.py # Optimization utilities
β βββ performance_monitoring.py # Performance tracking
β βββ utils/ # Core utilities
β βββ __init__.py
β βββ batch_utils.py # Batch processing
β βββ checkpoint_utils.py # Checkpointing
β βββ conversion_utils.py # Format conversion
β βββ download_utils.py # File downloading
β βββ logging_utils.py # Logging
β βββ minimal_marker.py # PDF to MD compatibility
β βββ monitor_progress.py # Progress monitoring
β βββ pdf2md/ # PDF to Markdown conversion
β β βββ __init__.py
β β βββ pdf2md.py # Core PDF to MD functionality
β β βββ pdf2md_diagnostic.py # PDF2MD diagnostics
β βββ pdf2md_wrapper.py # Enhanced PDF conversion
β βββ pdf_utils.py # PDF processing utilities
β βββ scrape_utils.py # Web scraping
β βββ storage.py # Data storage
βββ test_data/ # Test data files
β βββ test_document.pdf # Sample PDF for testing
βββ test_output/ # Test output files
β βββ test_document_minimal.md # Minimal test output
β βββ test_document_with_ocr_high.md # High-quality OCR test
β βββ test_document_with_ocr_low.md # Low-quality OCR test
β βββ test_document_with_ocr_medium.md # Medium-quality OCR test
β βββ test_document_without_ocr.md # Non-OCR test output
βββ tests/ # Test suite
βββ test_api.py
βββ test_bridge.py
βββ test_dependencies.py
βββ test_download_pdf.py
βββ test_end_to_end.py
βββ test_integration.py
βββ test_markdown_to_json.py
βββ test_marker_scanned_pdf.py
βββ test_ocr_flow.py
βββ test_ocr_minimal.py
βββ test_pdf2md.py
βββ test_pdf_to_markdown.py
βββ test_scrape.py
βββ test_storage.py
βββ test_validate_pdf_to_markdown.py
βββ test_validation.py
βββ test_verify_json_lite_llm.py
βββ test_verify_markdown_structure.py
βββ ztest_markdown_to_json_validation.py
Contributions to the JFK Files Scraper project are welcome! Here's how you can contribute:
- Fork the Repository: Create your own fork of the project
- Create a Feature Branch:
git checkout -b feature/your-feature-name - Make Your Changes: Implement your feature or bug fix
- Write Tests: Add tests for your changes
- Run the Test Suite: Ensure all tests pass
- Commit Your Changes:
git commit -m "Add your feature" - Push to Your Branch:
git push origin feature/your-feature-name - Create a Pull Request: Submit a PR to the main repository
Please follow these guidelines:
- Follow the existing code style and conventions
- Write clear, concise commit messages
- Document your changes
- Add or update tests as necessary
- Ensure your code passes all tests
This project is licensed under the MIT License - see the LICENSE file for details.
Disclaimer: This project is for educational and research purposes only. All JFK files are publicly available from the National Archives website. Please use this tool responsibly and in accordance with the National Archives' terms of service.