Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tool for scraping JFK declassified files, with OCR-powered PDF to text conversion and GPT integration for advanced document analysis.

nelsojona/jfk-files

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

JFK Files Scraper

License: MIT Python 3.10+ OCR Support GPT Integration

A Python tool for scraping JFK files from the National Archives website, handling pagination, and transforming data from PDF to Markdown to JSON for use as a "Lite LLM" dataset. Includes advanced optimization, performance monitoring, and custom GPT integration for the complete collection of 1,123 declassified documents.

πŸ“‹ Table of Contents

πŸ” Project Overview

This project aims to:

  1. Scrape JFK file URLs from the National Archives website
  2. Handle pagination across approximately 113 pages with 1,123 entries
  3. Download PDF files from the extracted URLs with parallel processing
  4. Convert PDF files to Markdown format with PDF2MD wrapper and OCR capabilities
  5. Transform Markdown to JSON format with robust conversion methods
  6. Store the processed data for later use as a "Lite LLM" dataset
  7. Provide optimization for large-scale processing with adaptive resource management
  8. Include comprehensive performance monitoring and visualization
  9. Create a custom GPT (JFK Files Archivist) with the processed data
  10. Provide tools for querying and analyzing the declassified documents

✨ Features

  • Robust Web Scraping: Handles pagination, rate limiting, and network retries
  • Parallel Processing: Concurrent downloads with adaptive thread management
  • Smart PDF Processing:
    • Automatic detection of scanned vs. digital documents
    • OCR support for scanned documents with quality options
    • Document repair capabilities for problematic PDFs
  • Enhanced Markdown Conversion:
    • Multiple conversion strategies
    • Quality validation and post-processing
    • Fallback mechanisms for reliability
  • JSON Transformation:
    • Structured format for GPT integration
    • Document metadata extraction
    • Full-text and section-based organization
  • Performance Optimization:
    • Adaptive thread pool with resource monitoring
    • Checkpointing for resumable operations
    • Memory usage optimization for large-scale processing
  • Monitoring & Visualization:
    • Real-time performance metrics
    • Resource usage tracking
    • Visual progress indicators and charts
  • GPT Integration:
    • Configurable GPT capabilities
    • Optimized knowledge upload
    • Test suite for query validation

πŸ—οΈ System Architecture

The JFK Files Scraper follows a pipeline architecture with these stages:

flowchart LR
    Scrape[Web Scraping] --> Download[PDF Download]
    Download --> PDFtoMD[PDF to Markdown]
    PDFtoMD --> MDtoJSON[Markdown to JSON]
    MDtoJSON --> Store[Storage]
    Store --> GPTPrep[GPT Preparation]
    GPTPrep --> GPTUpload[GPT Upload]
Loading

Key components:

  • Crawler: Handles webpage navigation and link extraction
  • Downloader: Manages parallel retrieval and storage of PDF files
  • Transformer: Coordinates the PDF β†’ Markdown β†’ JSON pipeline
  • Storage: Handles file I/O and data persistence
  • Performance Monitor: Tracks resource usage and optimization opportunities
  • GPT Integrator: Prepares and uploads data for GPT knowledge base

πŸ“₯ Installation

Development Environment

This project uses a Python environment with specific dependencies. Set up using either:

Option 1: Python venv

# Create and activate virtual environment
python -m venv jfk-env-py310
source jfk-env-py310/bin/activate  # Linux/macOS
# OR
jfk-env-py310\Scripts\activate  # Windows

# Install dependencies
pip install -r config/requirements.txt

Option 2: Conda Environment

# Create and activate conda environment
conda create -n jfkfiles_env python=3.10
conda activate jfkfiles_env

# Install dependencies
pip install -r config/requirements.txt

For automatic environment activation with direnv:

# Create .envrc file
echo "layout python3" > .envrc
direnv allow

Dependencies

The project requires various Python packages and system dependencies:

Python Packages

pip install -r config/requirements.txt

Key Python dependencies include:

  • Crawl4AI for web scraping
  • PyMuPDF (fitz) for PDF processing
  • pytesseract and pdf2image for OCR
  • psutil for system monitoring
  • matplotlib for visualization
  • openai and tiktoken for GPT integration

OCR Setup

For OCR functionality, install these system dependencies:

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils

macOS

brew install tesseract poppler

Windows

  1. Download and install Tesseract OCR
  2. Add Tesseract to your PATH environment variable
  3. Install poppler from poppler-windows
  4. Add poppler bin directory to your PATH

Verify your OCR installation:

scripts/run_pdf2md_diagnostic.sh

GPT Integration Setup

For GPT functionality:

pip install openai tiktoken

Create a .env file with your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

πŸš€ Usage

Basic Operation

python src/jfk_scraper.py

Command-Line Options

--url               Base URL for the JFK records page (default: https://www.archives.gov/research/jfk/release-2025)
--start-page        Page to start scraping from (default: 1)
--end-page          Page to end scraping at (default: scrape all pages)
--limit             Limit the number of files to process
--test              Run in test mode with a single PDF
--threads           Number of parallel download threads (default: 5)
--rate-limit        Delay between starting new downloads in seconds (default: 0.5)
--checkpoint-interval Save checkpoint after processing this many files (default: 10)
--ocr               Enable OCR processing for scanned documents
--force-ocr         Force OCR processing for all documents
--ocr-quality       OCR quality setting: low, medium, high (default: high)
--resume            Resume from last checkpoint if available
--clean             Clean all checkpoints before starting
--log-level         Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
--max-workers       Maximum number of concurrent downloads
--scrape-all        Scrape all 113 pages and process all 1,123 files
--organize          Organize PDFs into subdirectories by collection (default: True)
--flat              Save PDFs in a flat directory structure

Example Commands

Basic Usage

# Scrape and process a few files for testing
python src/jfk_scraper.py --start-page 1 --end-page 2 --limit 5

OCR Processing

# Process all PDFs with OCR support
python src/jfk_scraper.py --ocr

# Control OCR quality (low, medium, high - default is high)
python src/jfk_scraper.py --ocr --ocr-quality medium 

# Force OCR for all documents (even if they appear to be digital)
python src/jfk_scraper.py --force-ocr --ocr-quality high

# Run in test mode with OCR
python src/jfk_scraper.py --test --force-ocr --ocr-quality high

OCR with Convenience Script

scripts/run_with_ocr.sh --scrape-all --ocr
scripts/run_with_ocr.sh --scrape-all --ocr --ocr-quality medium
scripts/run_with_ocr.sh --scrape-all --force-ocr

Testing

# Test with a single file
python src/jfk_scraper.py --test

Large-Scale Processing

# Process specific page range with resource optimization
python src/jfk_scraper.py --start-page 5 --end-page 10 --max-workers 8 --checkpoint-interval 20

# Process all files with maximum optimization
python src/jfk_scraper.py --scrape-all --max-workers 12 --checkpoint-interval 50

# Resume processing from last checkpoint
python src/jfk_scraper.py --resume --max-workers 8

Performance Monitoring

Two options for monitoring performance:

Using the Performance Monitoring Module

python -m src.performance_monitoring --mode monitor

Using the Simplified Monitor Script

# View current status
python src/utils/monitor_progress.py --mode status

# Continuous monitoring
python src/utils/monitor_progress.py --mode monitor

# Generate detailed report
python src/utils/monitor_progress.py --mode report

Both tools support these monitoring options:

--mode              Operation mode: 'monitor' for continuous monitoring, 'status' for current status, 'report' for one-time report
--interval          Metrics collection interval in seconds (default: 5)
--report-interval   Report generation interval in seconds (default: 300)

πŸ“‚ Output Structure

The project is organized with the following directory structure:

jfk-files/
β”œβ”€β”€ config/                # Configuration files
β”œβ”€β”€ data/                  # Data files
β”‚   β”œβ”€β”€ json/              # Individual JSON files for each document
β”‚   β”œβ”€β”€ lite_llm/          # Processed JSON data for Lite LLM dataset
β”‚   β”‚   β”œβ”€β”€ consolidated_jfk_files.json   # Combined file for GPT upload
β”‚   β”‚   └── gpt_configuration.json        # GPT configuration settings
β”‚   β”œβ”€β”€ markdown/          # Converted Markdown files
β”‚   └── pdfs/              # Downloaded PDF files
β”‚       └── nara-104/      # Organized by collection
β”œβ”€β”€ docs/                  # Documentation files
β”œβ”€β”€ env/                   # Environment files
β”œβ”€β”€ logs/                  # Log files
β”œβ”€β”€ memory-bank/           # Project memory and context
β”œβ”€β”€ metrics/               # Performance monitoring data
β”‚   β”œβ”€β”€ charts/            # Generated performance visualization charts
β”‚   β”œβ”€β”€ metrics.csv        # CSV file with detailed metrics history
β”‚   └── metrics.json       # Latest performance report in JSON format
β”œβ”€β”€ scripts/               # Helper scripts
β”œβ”€β”€ src/                   # Source code
β”œβ”€β”€ test_data/             # Test data files
β”œβ”€β”€ test_output/           # Test output files
└── tests/                 # Test suite

πŸ“„ PDF to Markdown Conversion

The project uses a custom PDF2MD implementation with extensive capabilities:

Key Features

  • Smart Document Format Detection: Automatically detects if a PDF is scanned or digital
  • Multi-tier Conversion Strategy: Uses different approaches based on document type
  • OCR Support: Integrated OCR for scanned documents using pytesseract
  • Adaptive Quality Settings: Low, medium, and high-quality OCR modes (150, 200, 300 DPI)
  • Post-processing: Improves markdown output with consistent formatting
  • Fallback Mechanisms: Multiple fallback strategies if primary conversion fails
  • Performance Optimization: Efficient resource usage for large-scale processing
  • Document Repair: Handles problematic PDFs with repair capabilities

Testing OCR Capabilities

Test the PDF to Markdown conversion with these utilities:

# Test with default settings (no OCR)
python tests/test_pdf2md.py path/to/your/document.pdf

# Test with OCR enabled
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --force-ocr

# Compare different OCR quality levels
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --compare --output test_output

# Test different OCR quality settings
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --force-ocr --quality low
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --force-ocr --quality medium
python tests/test_ocr_flow.py --pdf path/to/your/document.pdf --force-ocr --quality high

Run the OCR diagnostic script to verify your system setup:

scripts/run_pdf2md_diagnostic.sh

πŸ§ͺ Testing

The project includes a comprehensive test suite for validating all components:

# Run all tests
pytest tests/

# Run specific test categories
python tests/test_validate_pdf_to_markdown.py
python tests/test_verify_markdown_structure.py
python tests/test_markdown_to_json_validation.py
python tests/test_verify_json_lite_llm.py
python src/gpt/test_gpt_queries.py
python tests/test_end_to_end.py

# Test with verbose output
pytest -v tests/test_api.py

βš™οΈ Advanced Optimization

For large-scale processing with advanced optimization features:

# Use the optimization module directly
python -c "from src.optimization import optimize_full_scale_processing; optimize_full_scale_processing()"

# Run with all optimization flags
python jfk_scraper.py --scrape-all --max-workers auto --checkpoint-interval 30 --ocr --ocr-quality high

The optimization module provides:

  • Adaptive thread pool that adjusts based on system resources
  • Memory usage monitoring and throttling to prevent OOM errors
  • Enhanced checkpointing with atomic writes and versioning
  • Optimized PDF processing with parallel OCR for suitable documents

πŸ€– GPT Integration

The project includes components for creating a custom GPT with the JFK files collection:

GPT Components

  • src/gpt/gpt_config.py: Configuration for the JFK Files Archivist GPT
  • src/gpt/upload_to_gpt.py: Script for uploading consolidated JSON to GPT
  • src/gpt/test_gpt_queries.py: Test script for validating GPT query capabilities
  • src/gpt/refine_instructions.py: Script for refining GPT instructions based on test results
  • src/gpt/documentation/gpt_usage_guidelines.md: Comprehensive usage guidelines

Using the JFK Files Archivist GPT

The JFK Files Archivist GPT provides access to and analysis of the complete collection of declassified JFK files.

Setup

# Configure the GPT settings
python -m src.gpt.gpt_config

# Upload the consolidated JSON file to GPT knowledge
python -m src.gpt.upload_to_gpt

# Test the GPT with sample queries
python -m src.gpt.test_gpt_queries

Capabilities

  1. Retrieve specific documents by record ID
  2. Search across documents for topics, people, or events
  3. Analyze connections between documents
  4. Get historical context for the documents

For detailed usage guidelines, see src/gpt/documentation/gpt_usage_guidelines.md

πŸ“ Project Structure

jfk-files/
β”œβ”€β”€ config/                     # Configuration files
β”‚   β”œβ”€β”€ requirements.txt        # Python dependencies
β”‚   β”œβ”€β”€ project_sync.yaml       # Project sync configuration
β”‚   └── setup_ocr_env.sh        # OCR environment setup script
β”œβ”€β”€ data/                       # Data files
β”‚   β”œβ”€β”€ json/                   # Individual JSON files for each document
β”‚   β”œβ”€β”€ lite_llm/               # Processed JSON data for Lite LLM dataset
β”‚   β”‚   β”œβ”€β”€ consolidated_jfk_files.json   # Combined file for GPT upload
β”‚   β”‚   β”œβ”€β”€ gpt_configuration.json        # GPT configuration settings
β”‚   β”‚   └── validation_report.md          # Validation report for GPT data
β”‚   β”œβ”€β”€ markdown/               # Converted Markdown files
β”‚   └── pdfs/                   # Downloaded PDF files
β”‚       └── nara-104/           # Organized by collection
β”œβ”€β”€ docs/                       # Documentation
β”‚   β”œβ”€β”€ CODE_OF_CONDUCT.md      # Code of conduct guidelines
β”‚   β”œβ”€β”€ CONTRIBUTING.md         # Contribution guidelines
β”‚   β”œβ”€β”€ INSTALLATION.md         # Installation guide
β”‚   β”œβ”€β”€ LICENSE                 # License information
β”‚   β”œβ”€β”€ README.md               # This file
β”‚   β”œβ”€β”€ RELEASE_NOTES.md        # Release notes
β”‚   β”œβ”€β”€ ROADMAP.md              # Project roadmap
β”‚   β”œβ”€β”€ RUN.md                  # Running instructions
β”‚   β”œβ”€β”€ SECURITY.md             # Security guidelines
β”‚   β”œβ”€β”€ TASKLIST.md             # Project task list
β”‚   └── refactoring_summary.md  # Summary of refactoring changes
β”œβ”€β”€ env/                        # Environment files
β”‚   β”œβ”€β”€ activate_env.sh         # Environment activation script
β”‚   β”œβ”€β”€ activate_jfk_env.sh     # JFK environment activation script
β”‚   └── jfk-env-py310/          # Python virtual environment
β”œβ”€β”€ logs/                       # Log files
β”‚   β”œβ”€β”€ jfk_scraper.log         # Main scraper log
β”‚   β”œβ”€β”€ jfk_scraper_errors.log  # Error logs
β”‚   └── run_output.log          # Run output logs
β”œβ”€β”€ metrics/                    # Performance metrics
β”‚   β”œβ”€β”€ charts/                 # Performance visualization charts
β”‚   β”œβ”€β”€ marker_diagnosis_results.txt # Marker diagnosis results
β”‚   β”œβ”€β”€ metrics.json            # Metrics in JSON format
β”‚   β”œβ”€β”€ metrics.csv             # Metrics in CSV format
β”‚   └── pdf2md_diagnosis_results.txt # PDF2MD diagnosis results
β”œβ”€β”€ scripts/                    # Helper scripts
β”‚   β”œβ”€β”€ combine_json_files.py   # Script to combine JSON files
β”‚   β”œβ”€β”€ format_gpt_json.py      # GPT JSON formatting
β”‚   β”œβ”€β”€ generate_project_overview.sh # Generate project overview
β”‚   β”œβ”€β”€ run_pdf2md_diagnostic.sh # OCR diagnostics
β”‚   β”œβ”€β”€ run_test.py             # Test runner
β”‚   β”œβ”€β”€ run_with_ocr.sh         # OCR convenience script
β”‚   β”œβ”€β”€ setup.py                # Setup script
β”‚   └── validate_gpt_json.py    # Validate GPT JSON files
β”œβ”€β”€ src/                        # Source code modules
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ gpt/                    # GPT integration
β”‚   β”‚   β”œβ”€β”€ configure_capabilities.py
β”‚   β”‚   β”œβ”€β”€ documentation/
β”‚   β”‚   β”‚   └── gpt_usage_guidelines.md
β”‚   β”‚   β”œβ”€β”€ gpt_config.py
β”‚   β”‚   β”œβ”€β”€ refine_instructions.py
β”‚   β”‚   β”œβ”€β”€ run_gpt_config.py
β”‚   β”‚   β”œβ”€β”€ test_gpt_queries.py
β”‚   β”‚   └── upload_to_gpt.py
β”‚   β”œβ”€β”€ jfk_scraper.py          # Main script
β”‚   β”œβ”€β”€ optimization.py         # Optimization utilities
β”‚   β”œβ”€β”€ performance_monitoring.py # Performance tracking
β”‚   └── utils/                  # Core utilities
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ batch_utils.py      # Batch processing
β”‚       β”œβ”€β”€ checkpoint_utils.py # Checkpointing
β”‚       β”œβ”€β”€ conversion_utils.py # Format conversion
β”‚       β”œβ”€β”€ download_utils.py   # File downloading
β”‚       β”œβ”€β”€ logging_utils.py    # Logging
β”‚       β”œβ”€β”€ minimal_marker.py   # PDF to MD compatibility
β”‚       β”œβ”€β”€ monitor_progress.py # Progress monitoring
β”‚       β”œβ”€β”€ pdf2md/             # PDF to Markdown conversion
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ pdf2md.py       # Core PDF to MD functionality
β”‚       β”‚   └── pdf2md_diagnostic.py # PDF2MD diagnostics
β”‚       β”œβ”€β”€ pdf2md_wrapper.py   # Enhanced PDF conversion
β”‚       β”œβ”€β”€ pdf_utils.py        # PDF processing utilities
β”‚       β”œβ”€β”€ scrape_utils.py     # Web scraping
β”‚       └── storage.py          # Data storage
β”œβ”€β”€ test_data/                  # Test data files
β”‚   └── test_document.pdf       # Sample PDF for testing
β”œβ”€β”€ test_output/                # Test output files
β”‚   β”œβ”€β”€ test_document_minimal.md # Minimal test output
β”‚   β”œβ”€β”€ test_document_with_ocr_high.md # High-quality OCR test
β”‚   β”œβ”€β”€ test_document_with_ocr_low.md # Low-quality OCR test
β”‚   β”œβ”€β”€ test_document_with_ocr_medium.md # Medium-quality OCR test
β”‚   └── test_document_without_ocr.md # Non-OCR test output
└── tests/                      # Test suite
    β”œβ”€β”€ test_api.py
    β”œβ”€β”€ test_bridge.py
    β”œβ”€β”€ test_dependencies.py
    β”œβ”€β”€ test_download_pdf.py
    β”œβ”€β”€ test_end_to_end.py
    β”œβ”€β”€ test_integration.py
    β”œβ”€β”€ test_markdown_to_json.py
    β”œβ”€β”€ test_marker_scanned_pdf.py
    β”œβ”€β”€ test_ocr_flow.py
    β”œβ”€β”€ test_ocr_minimal.py
    β”œβ”€β”€ test_pdf2md.py
    β”œβ”€β”€ test_pdf_to_markdown.py
    β”œβ”€β”€ test_scrape.py
    β”œβ”€β”€ test_storage.py
    β”œβ”€β”€ test_validate_pdf_to_markdown.py
    β”œβ”€β”€ test_validation.py
    β”œβ”€β”€ test_verify_json_lite_llm.py
    β”œβ”€β”€ test_verify_markdown_structure.py
    └── ztest_markdown_to_json_validation.py

🀝 Contributing

Contributions to the JFK Files Scraper project are welcome! Here's how you can contribute:

  1. Fork the Repository: Create your own fork of the project
  2. Create a Feature Branch: git checkout -b feature/your-feature-name
  3. Make Your Changes: Implement your feature or bug fix
  4. Write Tests: Add tests for your changes
  5. Run the Test Suite: Ensure all tests pass
  6. Commit Your Changes: git commit -m "Add your feature"
  7. Push to Your Branch: git push origin feature/your-feature-name
  8. Create a Pull Request: Submit a PR to the main repository

Please follow these guidelines:

  • Follow the existing code style and conventions
  • Write clear, concise commit messages
  • Document your changes
  • Add or update tests as necessary
  • Ensure your code passes all tests

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


Disclaimer: This project is for educational and research purposes only. All JFK files are publicly available from the National Archives website. Please use this tool responsibly and in accordance with the National Archives' terms of service.

About

Tool for scraping JFK declassified files, with OCR-powered PDF to text conversion and GPT integration for advanced document analysis.

Topics

Resources

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published