A comprehensive multi-source aviation accident data pipeline that aggregates flight crash information from multiple authoritative sources (Aviation Safety Network and NTSB) into a unified data warehouse. This repository provides tools to scrape, process, and analyze aviation accident data for research, analysis, and reporting purposes.
Watch the youtube video :
https://www.youtube.com/watch?v=T0hBFzDCDDM
- Project Overview
- Data Sources
- Repository Structure
- Prerequisites
- Quick Start
- Detailed Setup Instructions
- Scraping Instructions
- Database Schema
- Project Architecture
- Troubleshooting
- Contributing
This project provides a complete ETL (Extract, Transform, Load) pipeline for aviation accident data. It combines data from three major sources:
- Aviation Safety Network (ASN) - Comprehensive accident database from aviation-safety.net
- NTSB - National Transportation Safety Board official accident reports
- CSV Data - Additional structured accident data
The data is organized into a star-schema data warehouse with dimensions for dates, times, locations, aircraft, and operators, enabling sophisticated analytical queries.
- Source: https://aviation-safety.net
- Coverage: 2010-2025 (and ongoing)
- Method: Web scraping with realistic delays and browser fingerprinting
- Data: Comprehensive accident summaries with links to full reports
- Source: https://data.ntsb.gov
- Coverage: Monthly case files from 2010 onwards
- Method: API-based extraction
- Data: Detailed accident investigation reports and case information
- Source: Structured CSV files
- Usage: Supplementary accident information and historical data
flightCrashData/
├── ASN_scraping/ # Aviation Safety Network scraper
│ ├── scraper.py # Main ASN web scraper
│ ├── aviation_accidents_YYYY.json # Year-specific accident data
│ ├── merged_all_accidents.json # Combined raw data
│ ├── merged_all_accidents_cleaned.json # Cleaned/deduplicated data
│ ├── proxies.txt # Proxy configuration (optional)
│ └── scraper_progress.json # Resume capability for interrupted scrapes
│
├── NTSB_scraping/ # NTSB data extraction
│ ├── script.py # Main NTSB scraper
│ ├── merge_extracted_json.py # JSON merging utility
│ ├── unzip_with_rename.py # Archive extraction utility
│ └── ntsb_data/
│ ├── extracted/ # Monthly case files and metadata
│ └── readme.txt # NTSB-specific documentation
│
├── TL/ # Transform & Load (ETL)
│ ├── commands.sql # Database schema and ETL SQL
│ ├── load_staging.py # Python script to load staging tables
│ ├── ASN.json # Processed ASN data
│ ├── NTSB.json # Processed NTSB data
│ ├── CSV.csv # CSV source data
│ └── README.md # ETL-specific documentation
│
└── README.md # This file
- OS: Windows, macOS, or Linux
- Python: 3.8 or higher
- Database: PostgreSQL 12 or higher
Install required packages:
pip install requests
pip install beautifulsoup4
pip install curl-cffi
pip install psycopg2-binary
pip install lxmlOr use the requirements file (create if needed):
pip install -r requirements.txt- PostgreSQL server running and accessible
- Sufficient disk space for accident data (estimated 500MB-2GB)
- Database user with CREATE privileges
cd TL
psql -U postgres -h localhost -d postgres -f commands.sqlThis creates the complete schema with dimensions and fact tables.
cd ASN_scraping
python scraper.pyThe repository already includes pre-scraped data from 2010-2025.
cd TL
python load_staging.pyThis loads data from JSON files into PostgreSQL staging tables.
-- Connect to the database
psql -U postgres -d FlightAccidentMain
-- Example: Count accidents by year
SELECT
DATE_PART('year', d.full_date) as year,
COUNT(*) as accident_count
FROM fact_accidents fa
JOIN dim_date d ON fa.date_key = d.date_key
GROUP BY DATE_PART('year', d.full_date)
ORDER BY year DESC;psql -U postgres -h localhostCREATE DATABASE FlightAccidentMain;
\c FlightAccidentMainpsql -U postgres -d FlightAccidentMain -f TL/commands.sqlThis SQL script creates:
- Dimension Tables: date, time, location, aircraft, operator
- Fact Table: fact_accidents (central fact table)
- Staging Tables: stg_source1_aviation_safety, stg_source2_ntsb, stg_source3_csv
- Indexes: For optimal query performance
\dt -- List all tables
\di -- List all indexesYou should see all dimension, fact, and staging tables listed.
The repository includes pre-scraped data files:
ASN_scraping/merged_all_accidents_cleaned.jsonNTSB_scraping/ntsb_data/extracted/*.jsonTL/CSV.csv
Location: ASN_scraping/scraper.py
Features:
- Scrapes aviation-safety.net from 2010-2025
- Realistic delays and rotating user agents
- Resume capability (saves progress to
scraper_progress.json) - Proxy support (configure in
proxies.txt) - Browser fingerprinting to avoid blocking
Run:
cd ASN_scraping
python scraper.pyOutput:
- Year-specific files:
aviation_accidents_2010.jsonthroughaviation_accidents_2025.json - Merged file:
merged_all_accidents.json - Cleaned file:
merged_all_accidents_cleaned.json(deduplicated)
Configuration (in scraper.py):
# Modify years to scrape
years_to_scrape = [2023, 2024, 2025]
# Adjust delays (seconds)
min_delay = 0.2
max_delay = 0.5Location: NTSB_scraping/script.py
Features:
- Extracts monthly accident data from NTSB API
- Handles date ranges efficiently
- Exports to JSON format
- Merge utility for combining outputs
Run:
cd NTSB_scraping
python script.pyOutput:
- Monthly case files in
ntsb_data/extracted/ - Format:
YYYY_MM_cases2025-11-23_14-43.json
Merge Extracted Data:
python merge_extracted_json.pyEnsure the following files exist in the TL/ directory:
ASN.json- From ASN scraper (merged and cleaned)NTSB.json- From NTSB scraper (merged monthly files)CSV.csv- CSV data source
cd TL
python load_staging.pyWhat this does:
- Reads JSON files from ASN, NTSB, and CSV sources
- Extracts unique identifiers for each accident
- Loads raw data into staging tables:
stg_source1_aviation_safetystg_source2_ntsbstg_source3_csv
- Marks records with status
PENDINGfor processing
SELECT COUNT(*) FROM stg_source1_aviation_safety;
SELECT COUNT(*) FROM stg_source2_ntsb;
SELECT COUNT(*) FROM stg_source3_csv;- Ensure Python 3.8+ is installed
- Install dependencies:
pip install curl-cffi beautifulsoup4 requests - The script includes delays to be respectful to the website
cd ASN_scraping
python scraper.pyEdit scraper.py to customize:
# Target years
target_years = [2023, 2024, 2025]
# Delays between requests (seconds)
delay_min = 0.2
delay_max = 0.5
# Proxy (optional)
proxy = None # or "http://user:[email protected]:port"The scraper automatically saves progress to scraper_progress.json. If interrupted:
python scraper.py # Automatically resumes from last checkpoint| File | Description |
|---|---|
aviation_accidents_YYYY.json |
Per-year data files |
merged_all_accidents.json |
Combined raw data |
merged_all_accidents_cleaned.json |
Deduplicated/cleaned data |
scraper_progress.json |
Resume checkpoint |
- Ensure Python 3.8+ is installed
- Install dependencies:
pip install requests - No authentication required for NTSB API
cd NTSB_scraping
python script.pyFiles are saved to ntsb_data/extracted/:
- Format:
YYYY_MM_cases2025-11-23_14-43.json - Includes monthly case summaries and metadata
python merge_extracted_json.pyThis combines all monthly files into a single NTSB.json.
- Purpose: Temporal dimension for accident dates
- Key Fields: full_date, year, quarter, month, day_of_month, day_name, month_name
- Key: date_key
- Purpose: Temporal dimension for accident times
- Key Fields: time_value, hour, minute, second
- Key: time_key
- Purpose: Geographic and airport information
- Key Fields: country, state_province, city, airport_code, airport_name, latitude, longitude
- Key: location_key
- Unique: (data_source, source_location_id)
- Purpose: Aircraft specifications and details
- Key Fields: type_name, manufacturer, model, registration_number, serial_number, number_of_engines
- Key: aircraft_key
- Unique: (data_source, source_aircraft_id)
- Purpose: Airline and flight operation information
- Key Fields: operator_name, operator_type, owner_name, flight_operation_type
- Key: operator_key
- Unique: (data_source, source_operator_id)
- Purpose: Central fact table containing all accident records
- Key Fields:
- Flight information: flight_number, route_departure, route_destination
- Injury metrics: total_aboard, fatalities_total, fatalities_crew, fatalities_passengers, ground_fatalities
- Data source tracking: data_source, source_unique_id
- Foreign Keys: date_key, time_key, location_key, aircraft_key, operator_key
- Key: accident_id
Intermediate tables for data ingestion:
stg_source1_aviation_safety- ASN raw datastg_source2_ntsb- NTSB raw datastg_source3_csv- CSV data
ASN Website NTSB API CSV Files
↓ ↓ ↓
scraper.py script.py CSV reader
↓ ↓ ↓
JSON Files JSON Files JSON Format
↓ ↓ ↓
┌──────────────────────────────────────┐
│ Staging Tables (PostgreSQL) │
│ - stg_source1_aviation_safety │
│ - stg_source2_ntsb │
│ - stg_source3_csv │
└──────────────────────────────────────┘
↓
┌──────────────────────────────────────┐
│ Transformation & Deduplication │
│ (ETL process) │
└──────────────────────────────────────┘
↓
┌──────────────────────────────────────┐
│ Dimension & Fact Tables │
│ - dim_date, dim_time │
│ - dim_location, dim_aircraft │
│ - dim_operator │
│ - fact_accidents │
└──────────────────────────────────────┘
↓
┌──────────────────────────────────────┐
│ Analytics & Reporting │
│ (Your queries here) │
└──────────────────────────────────────┘
- Data Extraction: Python (requests, curl_cffi, BeautifulSoup)
- Data Storage: PostgreSQL 12+
- Data Processing: Python, SQL
- Format: JSON, CSV, JSONB (PostgreSQL)
Problem: psql: could not translate host name "localhost" to address
Solution:
# Use IP address instead
psql -U postgres -h 127.0.0.1 -d FlightAccidentMain
# Or check PostgreSQL service status (Windows)
Get-Service PostgreSQL*Problem: 403 Forbidden or Connection Timeout
Solution:
- Increase delays in scraper configuration
- Use proxy (configure in
proxies.txt) - Restart after a few minutes
delay_min = 1.0 # 1 second minimum
delay_max = 3.0 # 3 second maximumProblem: duplicate key value violates unique constraint
Solution:
-- Clear staging tables and retry
TRUNCATE TABLE stg_source1_aviation_safety CASCADE;
TRUNCATE TABLE stg_source2_ntsb CASCADE;
TRUNCATE TABLE stg_source3_csv CASCADE;Problem: Python runs out of memory with large JSON files
Solution:
- Process data in smaller time ranges
- Delete intermediate files after merging
- Increase system virtual memory
Problem: database "FlightAccidentMain" does not exist
Solution:
-- Create database
CREATE DATABASE FlightAccidentMain;
-- Re-run schema
psql -U postgres -d FlightAccidentMain -f TL/commands.sqlSELECT
DATE_PART('year', d.full_date) as year,
COUNT(*) as accident_count,
SUM(fa.fatalities_total) as total_fatalities
FROM fact_accidents fa
JOIN dim_date d ON fa.date_key = d.date_key
GROUP BY DATE_PART('year', d.full_date)
ORDER BY year DESC;SELECT
da.type_name,
COUNT(*) as accident_count,
AVG(fa.fatalities_total) as avg_fatalities
FROM fact_accidents fa
JOIN dim_aircraft da ON fa.aircraft_key = da.aircraft_key
WHERE fa.fatalities_total > 0
GROUP BY da.type_name
ORDER BY accident_count DESC
LIMIT 10;SELECT
dl.country,
COUNT(*) as accident_count,
SUM(fa.fatalities_total) as total_fatalities
FROM fact_accidents fa
JOIN dim_location dl ON fa.location_key = dl.location_key
JOIN dim_date d ON fa.date_key = d.date_key
WHERE DATE_PART('year', d.full_date) >= DATE_PART('year', CURRENT_DATE) - 5
GROUP BY dl.country
ORDER BY accident_count DESC;To contribute to this project:
- Report issues or suggest improvements via GitHub Issues
- Submit enhancements via Pull Requests
- Ensure data quality and completeness
- Test scraping changes before submitting
For questions or support regarding this repository, please open an issue or contact the project maintainers.
Please refer to the LICENSE file in the repository for licensing information.
Last Updated: December 2025
Data Coverage: 2010-2025
Sources: Aviation Safety Network, NTSB, CSV Data