Flight Crash Data Repository

A comprehensive multi-source aviation accident data pipeline that aggregates flight crash information from multiple authoritative sources (Aviation Safety Network and NTSB) into a unified data warehouse. This repository provides tools to scrape, process, and analyze aviation accident data for research, analysis, and reporting purposes.

Watch the youtube video :

https://www.youtube.com/watch?v=T0hBFzDCDDM

Project Overview

This project provides a complete ETL (Extract, Transform, Load) pipeline for aviation accident data. It combines data from three major sources:

Aviation Safety Network (ASN) - Comprehensive accident database from aviation-safety.net
NTSB - National Transportation Safety Board official accident reports
CSV Data - Additional structured accident data

The data is organized into a star-schema data warehouse with dimensions for dates, times, locations, aircraft, and operators, enabling sophisticated analytical queries.

Data Sources

Aviation Safety Network (ASN)

Source: https://aviation-safety.net
Coverage: 2010-2025 (and ongoing)
Method: Web scraping with realistic delays and browser fingerprinting
Data: Comprehensive accident summaries with links to full reports

NTSB (National Transportation Safety Board)

Source: https://data.ntsb.gov
Coverage: Monthly case files from 2010 onwards
Method: API-based extraction
Data: Detailed accident investigation reports and case information

CSV Data

Source: Structured CSV files
Usage: Supplementary accident information and historical data

Repository Structure

flightCrashData/
├── ASN_scraping/                    # Aviation Safety Network scraper
│   ├── scraper.py                   # Main ASN web scraper
│   ├── aviation_accidents_YYYY.json # Year-specific accident data
│   ├── merged_all_accidents.json    # Combined raw data
│   ├── merged_all_accidents_cleaned.json  # Cleaned/deduplicated data
│   ├── proxies.txt                  # Proxy configuration (optional)
│   └── scraper_progress.json        # Resume capability for interrupted scrapes
│
├── NTSB_scraping/                   # NTSB data extraction
│   ├── script.py                    # Main NTSB scraper
│   ├── merge_extracted_json.py      # JSON merging utility
│   ├── unzip_with_rename.py         # Archive extraction utility
│   └── ntsb_data/
│       ├── extracted/               # Monthly case files and metadata
│       └── readme.txt               # NTSB-specific documentation
│
├── TL/                              # Transform & Load (ETL)
│   ├── commands.sql                 # Database schema and ETL SQL
│   ├── load_staging.py              # Python script to load staging tables
│   ├── ASN.json                     # Processed ASN data
│   ├── NTSB.json                    # Processed NTSB data
│   ├── CSV.csv                      # CSV source data
│   └── README.md                    # ETL-specific documentation
│
└── README.md                        # This file

Prerequisites

System Requirements

OS: Windows, macOS, or Linux
Python: 3.8 or higher
Database: PostgreSQL 12 or higher

Python Dependencies

Install required packages:

pip install requests
pip install beautifulsoup4
pip install curl-cffi
pip install psycopg2-binary
pip install lxml

Or use the requirements file (create if needed):

pip install -r requirements.txt

Database Setup

PostgreSQL server running and accessible
Sufficient disk space for accident data (estimated 500MB-2GB)
Database user with CREATE privileges

Quick Start

1. Initialize the Database

cd TL
psql -U postgres -h localhost -d postgres -f commands.sql

This creates the complete schema with dimensions and fact tables.

2. Run the ASN Scraper (Optional - Data Already Included)

cd ASN_scraping
python scraper.py

The repository already includes pre-scraped data from 2010-2025.

3. Load Data into Database

cd TL
python load_staging.py

This loads data from JSON files into PostgreSQL staging tables.

4. Query Your Data

-- Connect to the database
psql -U postgres -d FlightAccidentMain

-- Example: Count accidents by year
SELECT
    DATE_PART('year', d.full_date) as year,
    COUNT(*) as accident_count
FROM fact_accidents fa
JOIN dim_date d ON fa.date_key = d.date_key
GROUP BY DATE_PART('year', d.full_date)
ORDER BY year DESC;

Detailed Setup Instructions

1. Database Initialization

Step 1: Connect to PostgreSQL

psql -U postgres -h localhost

Step 2: Create the Database

CREATE DATABASE FlightAccidentMain;
\c FlightAccidentMain

Step 3: Run Schema Creation

psql -U postgres -d FlightAccidentMain -f TL/commands.sql

This SQL script creates:

Dimension Tables: date, time, location, aircraft, operator
Fact Table: fact_accidents (central fact table)
Staging Tables: stg_source1_aviation_safety, stg_source2_ntsb, stg_source3_csv
Indexes: For optimal query performance

Verify Installation

\dt  -- List all tables
\di  -- List all indexes

You should see all dimension, fact, and staging tables listed.

2. Data Collection

Option A: Using Pre-collected Data (Recommended)

The repository includes pre-scraped data files:

ASN_scraping/merged_all_accidents_cleaned.json
NTSB_scraping/ntsb_data/extracted/*.json
TL/CSV.csv

Option B: Collect Fresh Data

ASN Scraper

Location: ASN_scraping/scraper.py

Features:

Scrapes aviation-safety.net from 2010-2025
Realistic delays and rotating user agents
Resume capability (saves progress to scraper_progress.json)
Proxy support (configure in proxies.txt)
Browser fingerprinting to avoid blocking

Run:

cd ASN_scraping
python scraper.py

Output:

Year-specific files: aviation_accidents_2010.json through aviation_accidents_2025.json
Merged file: merged_all_accidents.json
Cleaned file: merged_all_accidents_cleaned.json (deduplicated)

Configuration (in scraper.py):

# Modify years to scrape
years_to_scrape = [2023, 2024, 2025]

# Adjust delays (seconds)
min_delay = 0.2
max_delay = 0.5

NTSB Scraper

Location: NTSB_scraping/script.py

Features:

Extracts monthly accident data from NTSB API
Handles date ranges efficiently
Exports to JSON format
Merge utility for combining outputs

Run:

cd NTSB_scraping
python script.py

Output:

Monthly case files in ntsb_data/extracted/
Format: YYYY_MM_cases2025-11-23_14-43.json

Merge Extracted Data:

python merge_extracted_json.py

3. Data Loading

Prepare Data Files

Ensure the following files exist in the TL/ directory:

ASN.json - From ASN scraper (merged and cleaned)
NTSB.json - From NTSB scraper (merged monthly files)
CSV.csv - CSV data source

Load into Staging Tables

cd TL
python load_staging.py

What this does:

Reads JSON files from ASN, NTSB, and CSV sources
Extracts unique identifiers for each accident
Loads raw data into staging tables:
- stg_source1_aviation_safety
- stg_source2_ntsb
- stg_source3_csv
Marks records with status PENDING for processing

Verify Loading

SELECT COUNT(*) FROM stg_source1_aviation_safety;
SELECT COUNT(*) FROM stg_source2_ntsb;
SELECT COUNT(*) FROM stg_source3_csv;

Scraping Instructions

Aviation Safety Network (ASN)

Before You Start

Ensure Python 3.8+ is installed
Install dependencies: pip install curl-cffi beautifulsoup4 requests
The script includes delays to be respectful to the website

Running the Scraper

cd ASN_scraping
python scraper.py

Configuration

Edit scraper.py to customize:

# Target years
target_years = [2023, 2024, 2025]

# Delays between requests (seconds)
delay_min = 0.2
delay_max = 0.5

# Proxy (optional)
proxy = None  # or "http://user:[email protected]:port"

Resume Interrupted Scrapes

The scraper automatically saves progress to scraper_progress.json. If interrupted:

python scraper.py  # Automatically resumes from last checkpoint

Output Files

File	Description
`aviation_accidents_YYYY.json`	Per-year data files
`merged_all_accidents.json`	Combined raw data
`merged_all_accidents_cleaned.json`	Deduplicated/cleaned data
`scraper_progress.json`	Resume checkpoint

NTSB Scraper

Before You Start

Ensure Python 3.8+ is installed
Install dependencies: pip install requests
No authentication required for NTSB API

Running the Scraper

cd NTSB_scraping
python script.py

Output

Files are saved to ntsb_data/extracted/:

Format: YYYY_MM_cases2025-11-23_14-43.json
Includes monthly case summaries and metadata

Merge Results

python merge_extracted_json.py

This combines all monthly files into a single NTSB.json.

Database Schema

Dimension Tables

`dim_date`

Purpose: Temporal dimension for accident dates
Key Fields: full_date, year, quarter, month, day_of_month, day_name, month_name
Key: date_key

`dim_time`

Purpose: Temporal dimension for accident times
Key Fields: time_value, hour, minute, second
Key: time_key

`dim_location`

Purpose: Geographic and airport information
Key Fields: country, state_province, city, airport_code, airport_name, latitude, longitude
Key: location_key
Unique: (data_source, source_location_id)

`dim_aircraft`

Purpose: Aircraft specifications and details
Key Fields: type_name, manufacturer, model, registration_number, serial_number, number_of_engines
Key: aircraft_key
Unique: (data_source, source_aircraft_id)

`dim_operator`

Purpose: Airline and flight operation information
Key Fields: operator_name, operator_type, owner_name, flight_operation_type
Key: operator_key
Unique: (data_source, source_operator_id)

Fact Table

`fact_accidents`

Purpose: Central fact table containing all accident records
Key Fields:
- Flight information: flight_number, route_departure, route_destination
- Injury metrics: total_aboard, fatalities_total, fatalities_crew, fatalities_passengers, ground_fatalities
- Data source tracking: data_source, source_unique_id
Foreign Keys: date_key, time_key, location_key, aircraft_key, operator_key
Key: accident_id

Staging Tables

Intermediate tables for data ingestion:

stg_source1_aviation_safety - ASN raw data
stg_source2_ntsb - NTSB raw data
stg_source3_csv - CSV data

Project Architecture

Data Flow

ASN Website          NTSB API           CSV Files
      ↓                  ↓                   ↓
   scraper.py        script.py          CSV reader
      ↓                  ↓                   ↓
  JSON Files        JSON Files          JSON Format
      ↓                  ↓                   ↓
   ┌──────────────────────────────────────┐
   │    Staging Tables (PostgreSQL)       │
   │  - stg_source1_aviation_safety       │
   │  - stg_source2_ntsb                  │
   │  - stg_source3_csv                   │
   └──────────────────────────────────────┘
                       ↓
   ┌──────────────────────────────────────┐
   │   Transformation & Deduplication     │
   │   (ETL process)                      │
   └──────────────────────────────────────┘
                       ↓
   ┌──────────────────────────────────────┐
   │  Dimension & Fact Tables             │
   │  - dim_date, dim_time                │
   │  - dim_location, dim_aircraft        │
   │  - dim_operator                      │
   │  - fact_accidents                    │
   └──────────────────────────────────────┘
                       ↓
   ┌──────────────────────────────────────┐
   │  Analytics & Reporting               │
   │  (Your queries here)                 │
   └──────────────────────────────────────┘

Technology Stack

Data Extraction: Python (requests, curl_cffi, BeautifulSoup)
Data Storage: PostgreSQL 12+
Data Processing: Python, SQL
Format: JSON, CSV, JSONB (PostgreSQL)

Troubleshooting

Database Connection Issues

Problem: psql: could not translate host name "localhost" to address

Solution:

# Use IP address instead
psql -U postgres -h 127.0.0.1 -d FlightAccidentMain

# Or check PostgreSQL service status (Windows)
Get-Service PostgreSQL*

Scraper Getting Blocked

Problem: 403 Forbidden or Connection Timeout

Solution:

Increase delays in scraper configuration
Use proxy (configure in proxies.txt)
Restart after a few minutes

delay_min = 1.0  # 1 second minimum
delay_max = 3.0  # 3 second maximum

Duplicate Key Errors During Loading

Problem: duplicate key value violates unique constraint

Solution:

-- Clear staging tables and retry
TRUNCATE TABLE stg_source1_aviation_safety CASCADE;
TRUNCATE TABLE stg_source2_ntsb CASCADE;
TRUNCATE TABLE stg_source3_csv CASCADE;

Out of Memory During Scraping

Problem: Python runs out of memory with large JSON files

Solution:

Process data in smaller time ranges
Delete intermediate files after merging
Increase system virtual memory

Database Not Found

Problem: database "FlightAccidentMain" does not exist

Solution:

-- Create database
CREATE DATABASE FlightAccidentMain;

-- Re-run schema
psql -U postgres -d FlightAccidentMain -f TL/commands.sql

Example Queries

Count Accidents by Year

SELECT
    DATE_PART('year', d.full_date) as year,
    COUNT(*) as accident_count,
    SUM(fa.fatalities_total) as total_fatalities
FROM fact_accidents fa
JOIN dim_date d ON fa.date_key = d.date_key
GROUP BY DATE_PART('year', d.full_date)
ORDER BY year DESC;

Top 10 Most Dangerous Aircraft Types

SELECT
    da.type_name,
    COUNT(*) as accident_count,
    AVG(fa.fatalities_total) as avg_fatalities
FROM fact_accidents fa
JOIN dim_aircraft da ON fa.aircraft_key = da.aircraft_key
WHERE fa.fatalities_total > 0
GROUP BY da.type_name
ORDER BY accident_count DESC
LIMIT 10;

Accidents by Country (Last 5 Years)

SELECT
    dl.country,
    COUNT(*) as accident_count,
    SUM(fa.fatalities_total) as total_fatalities
FROM fact_accidents fa
JOIN dim_location dl ON fa.location_key = dl.location_key
JOIN dim_date d ON fa.date_key = d.date_key
WHERE DATE_PART('year', d.full_date) >= DATE_PART('year', CURRENT_DATE) - 5
GROUP BY dl.country
ORDER BY accident_count DESC;

Contributing

To contribute to this project:

Report issues or suggest improvements via GitHub Issues
Submit enhancements via Pull Requests
Ensure data quality and completeness
Test scraping changes before submitting

Contact & Support

For questions or support regarding this repository, please open an issue or contact the project maintainers.

License

Please refer to the LICENSE file in the repository for licensing information.

Last Updated: December 2025
Data Coverage: 2010-2025
Sources: Aviation Safety Network, NTSB, CSV Data

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ASN_scraping		ASN_scraping
NTSB_scraping		NTSB_scraping
TL		TL
flightCrashAnalysis		flightCrashAnalysis
.gitattributes		.gitattributes
DEPLOYMENT_CHECKLIST.md		DEPLOYMENT_CHECKLIST.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Flight Crash Data Repository

Table of Contents

Project Overview

Data Sources

Aviation Safety Network (ASN)

NTSB (National Transportation Safety Board)

CSV Data

Repository Structure

Prerequisites

System Requirements

Python Dependencies

Database Setup

Quick Start

1. Initialize the Database

2. Run the ASN Scraper (Optional - Data Already Included)

3. Load Data into Database

4. Query Your Data

Detailed Setup Instructions

1. Database Initialization

Step 1: Connect to PostgreSQL

Step 2: Create the Database

Step 3: Run Schema Creation

Verify Installation

2. Data Collection

Option A: Using Pre-collected Data (Recommended)

Option B: Collect Fresh Data

ASN Scraper

NTSB Scraper

3. Data Loading

Prepare Data Files

Load into Staging Tables

Verify Loading

Scraping Instructions

Aviation Safety Network (ASN)

Before You Start

Running the Scraper

Configuration

Resume Interrupted Scrapes

Output Files

NTSB Scraper

Before You Start

Running the Scraper

Output

Merge Results

Database Schema

Dimension Tables

dim_date

dim_time

dim_location

dim_aircraft

dim_operator

Fact Table

fact_accidents

Staging Tables

Project Architecture

Data Flow

Technology Stack

Troubleshooting

Database Connection Issues

Scraper Getting Blocked

Duplicate Key Errors During Loading

Out of Memory During Scraping

Database Not Found

Example Queries

Count Accidents by Year

Top 10 Most Dangerous Aircraft Types

Accidents by Country (Last 5 Years)

Contributing

Contact & Support

License

About

Topics

Resources

Uh oh!

Stars

Watchers

`dim_date`

`dim_time`

`dim_location`

`dim_aircraft`

`dim_operator`

`fact_accidents`