Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A Cannabis public content acquisition dashboard and web site to automate and display the publicly available cannabis information from state and federal government data resources

License

Notifications You must be signed in to change notification settings

phreakin/cannabis

Repository files navigation

Cannabis Data Aggregator

Project: ${PROJECT_NAME} Author: ${AUTHOR}
Website: ${AUTHOR_WEBSITE}
Contact: ${AUTHOR_EMAIL}


Automated collection and management of US cannabis open data from state and federal APIs. Aggregates dispensary locations, license data, sales figures, grower/processor records, and more into a unified database — with a web dashboard for management and export.


Features

  • 50+ pre-configured data sources across ~20 states + federal agencies
  • Multiple formats: Socrata SODA, JSON REST API, CSV, GeoJSON
  • Automated scheduling with cron and interval-based jobs (APScheduler)
  • Web dashboard for monitoring, managing sources/schedules, and browsing data
  • Map view of GPS-tagged records (Leaflet.js)
  • Export: CSV, JSON, GeoJSON, Excel (multi-sheet by category)
  • REST API for programmatic access to all collected data
  • Admin-configurable: add/remove sources, change schedules, toggle sources
  • Hash-based deduplication to avoid storing duplicate records
  • SQLite by default, PostgreSQL supported via DATABASE_URL

Quick Start

1. Install dependencies

python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt

2. Configure environment

cp .env .env
# Edit .env — at minimum set your Socrata app tokens for better rate limits

3. Initialize database

python main.py --mode setup

4. Seed sources and schedules from YAML config

python main.py --mode seed

5. Run

# Dashboard + background scheduler (recommended)
python main.py

# Or with make:
make run

Open http://localhost:5000 in your browser.


Usage

Run modes

python main.py                          # Dashboard + Scheduler (default)
python main.py --mode dashboard         # Web dashboard only
python main.py --mode scheduler         # Background scheduler only
python main.py --mode setup             # Initialize database
python main.py --mode seed              # Load sources/schedules from YAML
python main.py --mode seed --force      # Re-seed, overwriting existing

# Trigger collection manually
python main.py --mode collect --source co_med_licensees
python main.py --mode collect --all
python main.py --mode collect --all --state CO
python main.py --mode collect --all --category dispensary

Scripts

# Direct script access
python scripts/setup_db.py --check                    # DB health check
python scripts/seed_sources.py --dry-run              # Preview seed
python scripts/seed_sources.py --sources-only --force # Re-seed sources only
python scripts/run_collector.py --list                # List enabled sources
python scripts/run_collector.py --source co_med_licensees
python scripts/run_collector.py --all --state WA

# Export data
python scripts/export_data.py --format csv
python scripts/export_data.py --format geojson --state CO
python scripts/export_data.py --format xlsx --category dispensary
python scripts/export_data.py --format json --output my_export.json --limit 50000

Make targets

make install       # pip install -r requirements.txt
make setup         # Initialize database
make seed          # Seed sources and schedules
make run           # Start dashboard + scheduler
make dashboard     # Dashboard only
make scheduler     # Scheduler only
make collect       # Collect all enabled sources
make collect SOURCE=co_med_licensees   # Collect specific source
make export        # Export to CSV
make clean         # Remove cached files

Configuration

Environment variables (.env)

Variable Default Description
DATABASE_URL sqlite:///data/cannabis_aggregator.db Database connection
FLASK_HOST 0.0.0.0 Dashboard host
FLASK_PORT 5000 Dashboard port
FLASK_SECRET_KEY dev-secret-key-... Session secret
FLASK_DEBUG false Debug mode
SCHEDULER_TIMEZONE America/Chicago Cron timezone
SCHEDULER_MAX_WORKERS 5 Concurrent collection threads
LOG_LEVEL INFO Logging level
CO_APP_TOKEN Colorado Socrata app token
WA_APP_TOKEN Washington Socrata app token
(see .env.example for all)

config/sources.yaml

Defines all data sources. Key fields per source:

- id: co_med_licensees            # Unique identifier
  name: "Colorado MED Licensees"
  state: CO
  agency: Colorado MED
  category: licensee
  format: soda                    # soda | json | csv | geojson
  url: https://data.colorado.gov/resource/sqs8-2una.json
  enabled: true
  api_key_env: CO_APP_TOKEN       # Optional env var for auth
  pagination:
    type: offset                  # offset | page | cursor | link
    page_size: 1000
  field_mapping:                  # Maps source fields → standard schema
    name: licensee_name
    license_number: license_no
    address: street_address
    city: city
    state: state
    zip_code: zip
    latitude: latitude
    longitude: longitude

config/schedules.yaml

- id: sched_co_med_weekly
  name: "Colorado MED Licensees - Weekly"
  source_id: co_med_licensees
  enabled: true
  schedule_type: cron
  cron:
    minute: 0
    hour: 2
    day_of_week: sun    # Every Sunday at 2:00 AM
  priority: 2

Data Sources Included

Federal

Source Format Category
USDA AMS Hemp Producers CSV Hemp
DEA Registrant Locations JSON Pharmacy
ProPublica Congress API JSON Legislation
FDA NDC Drug Products JSON Pharmacy

States (sample)

State Agency Data
Colorado MED Licensees, Sales, Market Rates
Washington WSLCB Licensees, Sales, Violations
Oregon OLCC Licensees, GeoJSON Dispensaries
California DCC Licensees
Oklahoma OMMA Dispensaries, Growers, Processors, Transporters
Illinois IDFPR Cannabis Licenses, Monthly Sales
Massachusetts CCC Licensees, Weekly Sales
Michigan CRA Licenses, Sales
New York OCM All Licenses, Dispensaries
New Jersey CRC Licenses
Alaska AMCO License Database
Connecticut DCP Cannabis Licenses
DC ABCA Cannabis Licenses
New Mexico RLD Cannabis Licenses
(+ 10 more states)

Multi-state

Source Format Notes
OpenStreetMap/Overpass GeoJSON Free dispensary POI data
NCSL Cannabis Laws JSON State law tracker

Dashboard Pages

URL Description
/ Dashboard overview with stats and charts
/sources Manage data sources (add, edit, toggle, run)
/schedules Manage collection schedules
/data Browse collected records with filters
/data/map Leaflet map of GPS-tagged locations
/data/logs Collection run logs
/data/exports Export data + API documentation
/data/settings App settings

REST API

Base URL: http://localhost:5000/api

GET  /api/records                    Paginated records (filters: state, category, source_id, has_gps, search)
GET  /api/records/{id}               Single record
GET  /api/records/geojson            GeoJSON FeatureCollection of GPS records
GET  /api/records/export             File download (format=csv|json|geojson)
GET  /api/sources                    List sources
POST /api/sources                    Create source
PUT  /api/sources/{id}               Update source
POST /api/sources/{id}/toggle        Enable/disable
POST /api/sources/{id}/run           Trigger collection now
GET  /api/schedules                  List schedules
POST /api/schedules                  Create schedule
PUT  /api/schedules/{id}             Update schedule
POST /api/schedules/{id}/toggle      Enable/disable
GET  /api/runs                       Collection run history
GET  /api/logs                       Log entries
GET  /api/stats/categories           Record counts by category
GET  /api/stats/states               Record counts by state
POST /api/scheduler/sync             Sync scheduler jobs from DB
POST /api/seed                       Seed from YAML config

Adding a New Data Source

  1. Add to config/sources.yaml:

    - id: my_state_licenses
      name: "My State Cannabis Licenses"
      state: XX
      agency: My State Agency
      category: licensee
      format: soda        # or csv, json, geojson
      url: https://data.mystate.gov/resource/xxxx-xxxx.json
      enabled: true
      pagination:
        type: offset
        page_size: 1000
      field_mapping:
        name: business_name
        license_number: license_id
        city: city
        state: state_code
  2. Add a schedule to config/schedules.yaml:

    - id: sched_my_state_weekly
      name: "My State Licenses - Weekly"
      source_id: my_state_licenses
      enabled: true
      schedule_type: cron
      cron:
        minute: 0
        hour: 3
        day_of_week: sun
  3. Seed the database:

    python main.py --mode seed
    # or in the dashboard: Settings → Seed Sources from YAML
  4. Test with a manual collection:

    python scripts/run_collector.py --source my_state_licenses

Project Structure

cannabis-data-aggregator/
├── main.py                    Entry point
├── requirements.txt
├── .env.example               Environment template
├── docker-compose.yml
├── Makefile
├── config/
│   ├── sources.yaml           Data source definitions (50+ sources)
│   ├── schedules.yaml         Collection schedules
│   └── settings.yaml          Global settings
├── src/
│   ├── collectors/
│   │   ├── base.py            BaseCollector (HTTP, rate limiting, retries)
│   │   ├── api_collector.py   JSON REST + Socrata SODA
│   │   ├── csv_collector.py   CSV/TSV with auto-encoding detection
│   │   └── geojson_collector.py  GeoJSON + Overpass API
│   ├── processors/
│   │   └── normalizer.py      Field mapping, standardization
│   ├── scheduler/
│   │   └── manager.py         APScheduler + collection job runner
│   ├── storage/
│   │   ├── models.py          SQLAlchemy models
│   │   └── database.py        DB init, session management
│   └── dashboard/
│       ├── app.py             Flask app factory
│       ├── routes/            Blueprint routes
│       ├── templates/         Jinja2 HTML templates
│       └── static/            CSS, JavaScript
├── scripts/
│   ├── setup_db.py            Database initialization
│   ├── seed_sources.py        Seed from YAML
│   ├── run_collector.py       CLI collection runner
│   └── export_data.py         CLI data exporter
├── data/
│   ├── raw/                   Temporary raw files
│   └── processed/             Exported data files
└── logs/                      Application logs

Docker

Prerequisites

  • Docker Desktop 24+ (or Docker Engine + Compose plugin v2)
  • .env file configured — copy and edit .env.example first

Quick Start — SQLite (zero-config)

Runs the app with a local SQLite database stored in ./data/. No external services needed.

cp .env.example .env
# Edit .env: set FLASK_SECRET_KEY and any API tokens you want
# Ensure DATABASE_URL is set to SQLite (default):
#   DATABASE_URL=sqlite:///data/cannabis_aggregator.db

# Build and start
docker compose up --build -d

# First-run: initialize database and load sources
docker compose exec app python main.py --mode setup
docker compose exec app python main.py --mode seed

Open http://localhost:5000 in your browser.


With MySQL

Starts the app plus a MySQL 8.0 container using the mysql profile.

# In .env, configure:
#   DATABASE_URL=mysql+pymysql://cannabis:Passw0rd@db:3306/cannabis_data
#   MYSQL_ROOT_PASSWORD=your-root-password
#   MYSQL_DATABASE=cannabis_data
#   MYSQL_USER=cannabis
#   MYSQL_PASSWORD=your-password

docker compose --profile mysql up --build -d

# Wait ~10 s for MySQL to be ready, then initialise:
docker compose exec app python main.py --mode setup
docker compose exec app python main.py --mode seed

Common Commands

# Follow application logs
docker compose logs -f app

# Run a manual collection of all enabled sources
docker compose exec app python main.py --mode collect --all

# Collect a specific source
docker compose exec app python main.py --mode collect --source co_med_licensees

# Export data
docker compose exec app python scripts/export_data.py --format csv

# Open a shell in the container
docker compose exec app bash

# Restart the app (picks up .env changes)
docker compose restart app

# Stop all services (keeps volumes)
docker compose down

# Stop and destroy all data (irreversible)
docker compose down -v

Volumes & Mounts

Mount Purpose
./data/app/data Database file, exports, raw & processed data
./logs/app/logs Rotating application log files
./config/app/config sources.yaml, schedules.yaml — editable live
mysql_data (named) MySQL data directory (persists across restarts)

Tip: Because ./config is bind-mounted, you can edit config/sources.yaml and add new data sources without rebuilding the image. Just docker compose restart app to reload.


Building the Image Standalone

# Build
docker build -t cannabis-aggregator .

# Run (SQLite, data persisted to host ./data)
docker run -d \
  --name cannabis-aggregator \
  -p 5000:5000 \
  -v "$(pwd)/data:/app/data" \
  -v "$(pwd)/logs:/app/logs" \
  -v "$(pwd)/config:/app/config" \
  --env-file .env \
  cannabis-aggregator

Production Notes

Setting Recommendation
FLASK_SECRET_KEY Generate with openssl rand -hex 32 — never use the default
FLASK_ENV Set to production
FLASK_DEBUG Set to false
Database Use MySQL (or PostgreSQL via DATABASE_URL=postgresql://...) for multi-user deployments
Reverse proxy Place nginx or Caddy in front — expose only port 5000 internally
TLS Terminate SSL at the reverse proxy, not in Flask
Updates docker compose pull && docker compose up --build -d

License

MIT

cannabis

A Cannabis public content acquisition dashboard and web site to automate and display the publicly available cannabis information from state and federal government data resources

e27ffb1ac5cd4c841b3a06cfc8f5e6e9b8c5c441

About

A Cannabis public content acquisition dashboard and web site to automate and display the publicly available cannabis information from state and federal government data resources

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published