Cannabis Data Aggregator

Project: ${PROJECT_NAME} Author: ${AUTHOR}
Website: ${AUTHOR_WEBSITE}
Contact: ${AUTHOR_EMAIL}

Automated collection and management of US cannabis open data from state and federal APIs. Aggregates dispensary locations, license data, sales figures, grower/processor records, and more into a unified database — with a web dashboard for management and export.

Features

50+ pre-configured data sources across ~20 states + federal agencies
Multiple formats: Socrata SODA, JSON REST API, CSV, GeoJSON
Automated scheduling with cron and interval-based jobs (APScheduler)
Web dashboard for monitoring, managing sources/schedules, and browsing data
Map view of GPS-tagged records (Leaflet.js)
Export: CSV, JSON, GeoJSON, Excel (multi-sheet by category)
REST API for programmatic access to all collected data
Admin-configurable: add/remove sources, change schedules, toggle sources
Hash-based deduplication to avoid storing duplicate records
SQLite by default, PostgreSQL supported via DATABASE_URL

Quick Start

1. Install dependencies

python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt

2. Configure environment

cp .env .env
# Edit .env — at minimum set your Socrata app tokens for better rate limits

3. Initialize database

python main.py --mode setup

4. Seed sources and schedules from YAML config

python main.py --mode seed

5. Run

# Dashboard + background scheduler (recommended)
python main.py

# Or with make:
make run

Open http://localhost:5000 in your browser.

Usage

Run modes

python main.py                          # Dashboard + Scheduler (default)
python main.py --mode dashboard         # Web dashboard only
python main.py --mode scheduler         # Background scheduler only
python main.py --mode setup             # Initialize database
python main.py --mode seed              # Load sources/schedules from YAML
python main.py --mode seed --force      # Re-seed, overwriting existing

# Trigger collection manually
python main.py --mode collect --source co_med_licensees
python main.py --mode collect --all
python main.py --mode collect --all --state CO
python main.py --mode collect --all --category dispensary

Scripts

# Direct script access
python scripts/setup_db.py --check                    # DB health check
python scripts/seed_sources.py --dry-run              # Preview seed
python scripts/seed_sources.py --sources-only --force # Re-seed sources only
python scripts/run_collector.py --list                # List enabled sources
python scripts/run_collector.py --source co_med_licensees
python scripts/run_collector.py --all --state WA

# Export data
python scripts/export_data.py --format csv
python scripts/export_data.py --format geojson --state CO
python scripts/export_data.py --format xlsx --category dispensary
python scripts/export_data.py --format json --output my_export.json --limit 50000

Make targets

make install       # pip install -r requirements.txt
make setup         # Initialize database
make seed          # Seed sources and schedules
make run           # Start dashboard + scheduler
make dashboard     # Dashboard only
make scheduler     # Scheduler only
make collect       # Collect all enabled sources
make collect SOURCE=co_med_licensees   # Collect specific source
make export        # Export to CSV
make clean         # Remove cached files

Configuration

Environment variables (`.env`)

Variable	Default	Description
`DATABASE_URL`	`sqlite:///data/cannabis_aggregator.db`	Database connection
`FLASK_HOST`	`0.0.0.0`	Dashboard host
`FLASK_PORT`	`5000`	Dashboard port
`FLASK_SECRET_KEY`	`dev-secret-key-...`	Session secret
`FLASK_DEBUG`	`false`	Debug mode
`SCHEDULER_TIMEZONE`	`America/Chicago`	Cron timezone
`SCHEDULER_MAX_WORKERS`	`5`	Concurrent collection threads
`LOG_LEVEL`	`INFO`	Logging level
`CO_APP_TOKEN`	—	Colorado Socrata app token
`WA_APP_TOKEN`	—	Washington Socrata app token
(see .env.example for all)

`config/sources.yaml`

Defines all data sources. Key fields per source:

- id: co_med_licensees            # Unique identifier
  name: "Colorado MED Licensees"
  state: CO
  agency: Colorado MED
  category: licensee
  format: soda                    # soda | json | csv | geojson
  url: https://data.colorado.gov/resource/sqs8-2una.json
  enabled: true
  api_key_env: CO_APP_TOKEN       # Optional env var for auth
  pagination:
    type: offset                  # offset | page | cursor | link
    page_size: 1000
  field_mapping:                  # Maps source fields → standard schema
    name: licensee_name
    license_number: license_no
    address: street_address
    city: city
    state: state
    zip_code: zip
    latitude: latitude
    longitude: longitude

`config/schedules.yaml`

- id: sched_co_med_weekly
  name: "Colorado MED Licensees - Weekly"
  source_id: co_med_licensees
  enabled: true
  schedule_type: cron
  cron:
    minute: 0
    hour: 2
    day_of_week: sun    # Every Sunday at 2:00 AM
  priority: 2

Data Sources Included

Federal

Source	Format	Category
USDA AMS Hemp Producers	CSV	Hemp
DEA Registrant Locations	JSON	Pharmacy
ProPublica Congress API	JSON	Legislation
FDA NDC Drug Products	JSON	Pharmacy

States (sample)

State	Agency	Data
Colorado	MED	Licensees, Sales, Market Rates
Washington	WSLCB	Licensees, Sales, Violations
Oregon	OLCC	Licensees, GeoJSON Dispensaries
California	DCC	Licensees
Oklahoma	OMMA	Dispensaries, Growers, Processors, Transporters
Illinois	IDFPR	Cannabis Licenses, Monthly Sales
Massachusetts	CCC	Licensees, Weekly Sales
Michigan	CRA	Licenses, Sales
New York	OCM	All Licenses, Dispensaries
New Jersey	CRC	Licenses
Alaska	AMCO	License Database
Connecticut	DCP	Cannabis Licenses
DC	ABCA	Cannabis Licenses
New Mexico	RLD	Cannabis Licenses
(+ 10 more states)

Multi-state

Source	Format	Notes
OpenStreetMap/Overpass	GeoJSON	Free dispensary POI data
NCSL Cannabis Laws	JSON	State law tracker

Dashboard Pages

URL	Description
`/`	Dashboard overview with stats and charts
`/sources`	Manage data sources (add, edit, toggle, run)
`/schedules`	Manage collection schedules
`/data`	Browse collected records with filters
`/data/map`	Leaflet map of GPS-tagged locations
`/data/logs`	Collection run logs
`/data/exports`	Export data + API documentation
`/data/settings`	App settings

REST API

Base URL: http://localhost:5000/api

GET  /api/records                    Paginated records (filters: state, category, source_id, has_gps, search)
GET  /api/records/{id}               Single record
GET  /api/records/geojson            GeoJSON FeatureCollection of GPS records
GET  /api/records/export             File download (format=csv|json|geojson)
GET  /api/sources                    List sources
POST /api/sources                    Create source
PUT  /api/sources/{id}               Update source
POST /api/sources/{id}/toggle        Enable/disable
POST /api/sources/{id}/run           Trigger collection now
GET  /api/schedules                  List schedules
POST /api/schedules                  Create schedule
PUT  /api/schedules/{id}             Update schedule
POST /api/schedules/{id}/toggle      Enable/disable
GET  /api/runs                       Collection run history
GET  /api/logs                       Log entries
GET  /api/stats/categories           Record counts by category
GET  /api/stats/states               Record counts by state
POST /api/scheduler/sync             Sync scheduler jobs from DB
POST /api/seed                       Seed from YAML config

Adding a New Data Source

Add to config/sources.yaml:

- id: my_state_licenses
  name: "My State Cannabis Licenses"
  state: XX
  agency: My State Agency
  category: licensee
  format: soda        # or csv, json, geojson
  url: https://data.mystate.gov/resource/xxxx-xxxx.json
  enabled: true
  pagination:
    type: offset
    page_size: 1000
  field_mapping:
    name: business_name
    license_number: license_id
    city: city
    state: state_code

Add a schedule to config/schedules.yaml:

- id: sched_my_state_weekly
  name: "My State Licenses - Weekly"
  source_id: my_state_licenses
  enabled: true
  schedule_type: cron
  cron:
    minute: 0
    hour: 3
    day_of_week: sun

Seed the database:

python main.py --mode seed
# or in the dashboard: Settings → Seed Sources from YAML

Test with a manual collection:

python scripts/run_collector.py --source my_state_licenses

Project Structure

cannabis-data-aggregator/
├── main.py                    Entry point
├── requirements.txt
├── .env.example               Environment template
├── docker-compose.yml
├── Makefile
├── config/
│   ├── sources.yaml           Data source definitions (50+ sources)
│   ├── schedules.yaml         Collection schedules
│   └── settings.yaml          Global settings
├── src/
│   ├── collectors/
│   │   ├── base.py            BaseCollector (HTTP, rate limiting, retries)
│   │   ├── api_collector.py   JSON REST + Socrata SODA
│   │   ├── csv_collector.py   CSV/TSV with auto-encoding detection
│   │   └── geojson_collector.py  GeoJSON + Overpass API
│   ├── processors/
│   │   └── normalizer.py      Field mapping, standardization
│   ├── scheduler/
│   │   └── manager.py         APScheduler + collection job runner
│   ├── storage/
│   │   ├── models.py          SQLAlchemy models
│   │   └── database.py        DB init, session management
│   └── dashboard/
│       ├── app.py             Flask app factory
│       ├── routes/            Blueprint routes
│       ├── templates/         Jinja2 HTML templates
│       └── static/            CSS, JavaScript
├── scripts/
│   ├── setup_db.py            Database initialization
│   ├── seed_sources.py        Seed from YAML
│   ├── run_collector.py       CLI collection runner
│   └── export_data.py         CLI data exporter
├── data/
│   ├── raw/                   Temporary raw files
│   └── processed/             Exported data files
└── logs/                      Application logs

Docker

Prerequisites

Docker Desktop 24+ (or Docker Engine + Compose plugin v2)
.env file configured — copy and edit .env.example first

Quick Start — SQLite (zero-config)

Runs the app with a local SQLite database stored in ./data/. No external services needed.

cp .env.example .env
# Edit .env: set FLASK_SECRET_KEY and any API tokens you want
# Ensure DATABASE_URL is set to SQLite (default):
#   DATABASE_URL=sqlite:///data/cannabis_aggregator.db

# Build and start
docker compose up --build -d

# First-run: initialize database and load sources
docker compose exec app python main.py --mode setup
docker compose exec app python main.py --mode seed

Open http://localhost:5000 in your browser.

With MySQL

Starts the app plus a MySQL 8.0 container using the mysql profile.

# In .env, configure:
#   DATABASE_URL=mysql+pymysql://cannabis:Passw0rd@db:3306/cannabis_data
#   MYSQL_ROOT_PASSWORD=your-root-password
#   MYSQL_DATABASE=cannabis_data
#   MYSQL_USER=cannabis
#   MYSQL_PASSWORD=your-password

docker compose --profile mysql up --build -d

# Wait ~10 s for MySQL to be ready, then initialise:
docker compose exec app python main.py --mode setup
docker compose exec app python main.py --mode seed

Common Commands

# Follow application logs
docker compose logs -f app

# Run a manual collection of all enabled sources
docker compose exec app python main.py --mode collect --all

# Collect a specific source
docker compose exec app python main.py --mode collect --source co_med_licensees

# Export data
docker compose exec app python scripts/export_data.py --format csv

# Open a shell in the container
docker compose exec app bash

# Restart the app (picks up .env changes)
docker compose restart app

# Stop all services (keeps volumes)
docker compose down

# Stop and destroy all data (irreversible)
docker compose down -v

Volumes & Mounts

Mount	Purpose
`./data` → `/app/data`	Database file, exports, raw & processed data
`./logs` → `/app/logs`	Rotating application log files
`./config` → `/app/config`	`sources.yaml`, `schedules.yaml` — editable live
`mysql_data` (named)	MySQL data directory (persists across restarts)

Tip: Because ./config is bind-mounted, you can edit config/sources.yaml and add new data sources without rebuilding the image. Just docker compose restart app to reload.

Building the Image Standalone

# Build
docker build -t cannabis-aggregator .

# Run (SQLite, data persisted to host ./data)
docker run -d \
  --name cannabis-aggregator \
  -p 5000:5000 \
  -v "$(pwd)/data:/app/data" \
  -v "$(pwd)/logs:/app/logs" \
  -v "$(pwd)/config:/app/config" \
  --env-file .env \
  cannabis-aggregator

Production Notes

Setting	Recommendation
`FLASK_SECRET_KEY`	Generate with `openssl rand -hex 32` — never use the default
`FLASK_ENV`	Set to `production`
`FLASK_DEBUG`	Set to `false`
Database	Use MySQL (or PostgreSQL via `DATABASE_URL=postgresql://...`) for multi-user deployments
Reverse proxy	Place nginx or Caddy in front — expose only port 5000 internally
TLS	Terminate SSL at the reverse proxy, not in Flask
Updates	`docker compose pull && docker compose up --build -d`

License

MIT

cannabis

A Cannabis public content acquisition dashboard and web site to automate and display the publicly available cannabis information from state and federal government data resources

e27ffb1ac5cd4c841b3a06cfc8f5e6e9b8c5c441

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
config		config
data		data
docs		docs
logs		logs
scripts		scripts
src		src
web		web
.dockerignore		.dockerignore
.env		.env
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TEST_CHANGELOG.md		TEST_CHANGELOG.md
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
main.py		main.py
package.json		package.json
prep-github.ps1		prep-github.ps1
requirements.txt		requirements.txt
schema.sql		schema.sql

License

phreakin/cannabis

Folders and files

Latest commit

History

Repository files navigation

Cannabis Data Aggregator

Features

Quick Start

1. Install dependencies

2. Configure environment

3. Initialize database

4. Seed sources and schedules from YAML config

5. Run

Usage

Run modes

Scripts

Make targets

Configuration

Environment variables (.env)

config/sources.yaml

config/schedules.yaml

Data Sources Included

Federal

States (sample)

Multi-state

Dashboard Pages

REST API

Adding a New Data Source

Project Structure

Docker

Prerequisites

Quick Start — SQLite (zero-config)

With MySQL

Common Commands

Volumes & Mounts

Building the Image Standalone

Production Notes

License

MIT

cannabis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Environment variables (`.env`)

`config/sources.yaml`

`config/schedules.yaml`

Packages