Project:
Website: ${AUTHOR_WEBSITE}
Contact: ${AUTHOR_EMAIL}
Automated collection and management of US cannabis open data from state and federal APIs. Aggregates dispensary locations, license data, sales figures, grower/processor records, and more into a unified database — with a web dashboard for management and export.
- 50+ pre-configured data sources across ~20 states + federal agencies
- Multiple formats: Socrata SODA, JSON REST API, CSV, GeoJSON
- Automated scheduling with cron and interval-based jobs (APScheduler)
- Web dashboard for monitoring, managing sources/schedules, and browsing data
- Map view of GPS-tagged records (Leaflet.js)
- Export: CSV, JSON, GeoJSON, Excel (multi-sheet by category)
- REST API for programmatic access to all collected data
- Admin-configurable: add/remove sources, change schedules, toggle sources
- Hash-based deduplication to avoid storing duplicate records
- SQLite by default, PostgreSQL supported via
DATABASE_URL
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtcp .env .env
# Edit .env — at minimum set your Socrata app tokens for better rate limitspython main.py --mode setuppython main.py --mode seed# Dashboard + background scheduler (recommended)
python main.py
# Or with make:
make runOpen http://localhost:5000 in your browser.
python main.py # Dashboard + Scheduler (default)
python main.py --mode dashboard # Web dashboard only
python main.py --mode scheduler # Background scheduler only
python main.py --mode setup # Initialize database
python main.py --mode seed # Load sources/schedules from YAML
python main.py --mode seed --force # Re-seed, overwriting existing
# Trigger collection manually
python main.py --mode collect --source co_med_licensees
python main.py --mode collect --all
python main.py --mode collect --all --state CO
python main.py --mode collect --all --category dispensary# Direct script access
python scripts/setup_db.py --check # DB health check
python scripts/seed_sources.py --dry-run # Preview seed
python scripts/seed_sources.py --sources-only --force # Re-seed sources only
python scripts/run_collector.py --list # List enabled sources
python scripts/run_collector.py --source co_med_licensees
python scripts/run_collector.py --all --state WA
# Export data
python scripts/export_data.py --format csv
python scripts/export_data.py --format geojson --state CO
python scripts/export_data.py --format xlsx --category dispensary
python scripts/export_data.py --format json --output my_export.json --limit 50000make install # pip install -r requirements.txt
make setup # Initialize database
make seed # Seed sources and schedules
make run # Start dashboard + scheduler
make dashboard # Dashboard only
make scheduler # Scheduler only
make collect # Collect all enabled sources
make collect SOURCE=co_med_licensees # Collect specific source
make export # Export to CSV
make clean # Remove cached files| Variable | Default | Description |
|---|---|---|
DATABASE_URL |
sqlite:///data/cannabis_aggregator.db |
Database connection |
FLASK_HOST |
0.0.0.0 |
Dashboard host |
FLASK_PORT |
5000 |
Dashboard port |
FLASK_SECRET_KEY |
dev-secret-key-... |
Session secret |
FLASK_DEBUG |
false |
Debug mode |
SCHEDULER_TIMEZONE |
America/Chicago |
Cron timezone |
SCHEDULER_MAX_WORKERS |
5 |
Concurrent collection threads |
LOG_LEVEL |
INFO |
Logging level |
CO_APP_TOKEN |
— | Colorado Socrata app token |
WA_APP_TOKEN |
— | Washington Socrata app token |
| (see .env.example for all) |
Defines all data sources. Key fields per source:
- id: co_med_licensees # Unique identifier
name: "Colorado MED Licensees"
state: CO
agency: Colorado MED
category: licensee
format: soda # soda | json | csv | geojson
url: https://data.colorado.gov/resource/sqs8-2una.json
enabled: true
api_key_env: CO_APP_TOKEN # Optional env var for auth
pagination:
type: offset # offset | page | cursor | link
page_size: 1000
field_mapping: # Maps source fields → standard schema
name: licensee_name
license_number: license_no
address: street_address
city: city
state: state
zip_code: zip
latitude: latitude
longitude: longitude- id: sched_co_med_weekly
name: "Colorado MED Licensees - Weekly"
source_id: co_med_licensees
enabled: true
schedule_type: cron
cron:
minute: 0
hour: 2
day_of_week: sun # Every Sunday at 2:00 AM
priority: 2| Source | Format | Category |
|---|---|---|
| USDA AMS Hemp Producers | CSV | Hemp |
| DEA Registrant Locations | JSON | Pharmacy |
| ProPublica Congress API | JSON | Legislation |
| FDA NDC Drug Products | JSON | Pharmacy |
| State | Agency | Data |
|---|---|---|
| Colorado | MED | Licensees, Sales, Market Rates |
| Washington | WSLCB | Licensees, Sales, Violations |
| Oregon | OLCC | Licensees, GeoJSON Dispensaries |
| California | DCC | Licensees |
| Oklahoma | OMMA | Dispensaries, Growers, Processors, Transporters |
| Illinois | IDFPR | Cannabis Licenses, Monthly Sales |
| Massachusetts | CCC | Licensees, Weekly Sales |
| Michigan | CRA | Licenses, Sales |
| New York | OCM | All Licenses, Dispensaries |
| New Jersey | CRC | Licenses |
| Alaska | AMCO | License Database |
| Connecticut | DCP | Cannabis Licenses |
| DC | ABCA | Cannabis Licenses |
| New Mexico | RLD | Cannabis Licenses |
| (+ 10 more states) |
| Source | Format | Notes |
|---|---|---|
| OpenStreetMap/Overpass | GeoJSON | Free dispensary POI data |
| NCSL Cannabis Laws | JSON | State law tracker |
| URL | Description |
|---|---|
/ |
Dashboard overview with stats and charts |
/sources |
Manage data sources (add, edit, toggle, run) |
/schedules |
Manage collection schedules |
/data |
Browse collected records with filters |
/data/map |
Leaflet map of GPS-tagged locations |
/data/logs |
Collection run logs |
/data/exports |
Export data + API documentation |
/data/settings |
App settings |
Base URL: http://localhost:5000/api
GET /api/records Paginated records (filters: state, category, source_id, has_gps, search)
GET /api/records/{id} Single record
GET /api/records/geojson GeoJSON FeatureCollection of GPS records
GET /api/records/export File download (format=csv|json|geojson)
GET /api/sources List sources
POST /api/sources Create source
PUT /api/sources/{id} Update source
POST /api/sources/{id}/toggle Enable/disable
POST /api/sources/{id}/run Trigger collection now
GET /api/schedules List schedules
POST /api/schedules Create schedule
PUT /api/schedules/{id} Update schedule
POST /api/schedules/{id}/toggle Enable/disable
GET /api/runs Collection run history
GET /api/logs Log entries
GET /api/stats/categories Record counts by category
GET /api/stats/states Record counts by state
POST /api/scheduler/sync Sync scheduler jobs from DB
POST /api/seed Seed from YAML config
-
Add to
config/sources.yaml:- id: my_state_licenses name: "My State Cannabis Licenses" state: XX agency: My State Agency category: licensee format: soda # or csv, json, geojson url: https://data.mystate.gov/resource/xxxx-xxxx.json enabled: true pagination: type: offset page_size: 1000 field_mapping: name: business_name license_number: license_id city: city state: state_code
-
Add a schedule to
config/schedules.yaml:- id: sched_my_state_weekly name: "My State Licenses - Weekly" source_id: my_state_licenses enabled: true schedule_type: cron cron: minute: 0 hour: 3 day_of_week: sun
-
Seed the database:
python main.py --mode seed # or in the dashboard: Settings → Seed Sources from YAML -
Test with a manual collection:
python scripts/run_collector.py --source my_state_licenses
cannabis-data-aggregator/
├── main.py Entry point
├── requirements.txt
├── .env.example Environment template
├── docker-compose.yml
├── Makefile
├── config/
│ ├── sources.yaml Data source definitions (50+ sources)
│ ├── schedules.yaml Collection schedules
│ └── settings.yaml Global settings
├── src/
│ ├── collectors/
│ │ ├── base.py BaseCollector (HTTP, rate limiting, retries)
│ │ ├── api_collector.py JSON REST + Socrata SODA
│ │ ├── csv_collector.py CSV/TSV with auto-encoding detection
│ │ └── geojson_collector.py GeoJSON + Overpass API
│ ├── processors/
│ │ └── normalizer.py Field mapping, standardization
│ ├── scheduler/
│ │ └── manager.py APScheduler + collection job runner
│ ├── storage/
│ │ ├── models.py SQLAlchemy models
│ │ └── database.py DB init, session management
│ └── dashboard/
│ ├── app.py Flask app factory
│ ├── routes/ Blueprint routes
│ ├── templates/ Jinja2 HTML templates
│ └── static/ CSS, JavaScript
├── scripts/
│ ├── setup_db.py Database initialization
│ ├── seed_sources.py Seed from YAML
│ ├── run_collector.py CLI collection runner
│ └── export_data.py CLI data exporter
├── data/
│ ├── raw/ Temporary raw files
│ └── processed/ Exported data files
└── logs/ Application logs
- Docker Desktop 24+ (or Docker Engine + Compose plugin v2)
.envfile configured — copy and edit.env.examplefirst
Runs the app with a local SQLite database stored in ./data/. No external services needed.
cp .env.example .env
# Edit .env: set FLASK_SECRET_KEY and any API tokens you want
# Ensure DATABASE_URL is set to SQLite (default):
# DATABASE_URL=sqlite:///data/cannabis_aggregator.db
# Build and start
docker compose up --build -d
# First-run: initialize database and load sources
docker compose exec app python main.py --mode setup
docker compose exec app python main.py --mode seedOpen http://localhost:5000 in your browser.
Starts the app plus a MySQL 8.0 container using the mysql profile.
# In .env, configure:
# DATABASE_URL=mysql+pymysql://cannabis:Passw0rd@db:3306/cannabis_data
# MYSQL_ROOT_PASSWORD=your-root-password
# MYSQL_DATABASE=cannabis_data
# MYSQL_USER=cannabis
# MYSQL_PASSWORD=your-password
docker compose --profile mysql up --build -d
# Wait ~10 s for MySQL to be ready, then initialise:
docker compose exec app python main.py --mode setup
docker compose exec app python main.py --mode seed# Follow application logs
docker compose logs -f app
# Run a manual collection of all enabled sources
docker compose exec app python main.py --mode collect --all
# Collect a specific source
docker compose exec app python main.py --mode collect --source co_med_licensees
# Export data
docker compose exec app python scripts/export_data.py --format csv
# Open a shell in the container
docker compose exec app bash
# Restart the app (picks up .env changes)
docker compose restart app
# Stop all services (keeps volumes)
docker compose down
# Stop and destroy all data (irreversible)
docker compose down -v| Mount | Purpose |
|---|---|
./data → /app/data |
Database file, exports, raw & processed data |
./logs → /app/logs |
Rotating application log files |
./config → /app/config |
sources.yaml, schedules.yaml — editable live |
mysql_data (named) |
MySQL data directory (persists across restarts) |
Tip: Because
./configis bind-mounted, you can editconfig/sources.yamland add new data sources without rebuilding the image. Justdocker compose restart appto reload.
# Build
docker build -t cannabis-aggregator .
# Run (SQLite, data persisted to host ./data)
docker run -d \
--name cannabis-aggregator \
-p 5000:5000 \
-v "$(pwd)/data:/app/data" \
-v "$(pwd)/logs:/app/logs" \
-v "$(pwd)/config:/app/config" \
--env-file .env \
cannabis-aggregator| Setting | Recommendation |
|---|---|
FLASK_SECRET_KEY |
Generate with openssl rand -hex 32 — never use the default |
FLASK_ENV |
Set to production |
FLASK_DEBUG |
Set to false |
| Database | Use MySQL (or PostgreSQL via DATABASE_URL=postgresql://...) for multi-user deployments |
| Reverse proxy | Place nginx or Caddy in front — expose only port 5000 internally |
| TLS | Terminate SSL at the reverse proxy, not in Flask |
| Updates | docker compose pull && docker compose up --build -d |
A Cannabis public content acquisition dashboard and web site to automate and display the publicly available cannabis information from state and federal government data resources
e27ffb1ac5cd4c841b3a06cfc8f5e6e9b8c5c441