Attribution This repository makes use of and builds on external code from the following sources: - datasets/mosaiks
: Adapted from the Global Policy Lab's mosaiks-paper repository, which provides code for feature extraction and dataset handling using MOSAIKS features. - sampling/
: Adapted from TypiClust's deep-al module, which includes implementations for active learning and sampling methods. We thank the authors of these repositories for making their code available.
This repository contains a complete data processing pipeline for optimized sampling analysis across three datasets:
- USAVars
- India SECC
- Togo soil fertility (Note: not yet publicly released)
- Data Download
- Featurization
- Train-Test Split
- Generate GeoDataFrames
- Group Creation
- Initial Sampling
- Running Sampling
- Bash Scripts
- Results and Analysis
- Contributing
Download using torchgeo
.
- Docs: TorchGeo USAVars Dataset
Process/download from this repository: https://github.com/emilylaiken/satellite-fairness-replication
-
mosaiks_features_by_shrug_condensed_regions_25_max_tiles_100_india.csv
- Description:
- Precomputed MOSAIKS features (4000-dim)
- Columns:
condensed_shrug_id
: Unique ID per unitFeature0
toFeature3999
: Satellite features
- Description:
-
grouped.csv
- Description:
- Contains labels
- Columns:
condensed_shrug_id
(matching above)secc_cons_pc_combined
: Target variable
- Description:
-
villages_with_regions.shp
- Description:
- Shapefile with spatial polygons
- Columns:
condensed_
: Will be renamed tocondensed_shrug_id
geometry
: Polygon geometries
- Description:
Not yet available. Will be released by the Togolese Ministry of Agriculture.
Run:
cd datasets
python featurization.py \
--dataset_name USAVars \
--data_root /path/to/your/data/root \
--labels treecover,population \
--num_features 4096
Follow instructions at: satellite-fairness-replication
- Save features as a
.pkl
file (dict format) with keys:'X'
,'ids_X'
, and'latlon'
.
Run format_data.py
:
cd datasets
python format_data.py \
--save \
--label population \ # or other label
--feature_path /path/to/featurized/data/CONTUS_UAR_torchgeo4096.pkl # or India features
Creates an 80/20 split, saved as a .pkl
file with {X, y, latlon}_{train, test}.
GeoDataFrames are used for clusters and region-based sampling strategies.
First, download US county shape files from census.gov (Here we use 2015 shape files):
cd groups
python usavars_generate_gdfs.py \
--labels population,treecover \
--input_folder ../0_data/features/usavars \
--year 2015 \
--county_shp ../0_data/boundaries/us/us_county_2015 \
--output_dir ./admin_gdfs/usavars
Admin levels are pre-included in the shapefile. No processing needed.
Use:
python togo_generate_gdfs.py
- Admin Groups: States, regions
- Image Groups: Feature-based KMeans clustering
- NLCD Groups: Land cover classes (U.S. only)
Generate county-level groups for a single dataset (US population):
python generate_groups.py \
--datasets "usavars_pop" \
--gdf_paths "../0_data/admin_gdfs/usavars/gdf_counties_population_2015.geojson" \
--id_cols "id" \
--shape_files "../0_data/boundaries/us/us_county_2015" \
--group_type "counties" \
--group_cols "combined_county_id"
To run for other datasets (usavars_tc
, india_secc
, togo
), replace the --datasets
, --gdf_paths
, --id_cols
arguments accordingly.
Note: the following groups should be made:
- USAVars:
state
,county
- India SECC:
state
,district
- Togo :
region
,cantons
The image_clusters.py
script generates clusters from image features.
python image_clusters.py \
--dataset USAVars \ #or India SECC, Togo
--num_clusters 8 \ #or 3
--feature_path /path/to/features.pkl \
--output_path /path/to/save/clusters_{num_clusters}.pkl
- Download 2016 NLCD TIFF from: https://earthexplorer.usgs.gov/
- Run
nlcd_groups.py
with the following required arguments: --input_dir
: Directory containing NAIP image files--nlcd_path
: Path to existing NLCD .tif raster file--output_dir
: Directory to save output files--dataset_name
: Dataset name for output filenames (default: "usavars")--file_pattern
: File pattern to match in input directory (default: "*.tif")--k_min
: Minimum number of clusters to test (default: 2)--k_max
: Maximum number of clusters to test (default: 10)
Specifically, make sure k = 8
Use Jupyter Notebooks:
usavars_initial_sample.ipynb
india_secc_initial_sample.ipynb
togo_initial_sample.ipynb
Save the outputs as .pkl
to:
0_data/initial_samples/{dataset}/...
Attention: To solve optimization problem, cvxpy is run with MOSEK solver. You need a license to use MOSEK, which can be requested from https://www.mosek.com/products/academic-licenses/
Edit config files in:
sampling/configs/{dataset}/*.yaml
Set correct paths, especially DATASET.ROOT_DIR
.
Example usage:
python train.py \
--cfg ../configs/usavars/RIDGE_POP.yaml \
--exp-name experiment_1 \
--sampling_fn greedycost \
--budget 1000 \
--initial_set_str test \
--seed 42 \
--unit_assignment_path ../../0_data/groups/usavars_pop/counties_assignments.pkl \
--id_path '../../0_data/initial_samples/usavars/population/cluster_sampling/fixedstrata_Idaho_16-Louisiana_22-Mississippi_28-New Mexico_35-Pennsylvania_42/sample_state_combined_county_id_5ppc_150_size_seed_1.pkl' \
--cost_func uniform \
--cost_name uniform \
--group_assignment_path ../../0_data/groups/usavars_pop/states_assignments.pkl \
--group_type states
Argument | Description |
---|---|
--cost_array_path |
Path to a NumPy cost array |
--unit_assignment_path |
Path to unit assignment file |
--region_assignment_path |
Path to region assignment file |
--util_lambda |
Utility lambda for optimization |
--alpha |
Alpha parameter for cost-based sampling |
--points_per_unit |
For unit-based sampling |
run_india.sh
run_togo.sh
run_usavars_population.sh
run_usavars_treecover.sh
run_india_rep_states.sh
run_togo_rep_regions.sh
run_usavars_population_rep_states.sh
run_usavars_treecover_rep_states.sh
run_india_rep_image_8.sh
run_togo_cluster_rep_image_8.sh
run_usavars_population_rep_image_8.sh
run_usavars_treecover_rep_image_8.sh
(U.S. only)
run_usavars_population_rep_nlcd.sh
run_usavars_treecover_rep_nlcd.sh
run_india_cluster_multiple.sh
run_togo_cluster_multiple_initial_set.sh
run_togo_cost_diff.sh
Go to the summarize/
directory.
python parse_out_log.py --multiple True # or False
python generate_latex_table.py
python plot_multiple_initial_set.py
python plot_alpha.py
- Make sure all data paths in config files are set correctly.
- Check that required
.pkl
files (features, splits, groups) exist.