Collection of links to support the production of modern development data products made from both conventional (survey, census) and big data sources (Satellite, Mobile, Text).
- Ipums - census and survey data around the world
- Open Knowledge Foundation Open Data Portal List
- The UN Data Portal
- AWS registry of Open Data
- World Bank micro data catalog
- World Bank Open Data
- US gov open data portal
- European Union data portal
- AWS data exchange for discovery of third party data
- Google Data Search optimized for data discovery
- IBM data asset xchange for enterprise data
- Open Academic Graph links 3 billon academic graphs
- Carto Data Explorer package to discover satellite and other public data sources
- Kaggle Data Sets https://www.kaggle.com/datasets
- UCI machine learning library https://archive.ics.uci.edu/ml/index.php
- Open Street Map crowd-sourced [geospatial information](A Hitchhikers Guide to support the tradecraft of making data products that combine traditional and new data sources for robust, high resolution insights in development)
- World Pop demographic and population data
- Academic Torrents is a repository of research data
- Google Big Query Public Data Catalog
- Development Data Lab SHRUG high resolution admin open data for India
- Development Data Partnerships, data sharing with World Bank, IMF, IDB
- DCAT Catalog Vocabulary Standards
- Radiant Hub Python client
- Harvard dataverse repository of data and code
- Geo4Dev repository of geospatial data and tutorials
- Figshare is a repository for researchers to share data and work
- HDX (Common Operational Data Sets){https://data.humdata.org/dashboards/cod]
- Coleridge Rich Context initiative promotes data disovery with API and knowledge graph to link data, code, papers
- Open Buildings is an (open data repository of building footprints)[https://sites.research.google/open-buildings/] Mobile Location Data Providers (GPS)
- (SafeGraph)[https://www.safegraph.com/]
- Place data for pubic good
- (Veraset)[https://www.veraset.com/]
- (Unacast)[https://www.unacast.com/]
- (Foursquare)[https://foursquare.com/]
- (Placer)[https://www.placer.ai/]
- (Mapbox Location Telemetry)[https://www.mapbox.com/telemetry]
- (Cuebiq)[https://www.cuebiq.com/]
- Figure 8 is a service for data labeling and data annotation
- Amazon Mechanical Turk tutorial on collecting data with Mturk; and labelling data with MTURK Groundtruth Sagemaker
- Hive Data service for data labeling
- AWS SageMaker Ground Truth data labeling, tasking service
- Azavea groundwork imagery data labelling
- Prolific platform for recruiting participants for data collection
- Samasouce microtasking and labeling service
- RapidPro Surveyor tool for data collection
- Facebook Marketing API for data colleciton tutorial
- Premise is crowd source network for ground insights
- Native Data is crowd network for local insights
- Qfield ground data collection tool
- Observe is a Open Street Map collection and validation tool from Development Seed and Radiant Solutions github repo
- Understanding ground survey requirements for Remote Sensing Measurements
- Kobotoolbox is an open source suite of tools for data collection
- Mapswipe is a smartphone app to crowdsource labelling of ritical infrastructure and populations
- Fieldscope is a map-based data collection tool for citizen monitoring of environmental and social change
- Open Data Kit is a data collection tool suite
- Groundwork tool by Azavea for labeling imagery
- Label Maker data prep for satellite machine learning Lable Maker
- Cleanlab flags errors in data sets and helps to clean them up CleanLab
- Overture provides open interoperable map data
- Data mesh standard for data catalogs
- ARIES for SEEA is for Natural Capital Accounting
- Radiant Earth guide on Geo-referencing ground data for ML
- Tidy Data by Hadley Wickham provides guidance data structure and cleaning for integration
- Grid based sampling design framework for household surveys
- GeoStat guidance on spatial grid statistics
- H3 hexagonal hierarchical geospatial indexing system
- IEED Big Data Interoperability Framework NBDIF
- Analysis Ready Data defined
- IPUMS Census and Survey documentation
- Data Documetation Initiative specification for documenting survey and observational study data DDi
- Dublin Core Meta-Data initiative provides good practice on documentation of data
- Schemas.org is a specificaiton to document data for search indexing
- SDMX global initiative to improve statistical metadata exchange
- G-DIF Geostatistical Integration Framework to combine different sources of spatial information
- Guide for geo-referencing ground data for machine learning
- GeoJSON format for encoding variety of geographic data structures
- Cloug Optimized GeoTiff is a fomat for cloud native processgin COG
- STAC standard to catalog ecosystem of geospatial assets
- rio--cogeo is a plug-in for rasterio to publish COG
- Grid3 framework faciltiate the projection of satellite, survey, census spatial data into gridded format
- Downscaling tutorial for granular measurements
- Sentinal-Hub Earth Observation, eo-learn python framework for machine learning
- Cumulas supports cloud native processing for EOSDIS
- Create and use COG Mosaics Cog-geo
- G-DIF Geostatistical Integration Framework to combine different sources of spatial information
- Multi-resolution approximation approach for [spatial-temporal modeling}(https://www.sciencedirect.com/science/article/pii/S2211675320300592?via%3Dihub)
- CoData DDI interoperability initiative for core data
- Practical tools for designing and weighting survey samples
- Federal Geographic Data Commitee geospatial standards
- SAR Data Labelling Guide
- Paris21 Guide for Geospatial Data Integration for Official Statistics
- GeoParquet is a specification for storing geospatial vector data
- Earth on AWS repository of satelltie data sets
- Microsoft Planetary Computer
- USGS earth explorer to search for public satellite data
- Soar is an atlas of imagery from satellites and drones
- Copernicus Open Access Hub provides free access to Sentinal sensors
- IADF GRSS Search Catalog of reference and benchmark data sets
- Image Hunter to search for commercial satellite imagery
- NASA Earth Data Search [indexes data from over 50 sensors](https://search.earthdata.nasa.gov/search
- Maxar Open Data program
- ExactEarth is a platform for AIS data for vessel monitoring
- EOS LandViewer is a free GIS database that gives access to the most widely used satellite imagery
- Wikisatnet is a training data repository to peform large scale pre-training on satellite imagery
- Functional Map of the World Data [satellite training data set}(https://github.com/fMoW/dataset)
- Planet Explorer portal to explore Planet and public data
- Global Change Master Directory enhanced search for data and tools for earth obervation
- Carto Observatory is an enterprise catalog of curated data sets
- Remote Pixel is a search service for limited set of satellite data sets
- Awesome Satellite Data sets repository
- SpaceNet is a corpus of commericial satellite and labelled imagery for machine learning research
- Sentinal Hub is a cloud API for satellite imagery
- PopGrid Collaborative is a collaborative for settlement and populaiton data gridded products
- ESRI Living Atlas of geographic information
- State of Satellites knowledge product to compare characteristics of sensors
- Radiant Earth labelled [Open Library for Machine Learning](https://mlhub.earth/?utm_source=Radiant+Newsletter&utm_campaign=3bfe028ab1-March%2FApril+2018+Newsletter_COPY_01&utm_medium=email&utm_term=0_bb6bbe767b-3bfe028ab1-98785447
- GeoNetwork is a catolog application to publish geospatial data
- Radiant Earth MLHub hosts training data sets with API
- bird.i is an aggregator platform for high resolution satellite imagery from leading providers
- Mapillary street view [Vista data set}(https://www.mapillary.com/dataset/vistas?pKey=aFWuj_m4nGoq3-tDz5KAqQ)
- Open Street Map [OSM street map tool}(https://observablehq.com/d/176fbd0640a04220)
- AWS Sagemaker platform
- A python framework for Synthetic Aperture Radar data processing
- eo-Learn package for satellite image processing in Python
- Solaris open source machine learning library for geospatial imagery
- Rastervision open source framework for deep learning on satellite imagery
- Pangeo community platform for big data science
- Robosat library feature extraction tool for extracting geospatial OSM features
- Point Data Abstraction Library PDAL
- Sentinal Toolboxes
- Orfeo toolbox open source processing of satellite imagery
- Open Source Geospatial content management
- SNAP common architecture for sentinal toolboxes
- Raster Foundary platform to extract features from satellite imagery
- GeoPandas geospatial library
- GeosPy package for geospatial inference
- GeoMesa is an open source suite of tools that enables large-scale geospatial querying and analytics on distributed computing systems
- Chip-n-scale is a queue arranger that helps run machine learning models on satellite imagery at scale
- Google model search to find optimal models
- Foot is a tool by WorldPop to extract building footprints from satellite imagery
- Spatial Satellite feature Processing python tool
- PyViz site contains python tools for data visualization
- Yolov3 trains classifers using bounding box approach
- xView is an overhead imagery training data set
- Vector Pipe is an open source library for working with OSM and writing geometries to vector tile areas
- Descartes Lab provides platform for satellite data processing and modeling tools
- GeoTrellis is an open source library for geo-processing
- RasterVision is an open source framework for deep learning on satellite imagery
- Top python libraries used in data science
- Geospatial Python Resource Guide
- IBM Open source ML model asset exchange
- Python Geospatial Collection of Python tools for Geospatial analysis and visualization
- Awesome Data Fusion
- Spatial SQL book for modern GIS practices
- World Settlement Layer (WSF) global human settlment mask derived from Landsat and Sentinal
- Global Urban Footprint (GUF) global human settlements derived from Airbus sensors
- Grid3 bottom up gridded data sets
- WorldPop guide to gridded products and tools
- SDSN guide to gridded population products in international development
- Open AI
- Bloom
- Hugging Face is an AI developer platform for models tools for LLM, NLP
- ZeroGPT is provides free spaces and GPU for LLM development
- LlamaIndex is a data framework for connecting custom data sources to large language models
- Weights & Biases AI developer platform
- Open BMB is a repository for Open Lab for Big Model Base
- Roberta builds on BERT
- BART is a sequence to sequence model
- T5 is a sequence to sequence model
- Gemma is Google's Open LLM
- Mistral 7b is a open source, efficient LLM model
- Palm is one of Google's llm models
- LinkTransformer is a unified package for record linkage
- Ai2 Dolma from Allen AI is an Open Corpus of Data for LLM pre-training
- Jina is an open source text embedding model
- GDELT project, large collection of free online news, articles, text data
- GDELT BigQuery event database
- Practical guide on text analytics with Python Keras
- Paper on framework for [massive language sentance embeddings] https://arxiv.org/abs/1812.10464
- BloombergGPT is a LLM for finance
- Trends in NLP medium post
- Paper on "Robust Sparse methods on de-anonimization of large data sets"
- Good Practices for collecting online data report from UK NCRM
- Good Practices for NLP with examples repository
- Albert
- Inception platform is a semantic annotation and knowldge management tool
- Stanza, Stanford NLP library
- GPT-2 large transformer-based language model
- Microsoft Project Turing Natural Language Generation (T-NLG) model
- GLUE - Benchmark framework for performance of NLP models
- The Pile is a web data set for training text models
- LAION-5B is an open image-text data set
- T5 is a text transformer unified model
- Factiva content API
- Snorkle programatically building data
- Tutorial on using Facebook API for demographic research
- Kaggle text data sets
- Awesome NLP curated list of NLP data
- NLTK python library for NLP
- Apache Open NLP is a machine learning toolkit for NLP
- Newspaper3K is a python package for text scraping
- Newsapi is an API to search news articiles
- Social Watcher is a [python tool to watch changes in Instragram and Twitter accounts}(https://pypi.org/project/social-watcher/)
- PYSocialWatcher is a python tool for facebook data collection
- Social Marketing API data collection tutorial
- Twitter data collection and analysis tool
- Holistic Evaluation of Language Models HELM
- TimeGPT is a foundational model for time series forecasting
- Langchain is a toolset to link LLMs to real world applications
- Flowkit software toolkit from Flowminder supports mobile phone data access, managment and analysis
- Universit of Tokyo CDR Analysis Toolkit
- GFDDR, Purdue and MindEarth developed Mobilkit algorithms and tutorials for mobile analytics
- SciKit Mobility python package for mobility analytics
- Cuebiq data is an aggregator of data from mobile smartphone apps
- Mapbox Telemetry v4 collects telemetry data [information from map app and device](https://www.mapbox.com/telemetry/](https://www.mapbox.com/blog/enhanced-data-coverage-with-mapbox-movement-v4)
- Orbital Insight foot traffic information
- Google Open Mobile developer API
- Bandicoot open source python toolkit to analyze mobile data
- Mirage traffic data collection
- OpenCellID cell tower open data
- ITU Handbook on Mobile Data Statisitcs
- ITU methodoligy guide on uing big data for official statistics
- DIAL MD4D handbook
- World Bank mobility task force resources
- Cider is python software to analyze mobile phone data
- Pytorch for Graphs
- Workshop on Graph Neural Nets
- UN Global Pulse paper on building ethics into privacy frameworks
- Checklist data project https://www.oreilly.com/radar/of-oaths-and-checklists/
- Tutorial on privacy preserving methods in Pytorch
- Paper in Nature on model for privacy conscientious use of mobile data
- Tensor Flow library for training machine learning models with differential privacy
- Docker https://towardsdatascience.com/learn-enough-docker-to-be-useful-b7ba70caeb4b
- United Nations principles on personal data
- Global Pulse checklist for Data Science Projects
- UN Handbook for Privacy Preserving Techniques
- USAID Data Privacy Methods
- Mapbox prioritizing Privacy when using location data from maps
- Equitable Algortihms testimony to Congress from Rayid Ghani
- Online version of Privacy, Big Data, and the Public Good
- Deon is a checklist for privacy and ethics
- GSMA Covid19 privacy guidlines
- GSMA Mobile Data and Big Data Analytics privacy guidlines
- IHSN Guidance on microdata anonomization
- R CRAN statistical disclosure control methods package
- Data Science for Social Good Project Ethics Checklist
- Benchmark inititiative is working on the Locus Principles for responsible location data
- UN OCHA's peer review framework for predictive analytics
- UN Ethical AI Paper
- ICRC handbook on data protection in humanitarian action
- ICDPPC Resolution on Privacy and Humanitarian action
- GSMA Mobile AI Covid Paper
- US White House Blueprint for an AI Bill of Rights
- Github guide to open source for the social sector
- Anaconda suite has range of data science packages for R, Python, Jupyter
- Mamba is a conda like cross-platform package manager
- Kaggle is a data science environement with tools, data and learning resources
- Cookie Cutter Data Science is a standard, flexible repo structure to support collaborative data science
- Google Earth Engine is a platform for geospatial data science
- Domino Data Lab data science tooling
- Cookie Cutter Data Science, popular package from DrivenData for data science project repos
- AMP Lab maintains opensource stac for high performance computing BDAS
- Docker Tutorial
- MLFlow and machine learning project governance
- Docker https://matthewdharris.com/2017/11/27/a-more-reproducible-research-with-the-liftr-package-for-r/
- Awesome List of python machine learning frameworks
- Data Science for Social Good project scoping guide
- Curated papers on applied ML with production examples mostly private sector
- Guide for making machine learning interpretable
- Triage is a general purpose risk modelling and prediction toolkit for social policy problems
- Modernization of Official Statistics HLG-MOS has working papers and projects using machine learning and wiki
- UN Global Platform marketplace for methods
- Template for machine learning design
- Apache Airflow is a platform for managing data science workflows
- Prefect is a python based platform to automate data science workflows
- Curated list of machine learning interpretability resources
- IBM Open AI toolkit
- Argo CD is a Gitops for delivering Kubernetes
- Linux community open data license agreement
- Open Data Standards
- Creative Commons Attribution 4.0 [CC-BY40](Creative Commons Attribution 4.0 (CC-BY 4.0))
- MIT open source license for data and data products
- Open Geospatial training python and QGIS coursware
- Free pdf of book that has taught a generation of data scientists The Elements of Statistical Learning
- Bit by Bit, online book version of Social Research in Digital Age by Matthew Salganik, also see bit by bit teaching materials
- Reusable and Tranparency in Analytics BITTS
- Git and Github tutorials
- DataScience repo with Data Science for Social Good learning resources
- Berkely initiative for Transparancy in the Social Sciences BITSS
- ONS handbook on machine learning in imputation
- On new data sources for the production of official statistics
- USAID guide on [using ML in international development[(https://www.ictworks.org/wp-content/uploads/2021/02/usaid-guide-artificial-intelligence.pdf)
- World Bank Data in Action Toolkit for data product design
- UN SDG indicators
- Python for Geosciences tutorial
- GeoStats Guy learning resources
- World Bank Data in Action Toolkit for designing and scoping data projects
- Running Rstudio with Google Earth Engine
- A Gallery of interesting Jupyter Notebooks
- Flowminder CDR Aggregate Fundamentals