Modern Python library for downloading, parsing and engineering DATASUS public health datasets. SUSFlow provides:
- resilient FTP access to DATASUS
- a local cache that mirrors the FTP tree
- transparent decompression of proprietary
.dbcfiles to tabular data - helpers to load datasets as pandas DataFrame ready for analysis
This repository focuses on practical reproducibility and safe access to legacy public data systems.
Portuguese (Brazil) documentation and module index: Português do Brasil
- Module documentation in
docs/en/(layouts, variable dictionaries, notes) - Library code:
susflow/ - Utilities:
tools/(FTP mapping and inspection)
Quick links
- CNES — health establishments
- PNI — immunizations
- SIM — mortality
- SINAN — notifiable diseases
- SINASC — live births
- SIASUS — ambulatory information system (SUS)
- SIHSUS — hospital information system (SUS)
- IBGE — population estimates
- FTP file patterns summary
Install in editable mode during development (includes dev/lint/test tools via the dev extra):
git clone https://github.com/OncoAtlas/susflow.git
cd susflow
python -m venv .venv
. ./.venv/bin/activate
pip install -U pip
pip install -e ".[dev]"Install from PyPI (recommended for most users):
pip install susflowTo install a specific released version:
pip install susflow==0.2.0Core runtime dependencies are declared in pyproject.toml. Optional extras:
susflow[dev]— development tools (ruff, black, isort, pytest, coverage, etc.)susflow[polars]— Polars output support viaengine="polars"susflow[pyarrow]— PyArrow output + Parquet sidecar cache supportsusflow[parquet]— Parquet sidecar cache (pyarrow)susflow[polars,pyarrow]— for fullengine=and cache flexibility
Each DATASUS system is available under susflow.systems (including the newer ibge_pop). APIs are lightweight: list_files, download and read helpers manage discovery, download and conversion.
The read() functions now support additional options:
engine="pandas" | "polars" | "pyarrow"— return native objects instead of pandas DataFrame.parquet=True— enable local.parquetsidecar cache for faster repeated reads.
Example: SINASC (Live Births)
from susflow.systems import sinasc
# list files for a state
sinasc.list_files(uf="SP")
# download and return a pandas.DataFrame
df = sinasc.read(uf="SP", year=2020)Example: PNI (Vaccinations)
from susflow.systems import pni
df = pni.read(uf="RJ", year=2015)Example: Using new engine and parquet options + batch downloads
from susflow import download_batch
from susflow.systems import sinasc
# Read with Polars (requires susflow[polars])
df = sinasc.read(uf="SP", year=2020, engine="polars")
# Enable Parquet sidecar cache (requires susflow[pyarrow] or [parquet])
df = sinasc.read(uf="SP", year=2020, parquet=True)
# Concurrent downloads
paths = download_batch([
("sinasc", {"uf": "SP", "year": 2020}),
("sinasc", {"uf": "RJ", "year": 2021}),
])SUSFlow also provides a susflow CLI (installed via the package or pip install susflow):
susflow --help
susflow sinasc list --uf SP
susflow cnes download SP 2023 --type STThe CLI supports list and download for the main systems. For full control (including engine= and parquet cache), use the Python API.
New in recent releases: ibge_pop module for population estimates.
By default downloads are stored under ~/.susflow/cache/ mirroring FTP paths. If a requested file is present locally the library skips the download and reads directly from cache. To force re-download set force=True on download/reader helpers.
- Downcast numeric types and convert repeated strings to
categoryto reduce memory. - Convert commonly used datasets to Parquet once and reuse local Parquet caches.
- For very large datasets prefer processing in chunks or using DuckDB/Polars to avoid excessive RAM.
After pip install -e ".[dev]" (see Installation above), the tools are available. Run the checks locally:
. ./.venv/bin/activate
ruff check .
black --check .
isort --check-only .
pytest -q- Unit tests should mock FTP and file IO; see
tests/unit/for examples. - Integration tests that access live FTP data should be opt-in and run manually (network-dependent).
tools/mapear_ftp.py helps locate and audit DATASUS FTP directory structures when paths change. It can save structured maps to tools/mapas/ for offline analysis.
See CONTRIBUTING.md for guidelines: coding style, tests, and PR workflow. See docs/en/coverage.md for coverage instructions.
This project is released under the MIT License — see LICENSE.
Open issues and pull requests are welcome. For larger changes please open an issue to discuss scope before implementing.