SUSFlow

Modern Python library for downloading, parsing and engineering DATASUS public health datasets. SUSFlow provides:

resilient FTP access to DATASUS
a local cache that mirrors the FTP tree
transparent decompression of proprietary .dbc files to tabular data
helpers to load datasets as pandas DataFrame ready for analysis

This repository focuses on practical reproducibility and safe access to legacy public data systems.

Portuguese (Brazil) documentation and module index: Português do Brasil

Installation

Install in editable mode during development (includes dev/lint/test tools via the dev extra):

git clone https://github.com/OncoAtlas/susflow.git
cd susflow
python -m venv .venv
. ./.venv/bin/activate
pip install -U pip
pip install -e ".[dev]"

Install from PyPI (recommended for most users):

pip install susflow

To install a specific released version:

pip install susflow==0.2.0

Core runtime dependencies are declared in pyproject.toml. Optional extras:

susflow[dev] — development tools (ruff, black, isort, pytest, coverage, etc.)
susflow[polars] — Polars output support via engine="polars"
susflow[pyarrow] — PyArrow output + Parquet sidecar cache support
susflow[parquet] — Parquet sidecar cache (pyarrow)
susflow[polars,pyarrow] — for full engine= and cache flexibility

Basic usage

Each DATASUS system is available under susflow.systems (including the newer ibge_pop). APIs are lightweight: list_files, download and read helpers manage discovery, download and conversion.

The read() functions now support additional options:

engine="pandas" | "polars" | "pyarrow" — return native objects instead of pandas DataFrame.
parquet=True — enable local .parquet sidecar cache for faster repeated reads.

Example: SINASC (Live Births)

from susflow.systems import sinasc

# list files for a state
sinasc.list_files(uf="SP")

# download and return a pandas.DataFrame
df = sinasc.read(uf="SP", year=2020)

Example: PNI (Vaccinations)

from susflow.systems import pni
df = pni.read(uf="RJ", year=2015)

Example: Using new engine and parquet options + batch downloads

from susflow import download_batch
from susflow.systems import sinasc

# Read with Polars (requires susflow[polars])
df = sinasc.read(uf="SP", year=2020, engine="polars")

# Enable Parquet sidecar cache (requires susflow[pyarrow] or [parquet])
df = sinasc.read(uf="SP", year=2020, parquet=True)

# Concurrent downloads
paths = download_batch([
    ("sinasc", {"uf": "SP", "year": 2020}),
    ("sinasc", {"uf": "RJ", "year": 2021}),
])

Command-line interface

SUSFlow also provides a susflow CLI (installed via the package or pip install susflow):

susflow --help
susflow sinasc list --uf SP
susflow cnes download SP 2023 --type ST

The CLI supports list and download for the main systems. For full control (including engine= and parquet cache), use the Python API.

New in recent releases: ibge_pop module for population estimates.

Caching behavior

By default downloads are stored under ~/.susflow/cache/ mirroring FTP paths. If a requested file is present locally the library skips the download and reads directly from cache. To force re-download set force=True on download/reader helpers.

Performance guidance

Downcast numeric types and convert repeated strings to category to reduce memory.
Convert commonly used datasets to Parquet once and reuse local Parquet caches.
For very large datasets prefer processing in chunks or using DuckDB/Polars to avoid excessive RAM.

Developer tools and linters

After pip install -e ".[dev]" (see Installation above), the tools are available. Run the checks locally:

. ./.venv/bin/activate
ruff check .
black --check .
isort --check-only .
pytest -q

Testing strategy

Unit tests should mock FTP and file IO; see tests/unit/ for examples.
Integration tests that access live FTP data should be opt-in and run manually (network-dependent).

Utilities

tools/mapear_ftp.py helps locate and audit DATASUS FTP directory structures when paths change. It can save structured maps to tools/mapas/ for offline analysis.

Contributing

See CONTRIBUTING.md for guidelines: coding style, tests, and PR workflow. See docs/en/coverage.md for coverage instructions.

License

This project is released under the MIT License — see LICENSE.

Contact

Open issues and pull requests are welcome. For larger changes please open an issue to discuss scope before implementing.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.github/workflows		.github/workflows
docs		docs
susflow		susflow
tests		tests
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SUSFlow

Contents

Installation

Basic usage

Command-line interface

Caching behavior

Performance guidance

Developer tools and linters

Testing strategy

Utilities

Contributing

License

Contact

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SUSFlow

Contents

Installation

Basic usage

Command-line interface

Caching behavior

Performance guidance

Developer tools and linters

Testing strategy

Utilities

Contributing

License

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages