Thanks to visit codestin.com
Credit goes to github.com

Skip to content

OncoAtlas/susflow

Repository files navigation

SUSFlow

Python Version License: MIT Code Style: Black Output: pandas.DataFrame PyPI

Modern Python library for downloading, parsing and engineering DATASUS public health datasets. SUSFlow provides:

  • resilient FTP access to DATASUS
  • a local cache that mirrors the FTP tree
  • transparent decompression of proprietary .dbc files to tabular data
  • helpers to load datasets as pandas DataFrame ready for analysis

This repository focuses on practical reproducibility and safe access to legacy public data systems.

Portuguese (Brazil) documentation and module index: Português do Brasil

Contents

  • Module documentation in docs/en/ (layouts, variable dictionaries, notes)
  • Library code: susflow/
  • Utilities: tools/ (FTP mapping and inspection)

Quick links

Installation

Install in editable mode during development (includes dev/lint/test tools via the dev extra):

git clone https://github.com/OncoAtlas/susflow.git
cd susflow
python -m venv .venv
. ./.venv/bin/activate
pip install -U pip
pip install -e ".[dev]"

Install from PyPI (recommended for most users):

pip install susflow

To install a specific released version:

pip install susflow==0.2.0

Core runtime dependencies are declared in pyproject.toml. Optional extras:

  • susflow[dev] — development tools (ruff, black, isort, pytest, coverage, etc.)
  • susflow[polars] — Polars output support via engine="polars"
  • susflow[pyarrow] — PyArrow output + Parquet sidecar cache support
  • susflow[parquet] — Parquet sidecar cache (pyarrow)
  • susflow[polars,pyarrow] — for full engine= and cache flexibility

Basic usage

Each DATASUS system is available under susflow.systems (including the newer ibge_pop). APIs are lightweight: list_files, download and read helpers manage discovery, download and conversion.

The read() functions now support additional options:

  • engine="pandas" | "polars" | "pyarrow" — return native objects instead of pandas DataFrame.
  • parquet=True — enable local .parquet sidecar cache for faster repeated reads.

Example: SINASC (Live Births)

from susflow.systems import sinasc

# list files for a state
sinasc.list_files(uf="SP")

# download and return a pandas.DataFrame
df = sinasc.read(uf="SP", year=2020)

Example: PNI (Vaccinations)

from susflow.systems import pni
df = pni.read(uf="RJ", year=2015)

Example: Using new engine and parquet options + batch downloads

from susflow import download_batch
from susflow.systems import sinasc

# Read with Polars (requires susflow[polars])
df = sinasc.read(uf="SP", year=2020, engine="polars")

# Enable Parquet sidecar cache (requires susflow[pyarrow] or [parquet])
df = sinasc.read(uf="SP", year=2020, parquet=True)

# Concurrent downloads
paths = download_batch([
    ("sinasc", {"uf": "SP", "year": 2020}),
    ("sinasc", {"uf": "RJ", "year": 2021}),
])

Command-line interface

SUSFlow also provides a susflow CLI (installed via the package or pip install susflow):

susflow --help
susflow sinasc list --uf SP
susflow cnes download SP 2023 --type ST

The CLI supports list and download for the main systems. For full control (including engine= and parquet cache), use the Python API.

New in recent releases: ibge_pop module for population estimates.

Caching behavior

By default downloads are stored under ~/.susflow/cache/ mirroring FTP paths. If a requested file is present locally the library skips the download and reads directly from cache. To force re-download set force=True on download/reader helpers.

Performance guidance

  • Downcast numeric types and convert repeated strings to category to reduce memory.
  • Convert commonly used datasets to Parquet once and reuse local Parquet caches.
  • For very large datasets prefer processing in chunks or using DuckDB/Polars to avoid excessive RAM.

Developer tools and linters

After pip install -e ".[dev]" (see Installation above), the tools are available. Run the checks locally:

. ./.venv/bin/activate
ruff check .
black --check .
isort --check-only .
pytest -q

Testing strategy

  • Unit tests should mock FTP and file IO; see tests/unit/ for examples.
  • Integration tests that access live FTP data should be opt-in and run manually (network-dependent).

Utilities

tools/mapear_ftp.py helps locate and audit DATASUS FTP directory structures when paths change. It can save structured maps to tools/mapas/ for offline analysis.

Contributing

See CONTRIBUTING.md for guidelines: coding style, tests, and PR workflow. See docs/en/coverage.md for coverage instructions.

License

This project is released under the MIT License — see LICENSE.

Contact

Open issues and pull requests are welcome. For larger changes please open an issue to discuss scope before implementing.

About

High-performance ETL pipeline and standardized local data lake builder for DATASUS public health data.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages