Skip to main content

Brazilian public data integration platform for scientific research

Project description

🇧🇷 Guaraci: Brazilian Public Data Integration Platform

Python 3.11+ License: MIT Code style: black

A comprehensive toolkit for accessing, integrating, and analyzing Brazilian public data, with initial focus on public health and Neglected Tropical Diseases (NTDs).

🎯 Overview

Guaraci addresses a critical gap in Brazilian public health data accessibility. While databases exist for high-visibility diseases like COVID-19 and tuberculosis, Neglected Tropical Diseases (NTDs) remain underrepresented in computational epidemiology. Guaraci provides:

  • Unified Access: Single interface to multiple Brazilian health databases (DATASUS, SINAN, SIH, SIM, SIA)
  • Scientific Reproducibility: Standardized, versioned datasets with complete metadata
  • Performance Optimized: Concurrent downloads and memory-efficient processing
  • Multiple Interfaces: Both Python API and CLI for different use cases

🚀 Quick Start

Instalação via pip

Escolha conforme a necessidade:

  • Núcleo (sem DATASUS nem API): pip install guaraci
  • DATASUS (PySUS: SINAN/SIM/SIH): pip install "guaraci[datasus]"
  • API (FastAPI/uvicorn/httpx): pip install "guaraci[api]"
  • Completo (todos os extras): pip install "guaraci[full]"

Docker Setup (Recommended)

# Clone the repository
git clone https://github.com/autoaihub/guaraci.git
cd guaraci

# Build the Docker image
docker build -t guaraci .

# Run Guaraci commands
docker run --rm -it -v "$(pwd):/app" guaraci python -m guaraci.cli.main --help

Download SINAN Data (Docker)

# Download data for specific diseases and years
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli download 2020 2022 \
  --diseases DENG ZIKA --format csv

# Download single disease for one year
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli download 2020 2020 \
  --diseases RAIV --format csv

### Download SIM Data (Docker)

```bash
# Download SIM (CID10) for SP/RJ, 2019–2020
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sim_cli download 2019 2020 \
  --groups CID10 --states SP RJ --format csv

# Summary by basic cause (CAUSABAS)
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sim_cli summary CID10 --by CAUSABAS

Download SIH Data (Docker)

# Download SIH (AIH reduzida - RD) para SP, meses 1-3 de 2019–2020
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sih_cli download 2019 2020 \
  --groups RD --states SP --months 1 2 3 --format csv

# Resumo por UF
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sih_cli summary RD --by UF_ZI

### Python API (Inside Docker)

```bash
# Interactive Python session
docker run --rm -it -v "$(pwd):/app" guaraci python

# Then in Python:
from guaraci.datasus import SinanDataSource

# Initialize SINAN data source
sinan = SinanDataSource()

# Download data
sinan.download(start_year=2020, end_year=2020, diseases=['RAIV'])

# Load as DataFrame
df = sinan.load_dataframe('RAIV')

# Apply filters
filtered = sinan.filter(df, uf='SP')

# Export results
sinan.export(filtered, format='csv', name='raiva_sp')

Available CLI Commands

Para ajuda detalhada de cada base, use:

docker run --rm -it -v "$(pwd):/app" guaraci python -m guaraci.cli.sinan_cli --help
docker run --rm -it -v "$(pwd):/app" guaraci python -m guaraci.cli.sim_cli --help
docker run --rm -it -v "$(pwd):/app" guaraci python -m guaraci.cli.sih_cli --help
# Show platform information
docker run --rm guaraci python -m guaraci.cli.main info

# Download SINAN data
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli download 2020 2020 --diseases DENG --format csv

# Download SIM data
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sim_cli download 2019 2020 --groups CID10 --states SP RJ --format csv

# Download SIH data
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sih_cli download 2019 2020 --groups RD --states SP --months 1 2 3 --format csv

# Filter existing data (after download)
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli filter DENG --uf SP --output filtered_dengue

# Generate summary statistics
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli summary DENG --by UF --metric count

# Get information about available fields
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli info DENG

📊 Supported Data Sources

SINAN (Sistema de Informação de Agravos de Notificação)

  • Focus: Notifiable diseases surveillance
  • Coverage: 2007-present
  • Diseases: All SINAN diseases with emphasis on NTDs
  • Format: Parquet, CSV, SQLite

Supported Neglected Tropical Diseases

  • ANIM - Acidentes por Animais Peçonhentos
  • CHAG - Doença de Chagas
  • CHIK - Chikungunya
  • DENG - Dengue
  • ESQU - Esquistossomose
  • HANS - Hanseníase
  • LEIV - Leishmaniose Visceral
  • LTAN - Leishmaniose Tegumentar
  • RAIV - Raiva Humana

SIM (Sistema de Informações sobre Mortalidade)

  • Focus: Mortalidade (CID10/CID9)
  • Coverage: Décadas recentes (conforme FTP DATASUS)
  • Format: Parquet, CSV, SQLite
  • Groups: CID10 (padrão), CID9

SIH (Sistema de Informações Hospitalares)

  • Focus: Internações hospitalares financiadas pelo SUS (AIH)
  • Coverage: 1992–presente (conforme FTP DATASUS)
  • Format: Parquet, CSV, SQLite
  • Groups: RD (AIH reduzida, padrão), RJ, ER, SP, CH, CM

🛠 Development Setup

Docker-Based Development (Recommended)

# Clone repository
git clone https://github.com/autoaihub/guaraci.git
cd guaraci

# Build the Docker image
docker build -t guaraci .

# Run tests
docker run --rm guaraci python -m pytest tests/ -v

# Interactive development shell
docker run --rm -it -v "$(pwd):/app" guaraci bash

# Run specific commands
docker run --rm -it -v "$(pwd):/app" guaraci python -c "import guaraci; print(guaraci.__version__)"

Windows Users

# Use full paths for volume mounting
docker run --rm -it -v "C:\path\to\guaraci:/app" guaraci python -m guaraci.cli.main info

# Example with actual path (single line)
docker run --rm -it -v "C:\Users\username\Documents\guaraci:/app" guaraci python -m guaraci.cli.sinan_cli download 2020 2020 --diseases RAIV --format csv

# Multi-line with PowerShell backtick continuation
docker run --rm -it -v "C:\Users\username\Documents\guaraci:/app" guaraci `
  python -m guaraci.cli.sinan_cli download 2020 2020 --diseases RAIV --format csv

📖 Documentation

Configuration

Guaraci can be configured using environment variables in Docker:

# Run with custom configuration
docker run --rm -it -v "$(pwd):/app" \
  -e GUARACI_DATA_ROOT=/app/data \
  -e GUARACI_LOG_LEVEL=DEBUG \
  -e GUARACI_MAX_CONCURRENT_DOWNLOADS=10 \
  guaraci python -m guaraci.cli.sinan_cli download 2020 2020 --diseases DENG

Advanced Usage (Python API in Docker)

# Start interactive Python session
docker run --rm -it -v "$(pwd):/app" guaraci python

# Then in Python:
from guaraci.datasus import SinanDataSource
from guaraci.core.config import config

# View current configuration
print(f"Data root: {config.data_root}")
print(f"Max downloads: {config.max_concurrent_downloads}")

# Initialize with custom settings
sinan = SinanDataSource()

# Download with specific parameters
sinan.download(2020, 2021, diseases=['DENG'])

# Load and process data
df = sinan.load_dataframe('DENG')

# Advanced filtering
filtered = sinan.filter(
    df,
    uf='SP',
    municipio='São Paulo',
    ano=2021
)

# Generate summary statistics
summary = sinan.summary(filtered, by='CS_SEXO', metric='count')
print(summary)

# Export results
sinan.export(filtered, format='csv', name='dengue_sp_2021')

🧪 Testing

All testing is done within Docker containers:

# Run all tests
docker run --rm guaraci python -m pytest tests/ -v

# Run with coverage
docker run --rm guaraci python -m pytest tests/ --cov=guaraci --cov-report=term-missing

# Run specific test file
docker run --rm guaraci python -m pytest tests/test_utils.py -v

# Test installation
docker run --rm guaraci python test_install.py

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite (pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors & Contributors

  • Luis Felipe Vogel LopesLead Developer (v0.2 and ongoing)vogel@usp.br
    Responsible for the full modernization of Guaraci, including modular architecture, Docker-first workflow, Pydantic configuration system, enhanced CLI, and full testing suite.

  • Pedro Guilherme dos Reis TeixeiraOriginal Author (v0.1)pedro.guilherme2305@usp.br
    Created the initial Guaraci prototype and early SINAN integration.

  • Prof. Robson Parmezan BonidiaScientific Advisor – ICMC/USP

  • Prof. André Carlos Ponce de Leon Ferreira de CarvalhoScientific Advisor – ICMC/USP

🙏 Acknowledgments

  • PySUS - Foundation for DATASUS integration
  • Polars - High-performance DataFrame library
  • ICMC/USP - Institutional support
  • Brazilian Ministry of Health - Data provision through DATASUS

📚 Citation

If you use Guaraci in your research, please cite:

@software{guaraci2025,
  title     = {Guaraci: Brazilian Public Data Integration Platform},
  author    = {Lopes, Luis Felipe Vogel and Teixeira, Pedro Guilherme dos Reis and Bonidia, Robson Parmezan and Carvalho, André Carlos Ponce de Leon Ferreira de},
  year      = {2025},
  version   = {0.2},
  url       = {https://github.com/autoaihub/guaraci}
}

🔗 Links

📝 Changelog

Veja CHANGELOG.md para histórico de versões e novidades.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guaraci-0.3.2.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

guaraci-0.3.2-py3-none-any.whl (38.3 kB view details)

Uploaded Python 3

File details

Details for the file guaraci-0.3.2.tar.gz.

File metadata

  • Download URL: guaraci-0.3.2.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for guaraci-0.3.2.tar.gz
Algorithm Hash digest
SHA256 740bb36b7d930725816d936cf41fc558d9c0eb58987832919e43a8bf230bae20
MD5 39b28348fd53e62084c355597e83cd93
BLAKE2b-256 30b4a7221564c3d3cddd61ce879dc773110c42d925dd880f94d615e5599fcd1d

See more details on using hashes here.

File details

Details for the file guaraci-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: guaraci-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 38.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for guaraci-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a43a9dcedbdaff9185ccd60bde2453d207c691d68c8f8cbb323f1aa61254b165
MD5 152bd1f929c2b7b4677b1420c393a76f
BLAKE2b-256 f2dce7d6dd14d012d6b4c1bb4aea6f0e3a3055d3f821f8d47f3a97b41e84c4e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page