Skip to main content

Brazilian public data integration platform for scientific research

Project description

🇧🇷 Guaraci: Brazilian Public Data Integration Platform

Python 3.11+ License: MIT Code style: black

A comprehensive toolkit for accessing, integrating, and analyzing Brazilian public data, with initial focus on public health and Neglected Tropical Diseases (NTDs).

🎯 Overview

Guaraci addresses a critical gap in Brazilian public health data accessibility. While databases exist for high-visibility diseases like COVID-19 and tuberculosis, Neglected Tropical Diseases (NTDs) remain underrepresented in computational epidemiology. Guaraci provides:

  • Unified Access: Single interface to multiple Brazilian health databases (DATASUS, SINAN, SIH, SIM, SIA)
  • Scientific Reproducibility: Standardized, versioned datasets with complete metadata
  • Performance Optimized: Concurrent downloads and memory-efficient processing
  • Multiple Interfaces: Both Python API and CLI for different use cases

🚀 Quick Start

Instalação via pip

Escolha conforme a necessidade:

  • Núcleo (sem DATASUS nem API): pip install guaraci
  • DATASUS (PySUS: SINAN/SIM/SIH): pip install "guaraci[datasus]"
  • API (FastAPI/uvicorn/httpx): pip install "guaraci[api]"
  • Completo (todos os extras): pip install "guaraci[full]"

Docker Setup (Recommended)

# Clone the repository
git clone https://github.com/autoaihub/guaraci.git
cd guaraci

# Build the Docker image
docker build -t guaraci .

# Run Guaraci commands
docker run --rm -it -v "$(pwd):/app" guaraci python -m guaraci.cli.main --help

Download SINAN Data (Docker)

# Download data for specific diseases and years
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli download 2020 2022 \
  --diseases DENG ZIKA --format csv

# Download single disease for one year
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli download 2020 2020 \
  --diseases RAIV --format csv

Python API (Inside Docker)

# Interactive Python session
docker run --rm -it -v "$(pwd):/app" guaraci python

# Then in Python:
from guaraci.datasus import SinanDataSource

# Initialize SINAN data source
sinan = SinanDataSource()

# Download data
sinan.download(start_year=2020, end_year=2020, diseases=['RAIV'])

# Load as DataFrame
df = sinan.load_dataframe('RAIV')

# Apply filters
filtered = sinan.filter(df, uf='SP')

# Export results
sinan.export(filtered, format='csv', name='raiva_sp')

Available CLI Commands

# Show platform information
docker run --rm guaraci python -m guaraci.cli.main info

# Download SINAN data
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli download 2020 2020 --diseases DENG --format csv

# Filter existing data (after download)
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli filter DENG --uf SP --output filtered_dengue

# Generate summary statistics
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli summary DENG --by UF --metric count

# Get information about available fields
docker run --rm -it -v "$(pwd):/app" guaraci \
  python -m guaraci.cli.sinan_cli info DENG

📊 Supported Data Sources

SINAN (Sistema de Informação de Agravos de Notificação)

  • Focus: Notifiable diseases surveillance
  • Coverage: 2007-present
  • Diseases: All SINAN diseases with emphasis on NTDs
  • Format: Parquet, CSV, SQLite

Supported Neglected Tropical Diseases

  • ANIM - Acidentes por Animais Peçonhentos
  • CHAG - Doença de Chagas
  • CHIK - Chikungunya
  • DENG - Dengue
  • ESQU - Esquistossomose
  • HANS - Hanseníase
  • LEIV - Leishmaniose Visceral
  • LTAN - Leishmaniose Tegumentar
  • RAIV - Raiva Humana

🛠 Development Setup

Docker-Based Development (Recommended)

# Clone repository
git clone https://github.com/autoaihub/guaraci.git
cd guaraci

# Build the Docker image
docker build -t guaraci .

# Run tests
docker run --rm guaraci python -m pytest tests/ -v

# Interactive development shell
docker run --rm -it -v "$(pwd):/app" guaraci bash

# Run specific commands
docker run --rm -it -v "$(pwd):/app" guaraci python -c "import guaraci; print(guaraci.__version__)"

Windows Users

# Use full paths for volume mounting
docker run --rm -it -v "C:\path\to\guaraci:/app" guaraci python -m guaraci.cli.main info

# Example with actual path (single line)
docker run --rm -it -v "C:\Users\username\Documents\guaraci:/app" guaraci python -m guaraci.cli.sinan_cli download 2020 2020 --diseases RAIV --format csv

# Multi-line with PowerShell backtick continuation
docker run --rm -it -v "C:\Users\username\Documents\guaraci:/app" guaraci `
  python -m guaraci.cli.sinan_cli download 2020 2020 --diseases RAIV --format csv

📖 Documentation

Configuration

Guaraci can be configured using environment variables in Docker:

# Run with custom configuration
docker run --rm -it -v "$(pwd):/app" \
  -e GUARACI_DATA_ROOT=/app/data \
  -e GUARACI_LOG_LEVEL=DEBUG \
  -e GUARACI_MAX_CONCURRENT_DOWNLOADS=10 \
  guaraci python -m guaraci.cli.sinan_cli download 2020 2020 --diseases DENG

Advanced Usage (Python API in Docker)

# Start interactive Python session
docker run --rm -it -v "$(pwd):/app" guaraci python

# Then in Python:
from guaraci.datasus import SinanDataSource
from guaraci.core.config import config

# View current configuration
print(f"Data root: {config.data_root}")
print(f"Max downloads: {config.max_concurrent_downloads}")

# Initialize with custom settings
sinan = SinanDataSource()

# Download with specific parameters
sinan.download(2020, 2021, diseases=['DENG'])

# Load and process data
df = sinan.load_dataframe('DENG')

# Advanced filtering
filtered = sinan.filter(
    df,
    uf='SP',
    municipio='São Paulo',
    ano=2021
)

# Generate summary statistics
summary = sinan.summary(filtered, by='CS_SEXO', metric='count')
print(summary)

# Export results
sinan.export(filtered, format='csv', name='dengue_sp_2021')

🧪 Testing

All testing is done within Docker containers:

# Run all tests
docker run --rm guaraci python -m pytest tests/ -v

# Run with coverage
docker run --rm guaraci python -m pytest tests/ --cov=guaraci --cov-report=term-missing

# Run specific test file
docker run --rm guaraci python -m pytest tests/test_utils.py -v

# Test installation
docker run --rm guaraci python test_install.py

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite (pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors & Contributors

  • Luis Felipe Vogel LopesLead Developer (v0.2 and ongoing)vogel@usp.br
    Responsible for the full modernization of Guaraci, including modular architecture, Docker-first workflow, Pydantic configuration system, enhanced CLI, and full testing suite.

  • Pedro Guilherme dos Reis TeixeiraOriginal Author (v0.1)pedro.guilherme2305@usp.br
    Created the initial Guaraci prototype and early SINAN integration.

  • Prof. Robson Parmezan BonidiaScientific Advisor – ICMC/USP

  • Prof. André Carlos Ponce de Leon Ferreira de CarvalhoScientific Advisor – ICMC/USP

🙏 Acknowledgments

  • PySUS - Foundation for DATASUS integration
  • Polars - High-performance DataFrame library
  • ICMC/USP - Institutional support
  • Brazilian Ministry of Health - Data provision through DATASUS

📚 Citation

If you use Guaraci in your research, please cite:

@software{guaraci2025,
  title     = {Guaraci: Brazilian Public Data Integration Platform},
  author    = {Lopes, Luis Felipe Vogel and Teixeira, Pedro Guilherme dos Reis and Bonidia, Robson Parmezan and Carvalho, André Carlos Ponce de Leon Ferreira de},
  year      = {2025},
  version   = {0.2},
  url       = {https://github.com/autoaihub/guaraci}
}

🔗 Links

📝 Changelog

Veja CHANGELOG.md para histórico de versões e novidades.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

guaraci-0.3.0.tar.gz (32.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

guaraci-0.3.0-py3-none-any.whl (37.8 kB view details)

Uploaded Python 3

File details

Details for the file guaraci-0.3.0.tar.gz.

File metadata

  • Download URL: guaraci-0.3.0.tar.gz
  • Upload date:
  • Size: 32.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for guaraci-0.3.0.tar.gz
Algorithm Hash digest
SHA256 65e355f56d824c3b4a522a028035088836b37fbd1106293f290f15d5652d789b
MD5 30474fad210258be09c336db037cef91
BLAKE2b-256 ca2fdf8801c2e99e193736a99487553992fbdaa1299c44cfeb19ef9020c59427

See more details on using hashes here.

File details

Details for the file guaraci-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: guaraci-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for guaraci-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b63597d687c08d79d820500528200bf3aaae3aed0684dbfb856b23a1ac9a76d8
MD5 91a6a997d93386120aed39acb3937d58
BLAKE2b-256 76a4bd181eb36846fd6ca25494066855f627ed1a00b35e656d5dc7dd4b8e1bd2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page