Utilidades para interactuar con Azure Datalake.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

centraal-dataframework

centraal-dataframework es una libreria de python que implementa practicas para usar de manera eficiente azure function para ejecutar procesos de transformación y calidad de datos. Los procesos de transformación se ejecutan usando la libreria pandas y las reglas de calidad con great expectations.

Comienza a usar el framework

Usa como referencia el notebook. Algunos pasos basicos:

Instala la libreria:

pip install centraal-dataframework
Asegurar la creación de las siguientes variables de ambiente. En el ambiente desarollo local, es recomendado usar un archivo .env y la libreria python-dotenv. Ya con la function app desplegada estas variables deben estar configuradas en los Application settings.
- AZURE_STORAGE_CONNECTION_STRING: string de conexión al datalake
- CONTENEDOR_VALIDACIONES: donde se contienen las validaciones que van ser realizadas por great expectations.
Crea tus tareas, usando el decorador necesario, task_dq o task. Usar como referencia el notebook en la documentación

4. Cree el archivo de configuración `yaml`, por defecto este archivo se buscara en el directorio de trabajo bajo el nombre `centraal_dataframework.yaml`

```yml
#---contenido de config.yaml---
url_logicapp_email: https://prod-33.eastus.logic.azure.com:443/workflows/xxxxx
emails_notificar: 
    - nombre.appelido@centraal.studio
    - nombre.appelido@correo.com
    ...
tareas:
    # deben tener el nombre de la función definida.
    nombre_funcion:
        dias: '*'
        horas: 8,12,20
    segundo_nombre_funcion:
        dias:  0
        horas: 8,12,20

Nota: tener en cuenta que '*' puede ser usada en dias y horas, para indicar que la tarea debe ser ejecutada cada hora/dia. Algunos ejemplos para entender como trabaja:

dias: '*' y horas: '*' - > ejecutar todos los dias, cada hora.
dias: '*' y horas: '12' - > ejecutar todos los dias solo a las 12.
dias: '*' y horas: 1,8,12,15 - > ejecutar todos los dias solo a las 1, 8 a las 12 y a las 15:00 (3 pm).
dias 0,3,6 y horas: '*' - > ejectutar solo los Lunes (0), Jueves(3) y Domingo(6) cada hora.
dias: 1,4 y horas: 20 - > ejectutar solo los Lunes (0), Jueves(3) y Domingo(6) a las 20:00.

Crear la function app, adiciona el framework y tareas:

"""---contenido de function_app.py --- """
import azure.functions as func 
from centraal_dataframework.blueprints import framework
# se deben importar los modulos custom
from other.module.logica_pandas import *
from other.module.logica_calidad import *
# si tiene otro modulo tambien importarlo
# form otro.modulo import logica
# ...
app = func.FunctionApp()
#Adicionar el framework
app.register_functions(framework)

Desplega la azure function. -> proximamente documentación y herramientas para facilitar este proceso.

Arquitectura

La arquitectura general de la libreria esta basada en los siguientes servicios:

Arquitectura

El diseño inical del API de la libreria se comopone de los siguientes objetos:

config.yml
runner
task
1. log
2. alerta
dq-task: es un tipo de task especifica para reportar tareas de calidad de datos (dq).

---
title: Diseño inicial del API de la libreria
---
  graph TD;
      config[config.yml]:::cdf--usa-->runner:::cdf;
      m[Llamado manual]:::manual-->httptf[Http Trigger function]:::az;
      httptf --> runner;
      runner --encola tarea --> qtf(Queue Trigger Function):::az ;
      qtf --> task:::cdf
      task --usa--> log:::cdf --usa-->alerta:::cdf
      task --> dqt[dq-task]:::cdf
      dqt --> log
      classDef manual stroke:#f66,stroke-width:2px,stroke-dasharray: 5 5
      classDef az stroke:#2D9BF0,stroke-width:2px
      classDef cdf stroke:#FAC710

PyPI Release Checklist

NOTA: basado en check list del template original.

Before Your First Release

You better visit PyPI to make sure your package name is unused.

For Every Release

Make some pull requests, merge all changes from feature branch to master/main.
Update CHANGELOG.md manually. Make sure it follows the Keep a Changelog standard. Be noticed that GitHub workflow will read changelog and extract release notes automatically.

Commit the changelog changes:

git add CHANGELOG.md
git commit -m "Changelog for upcoming release 0.1.1."

Update version number and automatically create a commit, tag(can also be patch or major).
```
poetry run bump2version minor
```
Run the tests locally for insurance:
```
poetry run tox
```
Push these commits to master/main:
```
git push
```
Before proceeding to the next step, please check workflows triggered by this push have passed.
Push the tags(created by bump2version) to master/main, creating the new release on both GitHub and PyPI:
```
git push --tags
```
Only tag name started with 'v'(lower case) will leverage GitHub release workflow.
Check the PyPI listing page to make sure that the README, release notes, and roadmap display properly. If tox test passed, this should be ok, since we have already run twine check during tox test.

About This Checklist

This checklist is adapted from https://cookiecutter-pypackage.readthedocs.io/en/latest/pypi_release_checklist.html.

It assumes that you are using all features of Cookiecutter PyPackage.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

0.1.3

Dec 26, 2023

0.1.2

Dec 18, 2023

0.1.1

Dec 12, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

centraal_dataframework-0.1.3.tar.gz (14.6 kB view hashes)

Uploaded Dec 26, 2023 Source

Built Distribution

centraal_dataframework-0.1.3-py3-none-any.whl (13.8 kB view hashes)

Uploaded Dec 26, 2023 Python 3

Hashes for centraal_dataframework-0.1.3.tar.gz

Hashes for centraal_dataframework-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`de1a062776f800e7cc64bbd6bf8a758e6546bd96c2c97c8801a6dc73f7474700`
MD5	`654175aa765e5d473618f10c9ba9f694`
BLAKE2b-256	`ae119f4c6c2d80f739b382b6c62426d7c0d50a03d01e24a61d4e422336b3a90e`

Hashes for centraal_dataframework-0.1.3-py3-none-any.whl

Hashes for centraal_dataframework-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7782b0cca17597a0d0a938e7bea684342cf3dab6566b1ccacd2a3ec3abe67d4`
MD5	`c974b50b285032fb05b9ec18b3a60ff3`
BLAKE2b-256	`8fcc2f0e8349d6dc743341cd18957603dcf85a542766cba3ad95ce4025355324`