ETL Para banco macro, extrae pdf desde la url que le pongas, extrae datos y crea chunks que sube a s3 (aws)
Project description
Macro Flow / Flujo de Procesamiento de PDFs
ES / Español 🇪🇸
macro_flow es una librería modular en Python diseñada para pipelines de ETL sobre documentos PDF. Permite extraer un PDF desde una URL, transformarlo en chunks de texto y almacenarlo en un formato optimizado para su posterior uso en embeddings o flujos de RAG (Retrieval Augmented Generation).
🚀 Características
- Descarga de PDFs desde URLs externas.
- Transformación de PDFs en chunks de texto estructurados.
- Exportación en Parquet para análisis eficiente o entrenamiento de modelos.
- Carga opcional en Amazon S3 u otros destinos.
- Configuración flexible mediante
.env. - Uso como librería Python o desde línea de comandos.
⚙️ Variables de entorno necesarias
Debes definir un archivo .env en la raíz de tu proyecto con las siguientes variables:
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=us-east-1
AWS_BUCKET_NAME=chunks-data-parquet-2025
AWS_S3_PREFIX=chunks-s3
macroflow --url https://example.com/documento.pdf
from macro_flow.main import MacroEtl
etl = MacroEtl(url="https://example.com/documento.pdf") etl.run()
##🧠 Requirements
Python >= 3.9
Configured .env file
Internet connection to download PDFs
Valid AWS credentials (if using S3)
##🔮 Use cases
Preprocessing documents for embeddings.
Building RAG pipelines from PDF corpora.
Integration with cloud storage systems.
##🪪 License
MIT © Facu Vega https://github.com/facuvegaingenieer
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file macro_flow-0.0.4.tar.gz.
File metadata
- Download URL: macro_flow-0.0.4.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fe6639114858f5da96c4fb3329cc2bebd8632c47f2b917b67d56cd7de7e3c739
|
|
| MD5 |
3172d30b33f30cb6af96a554911567e9
|
|
| BLAKE2b-256 |
eb81db7c90a4bd9fe3c4d2c016edeb4865087d3322bfa1bdeb5ec2b113a75390
|
File details
Details for the file macro_flow-0.0.4-py3-none-any.whl.
File metadata
- Download URL: macro_flow-0.0.4-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80a53bf3968ea17d08b9994f95cccde49467e42c1717b615cd22e8746d6ad25c
|
|
| MD5 |
c0aeb5f8c8532b6ae137b5bfa30ea099
|
|
| BLAKE2b-256 |
638adf50087a01096ec225588f93cc4e7b4da38eb2106651abf56bcbc9a51fa6
|