Skip to main content

ETL Para banco macro, extrae pdf desde la url que le pongas, extrae datos y crea chunks que sube a s3 (aws)

Project description

Macro Flow / Flujo de Procesamiento de PDFs

ES / Español 🇪🇸

macro_flow es una librería modular en Python diseñada para pipelines de ETL sobre documentos PDF. Permite extraer un PDF desde una URL, transformarlo en chunks de texto y almacenarlo en un formato optimizado para su posterior uso en embeddings o flujos de RAG (Retrieval Augmented Generation).


🚀 Características

  • Descarga de PDFs desde URLs externas.
  • Transformación de PDFs en chunks de texto estructurados.
  • Exportación en Parquet para análisis eficiente o entrenamiento de modelos.
  • Carga opcional en Amazon S3 u otros destinos.
  • Configuración flexible mediante .env.
  • Uso como librería Python o desde línea de comandos.

⚙️ Variables de entorno necesarias

Debes definir un archivo .env en la raíz de tu proyecto con las siguientes variables:

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=us-east-1 
AWS_BUCKET_NAME=chunks-data-parquet-2025
AWS_S3_PREFIX=chunks-s3

macroflow --url https://example.com/documento.pdf

from macro_flow.main import MacroEtl

etl = MacroEtl(url="https://example.com/documento.pdf") etl.run()

##🧠 Requirements

Python >= 3.9

Configured .env file

Internet connection to download PDFs

Valid AWS credentials (if using S3)

##🔮 Use cases

Preprocessing documents for embeddings.

Building RAG pipelines from PDF corpora.

Integration with cloud storage systems.

##🪪 License

MIT © Facu Vega https://github.com/facuvegaingenieer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macro_flow-0.0.4.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macro_flow-0.0.4-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file macro_flow-0.0.4.tar.gz.

File metadata

  • Download URL: macro_flow-0.0.4.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for macro_flow-0.0.4.tar.gz
Algorithm Hash digest
SHA256 fe6639114858f5da96c4fb3329cc2bebd8632c47f2b917b67d56cd7de7e3c739
MD5 3172d30b33f30cb6af96a554911567e9
BLAKE2b-256 eb81db7c90a4bd9fe3c4d2c016edeb4865087d3322bfa1bdeb5ec2b113a75390

See more details on using hashes here.

File details

Details for the file macro_flow-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: macro_flow-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for macro_flow-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 80a53bf3968ea17d08b9994f95cccde49467e42c1717b615cd22e8746d6ad25c
MD5 c0aeb5f8c8532b6ae137b5bfa30ea099
BLAKE2b-256 638adf50087a01096ec225588f93cc4e7b4da38eb2106651abf56bcbc9a51fa6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page