Skip to main content

ETL Para banco macro, extrae pdf desde la url que le pongas, extrae datos y crea chunks que sube a s3 (aws)

Project description

Macro Flow / Flujo de Procesamiento de PDFs

ES / Español 🇪🇸

macro_flow es una librería modular en Python diseñada para pipelines de ETL sobre documentos PDF. Permite extraer un PDF desde una URL, transformarlo en chunks de texto y almacenarlo en un formato optimizado para su posterior uso en embeddings o flujos de RAG (Retrieval Augmented Generation).


🚀 Características

  • Descarga de PDFs desde URLs externas.
  • Transformación de PDFs en chunks de texto estructurados.
  • Exportación en Parquet para análisis eficiente o entrenamiento de modelos.
  • Carga opcional en Amazon S3 u otros destinos.
  • Configuración flexible mediante .env.
  • Uso como librería Python o desde línea de comandos.

⚙️ Variables de entorno necesarias

Debes definir un archivo .env en la raíz de tu proyecto con las siguientes variables:

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=us-east-1 
AWS_BUCKET_NAME=chunks-data-parquet-2025
AWS_S3_PREFIX=chunks-s3

macroflow --url https://example.com/documento.pdf

from macro_flow.main import MacroEtl

etl = MacroEtl(url="https://example.com/documento.pdf") etl.run()

##🧠 Requirements

Python >= 3.9

Configured .env file

Internet connection to download PDFs

Valid AWS credentials (if using S3)

##🔮 Use cases

Preprocessing documents for embeddings.

Building RAG pipelines from PDF corpora.

Integration with cloud storage systems.

##🪪 License

MIT © Facu Vega https://github.com/facuvegaingenieer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macro_flow-0.0.3.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macro_flow-0.0.3-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file macro_flow-0.0.3.tar.gz.

File metadata

  • Download URL: macro_flow-0.0.3.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for macro_flow-0.0.3.tar.gz
Algorithm Hash digest
SHA256 a021c61a2f7eec4c4c848d5ecb74f73cb7f6ae5062c4afd8be8f301bc6255e8b
MD5 18684454d185254aa25c97da0415d3e3
BLAKE2b-256 a52b1f5628ed7284b89e0236366e29f6a5740ed3383be2a3b2aa497c90661a6e

See more details on using hashes here.

File details

Details for the file macro_flow-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: macro_flow-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for macro_flow-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 96fde86a5d1fbf585a49239e74f9cc9f88a28f4540a7e4385ba366c5530ea927
MD5 0b393d3e85f7c0d98b360540c8062a97
BLAKE2b-256 0a0447c9fb9bf6da8ec3e719aa1a8c6b23765fb672dc0e7b8d2342c8ce2d2d61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page