Skip to main content

ETL Para banco macro, extrae pdf desde la url que le pongas, extrae datos y crea chunks que sube a s3 (aws)

Project description

Macro Flow / Flujo de Procesamiento de PDFs

ES / Español 🇪🇸

macro_flow es una librería modular en Python diseñada para pipelines de ETL sobre documentos PDF. Permite extraer un PDF desde una URL, transformarlo en chunks de texto y almacenarlo en un formato optimizado para su posterior uso en embeddings o flujos de RAG (Retrieval Augmented Generation).


🚀 Características

  • Descarga de PDFs desde URLs externas.
  • Transformación de PDFs en chunks de texto estructurados.
  • Exportación en Parquet para análisis eficiente o entrenamiento de modelos.
  • Carga opcional en Amazon S3 u otros destinos.
  • Configuración flexible mediante .env.
  • Uso como librería Python o desde línea de comandos.

⚙️ Variables de entorno necesarias

Debes definir un archivo .env en la raíz de tu proyecto con las siguientes variables:

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=us-east-1 
AWS_BUCKET_NAME=chunks-data-parquet-2025
AWS_S3_PREFIX=chunks-s3

macroflow --url https://example.com/documento.pdf

from macro_flow.main import MacroEtl

etl = MacroEtl(url="https://example.com/documento.pdf") etl.run()

##🧠 Requirements

Python >= 3.9

Configured .env file

Internet connection to download PDFs

Valid AWS credentials (if using S3)

##🔮 Use cases

Preprocessing documents for embeddings.

Building RAG pipelines from PDF corpora.

Integration with cloud storage systems.

##🪪 License

MIT © Facu Vega https://github.com/facuvegaingenieer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macro_flow-0.0.2.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macro_flow-0.0.2-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file macro_flow-0.0.2.tar.gz.

File metadata

  • Download URL: macro_flow-0.0.2.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for macro_flow-0.0.2.tar.gz
Algorithm Hash digest
SHA256 e54e86ac0f29a65d2f5f1e9231cdaa58580093ffa9c20b4468c62f6bfd336bb7
MD5 ddc57284e754b22623407b1d51c56932
BLAKE2b-256 5de74d40f441ecf456c1628d71f30d5fe5c29ccd5edb82ad3ad4dbfa21256696

See more details on using hashes here.

File details

Details for the file macro_flow-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: macro_flow-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for macro_flow-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 20cc5558c1e0c2e81cd6ff0db4b4a8061a5f1452a329535d1772f17285aaa16a
MD5 95e9b22596c9bf733e041f72d3384248
BLAKE2b-256 24e48783f62aa0fea14acc9d25fd8951888cb1861dbd4090af59ac9d91ba71c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page