Skip to main content

ETL Para banco macro, transforma los chunks de pdf bancarios y los tranforma en embeddings de 768 dimenciones y la sube a s3

Project description

macro-embedding-flow

macro-embedding-flow es una librería de Python para transformar archivos Parquet en embeddings de 768 dimensiones usando modelos de Sentence Transformers, y subirlos nuevamente a S3 ya procesados.


Español

Descripción

Esta librería permite procesar un archivo Parquet chunk desde un bucket S3 y generar embeddings de 768 dimensiones. El resultado se guarda como un nuevo archivo Parquet y se sube al mismo bucket S3, en la ruta indicada por la variable de entorno AWS_S3_PREFIX_EMBEDDINGS.

Instalación

pip install macro-embedding-flow
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=
AWS_BUCKET_NAME=
AWS_S3_PREFIX_EMBEDDINGS=

##Uso básico

from macro_embedding_flow import transform

# URL S3 del archivo parquet chunk
s3_url = "s3://mi-bucket/path/al/chunk.parquet"

# Transformar y subir embeddings
transform(s3_url)

macro-embedding-flow

macro-embedding-flow is a Python library for transforming Parquet files into 768-dimensional embeddings using Sentence Transformers models and uploading them back to S3 after processing.


English

Description

This library allows you to process a Parquet chunk file from an S3 bucket and generate 768-dimensional embeddings. The result is saved as a new Parquet file and uploaded to the same S3 bucket, in the path specified by the AWS_S3_PREFIX_EMBEDDINGS environment variable.

Facility

pip install macro-embedding-flow
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=
AWS_BUCKET_NAME=
AWS_S3_PREFIX_EMBEDDINGS=

##Basic use

from macro_embedding_flow import transform

# S3 URL of the parquet chunk file
s3_url = "s3://mi-bucket/path/al/chunk.parquet"

# Transform and upload embeddings
transform(s3_url)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

macro_embedding_flow-0.0.2.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

macro_embedding_flow-0.0.2-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file macro_embedding_flow-0.0.2.tar.gz.

File metadata

  • Download URL: macro_embedding_flow-0.0.2.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for macro_embedding_flow-0.0.2.tar.gz
Algorithm Hash digest
SHA256 77bcccac87b28191a7e3995b02f9e64c510a82e4b4548aade053c821fbd13784
MD5 81e62e1cfcdabd3656f11d4138821dc3
BLAKE2b-256 18d742c56ed7b9303d22e8c5d2f7730528e39ffe4963fcc5c04de7415d42a425

See more details on using hashes here.

File details

Details for the file macro_embedding_flow-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for macro_embedding_flow-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e05ea42591556e8e3fb113ba1e8975ae852314044ec1830ae09e171850ecb84e
MD5 9885f643aaa4414e329719ccb02ba498
BLAKE2b-256 31b5f20fe470351464b29bcc931d3cab3c0e5e51d41402ea02ff38a727d05150

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page