ETL Para banco macro, transforma los chunks de pdf bancarios y los tranforma en embeddings de 768 dimenciones y la sube a s3
Project description
macro-embedding-flow
macro-embedding-flow es una librería de Python para transformar archivos Parquet en embeddings de 768 dimensiones usando modelos de Sentence Transformers, y subirlos nuevamente a S3 ya procesados.
Español
Descripción
Esta librería permite procesar un archivo Parquet chunk desde un bucket S3 y generar embeddings de 768 dimensiones. El resultado se guarda como un nuevo archivo Parquet y se sube al mismo bucket S3, en la ruta indicada por la variable de entorno AWS_S3_PREFIX_EMBEDDINGS.
Instalación
pip install macro-embedding-flow
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=
AWS_BUCKET_NAME=
AWS_S3_PREFIX_EMBEDDINGS=
##Uso básico
from macro_embedding_flow import transform
# URL S3 del archivo parquet chunk
s3_url = "s3://mi-bucket/path/al/chunk.parquet"
# Transformar y subir embeddings
transform(s3_url)
macro-embedding-flow
macro-embedding-flow is a Python library for transforming Parquet files into 768-dimensional embeddings using Sentence Transformers models and uploading them back to S3 after processing.
English
Description
This library allows you to process a Parquet chunk file from an S3 bucket and generate 768-dimensional embeddings. The result is saved as a new Parquet file and uploaded to the same S3 bucket, in the path specified by the AWS_S3_PREFIX_EMBEDDINGS environment variable.
Facility
pip install macro-embedding-flow
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_DEFAULT_REGION=
AWS_BUCKET_NAME=
AWS_S3_PREFIX_EMBEDDINGS=
##Basic use
from macro_embedding_flow import transform
# S3 URL of the parquet chunk file
s3_url = "s3://mi-bucket/path/al/chunk.parquet"
# Transform and upload embeddings
transform(s3_url)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file macro_embedding_flow-0.0.1.tar.gz.
File metadata
- Download URL: macro_embedding_flow-0.0.1.tar.gz
- Upload date:
- Size: 6.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9fbd3217de416954d11e702eb53ac4bc80160daf4ac5a1542bed7dbcd32210a3
|
|
| MD5 |
56edf3c4269435645a9f00bff9b06161
|
|
| BLAKE2b-256 |
ef6a39839002974a844b6d608d579d337c9ae8a4ce6eaaaa6b1cbd5db0d24982
|
File details
Details for the file macro_embedding_flow-0.0.1-py3-none-any.whl.
File metadata
- Download URL: macro_embedding_flow-0.0.1-py3-none-any.whl
- Upload date:
- Size: 8.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4334e3c751045510fce7351263ce28e1ee02b5b45c55b788b961fab40f481ba
|
|
| MD5 |
d3aa047bf640ec35880b29692dd0d510
|
|
| BLAKE2b-256 |
aacb9dbe3982842323341879458cf108c35705962ae86b21b3f84d09c2919af4
|