Skip to main content

Contextual Rag with Cloud Solutions

Project description

wizit_context_ingestor

A powerful document processing and ingestion system that leverages AI services for document transcription, analysis, and semantic chunking.

Features

  • Document transcription using AWS and Google Cloud AI services
  • Semantic chunking of documents for better context understanding
  • Vector storage integration with PostgreSQL
  • Support for both local and cloud storage (S3)
  • Synthetic data generation capabilities
  • RAG (Retrieval-Augmented Generation) implementation

Prerequisites

  • Python 3.11 or higher
  • Poetry for dependency management
  • AWS credentials (for AWS services)
  • Google Cloud credentials (for GCP services)
  • PostgreSQL database (for vector storage)
  • Supabase account (for data storage)

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/mega-ingestor.git
cd mega-ingestor
  1. Install dependencies using Poetry:
poetry install
  1. Set up your environment variables by copying the example.env file:
cp example.env .env
  1. Fill in your environment variables in the .env file with your credentials and configuration.

Usage

The project provides several main functionalities:

Document Transcription

from main import transcribe_document

# Transcribe a document using AWS services
transcribe_document("your-document.pdf")

# Transcribe a document using Google Cloud services
cloud_transcribe_document("your-document.pdf")

Context Chunking

from main import context_chunks_in_document

# Get semantic chunks from a document
context_chunks_in_document("your-document.pdf")

Running Memory Profiler

To run the memory profiler, use the following command:

python -m memray run test_redis.py

Project Structure

mega-ingestor/
├── src/
│   ├── application/
│   ├── infra/
│   └── ...
├── data/
├── credentials/
├── main.py
├── app.py
└── pyproject.toml

Dependencies

  • llama-parse
  • langchain-experimental
  • langchain-google-vertexai
  • pymupdf
  • supabase
  • vecs
  • langchain-postgres
  • boto3
  • langchain-aws

GENERATE THE PACKAGE WITH POETRY

    poetry build

PUBLISH PACKAGE

    poetry config repositories.tbbcmegaingestor https://aws:$CODEARTIFACT_AUTH_TOKEN@tbbc-mega-ingestor-411728455297.d.codeartifact.us-east-1.amazonaws.com/pypi/tbbc-mega-ingestor-lib/
    export CODEARTIFACT_AUTH_TOKEN=`aws codeartifact get-authorization-token --domain tbbc-mega-ingestor --domain-owner 411728455297 --region us-east-1 --query authorizationToken --output text --profile <your-profile>`

Finally

    poetry publish -r tbbcmegaingestor

USAGE

For transcriptions

----- TODO --- You can provide number of retries and a transcription quality threshold

License

This project is licensed under the Apache License - see the LICENSE file for details.

TODO

  • Do not transcribe logos
  • Support for more cloud providers

Authors

(Daniel Quesada)[https://github.com/daquesada] (Jeison Patiño)[https://github.com/jeison-patino] (Javier Fernandez)[https://github.com/javimaufermu] (Esteban Cerón)[https://github.com/estebance]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wizit_context_ingestor-0.5.1b5.tar.gz (24.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wizit_context_ingestor-0.5.1b5-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file wizit_context_ingestor-0.5.1b5.tar.gz.

File metadata

File hashes

Hashes for wizit_context_ingestor-0.5.1b5.tar.gz
Algorithm Hash digest
SHA256 f3b962ded97c00a9b0846ec01c44e26bad1f2001afc949ea1e51e92c738ed4b2
MD5 c7b4e4fad733c82bc3e9ca1db283c75a
BLAKE2b-256 db5f987fba8574e3d4fd800a5f1c7d5a3967f0f389247a6e3d63a885341842b8

See more details on using hashes here.

File details

Details for the file wizit_context_ingestor-0.5.1b5-py3-none-any.whl.

File metadata

File hashes

Hashes for wizit_context_ingestor-0.5.1b5-py3-none-any.whl
Algorithm Hash digest
SHA256 e79c897c3a5ed252cb89de3ef1437fe9ad517b33d80b3cc8a4fae6f6537cf749
MD5 fa7b31a86d3dfda1f472e6e9f66c714b
BLAKE2b-256 c72ec789f2af605fc336c521a306dfedf6b1d3b27df0e8cd0ef2d1436999f0b9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page