Contextual Rag with Cloud Solutions
Project description
wizit_context_ingestor
A powerful document processing and ingestion system that leverages AI services for document transcription, analysis, and semantic chunking.
Features
- Document transcription using AWS and Google Cloud AI services
- Semantic chunking of documents for better context understanding
- Vector storage integration with PostgreSQL
- Support for both local and cloud storage (S3)
- Synthetic data generation capabilities
- RAG (Retrieval-Augmented Generation) implementation
Prerequisites
- Python 3.11 or higher
- Poetry for dependency management
- AWS credentials (for AWS services)
- Google Cloud credentials (for GCP services)
- PostgreSQL database (for vector storage)
- Supabase account (for data storage)
Installation
- Clone the repository:
git clone https://github.com/yourusername/mega-ingestor.git
cd mega-ingestor
- Install dependencies using Poetry:
poetry install
- Set up your environment variables by copying the example.env file:
cp example.env .env
- Fill in your environment variables in the
.envfile with your credentials and configuration.
Usage
The project provides several main functionalities:
Document Transcription
from main import transcribe_document
# Transcribe a document using AWS services
transcribe_document("your-document.pdf")
# Transcribe a document using Google Cloud services
cloud_transcribe_document("your-document.pdf")
Context Chunking
from main import context_chunks_in_document
# Get semantic chunks from a document
context_chunks_in_document("your-document.pdf")
Running Memory Profiler
To run the memory profiler, use the following command:
python -m memray run test_redis.py
Project Structure
mega-ingestor/
├── src/
│ ├── application/
│ ├── infra/
│ └── ...
├── data/
├── credentials/
├── main.py
├── app.py
└── pyproject.toml
Dependencies
- llama-parse
- langchain-experimental
- langchain-google-vertexai
- pymupdf
- supabase
- vecs
- langchain-postgres
- boto3
- langchain-aws
GENERATE THE PACKAGE WITH POETRY
poetry build
PUBLISH PACKAGE
poetry config repositories.tbbcmegaingestor https://aws:$CODEARTIFACT_AUTH_TOKEN@tbbc-mega-ingestor-411728455297.d.codeartifact.us-east-1.amazonaws.com/pypi/tbbc-mega-ingestor-lib/
export CODEARTIFACT_AUTH_TOKEN=`aws codeartifact get-authorization-token --domain tbbc-mega-ingestor --domain-owner 411728455297 --region us-east-1 --query authorizationToken --output text --profile <your-profile>`
Finally
poetry publish -r tbbcmegaingestor
USAGE
For transcriptions
----- TODO --- You can provide number of retries and a transcription quality threshold
License
This project is licensed under the Apache License - see the LICENSE file for details.
TODO
- Do not transcribe logos
- Support for more cloud providers
Authors
(Daniel Quesada)[https://github.com/daquesada] (Jeison Patiño)[https://github.com/jeison-patino] (Javier Fernandez)[https://github.com/javimaufermu] (Esteban Cerón)[https://github.com/estebance]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wizit_context_ingestor-0.4.7.tar.gz.
File metadata
- Download URL: wizit_context_ingestor-0.4.7.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
53264facfa2731fe2d23e53af773b18965d795b0e64977727d06456a1bb5273f
|
|
| MD5 |
700d1ae2feeb4ca72922b8d125c11e8a
|
|
| BLAKE2b-256 |
f3952c2bd3f6fd6e4429155ab96abd600a32fa1a7231a8a1f185726b3713c1a8
|
File details
Details for the file wizit_context_ingestor-0.4.7-py3-none-any.whl.
File metadata
- Download URL: wizit_context_ingestor-0.4.7-py3-none-any.whl
- Upload date:
- Size: 48.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52923c2f6b01ccbd9598f148c9453d3ae76c0afe9b46b6528ead12c6c5dae2f2
|
|
| MD5 |
3be7c6bbd49806a495f0ed95a1aef4d9
|
|
| BLAKE2b-256 |
c1585dff4da5466c911bdf6f6a513cb15461ade81daed2642c40eb0c0966ad76
|