Skip to main content

Vectorization & RAG Toolkit

Project description

veco-ai

veco-ai is a Python toolkit (Python 3.10-3.11) that converts a broad range of document types - text, PDF, Word, PowerPoint, images, audio, and video - into vector representations that can be queried through Retrieval Augmented Generation (RAG).
Embeddings are stored inside a FAISS index and can optionally be persisted to JSON (fallback), SQLite, or MongoDB. The integrated RAG interface lets you query knowledge bases via local Ollama models.

Features

  • Automatic input detection for text, PDF, Word, PowerPoint, images, audio, and video
  • Text extraction via pdfplumber, python-docx, python-pptx, pytesseract, moviepy, and whisper
  • Speaker diarization (optional through veco_diarization.py)
  • Vision extensions:
    • OCR via pytesseract
    • CNN classification (torchvision ResNet)
    • External image captioning (optional via veco_pic_describe)
  • Chunking with overlap for RAG-ready embeddings
  • Optional summaries generated with Ollama models (stored separately, never used as embedding input)
  • FAISS index for efficient retrieval
  • Persistence backends: JSON (fallback, stand-alone), SQLite, or MongoDB
  • RAG queries: End-to-end helper (query()) that retrieves context and triggers an Ollama response

Project Structure

.
|-- veco_ai/
|   |-- __init__.py
|   |-- veco_ai.py              # Core vectorization library
|   |-- veco_diarization.py  # Optional speaker diarization pipeline
|   `-- veco_pic_describe.py # Optional image captioning helpers
|-- test/
|   `-- veco_test.py         # Example usage script
|-- requirements.txt
|-- pyproject.toml / setup.py
|-- test_data/               # Sample files for testing
|-- vector_db.json           # Example JSON database (fallback storage)
`-- UML/                     # Architecture diagrams

Dependencies

  • Python 3.10 or 3.11
  • torch, torchaudio, torchvision (CPU wheels via PyPI; follow the official PyTorch guide for CUDA)
  • sentence-transformers
  • faiss-cpu
  • openai-whisper
  • pdfplumber
  • pytesseract
  • pillow
  • moviepy
  • python-docx
  • python-pptx
  • numpy and scipy
  • ollama
  • webrtcvad-wheels, librosa, soundfile, speechbrain
  • (See requirements.txt / pyproject.toml for exact versions)

Installation

1. Create a virtual environment

./setup_venv.ps1
# or
python3.11 -m venv .venv
.venv\Scripts\activate      # Windows
source .venv/bin/activate   # Linux/macOS

2. Install the base dependencies

pip install veco_ai

For local development instead of the published wheel, install from source:

pip install -r requirements.txt
# or
pip install -e .

3. Configure PyTorch (optional)

Follow the official PyTorch installation guide for your GPU/CPU setup.
For a CPU-only environment the default pip install from the requirements is sufficient.

Usage

Example script (tests/veco_test.py)

python tests/veco_test.py

The script loads or creates vector_db.json, vectorizes all files in the test_data/ folder, and saves the updated database.

Direct usage in Python

from veco_ai import Vectorize

# JSON fallback backend
veco = Vectorize(preload_json_path="vector_db.json")

# Vectorize a file
veco.vectorize("path/to/file.pdf", use_compression=True)

# Persist the database
veco.save_database("vector_db.json")

# Run a RAG query (Ollama required)
res = veco.query(
    database="vector_db.json",
    question="What is this document about?",
    llm_model="gemma3:12b",
)
print(res["answer"])

Architecture

The central class is Vectorize:

  • Input detection: identifies the file type
  • Text extraction: uses type-specific libraries
  • Optional compression: generates summaries through Ollama
  • Chunking: splits text into overlapping segments
  • Embedding: performed with sentence-transformers
  • Storage: FAISS index plus JSON/SQLite/MongoDB backends
  • RAG: retrieves relevant context and optionally queries an Ollama model

Development

Install the development extras to run linting and tests:

pip install .[dev]
pytest

License

The project is released under the terms of CC0 1.0 Universal.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

veco_ai-0.1.0.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

veco_ai-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file veco_ai-0.1.0.tar.gz.

File metadata

  • Download URL: veco_ai-0.1.0.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for veco_ai-0.1.0.tar.gz
Algorithm Hash digest
SHA256 033dad670ca0c780ea2926a93ef5ae547fbc8e69ae8760f774e662d4a50dd73d
MD5 d43c9fd8fe85da813823638079c6be80
BLAKE2b-256 2940c64bf9c2cf71a93efc8ee3f53084b8ee8802497a9ba6ce7f241d2f34bf10

See more details on using hashes here.

File details

Details for the file veco_ai-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: veco_ai-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for veco_ai-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4a238154f295f4836a274c5cf4edb0a07955d0328851801b5ce73f7b945b13f
MD5 e3a37431335eddb16fccf78379f1be5d
BLAKE2b-256 25c1ebe159a2af3a53ce41eb5e3f8e2a569ba2090d821f7baba079cdd0fe27a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page