Vectorization & RAG Toolkit

These details have not been verified by PyPI

Project links

Project description

veco-ai

veco-ai is a Python toolkit (Python 3.10-3.11) that converts a broad range of document types - text, PDF, Word, PowerPoint, images, audio, and video - into vector representations that can be queried through Retrieval Augmented Generation (RAG).
Embeddings are stored inside a FAISS index and can optionally be persisted to JSON (fallback), SQLite, or MongoDB. The integrated RAG interface lets you query knowledge bases via local Ollama models.

Features

Automatic input detection for text, PDF, Word, PowerPoint, images, audio, and video
Text extraction via pdfplumber, python-docx, python-pptx, pytesseract, moviepy, and whisper
Speaker diarization (optional through veco_diarization.py)
Vision extensions:
- OCR via pytesseract
- CNN classification (torchvision ResNet)
- External image captioning (optional via veco_pic_describe)
Chunking with overlap for RAG-ready embeddings
Optional summaries generated with Ollama models (stored separately, never used as embedding input)
FAISS index for efficient retrieval
Persistence backends: JSON (fallback, stand-alone), SQLite, or MongoDB
RAG queries: End-to-end helper (query()) that retrieves context and triggers an Ollama response

Project Structure

.
|-- veco_ai/
|   |-- __init__.py
|   |-- veco_ai.py              # Core vectorization library
|   |-- veco_diarization.py  # Optional speaker diarization pipeline
|   `-- veco_pic_describe.py # Optional image captioning helpers
|-- test/
|   `-- veco_test.py         # Example usage script
|-- requirements.txt
|-- pyproject.toml / setup.py
|-- test_data/               # Sample files for testing
|-- vector_db.json           # Example JSON database (fallback storage)
`-- UML/                     # Architecture diagrams

Dependencies

Python 3.10 or 3.11
torch, torchaudio, torchvision (CPU wheels via PyPI; follow the official PyTorch guide for CUDA)
sentence-transformers
faiss-cpu
openai-whisper
pdfplumber
pytesseract
pillow
moviepy
python-docx
python-pptx
numpy and scipy
ollama
webrtcvad-wheels, librosa, soundfile, speechbrain
(See requirements.txt / pyproject.toml for exact versions)

Installation

1. Create a virtual environment

./setup_venv.ps1
# or
python3.11 -m venv .venv
.venv\Scripts\activate      # Windows
source .venv/bin/activate   # Linux/macOS

2. Install the base dependencies

pip install veco_ai

For local development instead of the published wheel, install from source:

pip install -r requirements.txt
# or
pip install -e .

3. Configure PyTorch (optional)

Follow the official PyTorch installation guide for your GPU/CPU setup.
For a CPU-only environment the default pip install from the requirements is sufficient.

Usage

Example script (`tests/veco_test.py`)

python tests/veco_test.py

The script loads or creates vector_db.json, vectorizes all files in the test_data/ folder, and saves the updated database.

Direct usage in Python

from veco_ai import Vectorize

# JSON fallback backend
veco = Vectorize(preload_json_path="vector_db.json")

# Vectorize a file
veco.vectorize("path/to/file.pdf", use_compression=True)

# Persist the database
veco.save_database("vector_db.json")

# Run a RAG query (Ollama required)
res = veco.query(
    database="vector_db.json",
    question="What is this document about?",
    llm_model="gemma3:12b",
)
print(res["answer"])

Architecture

The central class is Vectorize:

Input detection: identifies the file type
Text extraction: uses type-specific libraries
Optional compression: generates summaries through Ollama
Chunking: splits text into overlapping segments
Embedding: performed with sentence-transformers
Storage: FAISS index plus JSON/SQLite/MongoDB backends
RAG: retrieves relevant context and optionally queries an Ollama model

Development

Install the development extras to run linting and tests:

pip install .[dev]
pytest

License

The project is released under the terms of CC0 1.0 Universal.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Oct 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

veco_ai-0.1.0.tar.gz (28.7 kB view details)

Uploaded Oct 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

veco_ai-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Oct 20, 2025 Python 3

File details

Details for the file veco_ai-0.1.0.tar.gz.

File metadata

Download URL: veco_ai-0.1.0.tar.gz
Upload date: Oct 20, 2025
Size: 28.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for veco_ai-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`033dad670ca0c780ea2926a93ef5ae547fbc8e69ae8760f774e662d4a50dd73d`
MD5	`d43c9fd8fe85da813823638079c6be80`
BLAKE2b-256	`2940c64bf9c2cf71a93efc8ee3f53084b8ee8802497a9ba6ce7f241d2f34bf10`

See more details on using hashes here.

File details

Details for the file veco_ai-0.1.0-py3-none-any.whl.

File metadata

Download URL: veco_ai-0.1.0-py3-none-any.whl
Upload date: Oct 20, 2025
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for veco_ai-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4a238154f295f4836a274c5cf4edb0a07955d0328851801b5ce73f7b945b13f`
MD5	`e3a37431335eddb16fccf78379f1be5d`
BLAKE2b-256	`25c1ebe159a2af3a53ce41eb5e3f8e2a569ba2090d821f7baba079cdd0fe27a6`

See more details on using hashes here.

veco-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

veco-ai

Features

Project Structure

Dependencies

Installation

1. Create a virtual environment

2. Install the base dependencies

3. Configure PyTorch (optional)

Usage

Example script (`tests/veco_test.py`)

Direct usage in Python

Architecture

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

veco-ai 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

veco-ai

Features

Project Structure

Dependencies

Installation

1. Create a virtual environment

2. Install the base dependencies

3. Configure PyTorch (optional)

Usage

Example script (tests/veco_test.py)

Direct usage in Python

Architecture

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Example script (`tests/veco_test.py`)