Vectorization & RAG Toolkit
Project description
veco-ai
veco-ai is a Python toolkit (Python 3.10-3.11) that converts a broad range of document types - text, PDF, Word, PowerPoint, images, audio, and video - into vector representations that can be queried through Retrieval Augmented Generation (RAG).
Embeddings are stored inside a FAISS index and can optionally be persisted to JSON (fallback), SQLite, or MongoDB. The integrated RAG interface lets you query knowledge bases via local Ollama models.
Features
- Automatic input detection for text, PDF, Word, PowerPoint, images, audio, and video
- Text extraction via
pdfplumber,python-docx,python-pptx,pytesseract,moviepy, andwhisper - Speaker diarization (optional through
veco_diarization.py) - Vision extensions:
- OCR via
pytesseract - CNN classification (torchvision ResNet)
- External image captioning (optional via
veco_pic_describe)
- OCR via
- Chunking with overlap for RAG-ready embeddings
- Optional summaries generated with Ollama models (stored separately, never used as embedding input)
- FAISS index for efficient retrieval
- Persistence backends: JSON (fallback, stand-alone), SQLite, or MongoDB
- RAG queries: End-to-end helper (
query()) that retrieves context and triggers an Ollama response
Project Structure
.
|-- veco_ai/
| |-- __init__.py
| |-- veco_ai.py # Core vectorization library
| |-- veco_diarization.py # Optional speaker diarization pipeline
| `-- veco_pic_describe.py # Optional image captioning helpers
|-- test/
| `-- veco_test.py # Example usage script
|-- requirements.txt
|-- pyproject.toml / setup.py
|-- test_data/ # Sample files for testing
|-- vector_db.json # Example JSON database (fallback storage)
`-- UML/ # Architecture diagrams
Dependencies
- Python 3.10 or 3.11
torch,torchaudio,torchvision(CPU wheels via PyPI; follow the official PyTorch guide for CUDA)sentence-transformersfaiss-cpuopenai-whisperpdfplumberpytesseractpillowmoviepypython-docxpython-pptxnumpyandscipyollamawebrtcvad-wheels,librosa,soundfile,speechbrain- (See
requirements.txt/pyproject.tomlfor exact versions)
Installation
1. Create a virtual environment
./setup_venv.ps1
# or
python3.11 -m venv .venv
.venv\Scripts\activate # Windows
source .venv/bin/activate # Linux/macOS
2. Install the base dependencies
pip install veco_ai
For local development instead of the published wheel, install from source:
pip install -r requirements.txt
# or
pip install -e .
3. Configure PyTorch (optional)
Follow the official PyTorch installation guide for your GPU/CPU setup.
For a CPU-only environment the default pip install from the requirements is sufficient.
Usage
Example script (tests/veco_test.py)
python tests/veco_test.py
The script loads or creates vector_db.json, vectorizes all files in the test_data/ folder, and saves the updated database.
Direct usage in Python
from veco_ai import Vectorize
# JSON fallback backend
veco = Vectorize(preload_json_path="vector_db.json")
# Vectorize a file
veco.vectorize("path/to/file.pdf", use_compression=True)
# Persist the database
veco.save_database("vector_db.json")
# Run a RAG query (Ollama required)
res = veco.query(
database="vector_db.json",
question="What is this document about?",
llm_model="gemma3:12b",
)
print(res["answer"])
Architecture
The central class is Vectorize:
- Input detection: identifies the file type
- Text extraction: uses type-specific libraries
- Optional compression: generates summaries through Ollama
- Chunking: splits text into overlapping segments
- Embedding: performed with
sentence-transformers - Storage: FAISS index plus JSON/SQLite/MongoDB backends
- RAG: retrieves relevant context and optionally queries an Ollama model
Development
Install the development extras to run linting and tests:
pip install .[dev]
pytest
License
The project is released under the terms of CC0 1.0 Universal.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file veco_ai-0.1.0.tar.gz.
File metadata
- Download URL: veco_ai-0.1.0.tar.gz
- Upload date:
- Size: 28.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
033dad670ca0c780ea2926a93ef5ae547fbc8e69ae8760f774e662d4a50dd73d
|
|
| MD5 |
d43c9fd8fe85da813823638079c6be80
|
|
| BLAKE2b-256 |
2940c64bf9c2cf71a93efc8ee3f53084b8ee8802497a9ba6ce7f241d2f34bf10
|
File details
Details for the file veco_ai-0.1.0-py3-none-any.whl.
File metadata
- Download URL: veco_ai-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4a238154f295f4836a274c5cf4edb0a07955d0328851801b5ce73f7b945b13f
|
|
| MD5 |
e3a37431335eddb16fccf78379f1be5d
|
|
| BLAKE2b-256 |
25c1ebe159a2af3a53ce41eb5e3f8e2a569ba2090d821f7baba079cdd0fe27a6
|