A Python package for seamless vectorization for any content type

These details have not been verified by PyPI

Project description

anyvec

AnyVec is an open-source Python package that makes it easy to vectorize any type of file — text, images, audio, video, or code — through a single, unified interface. Traditionally, embedding different data types (like text vs. images) requires different models and disparate code paths. AnyVec abstracts away these complexities, allowing you to work with a unified API for all your vectorization needs, regardless of file type.

Supported File Types

Category	Extensions / MIME Types
Text/Docs	.txt, .rtf, .md, .doc, .docx, .odt
PDF	.pdf
Presentation	.ppt, .pptx, .ppsx, .pptm, .odp
Spreadsheet	.xls, .xlsx, .ods
EPUB	.epub
Templates	.dotm, .dotx, .docm
Image	.png, .jpg, .jpeg, .jpe, .bmp, .gif, .tiff, .ico, .icns, .heic, .avif, .webp, .psd
Audio	.mp3, .wav, .ogg, .m4a
Video	.mp4, .avi, .mov, .mkv, .webm, .mpeg, .mpg
Code	.py, .js, .ts, .tsx, .jsx, .java, .cpp, .c, .h, .hpp, .cs, .go, .rb, .php, .pl, .sh, .swift, .scala, .lua, .f90, .f95, .erl, .exs, .bat, .sql, .lisp, .vb, .ipynb, .xml, .yml, .yaml, .json, .kt, .rst, .html

For the most up-to-date list, see the mime_handlers dictionary in the codebase.

Processing Flow

File Type Detection: AnyVec uses MIME type and file extension to determine the file type.
Extraction: The relevant extractor parses text, images, or audio from the file.
Vectorization: The extracted content is sent to a CLIP-like model via API for embedding.
Unified Output: You get back text and image vectors, regardless of input type.

Detailed Processing Flow

Text Files:

Extracts raw text using format-appropriate parsers.
Returns extracted text for vectorization.

Image Files:

Returns the image data as base64-encoded JPEGs or PNGs.
Optionally, OCR (optical character recognition) can be performed for text extraction.

Audio Files:

Audio bytes are sent to a transcription server (e.g., OpenAI Whisper).
The server returns the transcribed text, which is then vectorized.

Video Files:

The video is processed in two ways:
1. Audio Extraction & Transcription:
  - Audio is extracted from the video using MoviePy.
  - The extracted audio is sent to the /transcribe endpoint in your inference container.
  - The returned transcript is used for vectorization.
2. Frame Extraction:
  - Frames are extracted at n-second intervals using OpenCV.
  - Frames are returned as base64-encoded JPEGs for downstream processing or vectorization.

Return Values:

For text, audio, and video: returns extracted text (or transcript) and/or images (frames).
For images: returns images and optionally OCR text.

Quick Start / Usage

Installation

pip install anyvec

For inference, you can skip building locally and pull the latest public image directly from Docker Hub:

docker pull mxy680/clip-inference:latest

Then run the container:

docker run --rm -it -p 8000:8080 mxy680/clip-inference:latest

The API will be available at http://localhost:8000.

To run the container in detached mode (in the background), use:

docker run -d -p 8000:8080 mxy680/clip-inference:latest

The API will still be available at http://localhost:8000 while the container runs in the background.

Basic Example

from anyvec.client import AnyVecClient
from anyvec.models import VectorizationPayload

client = AnyVecClient("http://localhost:8000")

# Process a PDF
with open("example.pdf", "rb") as f:
    file_content = f.read()
payload = VectorizationPayload(file_content=file_content, file_name="example.pdf")
result = client.vectorize(payload)
print("Vectorization result:", result)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.8

May 10, 2025

0.1.7

May 9, 2025

0.1.6

May 9, 2025

0.1.5

May 7, 2025

0.1.4

May 7, 2025

0.1.3

May 2, 2025

0.1.2

Apr 29, 2025

0.1.1

Apr 29, 2025

0.1.0

Apr 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anyvec-0.1.8.tar.gz (17.8 kB view details)

Uploaded May 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anyvec-0.1.8-py3-none-any.whl (23.7 kB view details)

Uploaded May 10, 2025 Python 3

File details

Details for the file anyvec-0.1.8.tar.gz.

File metadata

Download URL: anyvec-0.1.8.tar.gz
Upload date: May 10, 2025
Size: 17.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.13.2 Darwin/24.3.0

File hashes

Hashes for anyvec-0.1.8.tar.gz
Algorithm	Hash digest
SHA256	`170a9114b87da8c5857e98099a960a2f6e9374f804e581eb287be17a0e00ba8d`
MD5	`2a6f6db69c50da96e37ea762cf4a11c0`
BLAKE2b-256	`3077d8fc4e0aa225c62f1db9cdfd521c4065d6bdfa88795bc44ab5302ab9b091`

See more details on using hashes here.

File details

Details for the file anyvec-0.1.8-py3-none-any.whl.

File metadata

Download URL: anyvec-0.1.8-py3-none-any.whl
Upload date: May 10, 2025
Size: 23.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.13.2 Darwin/24.3.0

File hashes

Hashes for anyvec-0.1.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f315efdd3f8fbb34394d5ecebcccb7a860ccb647a8f6f91a566885fcc5051850`
MD5	`d9881c3a58f5c77a014f19338a09e25a`
BLAKE2b-256	`30156db058dd9fb50268e5760da3c0b3b833b298a6d31d927711a920d53df0b6`

See more details on using hashes here.

anyvec 0.1.8

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

anyvec

Supported File Types

Processing Flow

Detailed Processing Flow

Quick Start / Usage

Installation

Basic Example

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes