A Python package for seamless vectorization for any content type
Project description
anyvec
AnyVec is an open-source Python package that makes it easy to vectorize any type of file — text, images, audio, video, or code — through a single, unified interface. Traditionally, embedding different data types (like text vs. images) requires different models and disparate code paths. AnyVec abstracts away these complexities, allowing you to work with a unified API for all your vectorization needs, regardless of file type.
Supported File Types
| Category | Extensions / MIME Types |
|---|---|
| Text/Docs | .txt, .rtf, .md, .doc, .docx, .odt |
| Presentation | .ppt, .pptx, .ppsx, .pptm, .odp |
| Spreadsheet | .xls, .xlsx, .ods |
| EPUB | .epub |
| Templates | .dotm, .dotx, .docm |
| Image | .png, .jpg, .jpeg, .jpe, .bmp, .gif, .tiff, .ico, .icns, .heic, .avif, .webp, .psd |
| Audio | .mp3, .wav, .ogg, .m4a |
| Video | .mp4, .avi, .mov, .mkv, .webm, .mpeg, .mpg |
| Code | .py, .js, .ts, .tsx, .jsx, .java, .cpp, .c, .h, .hpp, .cs, .go, .rb, .php, .pl, .sh, .swift, .scala, .lua, .f90, .f95, .erl, .exs, .bat, .sql, .lisp, .vb, .ipynb, .xml, .yml, .yaml, .json, .kt, .rst, .html |
For the most up-to-date list, see the
mime_handlersdictionary in the codebase.
Processing Flow
- File Type Detection: AnyVec uses MIME type and file extension to determine the file type.
- Extraction: The relevant extractor parses text, images, or audio from the file.
- Vectorization: The extracted content is sent to a CLIP-like model via API for embedding.
- Unified Output: You get back text and image vectors, regardless of input type.
Detailed Processing Flow
Text Files:
- Extracts raw text using format-appropriate parsers.
- Returns extracted text for vectorization.
Image Files:
- Returns the image data as base64-encoded JPEGs or PNGs.
- Optionally, OCR (optical character recognition) can be performed for text extraction.
Audio Files:
- Audio bytes are sent to a transcription server (e.g., OpenAI Whisper).
- The server returns the transcribed text, which is then vectorized.
Video Files:
- The video is processed in two ways:
- Audio Extraction & Transcription:
- Audio is extracted from the video using MoviePy.
- The extracted audio is sent to the
/transcribeendpoint in your inference container. - The returned transcript is used for vectorization.
- Frame Extraction:
- Frames are extracted at n-second intervals using OpenCV.
- Frames are returned as base64-encoded JPEGs for downstream processing or vectorization.
- Audio Extraction & Transcription:
Return Values:
- For text, audio, and video: returns extracted text (or transcript) and/or images (frames).
- For images: returns images and optionally OCR text.
Quick Start / Usage
Installation
pip install anyvec
For inference, you can skip building locally and pull the latest public image directly from Docker Hub:
docker pull mxy680/clip-inference:latest
Then run the container:
docker run --rm -it -p 8000:8080 mxy680/clip-inference:latest
The API will be available at http://localhost:8000.
To run the container in detached mode (in the background), use:
docker run -d -p 8000:8080 mxy680/clip-inference:latest
The API will still be available at http://localhost:8000 while the container runs in the background.
Basic Example
from anyvec.client import AnyVecClient
from anyvec.models import VectorizationPayload
client = AnyVecClient("http://localhost:8000")
# Process a PDF
with open("example.pdf", "rb") as f:
file_content = f.read()
payload = VectorizationPayload(file_content=file_content, file_name="example.pdf")
result = client.vectorize(payload)
print("Vectorization result:", result)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file anyvec-0.1.8.tar.gz.
File metadata
- Download URL: anyvec-0.1.8.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.1 CPython/3.13.2 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
170a9114b87da8c5857e98099a960a2f6e9374f804e581eb287be17a0e00ba8d
|
|
| MD5 |
2a6f6db69c50da96e37ea762cf4a11c0
|
|
| BLAKE2b-256 |
3077d8fc4e0aa225c62f1db9cdfd521c4065d6bdfa88795bc44ab5302ab9b091
|
File details
Details for the file anyvec-0.1.8-py3-none-any.whl.
File metadata
- Download URL: anyvec-0.1.8-py3-none-any.whl
- Upload date:
- Size: 23.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.0.1 CPython/3.13.2 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f315efdd3f8fbb34394d5ecebcccb7a860ccb647a8f6f91a566885fcc5051850
|
|
| MD5 |
d9881c3a58f5c77a014f19338a09e25a
|
|
| BLAKE2b-256 |
30156db058dd9fb50268e5760da3c0b3b833b298a6d31d927711a920d53df0b6
|