Skip to main content

A Python package for seamless vectorization for any content type

Project description

anyvec

AnyVec is an open-source Python package that makes it easy to vectorize any type of file — text, images, audio, video, or code — through a single, unified interface. Traditionally, embedding different data types (like text vs. images) requires different models and disparate code paths. AnyVec abstracts away these complexities, allowing you to work with a unified API for all your vectorization needs, regardless of file type.


Supported File Types

Category Extensions / MIME Types
Text/Docs .txt, .rtf, .md, .doc, .docx, .odt
PDF .pdf
Presentation .ppt, .pptx, .ppsx, .pptm, .odp
Spreadsheet .xls, .xlsx, .ods
EPUB .epub
Templates .dotm, .dotx, .docm
Image .png, .jpg, .jpeg, .jpe, .bmp, .gif, .tiff, .ico, .icns, .heic, .avif, .webp, .psd
Audio .mp3, .wav, .ogg, .m4a
Video .mp4, .avi, .mov, .mkv, .webm, .mpeg, .mpg
Code .py, .js, .ts, .tsx, .jsx, .java, .cpp, .c, .h, .hpp, .cs, .go, .rb, .php, .pl, .sh, .swift, .scala, .lua, .f90, .f95, .erl, .exs, .bat, .sql, .lisp, .vb, .ipynb, .xml, .yml, .yaml, .json, .kt, .rst, .html

For the most up-to-date list, see the mime_handlers dictionary in the codebase.

Processing Flow

  1. File Type Detection: AnyVec uses MIME type and file extension to determine the file type.
  2. Extraction: The relevant extractor parses text, images, or audio from the file.
  3. Vectorization: The extracted content is sent to a CLIP-like model via API for embedding.
  4. Unified Output: You get back text and image vectors, regardless of input type.

Detailed Processing Flow

Text Files:

  • Extracts raw text using format-appropriate parsers.
  • Returns extracted text for vectorization.

Image Files:

  • Returns the image data as base64-encoded JPEGs or PNGs.
  • Optionally, OCR (optical character recognition) can be performed for text extraction.

Audio Files:

  • Audio bytes are sent to a transcription server (e.g., OpenAI Whisper).
  • The server returns the transcribed text, which is then vectorized.

Video Files:

  • The video is processed in two ways:
    1. Audio Extraction & Transcription:
      • Audio is extracted from the video using MoviePy.
      • The extracted audio is sent to the /transcribe endpoint in your inference container.
      • The returned transcript is used for vectorization.
    2. Frame Extraction:
      • Frames are extracted at n-second intervals using OpenCV.
      • Frames are returned as base64-encoded JPEGs for downstream processing or vectorization.

Return Values:

  • For text, audio, and video: returns extracted text (or transcript) and/or images (frames).
  • For images: returns images and optionally OCR text.

Quick Start / Usage

Installation

pip install anyvec

For inference, you can skip building locally and pull the latest public image directly from Docker Hub:

docker pull mxy680/clip-inference:latest

Then run the container:

docker run --rm -it -p 8000:8080 mxy680/clip-inference:latest

The API will be available at http://localhost:8000.

To run the container in detached mode (in the background), use:

docker run -d -p 8000:8080 mxy680/clip-inference:latest

The API will still be available at http://localhost:8000 while the container runs in the background.


Basic Example

from anyvec.client import AnyVecClient
from anyvec.models import VectorizationPayload

client = AnyVecClient("http://localhost:8000")

# Process a PDF
with open("example.pdf", "rb") as f:
    file_content = f.read()
payload = VectorizationPayload(file_content=file_content, file_name="example.pdf")
result = client.vectorize(payload)
print("Vectorization result:", result)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anyvec-0.1.8.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anyvec-0.1.8-py3-none-any.whl (23.7 kB view details)

Uploaded Python 3

File details

Details for the file anyvec-0.1.8.tar.gz.

File metadata

  • Download URL: anyvec-0.1.8.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.13.2 Darwin/24.3.0

File hashes

Hashes for anyvec-0.1.8.tar.gz
Algorithm Hash digest
SHA256 170a9114b87da8c5857e98099a960a2f6e9374f804e581eb287be17a0e00ba8d
MD5 2a6f6db69c50da96e37ea762cf4a11c0
BLAKE2b-256 3077d8fc4e0aa225c62f1db9cdfd521c4065d6bdfa88795bc44ab5302ab9b091

See more details on using hashes here.

File details

Details for the file anyvec-0.1.8-py3-none-any.whl.

File metadata

  • Download URL: anyvec-0.1.8-py3-none-any.whl
  • Upload date:
  • Size: 23.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.13.2 Darwin/24.3.0

File hashes

Hashes for anyvec-0.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f315efdd3f8fbb34394d5ecebcccb7a860ccb647a8f6f91a566885fcc5051850
MD5 d9881c3a58f5c77a014f19338a09e25a
BLAKE2b-256 30156db058dd9fb50268e5760da3c0b3b833b298a6d31d927711a920d53df0b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page