Skip to main content

A simple toolkit for generating vector embeddings across multiple providers and models

Project description

EmbedKit

A unified interface for text and image embeddings, supporting multiple providers.

Installation

pip install embedkit

Usage

Text Embeddings

from embedkit import EmbedKit
from embedkit.classes import Model, CohereInputType, SnowflakeInputType

# Initialize with ColPali
kit = EmbedKit.colpali(
    model=Model.ColPali.COLPALI_V1_3,  # or COLSMOL_256M, COLSMOL_500M
    text_batch_size=16,  # Optional: process text in batches of 16
    image_batch_size=8,  # Optional: process images in batches of 8
)

# Get embeddings
result = kit.embed_text("Hello world")
print(result.model_provider)
print(result.input_type)
print(result.objects[0].embedding.shape)  # Returns 2D array for ColPali
print(result.objects[0].source_b64)

# Initialize with Cohere
kit = EmbedKit.cohere(
    model=Model.Cohere.EMBED_V4_0,
    api_key="your-api-key",
    text_input_type=CohereInputType.SEARCH_QUERY,  # or SEARCH_DOCUMENT
    text_batch_size=64,  # Optional: process text in batches of 64
    image_batch_size=8,  # Optional: process images in batches of 8
)

# Get embeddings
result = kit.embed_text("Hello world")
print(result.model_provider)
print(result.input_type)
print(result.objects[0].embedding.shape)  # Returns 1D array for Cohere
print(result.objects[0].source_b64)

# Initialize with Jina
kit = EmbedKit.jina(
    model=Model.Jina.CLIP_V2,
    api_key="your-api-key",
    text_batch_size=32,  # Optional: process text in batches of 32
    image_batch_size=8,  # Optional: process images in batches of 8
)

# Get embeddings
result = kit.embed_text("Hello world")
print(result.model_provider)
print(result.input_type)
print(result.objects[0].embedding.shape)  # Returns 1D array for Jina
print(result.objects[0].source_b64)

# Initialize with Snowflake
kit = EmbedKit.snowflake(
    model=Model.Snowflake.ARCTIC_EMBED_L_V2_0,  # or ARCTIC_EMBED_M_V1_5
    text_input_type=SnowflakeInputType.QUERY,  # or DOCUMENT
    text_batch_size=32,  # Optional: process text in batches of 32
)

# Get embeddings
result = kit.embed_text("Hello world")
print(result.model_provider)
print(result.input_type)
print(result.objects[0].embedding.shape)  # Returns 1D array for Snowflake
print(result.objects[0].source_b64)

Image Embeddings

from pathlib import Path

# Get embeddings for an image
image_path = Path("path/to/image.png")
result = kit.embed_image(image_path)

print(result.model_provider)
print(result.input_type)
print(result.objects[0].embedding.shape)  # 2D for ColPali, 1D for Cohere/Jina
print(result.objects[0].source_b64)  # Base64 encoded image

PDF Embeddings

from pathlib import Path

# Get embeddings for a PDF
pdf_path = Path("path/to/document.pdf")
result = kit.embed_pdf(pdf_path)

print(result.model_provider)
print(result.input_type)
print(result.objects[0].embedding.shape)  # 2D for ColPali, 1D for Cohere/Jina
print(result.objects[0].source_b64)  # Base64 encoded PDF page

Response Format

The embedding methods return an EmbeddingResponse object with the following structure:

class EmbeddingResponse:
    model_name: str
    model_provider: str
    input_type: str
    objects: List[EmbeddingObject]

class EmbeddingObject:
    embedding: np.ndarray  # 1D array for everything except ColPali
    source_b64: Optional[str]  # Base64 encoded source for images and PDFs

Supported Models

ColPali

  • Model.ColPali.COLPALI_V1_3
  • Model.ColPali.COLSMOL_256M
  • Model.ColPali.COLSMOL_500M

Cohere

  • Model.Cohere.EMBED_V4_0
  • Model.Cohere.EMBED_ENGLISH_V3_0
  • Model.Cohere.EMBED_ENGLISH_LIGHT_V3_0
  • Model.Cohere.EMBED_MULTILINGUAL_V3_0
  • Model.Cohere.EMBED_MULTILINGUAL_LIGHT_V3_0

Jina

  • Model.Jina.CLIP_V2

Snowflake

  • Model.Snowflake.ARCTIC_EMBED_L_V2_0 - Large model optimized for high accuracy
  • Model.Snowflake.ARCTIC_EMBED_M_V1_5 - Medium model balanced for speed and accuracy

Development

Running Tests

Tests are organized by provider and can be run selectively using pytest markers:

# Run all tests
pytest

# Run tests for specific providers
pytest -m cohere    # Run only Cohere tests
pytest -m colpali   # Run only ColPali tests
pytest -m jina      # Run only Jina tests
pytest -m snowflake # Run only Snowflake tests

# Run tests for multiple providers
pytest -m "cohere or jina"

# Run all tests except a specific provider
pytest -m "not cohere"

# Additional pytest options
pytest -v           # Verbose output
pytest -s           # Show print statements
pytest -x           # Stop on first failure

Requirements

  • Python 3.10+

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedkit-0.1.7.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedkit-0.1.7-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file embedkit-0.1.7.tar.gz.

File metadata

  • Download URL: embedkit-0.1.7.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.4

File hashes

Hashes for embedkit-0.1.7.tar.gz
Algorithm Hash digest
SHA256 804e824bb6992a3557e308dd3ab1bccb84fe72bb58c9a3840644cf1d4bf5bfb2
MD5 ecb08692f60511273d4f16bf68618e3b
BLAKE2b-256 51093f543e82fc5c26fa24abfc02d50f20e617875135a47608f6a5f4bacff53a

See more details on using hashes here.

File details

Details for the file embedkit-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: embedkit-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.4

File hashes

Hashes for embedkit-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 1e0a8cbd237dbe59d33cd1b05870f512a5a2ee69a647f69be01d1f5a6967ccb1
MD5 ef91a29a78790f3c2a2d8085dc1fc41f
BLAKE2b-256 d0c8659726d7cc07215cfafcea9627514afcde52cba127505f44da93e55621cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page