Skip to main content

Matryoshka ColBERT: Multi-dimensional ColBERT embeddings with PyLate

Project description

colbert-matryoshka

Matryoshka ColBERT: Multi-dimensional ColBERT embeddings with PyLate.

This package provides MatryoshkaColBERT, a ColBERT model with Multiple Linear Heads for Matryoshka embeddings (Jina-ColBERT-v2 style). It supports multiple embedding dimensions (32, 64, 96, 128) using separate projection heads.

Installation

pip install colbert-matryoshka

Quick Start

from colbert_matryoshka import MatryoshkaColBERT

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")

# Set embedding dimension (32, 64, 96, or 128)
model.set_active_dim(128)

# Encode queries and documents
query_embeddings = model.encode(["검색 쿼리"], is_query=True)
doc_embeddings = model.encode(["문서 내용"], is_query=False)

print(f"Query shape: {query_embeddings[0].shape}")  # (num_tokens, 128)
print(f"Doc shape: {doc_embeddings[0].shape}")      # (num_tokens, 128)

Retrieval with PyLate

from colbert_matryoshka import MatryoshkaColBERT
from pylate import indexes, retrieve

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)

# Initialize PLAID index
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,
)

# Encode and index documents
documents_ids = ["1", "2", "3"]
documents = ["첫번째 문서입니다", "두번째 문서입니다", "세번째 문서입니다"]

documents_embeddings = model.encode(documents, is_query=False)
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

# Retrieve
retriever = retrieve.ColBERT(index=index)
queries_embeddings = model.encode(["첫번째 문서 검색"], is_query=True)

scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=3,
)
print(scores)

Reranking

from colbert_matryoshka import MatryoshkaColBERT
from pylate import rank

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)

queries = ["인공지능 기술", "한국어 자연어처리"]

documents = [
    ["AI와 머신러닝에 대한 문서", "요리 레시피 문서"],
    ["한국어 NLP 연구", "영어 문법 설명", "프로그래밍 튜토리얼"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

# Encode
queries_embeddings = model.encode(queries, is_query=True)
documents_embeddings = [model.encode(docs, is_query=False) for docs in documents]

# Rerank
reranked = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)
print(reranked)

Available Models

Model Dimensions Language
dragonkue/colbert-ko-0.1b 32, 64, 96, 128 Korean

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

colbert_matryoshka-0.1.5.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

colbert_matryoshka-0.1.5-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file colbert_matryoshka-0.1.5.tar.gz.

File metadata

  • Download URL: colbert_matryoshka-0.1.5.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for colbert_matryoshka-0.1.5.tar.gz
Algorithm Hash digest
SHA256 9bdb86861b73e1b5b2459baca928bdccffe1d3848e4c90f7902fec10aadd6a6e
MD5 2a08ef7b106d2ec3e1f58f78993f8838
BLAKE2b-256 2b23179a05821fcf7e6d3fabd6f486ee9fdde5c5cd13733cb1b9db7f3a783f18

See more details on using hashes here.

File details

Details for the file colbert_matryoshka-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for colbert_matryoshka-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 68cbd50658edf3b9bd385eb703984572af25bbc3bcc728590d3e4eec1e21e43b
MD5 66f525f52d8c83e7fc931a74aa49a66b
BLAKE2b-256 5ccc36b3e8a388cfc2d082338a6a4adb4f1c9e100a8b2eca9060e81587e30b85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page