Matryoshka ColBERT: Multi-dimensional ColBERT embeddings with PyLate
Project description
colbert-matryoshka
Matryoshka ColBERT: Multi-dimensional ColBERT embeddings with PyLate.
This package provides MatryoshkaColBERT, a ColBERT model with Multiple Linear Heads for Matryoshka embeddings (Jina-ColBERT-v2 style). It supports multiple embedding dimensions (32, 64, 96, 128) using separate projection heads.
Installation
pip install colbert-matryoshka
Quick Start
from colbert_matryoshka import MatryoshkaColBERT
# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
# Set embedding dimension (32, 64, 96, or 128)
model.set_active_dim(128)
# Encode queries and documents
query_embeddings = model.encode(["검색 쿼리"], is_query=True)
doc_embeddings = model.encode(["문서 내용"], is_query=False)
print(f"Query shape: {query_embeddings[0].shape}") # (num_tokens, 128)
print(f"Doc shape: {doc_embeddings[0].shape}") # (num_tokens, 128)
Retrieval with PyLate
from colbert_matryoshka import MatryoshkaColBERT
from pylate import indexes, retrieve
# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)
# Initialize PLAID index
index = indexes.PLAID(
index_folder="pylate-index",
index_name="index",
override=True,
)
# Encode and index documents
documents_ids = ["1", "2", "3"]
documents = ["첫번째 문서입니다", "두번째 문서입니다", "세번째 문서입니다"]
documents_embeddings = model.encode(documents, is_query=False)
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
# Retrieve
retriever = retrieve.ColBERT(index=index)
queries_embeddings = model.encode(["첫번째 문서 검색"], is_query=True)
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=3,
)
print(scores)
Reranking
from colbert_matryoshka import MatryoshkaColBERT
from pylate import rank
# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)
queries = ["인공지능 기술", "한국어 자연어처리"]
documents = [
["AI와 머신러닝에 대한 문서", "요리 레시피 문서"],
["한국어 NLP 연구", "영어 문법 설명", "프로그래밍 튜토리얼"],
]
documents_ids = [
[1, 2],
[1, 3, 2],
]
# Encode
queries_embeddings = model.encode(queries, is_query=True)
documents_embeddings = [model.encode(docs, is_query=False) for docs in documents]
# Rerank
reranked = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
print(reranked)
Available Models
| Model | Dimensions | Language |
|---|---|---|
| dragonkue/colbert-ko-0.1b | 32, 64, 96, 128 | Korean |
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file colbert_matryoshka-0.1.5.tar.gz.
File metadata
- Download URL: colbert_matryoshka-0.1.5.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bdb86861b73e1b5b2459baca928bdccffe1d3848e4c90f7902fec10aadd6a6e
|
|
| MD5 |
2a08ef7b106d2ec3e1f58f78993f8838
|
|
| BLAKE2b-256 |
2b23179a05821fcf7e6d3fabd6f486ee9fdde5c5cd13733cb1b9db7f3a783f18
|
File details
Details for the file colbert_matryoshka-0.1.5-py3-none-any.whl.
File metadata
- Download URL: colbert_matryoshka-0.1.5-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68cbd50658edf3b9bd385eb703984572af25bbc3bcc728590d3e4eec1e21e43b
|
|
| MD5 |
66f525f52d8c83e7fc931a74aa49a66b
|
|
| BLAKE2b-256 |
5ccc36b3e8a388cfc2d082338a6a4adb4f1c9e100a8b2eca9060e81587e30b85
|