A very simple helper library for keeping track of locations in tokenized text.
Project description
Multiocular
A very simple helper library to keep track of locations in tokenized text. Works with HuggingFace Transformers.
Named after the multiocular O—which is also the default separator character—because it helps you see everywhere inside your tokenized string.
Example
import multiocular
from multiocular import SEP
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", use_fast= True)
chat = [
{"role": "user", "content": f"What is the capital of{SEP} France?"},
{"role": "assistant", "content": f"The capital of{SEP} France is{SEP} Paris."},
]
message = tok.apply_chat_template(chat, tokenize=False)
print(message)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
#
# Cutting Knowledge Date: December 2023
# Today Date: 26 Jul 2024
#
# <|eot_id|><|start_header_id|>user<|end_header_id|>
#
# What is the capital ofꙮ France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
#
# The capital ofꙮ France isꙮ Paris.<|eot_id|>
tokens, points = multiocular.tokenize(tok, message)
print((tokens, points))
# ({'input_ids': [128000, 128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 128009, 128006, 882, 128007, 271, 3923, 374, 279, 6864, 315, 9822, 30, 128009, 128006, 78191, 128007, 271, 791, 6864, 315, 9822, 374, 12366, 13, 128009], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}, [36, 46, 48])
france1, france2, paris = points
print(tok.decode(tokens.input_ids[:paris]))
# <|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>
#
# Cutting Knowledge Date: December 2023
# Today Date: 26 Jul 2024
#
# <|eot_id|><|start_header_id|>user<|end_header_id|>
#
# What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
#
# The capital of France is
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file multiocular-0.1.0.tar.gz.
File metadata
- Download URL: multiocular-0.1.0.tar.gz
- Upload date:
- Size: 2.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9306327326845e6a091402c4b1fabacb634850332b9cf8e04f3d8ded3135959
|
|
| MD5 |
fefd5fd19a342f02d575612040b96333
|
|
| BLAKE2b-256 |
a26feb1b3de269afd3339ad272928ef205caedceb2214ff2974f29f65106b377
|
File details
Details for the file multiocular-0.1.0-py3-none-any.whl.
File metadata
- Download URL: multiocular-0.1.0-py3-none-any.whl
- Upload date:
- Size: 2.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2bd4e781623545fd5b4dbdac0684088e8b823b74f823e90a6452adddc4874df0
|
|
| MD5 |
43d7cf9d8c02b905d3ab47b507498937
|
|
| BLAKE2b-256 |
3cce060bdf5b507c740f2c62d4e7755188e525e9c89e6a111bac6967c7eeaa98
|