Skip to main content

A very simple helper library for keeping track of locations in tokenized text.

Project description

Multiocular

A very simple helper library to keep track of locations in tokenized text. Works with HuggingFace Transformers.

Named after the multiocular O—which is also the default separator character—because it helps you see everywhere inside your tokenized string.

Example

import multiocular
from multiocular import SEP
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", use_fast= True)

chat = [
        {"role": "user", "content": f"What is the capital of{SEP} France?"},
        {"role": "assistant", "content": f"The capital of{SEP} France is{SEP} Paris."},
        ]
message = tok.apply_chat_template(chat, tokenize=False)
print(message)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# 
# Cutting Knowledge Date: December 2023
# Today Date: 26 Jul 2024
# 
# <|eot_id|><|start_header_id|>user<|end_header_id|>
# 
# What is the capital ofꙮ France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
# 
# The capital ofꙮ France isꙮ Paris.<|eot_id|>

tokens, points = multiocular.tokenize(tok, message)
print((tokens, points))
# ({'input_ids': [128000, 128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 1627, 10263, 220, 2366, 19, 271, 128009, 128006, 882, 128007, 271, 3923, 374, 279, 6864, 315, 9822, 30, 128009, 128006, 78191, 128007, 271, 791, 6864, 315, 9822, 374, 12366, 13, 128009], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}, [36, 46, 48])

france1, france2, paris = points
print(tok.decode(tokens.input_ids[:paris]))
# <|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>
# 
# Cutting Knowledge Date: December 2023
# Today Date: 26 Jul 2024
# 
# <|eot_id|><|start_header_id|>user<|end_header_id|>
# 
# What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
# 
# The capital of France is

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multiocular-0.1.0.tar.gz (2.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

multiocular-0.1.0-py3-none-any.whl (2.7 kB view details)

Uploaded Python 3

File details

Details for the file multiocular-0.1.0.tar.gz.

File metadata

  • Download URL: multiocular-0.1.0.tar.gz
  • Upload date:
  • Size: 2.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.14

File hashes

Hashes for multiocular-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b9306327326845e6a091402c4b1fabacb634850332b9cf8e04f3d8ded3135959
MD5 fefd5fd19a342f02d575612040b96333
BLAKE2b-256 a26feb1b3de269afd3339ad272928ef205caedceb2214ff2974f29f65106b377

See more details on using hashes here.

File details

Details for the file multiocular-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for multiocular-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2bd4e781623545fd5b4dbdac0684088e8b823b74f823e90a6452adddc4874df0
MD5 43d7cf9d8c02b905d3ab47b507498937
BLAKE2b-256 3cce060bdf5b507c740f2c62d4e7755188e525e9c89e6a111bac6967c7eeaa98

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page