LLM tokenizers tools

These details have not been verified by PyPI

Project links

Project description

llm_tokenizers

Language: English 中文

介绍

收集llm的各种 tokenizer

软件架构

软件架构说明

项目安装教程

克隆项目到本地：

git clone https://gitee.com/sky_flash/llm_tokenizers.git

cd llm_tokenizers

使用 pip 安装依赖：

pip install -r requirements.txt

软件包安装程

使用说明

使用 pip 安装

pip install llm_tokenizers

项目打包

确保已安装构建工具：

pip install build

python -m build

打包完成后，生成的 .whl 和 .tar.gz 文件会保存在 dist/ 目录下。

安装打包好的 .whl 文件（以生成的文件名为例）：

pip install dist/llm_tokenizers-0.1.0-py3-none-any.whl

API 调用说明

你可以通过导入 DeepSeekTokenizer 类来直接使用它提供的功能。以下是完整的使用教程。

from llm_tokenizers.deepseek_tokenizer import DeepSeekTokenizer

1. 获取 Tokenizer 标识

print(DeepSeekTokenizer.id())  # 输出: deepseek

id() 方法返回该 Tokenizer 的唯一标识符，可用于程序中识别当前使用的是哪个 Tokenizer。

2. 编码文本为 token ID 列表

text = "Hello, world!"
token_ids = DeepSeekTokenizer.encode(text)
print(token_ids)  # 输出: [列表形式的 token IDs]

encode(text: str) -> List[int]
将输入的字符串文本编码为对应的 token 编码列表。

3. 解码 token ID 为原始文本

decoded_text = DeepSeekTokenizer.decode(token_ids)
print(decoded_text)  # 输出: Hello, world!

decode(data: Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]) -> str
支持多种数据类型输入，返回解码后的字符串。

4. 统计 token 数量

token_count = DeepSeekTokenizer.tokens_len(text)
print(f"Token count: {token_count}")  # 示例输出: Token count: 5

tokens_len(text: str)
返回输入文本被编码后的 token 数量。

✅ 使用示例汇总

from llm_tokenizers.deepseek_tokenizer import DeepSeekTokenizer

# 获取标识
print("Tokenizer ID:", DeepSeekTokenizer.id())

# 编码文本
text = "Hello, deepseek tokenizer!"
token_ids = DeepSeekTokenizer.encode(text)
print("Encoded tokens:", token_ids)

# 解码 token
decoded = DeepSeekTokenizer.decode(token_ids)
print("Decoded text:", decoded)

# 统计 token 数量
print("Token count:", DeepSeekTokenizer.tokens_len(text))

📌 注意事项：

DeepSeekTokenizer 是一个 类方法驱动 的工具类，所有方法均为 @classmethod，无需实例化即可调用。
依赖的 transformers 模型文件应放在 resources/deepseek_tokenizer/ 目录下。
若使用 np.ndarray, torch.Tensor, tf.Tensor 类型的数据，需确保已安装对应库（如 numpy, torch, tensorflow）。

命令行调用

在完成项目安装后，如 whl安装后，执行 llm-token 命令

llm-token [选项]

命令行参数说明

参数	全称	说明	示例
`-t`	`--tokenizer`	指定要使用的 tokenizer 类型	`llm-token -t deepseek -i "Hello"`
`-f`	`--file`	指定输入文件路径	`llm-token -f ./input.txt`
`-u`	`--url`	指定输入 URL 路径	`llm-token -u https://example.com/text`
`-o`	`--output`	指定输出文件路径	`llm-token -i "Hello" -o ./output.txt`
`-c`	`--count`	统计 tokens 长度	`llm-token -c -i "Hello world"`
`-i`	`--input`	直接输入文本内容	`llm-token -i "Hello world"`
`--read-charset`		指定读取文件的字符集	`llm-token -f ./input.txt --read-charset gbk`

使用示例

直接输入文本进行编码：
```
llm-token -i "Hello, world!"
```

统计文本的 token 数量：

llm-token -c -i "Hello, world!"
# 输出示例: tokens count: 5

从文件读取内容进行编码：
```
llm-token -f ./input.txt
```

从 URL 获取内容进行编码：

llm-token -u https://example.com/sample.txt

指定 tokenizer 类型：

llm-token -t deepseek -i "Hello, world!"

将结果输出到文件：

llm-token -i "Hello, world!" -o ./encoded_output.txt

指定文件读取字符集：

llm-token -f ./chinese_text.txt --read-charset gbk

优先级说明

当同时指定多种输入方式时，程序按照以下优先级处理：

-i / --input (直接输入文本)
-f / --file (文件输入)
-u / --url (URL输入)

注意事项

至少需要指定一种输入方式（-i、-f 或 -u）
使用 -c 参数时，只会输出 token 数量，不会输出编码结果
输出默认打印到控制台，使用 -o 参数可指定输出文件

参与贡献

Fork 本仓库
新建 Feat_xxx 分支
提交代码
新建 Pull Request

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.4

Aug 19, 2025

0.1.3

Aug 19, 2025

0.1.2

Jul 23, 2025

This version

0.1.1

Jul 23, 2025

0.1.0

Jul 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_tokenizers-0.1.1.tar.gz (2.0 MB view details)

Uploaded Jul 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_tokenizers-0.1.1-py3-none-any.whl (2.0 MB view details)

Uploaded Jul 23, 2025 Python 3

File details

Details for the file llm_tokenizers-0.1.1.tar.gz.

File metadata

Download URL: llm_tokenizers-0.1.1.tar.gz
Upload date: Jul 23, 2025
Size: 2.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_tokenizers-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`8ac2361bcaa1e502ad697b57774e93eb36a7b6f27bec309dff5fd004898a7216`
MD5	`72439b78581d9640574db005ae1fe37c`
BLAKE2b-256	`f025118440cd74e4b8e042328e90dbb81742704e52337daa87b0c5fb87e0f814`

See more details on using hashes here.

File details

Details for the file llm_tokenizers-0.1.1-py3-none-any.whl.

File metadata

Download URL: llm_tokenizers-0.1.1-py3-none-any.whl
Upload date: Jul 23, 2025
Size: 2.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for llm_tokenizers-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`86a3e30b9e0e3a1f8b5846c4418a7f381b21f8d8fe46eee413aba3c2f1312d83`
MD5	`4483d1a30d3b73bb663d400b4282de92`
BLAKE2b-256	`432748eb002ed8cf5b249727de9a9ff2b9bd3bf879caf00bc7ac4778a97feddc`

See more details on using hashes here.

llm-tokenizers 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm_tokenizers

介绍

软件架构

项目安装教程

软件包安装程

使用说明

项目打包

API 调用说明

1. 获取 Tokenizer 标识

2. 编码文本为 token ID 列表

3. 解码 token ID 为原始文本

4. 统计 token 数量

✅ 使用示例汇总

📌 注意事项：

命令行调用

命令行参数说明

使用示例

优先级说明

注意事项

参与贡献

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes