A simple sentencepiece encoder and decoder without any dependency.
Project description
simple-sentencepiece
A simple sentencepiece encoder and decoder.
Note: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model as input and encode the string to ids/pieces or decode the ids to string. The advantage of this tool is that it doesn't have any dependency (no protobuf), so it will be easier to integrate it into a C++ project.
Installation
pip install simple-sentencepiece
Usage
The usage is very similar to sentencepiece, it also has encode and decode interface.
from ssentencepiece import Ssentencepiece
# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details
ssp = Ssentencepiece("/path/to/bpe.vocab")
# you can also use the default models provided by this package, see below for details
ssp = Ssentencepiece("gigaspeech-500")
ssp = Ssentencepiece("zh-en-10381")
# output ids (support both str and list of strs)
# if it is list of strs, the strs are encoded in parallel
ids = ssp.encode("HELLO WORLD")
ids = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"])
# output string pieces
# if it is list of strs, the strs are encoded in parallel
pieces = ssp.encode("HELLO WORLD", out_type=str)
pieces = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"], out_type=str)
# decode (support list of ids or list of list of ids)
# if it is list of list of ids, the ids are decoded in parallel
res = ssp.decode([1,2,3,4,5])
res = ssp.decode([[1,2,3,4], [4,5,6,7]])
# get vocab size
res = ssp.vocab_size()
# piece to id (support both str of list of strs)
id = ssp.piece_to_id("<sos>")
ids = ssp.piece_to_id(["<sos>", "<blk>", "H", "L"])
# id to piece (support both int of list of ints)
piece = ssp.id_to_piece(5)
pieces = ssp.id_to_piece([5, 10, 15])
Default models
| Model Name | Description | Link |
|---|---|---|
| alphabet-33 | <blk>,<unk>, <sos>, <eos>, <pad>, ', ▁ and 26 alphabets. |
alphabet-33 |
| librispeech-500 | 500 unigram pieces trained on Librispeech. | librispeech-500 |
| librispeech-5000 | 5000 unigram pieces trained on Librispeech. | librispeech-5000 |
| gigaspeech-500 | 500 unigram pieces trained on Gigaspeech. | gigaspeech-500 |
| gigaspeech-2000 | 2000 unigram pieces trained on Gigaspeech. | gigaspeech-2000 |
| gigaspeech-5000 | 5000 unigram pieces trained on Gigaspeech. | gigaspeech-5000 |
| zh-en-3876 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | zh-en-3876 |
| zh-en-6876 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | zh-en-6876 |
| zh-en-8481 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | zh-en-8481 |
| zh-en-5776 | 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | zh-en-5776 |
| zh-en-8776 | 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | zh-en-8776 |
| zh-en-10381 | 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | zh-en-10381 |
| zh-en-yue-9761 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. | zh-en-yue-9761 |
| zh-en-yue-11661 | 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. | zh-en-yue-11661 |
| chn_jpn_yue_eng_ko_spectok.bpe | bpe tokens used in sensevoice ASR, support Chinese, Japanese, Cantonese, English, Korean | chn_jpn_yue_eng_ko_spectok.bpe |
Note: The number of 3500, 6500 and 8105 is from 通用规范汉字表.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simple_sentencepiece-0.10.tar.gz.
File metadata
- Download URL: simple_sentencepiece-0.10.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97f0bc82ea6f03690fb25563f7a090b0192afc7db91829eec52d9b725ac6df20
|
|
| MD5 |
9e1287b909aeaf936ff17368e094ba43
|
|
| BLAKE2b-256 |
ad35b2222c1b7e05820297bb228f6b9642c796dde4ed57fbe33f16a40feca424
|
File details
Details for the file simple_sentencepiece-0.10-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: simple_sentencepiece-0.10-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 719.3 kB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea97982573aa2fbf901e056277886f5fa908a69265f850fd3ca19972df357804
|
|
| MD5 |
e67ac5585aeee4876de1d29c5d44ea9a
|
|
| BLAKE2b-256 |
27fceb6474118a3819d679912357dba7970b83228dc9b4fcca3b61fe5d90704c
|
File details
Details for the file simple_sentencepiece-0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: simple_sentencepiece-0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 719.3 kB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02e015dc4da719fab4e5b31602d402169ba6a20ae3c3d8d38ffabdf2f3197b74
|
|
| MD5 |
fac4ddbe37436ea87af447613cd87567
|
|
| BLAKE2b-256 |
6c9abc449c9943e55b1e15b88e77181ce29859fa83864faf666417ea3034d096
|
File details
Details for the file simple_sentencepiece-0.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: simple_sentencepiece-0.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 719.7 kB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0921443829eea3ecdd323e562f36c807513b35904cfefa2ac88a67c2ac44eba6
|
|
| MD5 |
74dc435267c7ff2b0e7056d69e79e83b
|
|
| BLAKE2b-256 |
88eef00d4aa5ca62a8d04543112ca878f34faaf54b6892d9092f5927f3696738
|
File details
Details for the file simple_sentencepiece-0.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: simple_sentencepiece-0.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 718.3 kB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
50bdf5735f84406ea4be6a12c8daffc935518bdd38deda2e339ad31c49e1a6b7
|
|
| MD5 |
375edc265bbafa8dc4192e11892113da
|
|
| BLAKE2b-256 |
2f5843d342e370302e8ff15427884c30f330baa50d044c4130ee0e527b333a37
|
File details
Details for the file simple_sentencepiece-0.10-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: simple_sentencepiece-0.10-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 718.7 kB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d1a9e1720da4f146285df068fb42737e794b25722abe9e7197d320c9dc276e33
|
|
| MD5 |
4410016add710f3d82d8736ba195fd12
|
|
| BLAKE2b-256 |
2295e41dcfe0572c8ea9b1759f74d76e66b4d9b87417901f6742b6c32168bda1
|
File details
Details for the file simple_sentencepiece-0.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: simple_sentencepiece-0.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 718.2 kB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76296f9836a80ddf525f9fbcdf6d6439c6a08e56d0587188eeec0e17c52b1d09
|
|
| MD5 |
3f928a7c19dc3a6655231f74be0e6b98
|
|
| BLAKE2b-256 |
abf3d12c1005fe7b229ac6bb197afada3fbeaf634491455dfdbaefef6c00f6f0
|