Skip to main content

A simple sentencepiece encoder and decoder without any dependency.

Project description

simple-sentencepiece

A simple sentencepiece encoder and decoder.

Note: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model as input and encode the string to ids/pieces or decode the ids to string. The advantage of this tool is that it doesn't have any dependency (no protobuf), so it will be easier to integrate it into a C++ project.

Installation

pip install simple-sentencepiece

Usage

The usage is very similar to sentencepiece, it also has encode and decode interface.

from ssentencepiece import Ssentencepiece

# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details
ssp = Ssentencepiece("/path/to/bpe.vocab")

# you can also use the default models provided by this package, see below for details
ssp = Ssentencepiece("gigaspeech-500")
ssp = Ssentencepiece("zh-en-10381")

# output ids (support both str and list of strs)
# if it is list of strs, the strs are encoded in parallel
ids = ssp.encode("HELLO WORLD")
ids = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"])

# output string pieces
# if it is list of strs, the strs are encoded in parallel
pieces = ssp.encode("HELLO WORLD", out_type=str)
pieces = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"], out_type=str)

# decode (support list of ids or list of list of ids)
# if it is list of list of ids, the ids are decoded in parallel
res = ssp.decode([1,2,3,4,5])
res = ssp.decode([[1,2,3,4], [4,5,6,7]])

# get vocab size
res = ssp.vocab_size()

# piece to id (support both str of list of strs)
id = ssp.piece_to_id("<sos>")
ids = ssp.piece_to_id(["<sos>", "<blk>", "H", "L"])

# id to piece (support both int of list of ints)
piece = ssp.id_to_piece(5)
pieces = ssp.id_to_piece([5, 10, 15])

Default models

Model Name Description Link
alphabet-33 <blk>,<unk>, <sos>, <eos>, <pad>, ', and 26 alphabets. alphabet-33
librispeech-500 500 unigram pieces trained on Librispeech. librispeech-500
librispeech-5000 5000 unigram pieces trained on Librispeech. librispeech-5000
gigaspeech-500 500 unigram pieces trained on Gigaspeech. gigaspeech-500
gigaspeech-2000 2000 unigram pieces trained on Gigaspeech. gigaspeech-2000
gigaspeech-5000 5000 unigram pieces trained on Gigaspeech. gigaspeech-5000
zh-en-3876 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-3876
zh-en-6876 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-6876
zh-en-8481 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-8481
zh-en-5776 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-5776
zh-en-8776 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-8776
zh-en-10381 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-10381
zh-en-yue-9761 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-yue-9761
zh-en-yue-11661 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-yue-11661
chn_jpn_yue_eng_ko_spectok.bpe bpe tokens used in sensevoice ASR, support Chinese, Japanese, Cantonese, English, Korean chn_jpn_yue_eng_ko_spectok.bpe

Note: The number of 3500, 6500 and 8105 is from 通用规范汉字表.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_sentencepiece-0.10.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simple_sentencepiece-0.10-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (719.3 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (719.3 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (719.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (718.3 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.10-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (718.7 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (718.2 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file simple_sentencepiece-0.10.tar.gz.

File metadata

  • Download URL: simple_sentencepiece-0.10.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for simple_sentencepiece-0.10.tar.gz
Algorithm Hash digest
SHA256 97f0bc82ea6f03690fb25563f7a090b0192afc7db91829eec52d9b725ac6df20
MD5 9e1287b909aeaf936ff17368e094ba43
BLAKE2b-256 ad35b2222c1b7e05820297bb228f6b9642c796dde4ed57fbe33f16a40feca424

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.10-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.10-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ea97982573aa2fbf901e056277886f5fa908a69265f850fd3ca19972df357804
MD5 e67ac5585aeee4876de1d29c5d44ea9a
BLAKE2b-256 27fceb6474118a3819d679912357dba7970b83228dc9b4fcca3b61fe5d90704c

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 02e015dc4da719fab4e5b31602d402169ba6a20ae3c3d8d38ffabdf2f3197b74
MD5 fac4ddbe37436ea87af447613cd87567
BLAKE2b-256 6c9abc449c9943e55b1e15b88e77181ce29859fa83864faf666417ea3034d096

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0921443829eea3ecdd323e562f36c807513b35904cfefa2ac88a67c2ac44eba6
MD5 74dc435267c7ff2b0e7056d69e79e83b
BLAKE2b-256 88eef00d4aa5ca62a8d04543112ca878f34faaf54b6892d9092f5927f3696738

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 50bdf5735f84406ea4be6a12c8daffc935518bdd38deda2e339ad31c49e1a6b7
MD5 375edc265bbafa8dc4192e11892113da
BLAKE2b-256 2f5843d342e370302e8ff15427884c30f330baa50d044c4130ee0e527b333a37

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.10-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.10-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d1a9e1720da4f146285df068fb42737e794b25722abe9e7197d320c9dc276e33
MD5 4410016add710f3d82d8736ba195fd12
BLAKE2b-256 2295e41dcfe0572c8ea9b1759f74d76e66b4d9b87417901f6742b6c32168bda1

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 76296f9836a80ddf525f9fbcdf6d6439c6a08e56d0587188eeec0e17c52b1d09
MD5 3f928a7c19dc3a6655231f74be0e6b98
BLAKE2b-256 abf3d12c1005fe7b229ac6bb197afada3fbeaf634491455dfdbaefef6c00f6f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page