Skip to main content

A simple sentencepiece encoder and decoder without any dependency.

Project description

simple-sentencepiece

A simple sentencepiece encoder and decoder.

Note: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model as input and encode the string to ids/pieces or decode the ids to string. The advantage of this tool is that it doesn't have any dependency (no protobuf), so it will be easier to integrate it into a C++ project.

Installation

pip install simple-sentencepiece

Usage

The usage is very similar to sentencepiece, it also has encode and decode interface.

from ssentencepiece import Ssentencepiece

# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details
ssp = Ssentencepiece("/path/to/bpe.vocab")

# you can also use the default models provided by this package, see below for details
ssp = Ssentencepiece("gigaspeech-500")
ssp = Ssentencepiece("zh-en-10381")

# output ids (support both str and list of strs)
# if it is list of strs, the strs are encoded in parallel
ids = ssp.encode("HELLO WORLD")
ids = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"])

# output string pieces
# if it is list of strs, the strs are encoded in parallel
pieces = ssp.encode("HELLO WORLD", out_type=str)
pieces = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"], out_type=str)

# decode (support list of ids or list of list of ids)
# if it is list of list of ids, the ids are decoded in parallel
res = ssp.decode([1,2,3,4,5])
res = ssp.decode([[1,2,3,4], [4,5,6,7]])

# get vocab size
res = ssp.vocab_size()

# piece to id (support both str of list of strs)
id = ssp.piece_to_id("<sos>")
ids = ssp.piece_to_id(["<sos>", "<blk>", "H", "L"])

# id to piece (support both int of list of ints)
piece = ssp.id_to_piece(5)
pieces = ssp.id_to_piece([5, 10, 15])

Default models

Model Name Description Link
alphabet-33 <blk>,<unk>, <sos>, <eos>, <pad>, ', and 26 alphabets. alphabet-33
librispeech-500 500 unigram pieces trained on Librispeech. librispeech-500
librispeech-5000 5000 unigram pieces trained on Librispeech. librispeech-5000
gigaspeech-500 500 unigram pieces trained on Gigaspeech. gigaspeech-500
gigaspeech-2000 2000 unigram pieces trained on Gigaspeech. gigaspeech-2000
gigaspeech-5000 5000 unigram pieces trained on Gigaspeech. gigaspeech-5000
zh-en-3876 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-3876
zh-en-6876 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-6876
zh-en-8481 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-8481
zh-en-5776 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-5776
zh-en-8776 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-8776
zh-en-10381 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-10381
zh-en-yue-9761 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-yue-9761
zh-en-yue-11661 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-yue-11661
chn_jpn_yue_eng_ko_spectok.bpe bpe tokens used in sensevoice ASR, support Chinese, Japanese, Cantonese, English, Korean chn_jpn_yue_eng_ko_spectok.bpe

Note: The number of 3500, 6500 and 8105 is from 通用规范汉字表.

C++ Integration

The C++ core has no protobuf dependency — only a C++14 compiler and pthread are required. This makes it easy to embed simple-sentencepiece in any existing C++ system without conflicting with other protobuf versions.

A complete integration demo with CMake build scripts is provided in example/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_sentencepiece-0.12.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simple_sentencepiece-0.12-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (728.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.12-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (728.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.12-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (729.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.12-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (728.0 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.12-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.5 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file simple_sentencepiece-0.12.tar.gz.

File metadata

  • Download URL: simple_sentencepiece-0.12.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for simple_sentencepiece-0.12.tar.gz
Algorithm Hash digest
SHA256 6c1c53e27c0c31900564cace2fd04bc400c719d14a8c87385ab2d815841a2131
MD5 345f49c7dab2ed8a7299544db59e0f45
BLAKE2b-256 c5196de51bc85fe6dc7c2e544d24cdef1b3c6a1298001f2d2c5734c513ccc512

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.12-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.12-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2816125dc7ed388d12720cdfdaa2ad4610c0b8e1a6ef4e562a1bf9f0a33d2e6f
MD5 b9bc809bad697726235207ae0c213cc5
BLAKE2b-256 f50428d6824b756cca4b4b580caeed9813900c8c1c3cbebe28e4276fefd2df20

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.12-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.12-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7b3e81c4ce8b1818041fc853806b24d65acb616fa6ee6739bd11a150c5fb3bcd
MD5 d6088441d353348741ea700f22185eac
BLAKE2b-256 83497c27fee451320403fe89d6a7b8488591d5eb771eb0c2f09a9ef4410b2212

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.12-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.12-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cbf08b48dce0c34fe2abfe46312a0188bbc91dda4a8aa7c151a0fca4c1f160e4
MD5 4bb3b4d448633551c654fb2daaea41bf
BLAKE2b-256 efe414600eff3f8c0676e76104bf8ea96af726dbbbc573a00693d42636672885

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.12-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7e552ec42363f81ce654a3b9d0920ee6634d88b9229ff07104536cfe1650693e
MD5 8d5b8d314928fd8dc5aa3b772fc34b24
BLAKE2b-256 2d776ab7ba2f11852c2f4bfd44998ff6842b4b4d1caf46a520aa7a910dd0acfb

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.12-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.12-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eca76927fd87725efe93959a191511bfb0d973b61c6d25d3a6aa7fce241b5961
MD5 5b3e07b464ab2c7113bcdb6e87f56128
BLAKE2b-256 0f75f34e656e7dc7c1d1a118f2d5e16cf388720403cc8df89c732054a536eac2

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.12-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.12-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bf59aedf6d926e24052803e8e30ef6fc1c068c5f9cb1765466163cef3b861589
MD5 3e4edb405b7ca5168ca9f082e433be07
BLAKE2b-256 a9a2dd490baf0583e92f8d9a2e24f2dd02941e773baea2e76510adb315017b27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page