Skip to main content

A simple sentencepiece encoder and decoder without any dependency.

Project description

simple-sentencepiece

A simple sentencepiece encoder and decoder.

Note: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model as input and encode the string to ids/pieces or decode the ids to string. The advantage of this tool is that it doesn't have any dependency (no protobuf), so it will be easier to integrate it into a C++ project.

Installation

pip install simple-sentencepiece

Usage

The usage is very similar to sentencepiece, it also has encode and decode interface.

from ssentencepiece import Ssentencepiece

# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details
ssp = Ssentencepiece("/path/to/bpe.vocab")

# you can also use the default models provided by this package, see below for details
ssp = Ssentencepiece("gigaspeech-500")
ssp = Ssentencepiece("zh-en-10381")

# output ids (support both str and list of strs)
# if it is list of strs, the strs are encoded in parallel
ids = ssp.encode("HELLO WORLD")
ids = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"])

# output string pieces
# if it is list of strs, the strs are encoded in parallel
pieces = ssp.encode("HELLO WORLD", out_type=str)
pieces = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"], out_type=str)

# decode (support list of ids or list of list of ids)
# if it is list of list of ids, the ids are decoded in parallel
res = ssp.decode([1,2,3,4,5])
res = ssp.decode([[1,2,3,4], [4,5,6,7]])

# get vocab size
res = ssp.vocab_size()

# piece to id (support both str of list of strs)
id = ssp.piece_to_id("<sos>")
ids = ssp.piece_to_id(["<sos>", "<blk>", "H", "L"])

# id to piece (support both int of list of ints)
piece = ssp.id_to_piece(5)
pieces = ssp.id_to_piece([5, 10, 15])

Default models

Model Name Description Link
alphabet-33 <blk>,<unk>, <sos>, <eos>, <pad>, ', and 26 alphabets. alphabet-33
librispeech-500 500 unigram pieces trained on Librispeech. librispeech-500
librispeech-5000 5000 unigram pieces trained on Librispeech. librispeech-5000
gigaspeech-500 500 unigram pieces trained on Gigaspeech. gigaspeech-500
gigaspeech-2000 2000 unigram pieces trained on Gigaspeech. gigaspeech-2000
gigaspeech-5000 5000 unigram pieces trained on Gigaspeech. gigaspeech-5000
zh-en-3876 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-3876
zh-en-6876 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-6876
zh-en-8481 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-8481
zh-en-5776 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-5776
zh-en-8776 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-8776
zh-en-10381 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-10381
zh-en-yue-9761 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-yue-9761
zh-en-yue-11661 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-yue-11661
chn_jpn_yue_eng_ko_spectok.bpe bpe tokens used in sensevoice ASR, support Chinese, Japanese, Cantonese, English, Korean chn_jpn_yue_eng_ko_spectok.bpe

Note: The number of 3500, 6500 and 8105 is from 通用规范汉字表.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_sentencepiece-0.11.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simple_sentencepiece-0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (721.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (721.5 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (722.0 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (720.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.11-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (721.0 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (720.5 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file simple_sentencepiece-0.11.tar.gz.

File metadata

  • Download URL: simple_sentencepiece-0.11.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for simple_sentencepiece-0.11.tar.gz
Algorithm Hash digest
SHA256 0f085725200c3eda99d17aa9f319c63393d604643232ab7813cdb7994b52f84a
MD5 03c5c2f61a0f26eabe29e22b5bd46171
BLAKE2b-256 aff9bed12158952967c8d3e4ba3dbd200726e9540154b182322b698d9d5308bf

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bf55863a78745728aaccc580ceb16f20613694882ff43e109cfc9a8a71cfd5e1
MD5 8bb5b66b6d7163aee85f841463be74d3
BLAKE2b-256 c5f185958ec3e094582ba34981aa8b40b51f281da72057e7586aaf769dab9d60

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f2a6497f8a288131fda683f616077baaff2ab28b01d7fe90c6347b7355ccb6b6
MD5 4d54a07339ecf34da02879f945cd27bd
BLAKE2b-256 9c4bbf7a2a060e9d269ad1d3d813351617c3bab01742a191397d552c7950a633

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 44ea354370078a3d113ce2b19869b8abbd0111695b2f2c0c302a73a31b2c8059
MD5 0916fc6a590e29e2802cf741797637d6
BLAKE2b-256 943e146e46dc7a09cbfeca7c80c15a6151e2b8725deb11ca9754ef60ad341c1f

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6079d916cba6a56a9098da07415f27d56d645c1e87036c49f3178282825f5191
MD5 bc84961d18a0d267defb0925159e219f
BLAKE2b-256 66763ea829bbd6188e893da34fb75095e5625366fbe49015053f35db0c4bad55

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.11-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.11-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ae6e6ff32a63679c9ac614197e075d3db217e96ba021d30297b5f2ff19a7d0fb
MD5 309af8407fc7e36f52ae6988a8dfeebe
BLAKE2b-256 62ee4b35fa3606615e1de8941b8268bcd489f43fec7ee59c14e3c1ebeeb79f16

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1676218ab1218f9e7a7c86248042bdbb8d6d99a0fee77210a8ed4304bd5902ca
MD5 49a6272fdf0cadd09d9626320968dc63
BLAKE2b-256 5af0ab35a1d30de1cd5e03fac2525c6c42e80777a0b5030eae9a05139958517e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page