Skip to main content

A simple sentencepiece encoder and decoder without any dependency.

Project description

simple-sentencepiece

A simple sentencepiece encoder and decoder.

Note: This is not a new sentencepiece toolkit, it just uses google's sentencepiece model as input and encode the string to ids/pieces or decode the ids to string. The advantage of this tool is that it doesn't have any dependency (no protobuf), so it will be easier to integrate it into a C++ project.

Installation

pip install simple-sentencepiece

Usage

The usage is very similar to sentencepiece, it also has encode and decode interface.

from ssentencepiece import Ssentencepiece

# you can get bpe.vocab from a trained bpe model, see google's sentencepiece for details
ssp = Ssentencepiece("/path/to/bpe.vocab")

# you can also use the default models provided by this package, see below for details
ssp = Ssentencepiece("gigaspeech-500")
ssp = Ssentencepiece("zh-en-10381")

# output ids (support both str and list of strs)
# if it is list of strs, the strs are encoded in parallel
ids = ssp.encode("HELLO WORLD")
ids = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"])

# output string pieces
# if it is list of strs, the strs are encoded in parallel
pieces = ssp.encode("HELLO WORLD", out_type=str)
pieces = ssp.encode(["HELLO WORLD", "LOVE AND PIECE"], out_type=str)

# decode (support list of ids or list of list of ids)
# if it is list of list of ids, the ids are decoded in parallel
res = ssp.decode([1,2,3,4,5])
res = ssp.decode([[1,2,3,4], [4,5,6,7]])

# get vocab size
res = ssp.vocab_size()

# piece to id (support both str of list of strs)
id = ssp.piece_to_id("<sos>")
ids = ssp.piece_to_id(["<sos>", "<blk>", "H", "L"])

# id to piece (support both int of list of ints)
piece = ssp.id_to_piece(5)
pieces = ssp.id_to_piece([5, 10, 15])

Default models

Model Name Description Link
alphabet-33 <blk>,<unk>, <sos>, <eos>, <pad>, ', and 26 alphabets. alphabet-33
librispeech-500 500 unigram pieces trained on Librispeech. librispeech-500
librispeech-5000 5000 unigram pieces trained on Librispeech. librispeech-5000
gigaspeech-500 500 unigram pieces trained on Gigaspeech. gigaspeech-500
gigaspeech-2000 2000 unigram pieces trained on Gigaspeech. gigaspeech-2000
gigaspeech-5000 5000 unigram pieces trained on Gigaspeech. gigaspeech-5000
zh-en-3876 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-3876
zh-en-6876 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-6876
zh-en-8481 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-8481
zh-en-5776 3500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-5776
zh-en-8776 6500 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-8776
zh-en-10381 8105 Chinese characters, 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-10381
zh-en-yue-9761 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 100 English unigram pieces. zh-en-yue-9761
zh-en-yue-11661 8105 + 1280 Chinese characters(Cantonese included), 256 fallback bytes, 10 numbers, 10 punctuations, 2000 English unigram pieces. zh-en-yue-11661
chn_jpn_yue_eng_ko_spectok.bpe bpe tokens used in sensevoice ASR, support Chinese, Japanese, Cantonese, English, Korean chn_jpn_yue_eng_ko_spectok.bpe

Note: The number of 3500, 6500 and 8105 is from 通用规范汉字表.

C++ Integration

The C++ core has no protobuf dependency — only a C++14 compiler and pthread are required. This makes it easy to embed simple-sentencepiece in any existing C++ system without conflicting with other protobuf versions.

A complete integration demo with CMake build scripts is provided in example/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_sentencepiece-0.13.tar.gz (1.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simple_sentencepiece-0.13-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (728.6 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.13-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (728.6 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (729.1 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.13-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.6 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.13-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (728.0 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

simple_sentencepiece-0.13-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (727.5 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file simple_sentencepiece-0.13.tar.gz.

File metadata

  • Download URL: simple_sentencepiece-0.13.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.12

File hashes

Hashes for simple_sentencepiece-0.13.tar.gz
Algorithm Hash digest
SHA256 b3e1088d4372d94d195ba74a05bece4e7222b1d4fbf74a85cf6e8d9fb8e8a822
MD5 280b1a7985d78fe3dcadb88185b150e4
BLAKE2b-256 1bbc93da1590b457939e85f5680c5a6ce5cde78f04e1619f7ee569d13bbb721f

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.13-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.13-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 21162392c0e323179806174f88f6dcc81b7c2969e673e3eb449abb2e4d358bc3
MD5 4dc78acad874d1943b87fd96436cc0ce
BLAKE2b-256 5a30bedf0da37a4410671f15177d3899a249d2d63c98c37a0ef0ac11c6df38f8

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.13-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.13-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ec94cad667bf1bc8bb893dd407bb50547b786423e5f283be2edb55ae48e363ef
MD5 38967a71e512d5e9f82a196a037c583a
BLAKE2b-256 ade04cfd16cb94e4127d5bc34c7eb6a0d7e22dd9c541f3222934f85c1a1b22a8

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.13-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee7d5a939e37507a957b750ad790a926932f55c57c55b384ecc99380bab7b532
MD5 742dcfe460b9f62c9f6b1cfe1119b3fc
BLAKE2b-256 01c02436af2d9c50f15df95e5b69a7a780d031f6a35102a4a5eb1202eb5f759e

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.13-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.13-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3bc32483ccffcc85234098f6e642cc7f404a52c2654b3a2ca09d53b5ed6dc3e8
MD5 356bee82d3b0f6b2f24ace30788df185
BLAKE2b-256 83e875a51dd8cef912bea423c954ebc32e74e923f30023c38374063985bbfaeb

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.13-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.13-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2954bc67744c49af0962343cda475357314e8e1561b0a529a25e25ef4813e35c
MD5 e07b8aa2827dcd26e8e8bc5a4a831422
BLAKE2b-256 da57445a478ec384861c4f2744e2033855eba28f7310eeb04c50b7baf7ff5562

See more details on using hashes here.

File details

Details for the file simple_sentencepiece-0.13-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simple_sentencepiece-0.13-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 53a16035fd2a967720b7d8eeacb4826ae7838ab105f8b9ef066fb3a9068855fe
MD5 f1abc2713896627d89fd07eae492833b
BLAKE2b-256 b2d1f3c8a961ef137d306ea0ab72e8bf28dd10410d1941397bc0f04f92cb8224

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page