SentencePiece python wrapper
Project description
SentencePiece Python Wrapper
Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
- Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectively.
- Support model training with SentencePieceTrainer.Train method.
- SentencePieceText proto is not supported.
- Added len and getitem methods. len(obj) and obj[key] returns vocab size and vocab id respectively.
Build and Install SentencePiece
For Linux (x64/i686), macOS, and Windows(win32/x64) environment, you can simply use pip command to install SentencePiece python module.
% pip install sentencepiece
To build and install the Python wrapper from source, please install SentencePiece C++ and try the following commands:
% python setup.py build
% sudo python setup.py install
If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
% python setup.py install --user
Usage
See this google colab page to run sentencepiece interactively.
Segmentation
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncodeAsPieces("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncodeAsPieces("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
Model Training
Training is performed by passing parameters of spm_train to SentencePieceTrainer.Train() function.
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string. The output string type is determined by the input string type. The output type of IdToPiece/DecodeIds methods is str, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.
- Python2:
>>> sp.EncodeAsPieces('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.EncodeAsPieces(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.EncodeAsPieces(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
- Python3:
>>> sp.EncodeAsPieces('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.EncodeAsPieces('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for sentencepiece-0.1.85-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 22fe7d92203fadbb6a0dc7d767430d37cdf3a9da4a0f2c5302c7bf294f7bfd8f |
|
MD5 | c2deac99b3e257dfffa0a09d8edeccc8 |
|
BLAKE2b-256 | 9a7ed78e2a295746d90998681752e7154884096f89c64d2e5a0e44831a59afbd |
Hashes for sentencepiece-0.1.85-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb69c5ba325b900cf2b91f517b46eec8ce3c50995955e293b46681d832021c0e |
|
MD5 | fa2c506db19bea45630eaacca30af7e9 |
|
BLAKE2b-256 | ce635af1346c2d4e4966da1bb68ec6b1ff0d1768d0e75d7a5bc8d724338ee196 |
Hashes for sentencepiece-0.1.85-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d2bbdbf296d96304c6345675749981bb17dcf2a7163d2fec38f70a704b75669 |
|
MD5 | 7f6d8c7272306e316a77466b99d51e78 |
|
BLAKE2b-256 | 47f0710e01f5311d836a9c82b6696fed0ba9f7bc11ab83c1ef86e5eff339dd7d |
Hashes for sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dfdcf48678656592b11d11e2102c52c38122e309f7a1a5272305d397cfe21ce0 |
|
MD5 | d9e8f4086b84c4d65b76e9fd84eb954e |
|
BLAKE2b-256 | 6c7494ef3ea0c66236e9ddeb2846255bdbc5bd576c8522dd4f9513d799c2000b |
Hashes for sentencepiece-0.1.85-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf0bad6ba01ace3e938ffdf05c42b24d8fd3740487ba865504795a0bb9b1f2b3 |
|
MD5 | 0400ae4000cf662094beb1e25f835ddc |
|
BLAKE2b-256 | 61c5e7e2f45c076097ac1a58b21288be25ae4eb4044be899e6c04cd897a00f15 |
Hashes for sentencepiece-0.1.85-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30791ce80a557339e17f1290c68dccd3f661612fdc6b689b4e4f21d805b64952 |
|
MD5 | 02a1c9a4e07fe8ece40e1dbb3d5e5973 |
|
BLAKE2b-256 | fcc48d65018790cf654d0a69c37ccd9f0199c150a1bb38821b3fa5c86cfcfe2d |
Hashes for sentencepiece-0.1.85-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 576bf820eb963e6f275d4005ed5334fbed59eb54bed508e5cae6d16c7179710f |
|
MD5 | 09c0e93ce91eb3e004ddf4d8457182d9 |
|
BLAKE2b-256 | 11e01264990c559fb945cfb6664742001608e1ed8359eeec6722830ae085062b |
Hashes for sentencepiece-0.1.85-cp37-cp37m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39904713b81869db10de53fe8b3719f35acf77f49351f28ceaad0d360f2f6305 |
|
MD5 | a6bff1a26f74ce6133a85d1296ce8ee3 |
|
BLAKE2b-256 | b8d6131320db750d5c487aee70cbf74cdd5955a66e974912a9a9f618b4910051 |
Hashes for sentencepiece-0.1.85-cp37-cp37m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fba83bef6c7a7899cd811d9b1195e748722eb2a9737c3f3890160f0e01e3ad08 |
|
MD5 | 15351ce037cd7297cf0288ef66a7de8e |
|
BLAKE2b-256 | e6562e6cfc364c4760b85adab40cb38d91e7ce67d6b2745a2e1aa1497c776fe1 |
Hashes for sentencepiece-0.1.85-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b416f514fff8785a1113e6c07f696e52967fc979d6cd946e454a8660cca72ef8 |
|
MD5 | 3e391b26119d714b920bbf7d9ca75762 |
|
BLAKE2b-256 | 19df055557e0b5e05c13bbfcc648c10181627949a9313c55a6390558eac10cf1 |
Hashes for sentencepiece-0.1.85-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3b6fe02af7ea4823c19e0d8efddc10ff59b8449bc1ae9921f9dd8ad33802c33 |
|
MD5 | 98c223391156c8365b986c59cf720a35 |
|
BLAKE2b-256 | 2889f8bd230c75b251fe2c5fac0225af2a9557f2967f9d7ddb82f890f9a5c127 |
Hashes for sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffdf51218a3d7e0dad79bdffd21ad15a23cbb9c572d2300c3295c6efc6c2357e |
|
MD5 | d78ec17627e60dd38f7fed3d73966123 |
|
BLAKE2b-256 | 74f42d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214 |
Hashes for sentencepiece-0.1.85-cp36-cp36m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4dcea889af53f669dc39d1ca870c37c52bb3110fcd96a2e7330d288400958281 |
|
MD5 | 3718df300e121b05c7cb058a7b9584d9 |
|
BLAKE2b-256 | b36ca8aaefee9c193360a3075af38ba7ac964d4b1298e1744f8dd61fc5a6d0ce |
Hashes for sentencepiece-0.1.85-cp36-cp36m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f3dee204635c33ca2e450e17ee9e0e92f114a47f853c2e44e7f0f0ab444d8d0 |
|
MD5 | 0d5d5188b0b74196d5dbd13741ac4c80 |
|
BLAKE2b-256 | 601068d949f03c994dbff789129107a2734db2313cace770008588dab51bc281 |
Hashes for sentencepiece-0.1.85-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c23fb7bb949934998375d41dbe54d4df1778a3b9dcb24bc2ddaaa595819ed1da |
|
MD5 | 2fd2f912ef9d41827af373370670f12d |
|
BLAKE2b-256 | 4b7b97cdb2425ae93cdb3231ba38edb759bb311751c16bebda242508f9a79682 |
Hashes for sentencepiece-0.1.85-cp35-cp35m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 76fdce3e7e614e24b35167c22c9c388e0c843be53d99afb5e1f25f6bfe04e228 |
|
MD5 | 8901390a2ef5a15dc1eb51c982f847bb |
|
BLAKE2b-256 | 90fb5f82ca0a1bec34c22f861aff8969680579db9be797adf04ee5ea3ceee60b |
Hashes for sentencepiece-0.1.85-cp35-cp35m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e36a92558ad9e2f91b311c5bcea90b7a63c567c0e7e20da44d6a6f01031b57e |
|
MD5 | d909829cd9ec72306a3aff677526f959 |
|
BLAKE2b-256 | 615bf521be878aafa4e0ff868d10e646ac26bf3afb20d2e6907bfa43da4cf1c3 |
Hashes for sentencepiece-0.1.85-cp34-cp34m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe115aee209197839b2a357e34523e23768d553e8a69eac2b558499ccda56f80 |
|
MD5 | 865455412eeb7fc88da9a9a4f928fa56 |
|
BLAKE2b-256 | aa8b41ec963d01527f70e0588805a38642cc70ee087dcc558f6f47d9f8d9cc3d |
Hashes for sentencepiece-0.1.85-cp34-cp34m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a98ec863e541304df23a37787033001b62cb089f4ed9307911791d7e210c0b1 |
|
MD5 | 35ac69fb003a5053ddd8d5414e2bec8b |
|
BLAKE2b-256 | ac967c958dd402a72e6a9ffe35ccfcef9cf66bbc2fc9172c35445c2ddb999a30 |
Hashes for sentencepiece-0.1.85-cp34-cp34m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ad221ea7914d65f57d3e3af7ae48852b5035166493312b5025367585b43ac41 |
|
MD5 | 07560d59692f3b35c2677158f3ae1822 |
|
BLAKE2b-256 | 626abd5d050ccb5c3bdc99e996d2fdb61ea62e5a7fda748987e0770a29d59bd3 |
Hashes for sentencepiece-0.1.85-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c00387970360ec0369b5e7c75f3977fb14330df75465200c13bafb7a632d2e6b |
|
MD5 | c8819f44954f4f17d92bf57482179eac |
|
BLAKE2b-256 | dd5ba1c0e410f7b805016d99c3d68d332f3d9ab43cca576d9cbe9fb95acbfe7e |
Hashes for sentencepiece-0.1.85-cp27-cp27mu-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a72d4c3d0dbb1e099ddd2dc6b724376d3d7ff77ba494756b894254485bec4b4 |
|
MD5 | 08428d5efd74d188d3a632dafe712a8c |
|
BLAKE2b-256 | 513602fc44ec155583fce493410d614e78cbbc84b29fe52d83a8af95ce3e0c9c |
Hashes for sentencepiece-0.1.85-cp27-cp27m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3d5a2163deea95271ce8e38dfd0c3c924bea92aaf63bdda69b5458628dacc8bd |
|
MD5 | cee4c3328d522169b2081dfc349fae23 |
|
BLAKE2b-256 | 10db1cd9da00c9d5e837b07b3f97a75ed711ad71e866719d88c8b2f5f3037799 |
Hashes for sentencepiece-0.1.85-cp27-cp27m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97b8ee26892d236b2620af8ddae11713fbbb2dae9adf4ad5e988e5a82ce50a90 |
|
MD5 | b3fddebaaf4c58bb7a2a0a62c48e8225 |
|
BLAKE2b-256 | 8d70468331a3e4d65bab4ae2051aede1c54fce3cd243e31a834ea3c13909500b |
Hashes for sentencepiece-0.1.85-cp27-cp27m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f72c4151791de7242e7184a9b7ef12503cef42e9a5a0c1b3510f2c68874e810 |
|
MD5 | 4fb88dac008b8a8210a22286f6456179 |
|
BLAKE2b-256 | 025ea6aba159f4131d45189681447bbb9d3d808474a596feddddc111cb7f6599 |