SentencePiece python wrapper
Project description
SentencePiece Python Wrapper
Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
- Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectively.
- Support model training with SentencePieceTrainer.Train method.
- SentencePieceText proto is not supported.
- Added len and getitem methods. len(obj) and obj[key] returns vocab size and vocab id respectively.
Build and Install SentencePiece
For Linux (x64/i686), macOS, and Windows(win32/x64) environment, you can simply use pip command to install SentencePiece python module.
% pip install sentencepiece
To build and install the Python wrapper from source, please install SentencePiece C++ and try the following commands:
% python setup.py build
% sudo python setup.py install
If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
% python setup.py install --user
Usage
See this google colab page to run sentencepiece interactively.
Segmentation
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncodeAsPieces("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncodeAsPieces("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
Model Training
Training is performed by passing parameters of spm_train to SentencePieceTrainer.Train() function.
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string. The output string type is determined by the input string type. The output type of IdToPiece/DecodeIds methods is str, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.
- Python2:
>>> sp.EncodeAsPieces('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.EncodeAsPieces(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.EncodeAsPieces(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
- Python3:
>>> sp.EncodeAsPieces('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.EncodeAsPieces('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for sentencepiece-0.1.86-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bb46b1ac476b1ff4774215dbb65a50cdf6ecb8bb8f535a9e8d32e55e8dcf3ff6 |
|
MD5 | 6e178d0ca65a660a09856fa8e9fcc5f6 |
|
BLAKE2b-256 | e06ffe0cf77e67acd3c38e6f261966a4be0142632dc618463b6e4095fdd23037 |
Hashes for sentencepiece-0.1.86-cp38-cp38-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c97a5c26714cec7bbb9a4c702d9d133c255567e3d2a8580b8fb17ef6a34ff82 |
|
MD5 | 6c9b58e5df25b84e1ada21acb9dffd69 |
|
BLAKE2b-256 | cc362e033cc4fc5e4cdf6205476fe9990624e0ccb72cdcc5769f03dab729f8d5 |
Hashes for sentencepiece-0.1.86-cp38-cp38-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 564059fc856dc871950b10332f4dab16a58bd3d1b6cf26796ea9bc70f1644476 |
|
MD5 | 475da9011a606d3f656c2f0a7529e034 |
|
BLAKE2b-256 | 57a0068609d04186680fc790f9657793bfce3b45adaf150681c497ab823c7cc0 |
Hashes for sentencepiece-0.1.86-cp38-cp38-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74c45e26b0c433d3fee28887d3b17b89f075872cc04989b75a84f289c686eb47 |
|
MD5 | e145e00f30a78bc0e96fd4ec6a930b04 |
|
BLAKE2b-256 | c51da793723942c335879fc7b53111337ffc03edcdf8341cec841ea2e34b6128 |
Hashes for sentencepiece-0.1.86-cp38-cp38-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 465292e83116025d6b86438b140c572469c53cd5e62b8917abc3bda57d0c7f61 |
|
MD5 | f249010af013393b412743becf0d358c |
|
BLAKE2b-256 | dfb5b164e68e0c9e6a05e13a868d7fd51b3d9a14fee7619fbe78999d8be6d578 |
Hashes for sentencepiece-0.1.86-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5fe33f4d9ecf78fe78af41688680a5be8ded658ad16ebad62c05738d4e2c6e8d |
|
MD5 | 703e5946e3a1b7a752a1ab7fbaf64607 |
|
BLAKE2b-256 | d05f44f362432c092398eff7a31c3f4b44a78ba7ad72d3d02f7cc9aa19f56f3b |
Hashes for sentencepiece-0.1.86-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d8ead88269a00204b63e5fe5fac5c1f1859635e3884c0042f37481bd9b876f8b |
|
MD5 | 080d84dc1a002cc98298461ba05053b9 |
|
BLAKE2b-256 | f771c3166f69e5cb089c1909ca874d61cb236a82a5f1c6cd7d98f4f43aa44998 |
Hashes for sentencepiece-0.1.86-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 062e304c8e65103c83796d27edbfc7fbe2fdd988aba8c999a62c069cbb96b5e5 |
|
MD5 | ddb608aaa0d7b77dad7a5f849deb7aed |
|
BLAKE2b-256 | 929ddcaaba6fcee6a5c3b36c465557720f088c29cdb5931bc8b4b2556394b3d0 |
Hashes for sentencepiece-0.1.86-cp37-cp37m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba486f44af20b71d0edb3d5bbb1cda251fdd4f71d0913ef45d658f7383e184df |
|
MD5 | 2fabae5cffc7f2e3c9ccb1cec95028ee |
|
BLAKE2b-256 | edca6e46310d3b15b580770aa987110fbf92203c2942d4b65afc53193117100d |
Hashes for sentencepiece-0.1.86-cp37-cp37m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 91ae6e0083be525e2a2c2c18579125c39d5ae3d7678b7171676ae059ee0b72d2 |
|
MD5 | b15466feae3469d3f8bf66ff9ad9eeeb |
|
BLAKE2b-256 | 1807a3dcf144f3effe5261842f21395c802846722634ff99ddfdf5edfcb30d22 |
Hashes for sentencepiece-0.1.86-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | aeaa4fad9a4098831f24d96cb70d387253dcf90f1b1595898264e74e8b5dc742 |
|
MD5 | 84baf5b30a4a7da81cf4f4a2e0a3ce44 |
|
BLAKE2b-256 | 8792131abd443baa624835df057de00815366948c0326f52c28fe5dadd46e12d |
Hashes for sentencepiece-0.1.86-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5beedea9fd34244b9e0277d70846dab22ae2294c14f59e9f5977894a32854772 |
|
MD5 | 042b820dc65385d6fdfd3c587d7cc9d5 |
|
BLAKE2b-256 | 53a38115da88a20892610e0f18b09adedf3f676d30923b04324569df71dfa739 |
Hashes for sentencepiece-0.1.86-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 128b5fc22b8da39096577927b9737a3d48f237eb9fcc45b5350da901980638f3 |
|
MD5 | 373e611fb635ed8976ad02a574f532ef |
|
BLAKE2b-256 | 982c8df20f3ac6c22ac224fff307ebc102818206c53fc454ecd37d8ac2060df5 |
Hashes for sentencepiece-0.1.86-cp36-cp36m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 681f901f8c351c85379d7924e3fbe6df053180253a675c19eb0370dd974bf7c0 |
|
MD5 | febfda6a88d8467915243b8f7736ccdc |
|
BLAKE2b-256 | 19a381dd5f59755576f6d673321f656bde6ebc6bdc664230d35078c57e90a32b |
Hashes for sentencepiece-0.1.86-cp36-cp36m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2cb1744e7b497c23c4fc233fee781723fcce4de906ea564b68d67bfb60ed29e4 |
|
MD5 | c5b2411f99cdfa3b9731ce98b546e503 |
|
BLAKE2b-256 | 705919d287e3ddb00fa494422acb0ddab9964733e8c1b74fa20e7632a7825510 |
Hashes for sentencepiece-0.1.86-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58c2dd23128e017023808ce08cb65a9cb2f16abf5b9b4f0d581d4b244b4c3d2f |
|
MD5 | 0a73489cf5bb6d1ce96be24ec2a1b39b |
|
BLAKE2b-256 | 11b9bbd63130eb277a3071ffc62c18669d931fe9afdbc136d9c14a7089d232c1 |
Hashes for sentencepiece-0.1.86-cp35-cp35m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb48234fa370a35c8449ffd545570bcabdfdc005af8d4e5e066e9ff0a1610821 |
|
MD5 | 5358f64f81a27e8bf3a159f25a04242a |
|
BLAKE2b-256 | 125c2aa23431f7c63c87628e54c72b8a722e673124dd1a02f388833821c05d4c |
Hashes for sentencepiece-0.1.86-cp35-cp35m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c3bafabee08cd40178090508d3af89668e643acaddcec56cf076f3e4932dfd3 |
|
MD5 | f6d360819f733ddb2d41ccb5b527dc31 |
|
BLAKE2b-256 | b04baa6bb6d071415a7f7ec3905f46e134ff52440ef1b499ce8795029899beeb |
Hashes for sentencepiece-0.1.86-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 861a16aa070e331fbdfc54d3fb451e3472f532008c1ae303359216860b2e9e24 |
|
MD5 | db94729d4df0d1995b2454b217e7f8b3 |
|
BLAKE2b-256 | 8ea5b175da0c3e6bc6eefb3d87470f029f6772512d5a3a4417a5766a7b487979 |
Hashes for sentencepiece-0.1.86-cp27-cp27mu-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8174361843c3fa27ba63e94afecb1a538d5991f137f9b9981ed8cb0ea1e50371 |
|
MD5 | 6f1267d2d0e7651c657ed82692bcafbf |
|
BLAKE2b-256 | a44976691c42b4afbddbb196a66031fa949c5ba15fe6549c0eb3a1694d0d314f |
Hashes for sentencepiece-0.1.86-cp27-cp27m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97eb1834d5e444a9980e6c5677a4f8245468fdff5fcc4442c43f8e325850d405 |
|
MD5 | 4f322bdb2f7b745df2c68ab97cb23da2 |
|
BLAKE2b-256 | bc84edc1b9d8ce00199902ac6e72dfed9e18de766fb707d0174e926ce9723f22 |
Hashes for sentencepiece-0.1.86-cp27-cp27m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 179439b9490e0f8e8dd6b18746b7af1ccd890d7eb4ed19a5f7dca0e6e7d973a4 |
|
MD5 | 814183d399dcac7b6766b307ec618717 |
|
BLAKE2b-256 | 2921d55920ad15cae091ec636480830008be4389fe5052eab9f271be5423f0d7 |
Hashes for sentencepiece-0.1.86-cp27-cp27m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 212ac571d3d9155783437af749df872e156811cacda3760e8a0664f2b2a3e980 |
|
MD5 | 9efbd477b095f2c01f7e4168ace85eb9 |
|
BLAKE2b-256 | fa5b9722dc408208ddfdd1467e73d86f16950840dca7aab6599cd308d188fac8 |