SentencePiece python wrapper
Project description
SentencePiece Python Wrapper
Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
- Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectively.
- Support model training with SentencePieceTrainer.Train method.
- SentencePieceText proto is not supported.
- Added len and getitem methods. len(obj) and obj[key] returns vocab size and vocab id respectively.
Build and Install SentencePiece
For Linux (x64/i686), macOS, and Windows(win32/x64) environment, you can simply use pip command to install SentencePiece python module.
% pip install sentencepiece
To build and install the Python wrapper from source, please install SentencePiece C++ and try the following commands:
% python setup.py build
% sudo python setup.py install
If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
% python setup.py install --user
Usage
See this google colab page to run sentencepiece interactively.
Segmentation
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncodeAsPieces("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncodeAsPieces("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
Model Training
Training is performed by passing parameters of spm_train to SentencePieceTrainer.Train() function.
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string. The output string type is determined by the input string type. The output type of IdToPiece/DecodeIds methods is str, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.
- Python2:
>>> sp.EncodeAsPieces('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.EncodeAsPieces(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.EncodeAsPieces(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
- Python3:
>>> sp.EncodeAsPieces('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.EncodeAsPieces('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for sentencepiece-0.1.83-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fde644520c9c21ab1ee3e2a93be65ca751c971ccc166f0fce20db7f5da324029 |
|
MD5 | 9b51ba31be3a31bf5ce539e4411c7cea |
|
BLAKE2b-256 | ce1617838ebf03ee21daa3b4da0ca5c344bd060bc2963a7567a071cd7008e996 |
Hashes for sentencepiece-0.1.83-cp37-cp37m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac049046d3a970aa9cb48a707a4d9cc8e1fcc060eeab7fcb7d03086b40dff70c |
|
MD5 | 15e8489cdb96921e2e69f2937865f1cf |
|
BLAKE2b-256 | 7575f8bdec5f70df90e4f7e4d677db69b94ca01f402fb67f2faed05f4453a586 |
Hashes for sentencepiece-0.1.83-cp37-cp37m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4daf9930f5e8882e09b6b1052614ab02e7db40ca83032160d49e7956e447d1f4 |
|
MD5 | 69fa9b7761afae7cdf2a3f7658afe605 |
|
BLAKE2b-256 | e8cf7089b87fdae8f47be81ce8e2e6377b321805c4648f2eb12fbd2987388dac |
Hashes for sentencepiece-0.1.83-cp37-cp37m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 07a544533bbefec3dda5ef15cd00e9b4b9abb3ed0c82eb4b4c57d5f5fd01180b |
|
MD5 | afa84e6db182416dc3086138114e6960 |
|
BLAKE2b-256 | 33225d8ff04390171f5a658f6189ea18c91d6e33603e2f30e81acc847e434e69 |
Hashes for sentencepiece-0.1.83-cp37-cp37m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ceb495ea00bf04ac9ee1a16a78a8d29efbac9e4ae4744bb1c11d8911fe7eb2e9 |
|
MD5 | f5f73511f6c912ab9b6d33dd614d96d1 |
|
BLAKE2b-256 | db7df077c7d6c8dc958207960896b35ca1bbf6128527b8a6d4c4aef862660489 |
Hashes for sentencepiece-0.1.83-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2055e1e089f08d25f822ede20d004df64d02789b290d115c20aae86d349afed4 |
|
MD5 | 30d2508781607e33687b996caa6f54eb |
|
BLAKE2b-256 | 30d4738890b3de90ad457db99ef23a9d99c27542144e545b85d38630f1b6efea |
Hashes for sentencepiece-0.1.83-cp36-cp36m-win32.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 791ea9c0a5425f6dd2be6ea353ada94f53ca0964d96b5bf60d2a8e0b1abdfb5f |
|
MD5 | 228e9e88ff86c6a44213b23b278be8c6 |
|
BLAKE2b-256 | 934b2c77b42d8f5a6d131c3491d19059c8735fa05ea46e7e1453cff0c8c20611 |
Hashes for sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fdbf04c50f131c36867bde2ae6ff48e1fd5216e43c8087fff1f762b7bf9d229a |
|
MD5 | d0903ca5bc6c6b701d27b8814891e044 |
|
BLAKE2b-256 | 143defb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29 |
Hashes for sentencepiece-0.1.83-cp36-cp36m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f664c90a0637f5e0ed87cefb337aeb976fa44f3c59cb0aecbb78ac0688184378 |
|
MD5 | 56058a025aacc46a847d4ca497ac6696 |
|
BLAKE2b-256 | 94b968e94f7fcb6273eea9675b4a11cd5c984b5f4b9b738845aba9f41c743897 |
Hashes for sentencepiece-0.1.83-cp36-cp36m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 268bb70d89e6c808086844a96a2f086f3b05a66fc6d3e25e2c50691cb3fd14b1 |
|
MD5 | 2690649b91b45cb591a9bdd344461b3a |
|
BLAKE2b-256 | 8e0ffed4e14cd1c81d16622a80cf46f3e63d89733b0f0394ef91a7f560ae8d7a |
Hashes for sentencepiece-0.1.83-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4240117d7af47b596b8bbea2a8a09ce9f25ac234027030541f188490ad4f367b |
|
MD5 | 7a2deca6b37797ba87c0d2c7eaf0037f |
|
BLAKE2b-256 | d820218671858ea07af0417c39847f0ca7d33c0e979ef0a19604bc215d055aab |
Hashes for sentencepiece-0.1.83-cp35-cp35m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 845748caaebc2fc3b3292537b98d4c5a31e5f15285f7d061ce1f6c795fe2884e |
|
MD5 | 5c8b75149f40f4cd3ce27c5883917e5e |
|
BLAKE2b-256 | 2265db4bd6c6c7493b5003e1a36dea4ce62967bf094fe933631aab43ce436c2a |
Hashes for sentencepiece-0.1.83-cp35-cp35m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c1b4a5a93b95dd2ac03f3c099654e418800c1ff8fd7d1691f42bdbe8719ae4d3 |
|
MD5 | d950bcb47c20adbe84ec6185e3b78409 |
|
BLAKE2b-256 | df3cb165b8c13c46818eb22ae483186a3f981355a8e1fb86365265fc1ce58453 |
Hashes for sentencepiece-0.1.83-cp34-cp34m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7eac4a9e4c51f6297b451ce670fce104e6f0e74049e304416bdde6e66c2708a6 |
|
MD5 | 93ec7cf4eb350db346ee7df1a13b4033 |
|
BLAKE2b-256 | cba6b87115716449e4b5e1f4b34eefaca94e01e5e09f60b75da8d9920106a664 |
Hashes for sentencepiece-0.1.83-cp34-cp34m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0014d39669289f7925c55e050932a907fa2199b1404385ddb4a69608f75e5b17 |
|
MD5 | de654be1070e628a2a6a60aa04d992d0 |
|
BLAKE2b-256 | cde07b01945f64a8ced79166dd1fb54668d1bc0601b81989efbd023e05deaa9a |
Hashes for sentencepiece-0.1.83-cp34-cp34m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9d206332f6d03fd3e9be40dbe4c04538d0c4d732c5e6314058c1cec3868ef213 |
|
MD5 | da474a5eed6a07815b90deef10cb880d |
|
BLAKE2b-256 | 9553e5cdafc1a6ae4fab22265875c8a0051cc3e4bcc2b4d5963b423cf586f1bd |
Hashes for sentencepiece-0.1.83-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 877a130c1787e2f210e5d7e317e8606ebdc212a1508101ff1bc155aa2ff51f20 |
|
MD5 | b6a300fe1dfa7574663b7d9aaedeebab |
|
BLAKE2b-256 | fa5009193c69a66cc87e95bd53b935f42453ea118cd90f5b118d74536c633d0c |
Hashes for sentencepiece-0.1.83-cp27-cp27mu-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 49973f229728d858f9a97ad5c6f54c897b7dc10eddbd19f350f9870320540728 |
|
MD5 | 970e5b2c74c390c65b313e72ebe9d9b2 |
|
BLAKE2b-256 | a8c7f0d8ac61e5794422d1b072fb9aa6adc05074e89b77986dbc185ed0dfea0e |
Hashes for sentencepiece-0.1.83-cp27-cp27m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | df2a04cffe27f1394008e55e75239804c907d9358214ff9d6252b1c3cfa7f4ab |
|
MD5 | 9307f29676a91f24436b308fa77f6a5c |
|
BLAKE2b-256 | c3dd5766386d1c9d3b04daae896255bed8df5d0b6888aa9c5b669af7d1b0d4a8 |
Hashes for sentencepiece-0.1.83-cp27-cp27m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8174139eca911fdbeab0fcf5eb5f96341959ccda9d403da7695261f20ec909f5 |
|
MD5 | 2d6f52f8931fc09a3a5c1d5f6decbb79 |
|
BLAKE2b-256 | 66650288410587e62d6085a2db9d66600e01bd9902cf2a246c6685030a6ca812 |
Hashes for sentencepiece-0.1.83-cp27-cp27m-macosx_10_6_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1234b461c7706466368624d4a17661984f310708a61c6915174f9b49214ce001 |
|
MD5 | 860bf746b3bccbb1060b995aa52ccc49 |
|
BLAKE2b-256 | 54574339fa9c5b9bfd311a6074034e5a394d647516beebbb9f0fe186b79c6d09 |