SentencePiece python wrapper
Project description
# SentencePiece Python Wrapper
Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.
## Build and Install SentencePiece
You can simply use pip comand to install SentencePiece python module.
```
% pip install sentencepiece
```
To build and install the wrapper manually, you need to install SentencePiece C++ in advance, and then try the following commands:
```
% python setup.py build
% sudo python setup.py install
```
If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```
## Usage
### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncode("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncode("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```
### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.
```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```
## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.
* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```
* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```
Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.
## Build and Install SentencePiece
You can simply use pip comand to install SentencePiece python module.
```
% pip install sentencepiece
```
To build and install the wrapper manually, you need to install SentencePiece C++ in advance, and then try the following commands:
```
% python setup.py build
% sudo python setup.py install
```
If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```
## Usage
### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncode("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncode("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```
### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.
```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```
## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.
* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```
* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sentencepiece-0.0.6.tar.gz
(492.3 kB
view hashes)
Built Distributions
Close
Hashes for sentencepiece-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1155894efefa102d759c9fc7df37c7ff0a8f08e4ee484196b175116a75e73b8c |
|
MD5 | 3093b8328c333634d231a0591550ef18 |
|
BLAKE2b-256 | 89a5780157c15e2d4e64ebe7e233ed27d4ddeb5b8100fbd1cd361b4d0d87fa71 |
Close
Hashes for sentencepiece-0.0.6-cp36-cp36m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae26889c635848e20a8ea9a22ec337b774ced751b475eb21d1aff858e42b38c9 |
|
MD5 | 4ad968a56a9616de7fbe0a56a9c2bcf8 |
|
BLAKE2b-256 | b2db064f42ca61d26f02e980ed9d0b220541a08451a552f3164f480c4f629375 |
Close
Hashes for sentencepiece-0.0.6-cp35-cp35m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b69486f2b3532735a34307befc1173ed27d71fdcf10cbe94ca6e5d43e717c59 |
|
MD5 | c127e60d0fa5a9267f7527af7a49239b |
|
BLAKE2b-256 | 7ef73e502d84e4b03a7f8ee39eea9e079402b73020bca8e6a13171441fcf704e |
Close
Hashes for sentencepiece-0.0.6-cp35-cp35m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a1d149534fe528336a168bc9cdcfaea9c0b0a58d427dac0c765bba188a817da |
|
MD5 | 63e1882fd1e528278db03b061bff5a23 |
|
BLAKE2b-256 | 3a66dd9e79d228d9d677fdcc6566bfb3e548c7b420530cfb233cb6a65b6cd3bb |
Close
Hashes for sentencepiece-0.0.6-cp34-cp34m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e03c1782e81df6c26c532632f6231764906e42efde36fbd89224c2f56a2c54e |
|
MD5 | 798c392614e641835b125c1f7b3df7d2 |
|
BLAKE2b-256 | 0bf72de8a524d0e44adb0052f9ee67258048b6f8ce64aacbb35586dbe42ae608 |
Close
Hashes for sentencepiece-0.0.6-cp34-cp34m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35d2b7be827d784e2ea7bb8c0ce6e8fa28ed07521b9d0a6a4afe1f5769fcb742 |
|
MD5 | 25001e52f744ae2355dd8b897b4d269d |
|
BLAKE2b-256 | 080795ff4738933ee109dd2edc5ac6ed6ce0d8e3b17761cbce734e863edfbef8 |
Close
Hashes for sentencepiece-0.0.6-cp33-cp33m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46f3ff24afd75b62c67aceedfe9900f7325ce67d3aac3c06b0cc8e0a971cebcf |
|
MD5 | f2063f9efd393d281b2b3a7807b40aa7 |
|
BLAKE2b-256 | 0d0230bed1d5fb6e4c2729e6b9596f82d8e9b59c87919bfbea7878f47901006b |
Close
Hashes for sentencepiece-0.0.6-cp33-cp33m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 786964dfab3f9061969b114cc1aa18460b0525ef20fe73f98dc6f6858442f372 |
|
MD5 | 1ee2ff5123983be569db41dbc49313b4 |
|
BLAKE2b-256 | 56976daa76674d24e235accead0f1e4eb811839e28f368350deb6d03f622309f |
Close
Hashes for sentencepiece-0.0.6-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2795c949885a9e13ddceb8f1f364892e69cb49a674534af3b20405b2d087dbd3 |
|
MD5 | b2a05788fbbcbe932cbe00412cc866df |
|
BLAKE2b-256 | 2bd219c235803c524252363155c49514ec725d62e3738f460d33850c40f1df9b |
Close
Hashes for sentencepiece-0.0.6-cp27-cp27mu-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f200a927950bf1bdfb4b9c4724459e74fbe48b458dfb98afdf6a74921ad4dfa4 |
|
MD5 | 41d8ca99c47fe52689c066b6e15f6f30 |
|
BLAKE2b-256 | d972f0341c31a4b88ff04cdb547f76dc1f4208f455a09fc7da9685916b29c24e |
Close
Hashes for sentencepiece-0.0.6-cp27-cp27m-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 69df81c3aad62aa40d8b85d96c0ab77c70e1fe069c3cd294b2e600fd4c2c7ba0 |
|
MD5 | 1d5c9ba6f8c4c3d71cf7c8aeb81df3df |
|
BLAKE2b-256 | a95b55729d35a0f41db657b7a306d6767ecdfa7bc0dfdc1253b0d46d56f1a553 |
Close
Hashes for sentencepiece-0.0.6-cp27-cp27m-manylinux1_i686.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7b7aa8b96c41f149b031e1b638f4d2abd82a4a0e12b1909760a97b58dd0ce6b |
|
MD5 | ca0a402a7465a9ef9adbb939c2c2e784 |
|
BLAKE2b-256 | 0fc27c81e65d7de15a927a9c8c2855267363982f6cfcaa1e0d0f64b84f2f298c |