Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
You can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

To build and install the wrapper manually, you need to install SentencePiece C++ in advance, and then try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncode("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncode("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.0.6.tar.gz (492.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.0.6-cp36-cp36m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.6-cp36-cp36m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.6-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.6-cp35-cp35m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.6-cp34-cp34m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.6-cp34-cp34m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.6-cp33-cp33m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.3m

sentencepiece-0.0.6-cp33-cp33m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.3m

sentencepiece-0.0.6-cp27-cp27mu-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.6-cp27-cp27mu-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.6-cp27-cp27m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 2.7m

sentencepiece-0.0.6-cp27-cp27m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.0.6.tar.gz.

File metadata

  • Download URL: sentencepiece-0.0.6.tar.gz
  • Upload date:
  • Size: 492.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.0.6.tar.gz
Algorithm Hash digest
SHA256 d101a584732789a2c1b3e633a0b1a8e367c86d2a607c5b911080e88f50483f4d
MD5 7f9da5fb9d1d490cfb36e16466605920
BLAKE2b-256 8a1e5eaa80c2a01d0d40face64d7de2d525c9d77d6ffb15f74377eb373552d7a

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 1155894efefa102d759c9fc7df37c7ff0a8f08e4ee484196b175116a75e73b8c
MD5 3093b8328c333634d231a0591550ef18
BLAKE2b-256 89a5780157c15e2d4e64ebe7e233ed27d4ddeb5b8100fbd1cd361b4d0d87fa71

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 ae26889c635848e20a8ea9a22ec337b774ced751b475eb21d1aff858e42b38c9
MD5 4ad968a56a9616de7fbe0a56a9c2bcf8
BLAKE2b-256 b2db064f42ca61d26f02e980ed9d0b220541a08451a552f3164f480c4f629375

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0b69486f2b3532735a34307befc1173ed27d71fdcf10cbe94ca6e5d43e717c59
MD5 c127e60d0fa5a9267f7527af7a49239b
BLAKE2b-256 7ef73e502d84e4b03a7f8ee39eea9e079402b73020bca8e6a13171441fcf704e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 9a1d149534fe528336a168bc9cdcfaea9c0b0a58d427dac0c765bba188a817da
MD5 63e1882fd1e528278db03b061bff5a23
BLAKE2b-256 3a66dd9e79d228d9d677fdcc6566bfb3e548c7b420530cfb233cb6a65b6cd3bb

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5e03c1782e81df6c26c532632f6231764906e42efde36fbd89224c2f56a2c54e
MD5 798c392614e641835b125c1f7b3df7d2
BLAKE2b-256 0bf72de8a524d0e44adb0052f9ee67258048b6f8ce64aacbb35586dbe42ae608

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 35d2b7be827d784e2ea7bb8c0ce6e8fa28ed07521b9d0a6a4afe1f5769fcb742
MD5 25001e52f744ae2355dd8b897b4d269d
BLAKE2b-256 080795ff4738933ee109dd2edc5ac6ed6ce0d8e3b17761cbce734e863edfbef8

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp33-cp33m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp33-cp33m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 46f3ff24afd75b62c67aceedfe9900f7325ce67d3aac3c06b0cc8e0a971cebcf
MD5 f2063f9efd393d281b2b3a7807b40aa7
BLAKE2b-256 0d0230bed1d5fb6e4c2729e6b9596f82d8e9b59c87919bfbea7878f47901006b

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp33-cp33m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp33-cp33m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 786964dfab3f9061969b114cc1aa18460b0525ef20fe73f98dc6f6858442f372
MD5 1ee2ff5123983be569db41dbc49313b4
BLAKE2b-256 56976daa76674d24e235accead0f1e4eb811839e28f368350deb6d03f622309f

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2795c949885a9e13ddceb8f1f364892e69cb49a674534af3b20405b2d087dbd3
MD5 b2a05788fbbcbe932cbe00412cc866df
BLAKE2b-256 2bd219c235803c524252363155c49514ec725d62e3738f460d33850c40f1df9b

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 f200a927950bf1bdfb4b9c4724459e74fbe48b458dfb98afdf6a74921ad4dfa4
MD5 41d8ca99c47fe52689c066b6e15f6f30
BLAKE2b-256 d972f0341c31a4b88ff04cdb547f76dc1f4208f455a09fc7da9685916b29c24e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 69df81c3aad62aa40d8b85d96c0ab77c70e1fe069c3cd294b2e600fd4c2c7ba0
MD5 1d5c9ba6f8c4c3d71cf7c8aeb81df3df
BLAKE2b-256 a95b55729d35a0f41db657b7a306d6767ecdfa7bc0dfdc1253b0d46d56f1a553

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.6-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.6-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 f7b7aa8b96c41f149b031e1b638f4d2abd82a4a0e12b1909760a97b58dd0ce6b
MD5 ca0a402a7465a9ef9adbb939c2c2e784
BLAKE2b-256 0fc27c81e65d7de15a927a9c8c2855267363982f6cfcaa1e0d0f64b84f2f298c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page