Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
For Linux (x64/i686) environment, you can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

Note that binary wheel package is not avaialble for non-Linux environment, including macOS, Windows, and Linux (arm).
You need to install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) library in advance.

To build and install the Python wrapper manually, please install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) and try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncode("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncode("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.0.9.tar.gz (492.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.0.9-cp36-cp36m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.9-cp36-cp36m-manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.9-cp35-cp35m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.9-cp35-cp35m-manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.9-cp34-cp34m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.9-cp34-cp34m-manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.9-cp27-cp27mu-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.9-cp27-cp27mu-manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.9-cp27-cp27m-manylinux1_x86_64.whl (1.6 MB view details)

Uploaded CPython 2.7m

sentencepiece-0.0.9-cp27-cp27m-manylinux1_i686.whl (1.6 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.0.9.tar.gz.

File metadata

  • Download URL: sentencepiece-0.0.9.tar.gz
  • Upload date:
  • Size: 492.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.0.9.tar.gz
Algorithm Hash digest
SHA256 134f128a6903f3c7d7c0f31945803173b502db76e1b992d735f26c2ed158ef26
MD5 6cc7dfca2729ac2dcebb6f2eb8dea9c1
BLAKE2b-256 a306add980badde94dd2ec5aa8058fb762946dd7db62aa3328c15186bcadaae0

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a2a60b00efb1be7383e654969161444d164acad1cf052d461f330e42a5feef7d
MD5 3df89476baa1e93d34c4283599eafc1f
BLAKE2b-256 53efac16f2bb281fec7540af070b0789e23b331e1473b313c632e497c4b04407

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 18386a023c9199eff92ff1feb846b05a82eba6447e16b3f0639047a30153cba0
MD5 ab24b619021613f229c3dfe178d01e3f
BLAKE2b-256 17f5e02f8237712e542f3a466c52ad15834899cb71df6e2e5f87214538cd5c43

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7c228731d26b9be4e441aaa3328146cb81a6397eb362ea6b4e9dc2d941a6c47c
MD5 bb9fded2d0a8100f543ce5e90d27cd91
BLAKE2b-256 42a180189c3f71c05e5475aa76ae296e2d4775a6bd9cb4feb1e0d8c9f1882144

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 3a89dcf2e813a6f0785c30b74c2ef52cf4253de04a19c9ae3a27f441d93b6eba
MD5 cd3a7c11acb6dd2dd0494d010e139f86
BLAKE2b-256 5bf99f84812ee01c6bc8947fbcfa223b07f60181eb4ba554c0fca5048f342613

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3ca78dba3007b9d8afa8ea06047742a7e89bc7d5fee780a47f72a7a325e93ba5
MD5 18fccc86801b0eda248c39d7d00d7aa8
BLAKE2b-256 03c9470389ea8ce6675986294ecd1babcfb1daf776c1ee2934d6e1196d139021

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 5e732cbe7f9b36b9f7d1fa0647fd410bbe544a07055fee64fb29ccebaf6746f7
MD5 f951c83c8f0ea506603da547f8100ab5
BLAKE2b-256 43f8bb91d82992f22ac75084beedbde31f787e0381e19988ed2a4d8c4771b4d0

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ac085faa206d19a1b9cac95f96b13a095e15a75a3c4a1c7d08c30a860e2bfcb1
MD5 0bd7201f43b5130bd0c850d47882a6bf
BLAKE2b-256 db1bccc43529f8db1c2361d2c7c7eecce64825ff5ffee6c6da0ea87516e1b3c1

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 0f080f6413e1c2be8ac1e1a34f00713c9a7fc8de4e2b89c113b04541fe7e620c
MD5 f5853eb73da1ef93ef7dc0b701e12b3b
BLAKE2b-256 83a1748248c44f1d8ad8dd62619c442afec727ada7019ecce1f6267ec18f5ad6

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c011ad94bcd7b58092243c5aed75de4f975bdde30a8195ee43f6ee8ec420075f
MD5 fc4ae505bdc0994fbc9f46a891cb9074
BLAKE2b-256 84af7c38db0aef2322841cd9cfea2cb04c31ce3d4449e75ab00a0127bcdaa53e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.9-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.9-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 9b93fc25ba85d7a2a9e0b4dee8761e64038028da89fd68bee0b20bda91f74ed9
MD5 c2766dffc4dad5d5ba518c8a00786fca
BLAKE2b-256 9eb86fcf3e5e701a49705d50171e59593933f8ac0e28cf9063e24fa91f7d59d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page