Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
You can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

To build and install the wrapper manually, you need to install SentencePiece C++ in advance, and then try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncode("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncode("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.0.5.tar.gz (492.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.0.5-cp36-cp36m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.5-cp36-cp36m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.5-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.5-cp35-cp35m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.5-cp34-cp34m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.5-cp34-cp34m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.5-cp33-cp33m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.3m

sentencepiece-0.0.5-cp33-cp33m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.3m

sentencepiece-0.0.5-cp27-cp27mu-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.5-cp27-cp27mu-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.5-cp27-cp27m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 2.7m

sentencepiece-0.0.5-cp27-cp27m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.0.5.tar.gz.

File metadata

  • Download URL: sentencepiece-0.0.5.tar.gz
  • Upload date:
  • Size: 492.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.0.5.tar.gz
Algorithm Hash digest
SHA256 c3efbfc9598b1afb5d58d9f34f70f670838f455603357f63b096d5c71337ef85
MD5 c0cbfe3b60cea34cda53b7dc7e05c763
BLAKE2b-256 9de6ab32fac74b278dcde7e3f2310afea53c13f6e86fbc80eeb0e100c9038d65

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d3a8028afc4d0870277d8e6b271327beab13a4c69f7dec0d9d6c6fe249588df2
MD5 f39d82faafe39af2a0fc1f1bc33c85f8
BLAKE2b-256 dcc76ebefdb559eb4fa61694317f5a35db4f76eaceccca74013908383fc38d6b

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 0e219bd2d2ac761bf0126707630b28bf8506fd8aca55a846f780145b022aa008
MD5 7b01a0f4243f9db459f2772ea948da8e
BLAKE2b-256 79c6203ca4254048ca7d216a70b63984a94ecff8b45b9cce6a5587e1c8978504

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 867c401dcf8da01da65002efdb19d9bcc900efc5ed6535011f348279bb6b7ef8
MD5 2d7769da1e517821a5c0cde42be2b8e6
BLAKE2b-256 e96d1f53639a7412e955158265a71c689faa71b4bf619591f7cef38913bd64e2

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 5ec7a7211aad225e8be59570f0cba0a39227b666a1f5e9fb2aad71cd22570734
MD5 14e6aadcb9040af2f3cf83bfab53d2b8
BLAKE2b-256 3e8b740d1e72fcdbd0521e9f38891968a3e2e4307791471523994956f1ecc626

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b0aecf05a8091b84839c39c277b484cd94d748aa4724365732b8efdfdbb5fc5a
MD5 f2a4a1fb190717d80daa5d78a2d1ff77
BLAKE2b-256 947d2707eae2b1f44e70308fee7d76840b17a9bc79f9c94ffef0da6c1962d7f3

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 d21b81f80265b005282c4afc9e4d69c9ce2cf57da72546d2f83218ae05867ff1
MD5 294580c20e112aafe0284453c0be14fc
BLAKE2b-256 11efd5f909518d26086af61db8e3c69f0785a631d6c5388f617f8212261ed701

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp33-cp33m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp33-cp33m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8f27e3b1745886a6c0ece00cf04927a44591d1815e716c94c8f24a65cdf27643
MD5 cf38d8ed975b80acfab6ce6e9274cdec
BLAKE2b-256 3a44626981b10aaa41f554eac4a7dcc256fde300c56dbc7a9d376603589a0164

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp33-cp33m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp33-cp33m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 871ff88b63f48373bc84297f8162c9d02624fd17cee26bdef1cf74609d71a6de
MD5 8379a393171909eef2544d9c7b8f35c5
BLAKE2b-256 e8ed920964f2a821a250ba6feea36b2fa15a453d91abdda9d6381d9e4403825d

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 109aafdf3a4147a3e033d26c223043e9355cfc07952b3fb674cffb0dd4876a9a
MD5 86c6e756e91be710a9c4dbc072055d74
BLAKE2b-256 1391a33c6554a972b1b082fc9c6b28fcdc10c1a42fb05389f42fda0efbf23c44

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 50480fbddb497b4953b8e7024a6d706a2697c754f66a11eda4689344cdce6d26
MD5 37857e18d16f606f699ed06d6077731b
BLAKE2b-256 3e6ad3e1237eb1794ff4bfd7d431a9607f3d6679b8ea6b2f87acc0a566e9a9be

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 865627201f58586b9d4472823c078110e99d8cecc0489581809e4151a6d77ff1
MD5 adefa8a559a1227ed87fe0b1beb74488
BLAKE2b-256 5b520674ce943eee3038b687d43f07b093370307ad70c6806843303c01bc7adc

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.5-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.5-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 bb50c13c3136dd53cce3c48cb968a3af4d0578f582227dfe1d75278b4e0557b5
MD5 6bd36a5146e67c75ab5a42e068f1fb47
BLAKE2b-256 53c068e370e20ce126d352f78e350b929cefdf9332d09070daede5f7106d4067

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page