Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
For Linux (x64/i686) environment, you can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

Note that binary wheel package is not avaialble for non-Linux environment, including macOS, Windows, and Linux (arm).
You need to install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) library in advance.

To build and install the Python wrapper manually, please install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) and try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncode("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncode("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.1.0.tar.gz (492.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.1.0-cp36-cp36m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.1.0-cp36-cp36m-manylinux1_i686.whl (1.7 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.1.0-cp35-cp35m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.1.0-cp35-cp35m-manylinux1_i686.whl (1.7 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.1.0-cp34-cp34m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.1.0-cp34-cp34m-manylinux1_i686.whl (1.7 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.1.0-cp27-cp27mu-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.1.0-cp27-cp27mu-manylinux1_i686.whl (1.7 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.1.0-cp27-cp27m-manylinux1_x86_64.whl (1.7 MB view details)

Uploaded CPython 2.7m

sentencepiece-0.1.0-cp27-cp27m-manylinux1_i686.whl (1.7 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.1.0.tar.gz.

File metadata

  • Download URL: sentencepiece-0.1.0.tar.gz
  • Upload date:
  • Size: 492.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6b76ff3598f729cf0064fd692431039110990572ee05d94c9ffc596d5d01dd05
MD5 adfee0bf51cb53ef1aa67698905b839d
BLAKE2b-256 1b26a721ab368de4d3e79ebcba905d3d407261898a95cdbd0790e4f8cabe5958

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 02a9ca76545deecf7c043af124ad227444f7efe230de98283efa35233b04f201
MD5 1b986d57133a50fafb4c3304a7d51500
BLAKE2b-256 c4b1f33457ee7bb6dfee88158529f92df548389e3aa6af7b0612143f7e6c10ac

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 94e6b48ae21a1175dbf29c1ca7f7e024d28f9f53bdb9fc004731aa26dcbca3ec
MD5 829b81804b4ea2f64a78cbe39cdc4a1c
BLAKE2b-256 d1f550ea3f5ac446db3e3c1f36ab3bb2b571ff5a33b63e662e6c8565612134f0

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7fefc5819445f1ffc39d91433742c167a26df5722e39054b614dd4d91e14f4dd
MD5 acc1ad47b79a751b16a2eeb58056fa54
BLAKE2b-256 ece6d49111f8b48073898b4821551fae1593f6d4da7d33b652eba482cc8482f4

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 fa4c668484188d1c1a01f155d7c77fc4f7a73a6944d5749258bd7034bf10f7d0
MD5 4c55006fb37d63c7e98bff16fd5e8b8d
BLAKE2b-256 50576d798a11b47bb4544c1610881d46bc0ac425c458bf242d20e768a7accd69

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6d54692524ec919f2b6583bf53a35042a2d2e383dd869df574bdb14f1722d1fe
MD5 2fa120fd9c6f5e080da46d3fb7eee9fb
BLAKE2b-256 f79d4c8cc552b38b9c0f836e528f04524a23d47a89baff1ee8c40a16508dbe05

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 14785747932827f9d2425cdb3a9aad1dabea6f6ba79c21b369c8ab95921ea155
MD5 c5e419f52aae40f1b1582df65a278250
BLAKE2b-256 e99ea87f975970b57e6173722d72c7c632cee4a51486ef4ccec2a6d988bed291

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5294353eded0750b0ec06d54dbb4f8a17bb53165058ebb8177d3f68f74f4bd20
MD5 f6af49a8453e413e5d435f637372eb41
BLAKE2b-256 dff9939561239ba2486f0ae7c2630f0e80cc00fe9adaa37b516ab63b75e8bbfc

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 d78487a9b9b2e4d1aac1f707d0748ff7cbeb44756501497fedce00e637913bf0
MD5 f80ab558ed64b3feb0ff1c01ee367610
BLAKE2b-256 7886138300bc8b0ba380b6b434e2859b5e9e8ee874a0385d2c90e95831f4bb7e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d96696720c896baeaa27b7b1dc19391dbb8f4a12e93f5c0977c55041077f15fe
MD5 95d859611178100d0706e16a2ed8b7f4
BLAKE2b-256 35fcdf07a173c4fd9a8850eecf5e31e4520f8142426d255682668f8a3239efd3

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.0-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.0-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 b50f6db65474ee87163e03f504e2b37f51d9fecb20522348e18e6ccd29b8eb9a
MD5 b267530ae20718fc0c14e564e82d1d08
BLAKE2b-256 1d4e86e2c8b3cdc3f8cd1708d80855fbb8f1d3d35c8e077078fa643e849fa896

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page