Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
For Linux (x64/i686) environment, you can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

Note that binary wheel package is not avaialble for non-Linux environment, including macOS, Windows, and Linux (arm).
You need to install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) library in advance.

To build and install the Python wrapper manually, please install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) and try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncodeAsPieces("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncodeAsPieces("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.EncodeAsPieces('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.EncodeAsPieces(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.EncodeAsPieces(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.EncodeAsPieces('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.EncodeAsPieces('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.1.1.tar.gz (496.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.1.1-cp37-cp37m-manylinux1_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m

sentencepiece-0.1.1-cp37-cp37m-manylinux1_i686.whl (1.2 MB view details)

Uploaded CPython 3.7m

sentencepiece-0.1.1-cp36-cp36m-manylinux1_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.1.1-cp36-cp36m-manylinux1_i686.whl (1.2 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.1.1-cp35-cp35m-manylinux1_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.1.1-cp35-cp35m-manylinux1_i686.whl (1.2 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.1.1-cp34-cp34m-manylinux1_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.1.1-cp34-cp34m-manylinux1_i686.whl (1.2 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.1.1-cp27-cp27mu-manylinux1_x86_64.whl (1.2 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.1.1-cp27-cp27mu-manylinux1_i686.whl (1.2 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.1.1-cp27-cp27m-manylinux1_x86_64.whl (1.2 MB view details)

Uploaded CPython 2.7m

sentencepiece-0.1.1-cp27-cp27m-manylinux1_i686.whl (1.2 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.1.1.tar.gz.

File metadata

  • Download URL: sentencepiece-0.1.1.tar.gz
  • Upload date:
  • Size: 496.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b3a22bbbf8863c95447548f4b5b1288eae2acc6e47cfd61adc27d20ae959dcce
MD5 e071a637ef2849558cbe4e7f809cdce4
BLAKE2b-256 38535f0b4d0d0864c4438bb2349c9325acc9b4a72e72006f8ae3e4ea48ce3883

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ac14f787f31a04940dbfa84a0a6de65c3a619b0b3f254a50f75395273572fc90
MD5 fa885faa9a44d374a1d96e4478167b8f
BLAKE2b-256 5ca7ddb5ab9c9b5b166cd9e83dc16c3e10804a63a6b6fdb661b3691d9b6b70a2

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp37-cp37m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp37-cp37m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 1553f7e29af2a81ab02d86814f4f1c2d47831b68036cf96dd83552bb4baf8f9b
MD5 2b5fc4499ebfbae808e1ad6eca476d04
BLAKE2b-256 09e02cb1ba19f9d36e62ede5fe169388035aca0521fd66e91a0df36427aeb755

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 73c635772f3f11f1a33016acc707f25553fbd0afd889e4a5ce6e681a8c0ea894
MD5 670e5806adb87b538e48c56e986658d3
BLAKE2b-256 634dbef1b765f57a3c00803f91b0968a90601239a676700e2b6ec0e0cb065aab

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 03b34595b46965a6dfe0990065ee57caa4d8c93c699aaac4bef7406908e5d8cb
MD5 461ba3b5af1656413b4a6775ca0174fd
BLAKE2b-256 8f54c8af14c726cc91066fec8bf1353e03a42af18eb5c588006f20b0d4522721

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f8fd7a884409e638fdf9531a1f8d62eccd0fda5f5f8f92401139a132718b5634
MD5 979583fb3edb1f7dca4e82099bd1af62
BLAKE2b-256 7968816ebd42203d9bfbc2f3653bbfece8a433c283d9f7b4cb9c4202b54e6d9e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 922cf9b51bb3a1abb3f8a748311f2a09b252a0e54314f8a5391cf2d2b686731d
MD5 e84280bfad94cbbac892e1e2e372a525
BLAKE2b-256 5e873e5c17ecbeb3889979fc9ed825f051309d2307cca5609de50480d6d7fe10

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d97f64b908f505db9cb9c2b0904abcb0510a4830c446e33865b3c21a94f83672
MD5 c865d5166dd0b492bce6be2e6ba00a1e
BLAKE2b-256 457371fb3681c62a068562cfa2efd6718b3ade7481f36a1ddb2b84dbe03ef233

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 89e1e3b0e101323c6078107d9855178b16c99058cec30954cf0b2ebd3dd96ce0
MD5 740f563f0f351cc1781efa64160d9829
BLAKE2b-256 ac1a821b5eb9ee08b548f1eb82c1434a949b90f817ec1a17d8df15d4a04cc433

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 13e7637c222ac08fe9802629c637d10a5e047fdded6f1a1c96601b36f123abee
MD5 cafeb2792941914951fa857546209888
BLAKE2b-256 bcf3159afadc7469e88b1dd13cbd040596997c726649ffd7f4135ed582ee248d

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 35b246332d49f809ac7d044d3ca08ff4bc61d5fac7cc8e5b44585cf5472e6309
MD5 5cf878c35a9bac8b60a4d70bf862d513
BLAKE2b-256 817c27b348cc66072f91042e961e04251c0f77ef98da68f8c02b5c71ee92bb1c

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 995231ef5a4a3ea2bb70507a9a88966c24fe028ade925c73e2470283109fb484
MD5 c2235466967c6e34e5ad5defff5609fc
BLAKE2b-256 6f6aefa5e578041c20867717f0a4a254c88f37055ee483465e6ce75bd245b0ac

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.1-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.1-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 b8d0e8c129af1dbfe3eb5431bdc7480acef409d9e62ad41846c4bec003d2894f
MD5 7ac0b2ebb7b0577569be66d5c414a22e
BLAKE2b-256 69aba2342f2e0eff66a88f105107eaee83d28c443bac45a28a0b26ffa98a0c28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page