Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
For Linux (x64/i686) environment, you can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

Note that binary wheel package is not avaialble for non-Linux environment, including macOS, Windows, and Linux (arm).
You need to install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) library in advance.

To build and install the Python wrapper manually, please install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) and try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncodeAsPieces("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncodeAsPieces("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.EncodeAsPieces('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.EncodeAsPieces(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.EncodeAsPieces(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.EncodeAsPieces('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.EncodeAsPieces('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.1.3.tar.gz (498.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.1.3-cp37-cp37m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m

sentencepiece-0.1.3-cp37-cp37m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 3.7m

sentencepiece-0.1.3-cp36-cp36m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.1.3-cp36-cp36m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.1.3-cp35-cp35m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.1.3-cp35-cp35m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.1.3-cp34-cp34m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.1.3-cp34-cp34m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.1.3-cp27-cp27mu-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.1.3-cp27-cp27mu-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.1.3-cp27-cp27m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 2.7m

sentencepiece-0.1.3-cp27-cp27m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.1.3.tar.gz.

File metadata

  • Download URL: sentencepiece-0.1.3.tar.gz
  • Upload date:
  • Size: 498.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3.tar.gz
Algorithm Hash digest
SHA256 bfdfa5699ed8c44c55c9e19c46327bb9b3106828fce631db10e83b29a10bc315
MD5 bf1c07f588b578bfb6d329173573e049
BLAKE2b-256 fd456d0eb609d5cd81df094aab71a867b2ab6b315ffd592e78fb94a625c4d6aa

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4ff2dff02bad18ff02e980265d51f2cdbbf63c101519fdd8e240eb907d8728ed
MD5 bb5542894fb7d0716f9244759158a808
BLAKE2b-256 8f19653c02c16da63da5654af2cf00d15c0917bb448eac03716d2b5bb65cbc43

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp37-cp37m-manylinux1_i686.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp37-cp37m-manylinux1_i686.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp37-cp37m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 c852682632a8e6545b943c09cf368ab47f0ea3d166b356d7b104d95f2dda8337
MD5 2bc0945d5aaf72c394f8302aae156dc0
BLAKE2b-256 d89832bcd290e65892ab13c9ea1aa55ea7d3ad1d008ba9c8e50c3a4e480715cf

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a672dc8e4024bc2edbdd61a6bc82222a0b7cc6be1ac252b51b306ca1ddf7d0eb
MD5 cf26c6df45b239fffe2b3386bd7cab05
BLAKE2b-256 4db0f61a70c28af211ba4bf26965db42ceff3b245ad4e4c58f82860ad73cdabd

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp36-cp36m-manylinux1_i686.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp36-cp36m-manylinux1_i686.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 fea1f1f311553b1690ee888b767a08e9492391f2ebc8bfc38d718c5b6482902e
MD5 0952a53fcbd94d7f41a9281a18c7391e
BLAKE2b-256 d27ca1297e9752f4f5fe15c0215763c105df229761ab4d6cf28572b2734af7d5

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b4959ecd41198437c0510859b2a645149e5f605b810248e83d20677b6bc153f8
MD5 62bbd8931104f487effb1c6c80c632a8
BLAKE2b-256 8a40c9cd0db4da4e9d9401f2a81c623f781fe09ae8361e16ea57decf146c7d1e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp35-cp35m-manylinux1_i686.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp35-cp35m-manylinux1_i686.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 423120971a86745b944ed69759fd33f179e67f328d0fb9fe8249c53f0456a5bf
MD5 bbcca1bb046fe1c9be7cf1bcc0972d27
BLAKE2b-256 d381aa2c98e32b310c7012c973958d5549351d0ea03ce465cbf5bf6384cf59eb

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp34-cp34m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2dcdd981c2182166c6aff6f760ef1e75d403a24e21d157193c338f60ae8b2a94
MD5 2c1c1e47583b58fbc0b0433e9b1ce6a3
BLAKE2b-256 c3f86e41be43090bbdd48095650c21520c00a881aa2f2d2d943cd6ce2e6ef24a

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp34-cp34m-manylinux1_i686.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp34-cp34m-manylinux1_i686.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.4m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 b7df8867591963b61710f1cf49dbde3aa27ae16158ce3fc70eb5d9a284b50476
MD5 dd4a45cbe3f314decdfef735faa87375
BLAKE2b-256 9b21d10f2839f365071c7408f9c8ee6cee0dc51e07b911ea3f3620ceb5cd4b6f

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4f847866630bf66b0acf83349bc9d2f290d6ccccdeb7e03538c273e942b81b42
MD5 d02ded4d4323c18c091d0579147b65da
BLAKE2b-256 c426b3aca98dd1631be077846630e85f17871fd307abfb9ee9f2507a7072c522

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp27-cp27mu-manylinux1_i686.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp27-cp27mu-manylinux1_i686.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 bd7de72916ade5741e0eb9b7f1ff7c20f2d2a84614e770de6d0133feb52fc023
MD5 4fcb9a9433891de231f6f265d4a725b4
BLAKE2b-256 2cfa5b832754ea3c1fe6f9b48ac5dd6bdc367b063358dd1569088bf0b3b3663a

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 163672e4e0669d4d7a211d00ed3c80948dbf9297a6ecab473d98f493feec9751
MD5 cebf81ea16acdb7ccad4cf54398ab6b7
BLAKE2b-256 cc50c591bdc152ad8154b05c256596aabd4ce2fc79e035ad4b2298ada997a3b1

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.3-cp27-cp27m-manylinux1_i686.whl.

File metadata

  • Download URL: sentencepiece-0.1.3-cp27-cp27m-manylinux1_i686.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.0.0 requests-toolbelt/0.8.0 tqdm/4.24.0 CPython/2.7.13

File hashes

Hashes for sentencepiece-0.1.3-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 cfed4ee271131a4779f301a5744e13c985ff3c56edd6bf0b6d46e129584f4522
MD5 209018774e3b1673fc22ed98ce0c7a92
BLAKE2b-256 8e9da681f86e21f2c3b44992070356094423bf69a0924936fe93b9ecf3f247c8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page