Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
You can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

To build and install the wrapper manually, you need to install SentencePiece C++ in advance, and then try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncode("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncode("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.0.7.tar.gz (493.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.0.7-cp36-cp36m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.7-cp36-cp36m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.7-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.7-cp35-cp35m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.7-cp34-cp34m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.7-cp34-cp34m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.7-cp33-cp33m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.3m

sentencepiece-0.0.7-cp33-cp33m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 3.3m

sentencepiece-0.0.7-cp27-cp27mu-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.7-cp27-cp27mu-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.7-cp27-cp27m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 2.7m

sentencepiece-0.0.7-cp27-cp27m-manylinux1_i686.whl (1.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.0.7.tar.gz.

File metadata

  • Download URL: sentencepiece-0.0.7.tar.gz
  • Upload date:
  • Size: 493.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.0.7.tar.gz
Algorithm Hash digest
SHA256 83a40cecb39a446176809dc53361d2efce93e3009cea0b5dc3ace3ab2902cde0
MD5 54f6b2bd28a4db15e790758974a6dd75
BLAKE2b-256 efba17c0c4f8ccc746b2182c7e3c8292be0bdb37fbadeaf467d2f69565160764

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6dd94e7f4e0eea6998bc3e00e572ee26958f83d1766b754d022c24d547556c3e
MD5 a2e9cab8c0d1e50ab9fb6f21aca25dc7
BLAKE2b-256 27ff943c1ac084e35071af376a160e31de2d2bff46de1d0a6e357fec897bb71a

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 80fd4b12d66bb878ccb1560610f71e6c7036daaf30fb244ba359f1806d6621b0
MD5 3902464d7b74238d30981a5645fb475a
BLAKE2b-256 402c17e6f6d02a0d1742d4f1ccefb9e75a6a6d3d0a69cba6ea4db073f288c2fe

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4a33e06a9a347e83bab426491655376e3110518d0f4c4e9cd9e99be653f30c66
MD5 dab018c86c80d2916806e831593c131f
BLAKE2b-256 5a4c18ae02993ce17da29f9c9ecae0e13e1b9acfac4e82f03c199a9db6782fdc

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 b62ff65b57117c010c74d8c15b1af10b4aed69f4119cebc98218ae70e279f6a5
MD5 103bcb1b3585ab47f96b96135b003d01
BLAKE2b-256 cf123b1bdd0fd2f688d6842ad2efe51e811f48606b52aec1ef1fbf239fba71e9

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 65a103cc7ff9f6a5aae5f6ce8d406accca701f42822638c36afd738ef0c12b42
MD5 06599848661650b5984d26fb77b08bc8
BLAKE2b-256 467d5c9f9db067202e1a09d627ed712a3abb556de4af3e1ce5fedd855e8b865e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 05e06a1545618f8668149a8afe1c8e591ec7bb9755bb388a39f705f7b92ee582
MD5 60d17682fdc5434a362d17663db6d8e7
BLAKE2b-256 e8af097920069703c15720cdbc4a9939d87f82f5b55fd5b52a556469334d414d

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp33-cp33m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp33-cp33m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 037c9e0830b546131e065803caf681b6393043dd5518ed48dc220652ae34813d
MD5 a682054462322a2c1f8bc5e0504f0200
BLAKE2b-256 6cb29dc50ea7b39d8f84d26fc8caa0265d44f62c4a66309fb17671113db8af5e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp33-cp33m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp33-cp33m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 1148f4190a27d2127e1135f99af1958ed4a36e54b5e38d3e0f3e10ccd8a233b3
MD5 1cfedd8827e0105d91714e59733e640c
BLAKE2b-256 4a9b06161cdbc34d72a70b3f924c526ea0aaf52fde9b773131bfc72da983166c

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8c80138a8de107993948922af49fad135b7901e1818cab1e6180675db4d46d0d
MD5 fbf8028d932c354b8734ba71adba3632
BLAKE2b-256 b2f678ca4bda4f445c15d715504340db45fdf9439f25ac61696d07492a0afd89

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 e3054b0017bfef26a96781bf79b3924639515ed6e86f807ee2e124bda5ea0747
MD5 d0b744c9fb39db76c14e04eaa5096055
BLAKE2b-256 e3d10eb81351c8217adeafc553df22783ba75dd4440ba1d2805f7b8dbe694c3f

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 08489b6a848cbc0e1308e7f97c94ba10ac0e4574a59b10a3463d1c6e3f9038d4
MD5 6b4752c567e184bc4972303dd2ebb453
BLAKE2b-256 116fe1fb94503e77de0eee8af2b76b5f7efe968cbd74d0d6413b6a5aa3440783

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.7-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.7-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 24fdfc21ba501946c69486db3cc08346c6d85da9bc0addc829c42109166b7423
MD5 ab99ff3eb80a377a54258b06b37c3149
BLAKE2b-256 74a4f3c1278fdf602ec3a8e157c9ffc424541148a21981b72afc882c952e73eb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page