Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
For Linux (x64/i686) environment, you can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

Note that binary wheel package is not avaialble for non-Linux environment, including macOS, Windows, and Linux (arm).
You need to install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) library in advance.

To build and install the Python wrapper manually, please install [SentencePiece C++](https://github.com/google/sentencepiece#c-from-source) and try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncodeAsPieces("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncodeAsPieces("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.EncodeAsPieces('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.EncodeAsPieces(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.EncodeAsPieces(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.EncodeAsPieces('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.EncodeAsPieces('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.1.2.tar.gz (496.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.1.2-cp37-cp37m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7m

sentencepiece-0.1.2-cp37-cp37m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 3.7m

sentencepiece-0.1.2-cp36-cp36m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.1.2-cp36-cp36m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.1.2-cp35-cp35m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.1.2-cp35-cp35m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.1.2-cp34-cp34m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.1.2-cp34-cp34m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.1.2-cp27-cp27mu-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.1.2-cp27-cp27mu-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.1.2-cp27-cp27m-manylinux1_x86_64.whl (1.4 MB view details)

Uploaded CPython 2.7m

sentencepiece-0.1.2-cp27-cp27m-manylinux1_i686.whl (1.4 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.1.2.tar.gz.

File metadata

  • Download URL: sentencepiece-0.1.2.tar.gz
  • Upload date:
  • Size: 496.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0e87925ab14aab269694530d47d3c9b68d67a75a1b6f0cc5ec4f631c6784fc58
MD5 fd0fdf2e37190bcbf1abc259ff7675b6
BLAKE2b-256 db85d618ae3089a332f5e975820abee2de54e4e2ecbec99b08d80cb6ab8f5b59

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 cc633ab66d1a655bb8e819855919f7012c316459e792c01f2b1a5ea3c8a89cf3
MD5 97181429552d3e7512bc27de61df946c
BLAKE2b-256 8767a30731cb46271204ae05965524d3a84869868822a462fc9ea154d9867f6b

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp37-cp37m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp37-cp37m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 1d8a96e1b073ac1b7461eec88e1e9b3b4e30c4f754af9c7d53376dcbf6c5a095
MD5 902454845b1e945981a52d65a3639e1a
BLAKE2b-256 db982a7529932e56eb3fb77b78c4d61323d9d17348a3841cf670893c5fe3809a

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 87185f86b5cbf853e6765d540c12a134838f53b5ac4612791261c8e394e6bec1
MD5 3ed1c0fb64220f8bf2bbd9fb3c60e7cd
BLAKE2b-256 e5c06efd4bfe546780d2e52d7fb153121ed54535f78796c554ce57f84dc14cf9

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp36-cp36m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp36-cp36m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 6b7f69e04c6b2730f5b1f8e9b86f1ca6db5e79e5f894d000941db5d597a0f449
MD5 70044281b226574b0d81deec1a55403f
BLAKE2b-256 f604985fa52a31b269ea6972f6a286bb26d2ef8afdae9dbf20e78e66712971a6

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7a1192ec022af63e6a628a1d89165f0c32047f04d0c39723b11273f568975074
MD5 a6b9b4e972db4be22631025f120ff0cf
BLAKE2b-256 e7fde4ada04ef6028ce5af1f01e83462798dc19db7cf9b99ced8d6cb3d4f76a6

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp35-cp35m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp35-cp35m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 17e0186275f6f5bdd50059c3101a742816153f1f871bec0cea38aef6fdc4a3da
MD5 8e7edd8da766a4f5ed0a4c7f9f8abafc
BLAKE2b-256 f67e0a9ebee1bd61fbf6f5750f1e0f4641aa0f0b254849abcdde8bd87d88087a

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 63931429cce15cc3ec91fb5f5be899bfadac81ae9218fd7cffe81c5074927983
MD5 2a7e629756ad3853d382f97e42308d1e
BLAKE2b-256 183bd4b7b2fcaf258a7231f2dcce6eb10b8a71299c7b93149dd1f0bc7c2f86f0

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp34-cp34m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp34-cp34m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 62baf95d473b4e0a494189f2d048173e8f77e3b13e4f78db65914a21f753349c
MD5 c916c49ee03301d05bc0ec6fb56ee539
BLAKE2b-256 a29b5a512408b7e365f98983859bee92a7526315e2f847a9bc44c048d1ca9500

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6d4ed5b3b01bc6e6c25d325dbb329803259c2d87eea7fc3916337e937440fa21
MD5 ba553cc2dd5230c061d37fe627b527bd
BLAKE2b-256 4945d1405dc5c26de82bb0c8920886dc816f6d6b3bf05d482d1f795d9dff1c51

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp27-cp27mu-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp27-cp27mu-manylinux1_i686.whl
Algorithm Hash digest
SHA256 12cd1d37cf5dc5e1dce4d50bcd8804418a328aeacad12562f5b298d7320c668b
MD5 4ec9ce42723ebc3f17518ea125fbbad8
BLAKE2b-256 7c6c5ba696ec3886fb5c2c44a5dc21b3c949a7eed29de7be2c879fda19ca9511

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 368cbd7e9cdfdcf79ae1c15a35e9a206ba64c9134d34fada207b12412d2774bb
MD5 48f077ae5ad1efcb2f6ac9926faeefb7
BLAKE2b-256 07b1718cde24384262171157fee98a0bd778145e9dc69d6798b45c4dd011b840

See more details on using hashes here.

File details

Details for the file sentencepiece-0.1.2-cp27-cp27m-manylinux1_i686.whl.

File metadata

File hashes

Hashes for sentencepiece-0.1.2-cp27-cp27m-manylinux1_i686.whl
Algorithm Hash digest
SHA256 eeda8a35085185ffadc7d56a750ff5cf860bbb9f628369f74603a0c67f165552
MD5 298f1a4b7994d6279fff6c0fe8b24f4d
BLAKE2b-256 c1148ac27c0fecb7915c5e394103a092793fd8b18a182eb394d3dc39272f1234

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page