SentencePiece python wrapper
Project description
Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.
## Build and Install SentencePiece
You need to install SentencePiece before installing this python wrapper.
You can simply use pip comand to install SentencePiece python module.
```
% pip install sentencepiece
```
To install the wrapper manually, try the following commands:
```
% python setup.py build
% sudo python setup.py install
```
If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```
## Usage
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```
## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.
* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```
* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sentencepiece-0.0.3.tar.gz.
File metadata
- Download URL: sentencepiece-0.0.3.tar.gz
- Upload date:
- Size: 380.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dbefd4ba670d5862e7925bf11792d535074dc06af8015558c6642648b770358
|
|
| MD5 |
2fda2a126abc519bd593664ea8aa5258
|
|
| BLAKE2b-256 |
96adb1532b8bb8f20a392189de09d93eb4c7c95ccf98c93b55e9ab00db1bc61f
|
File details
Details for the file sentencepiece-0.0.3-cp36-cp36m-manylinux1_x86_64.whl.
File metadata
- Download URL: sentencepiece-0.0.3-cp36-cp36m-manylinux1_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.6m
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa63c25bd64ff160e221845bdf7aae31ce60d46394ea005cc0af55ee4e64dcfe
|
|
| MD5 |
6e8ad3d6ce851e6f47d064d7bc6563ac
|
|
| BLAKE2b-256 |
8a54dee2ab7e10681b145d0fe99a2c69f63c95f9e9a406ccb3031dc38acea6ca
|
File details
Details for the file sentencepiece-0.0.3-cp35-cp35m-manylinux1_x86_64.whl.
File metadata
- Download URL: sentencepiece-0.0.3-cp35-cp35m-manylinux1_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.5m
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0d2ddd93417a3482250479b2e1714e732b1354ee5586544a4a02945899ab7f1
|
|
| MD5 |
8424ddd9b83da202b0c6eabef5aff206
|
|
| BLAKE2b-256 |
ff4e9367660052d355e5e27607de5ad824638f07d97d391c18bd654de00aa98e
|
File details
Details for the file sentencepiece-0.0.3-cp34-cp34m-manylinux1_x86_64.whl.
File metadata
- Download URL: sentencepiece-0.0.3-cp34-cp34m-manylinux1_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.4m
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3792e66ed2b890e08fa3a239509c66a10cd9e3639e27de61ee3c2fd9efb2fe3c
|
|
| MD5 |
ca219d615419e74134c7d443f7b4a590
|
|
| BLAKE2b-256 |
a801d572060751a2bf16c03ac29f1571ee35ccd64b8dfd18cecaa0020c7f6635
|
File details
Details for the file sentencepiece-0.0.3-cp33-cp33m-manylinux1_x86_64.whl.
File metadata
- Download URL: sentencepiece-0.0.3-cp33-cp33m-manylinux1_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.3m
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06be6816ba4647bbf0318284137fb0b75e4393c47f77883695a3fe52db13d181
|
|
| MD5 |
9018d0e5fba7b112bfa2397d3a94464e
|
|
| BLAKE2b-256 |
e27488e33c49b99a2bc9d12772813e098bcc43eec12d856682a623ef4da55aaf
|
File details
Details for the file sentencepiece-0.0.3-cp27-cp27mu-manylinux1_x86_64.whl.
File metadata
- Download URL: sentencepiece-0.0.3-cp27-cp27mu-manylinux1_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 2.7mu
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68c3e246f11cf7156cc5dcaf2481f2c1e9b444815e0f56ca6ce9f4b36e4bcd6e
|
|
| MD5 |
430fb989ad70eddb7cbebcc52817ea17
|
|
| BLAKE2b-256 |
7ac6711acbc013de2caecba62cc10a99d7f04cd69e6652592139ff2e9514df6c
|
File details
Details for the file sentencepiece-0.0.3-cp27-cp27m-manylinux1_x86_64.whl.
File metadata
- Download URL: sentencepiece-0.0.3-cp27-cp27m-manylinux1_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 2.7m
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8b21dbcc2d150d2bc4aa62a72ab1cea29676913df6b00b30604c0f72d7c40c2
|
|
| MD5 |
b120405627903dee9aa1d9fc3409f792
|
|
| BLAKE2b-256 |
9812b14eda27802edf1222803e1d33334423ee757fde1fe294fb6ed1a57e7570
|