Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
You need to install SentencePiece before installing this python wrapper.

You can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

To install the wrapper manually, try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.0.3.tar.gz (380.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.0.3-cp36-cp36m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.3-cp35-cp35m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.3-cp34-cp34m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.3-cp33-cp33m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.3m

sentencepiece-0.0.3-cp27-cp27mu-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.3-cp27-cp27m-manylinux1_x86_64.whl (1.1 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.0.3.tar.gz.

File metadata

  • Download URL: sentencepiece-0.0.3.tar.gz
  • Upload date:
  • Size: 380.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.0.3.tar.gz
Algorithm Hash digest
SHA256 7dbefd4ba670d5862e7925bf11792d535074dc06af8015558c6642648b770358
MD5 2fda2a126abc519bd593664ea8aa5258
BLAKE2b-256 96adb1532b8bb8f20a392189de09d93eb4c7c95ccf98c93b55e9ab00db1bc61f

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fa63c25bd64ff160e221845bdf7aae31ce60d46394ea005cc0af55ee4e64dcfe
MD5 6e8ad3d6ce851e6f47d064d7bc6563ac
BLAKE2b-256 8a54dee2ab7e10681b145d0fe99a2c69f63c95f9e9a406ccb3031dc38acea6ca

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f0d2ddd93417a3482250479b2e1714e732b1354ee5586544a4a02945899ab7f1
MD5 8424ddd9b83da202b0c6eabef5aff206
BLAKE2b-256 ff4e9367660052d355e5e27607de5ad824638f07d97d391c18bd654de00aa98e

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.3-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.3-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3792e66ed2b890e08fa3a239509c66a10cd9e3639e27de61ee3c2fd9efb2fe3c
MD5 ca219d615419e74134c7d443f7b4a590
BLAKE2b-256 a801d572060751a2bf16c03ac29f1571ee35ccd64b8dfd18cecaa0020c7f6635

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.3-cp33-cp33m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.3-cp33-cp33m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 06be6816ba4647bbf0318284137fb0b75e4393c47f77883695a3fe52db13d181
MD5 9018d0e5fba7b112bfa2397d3a94464e
BLAKE2b-256 e27488e33c49b99a2bc9d12772813e098bcc43eec12d856682a623ef4da55aaf

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.3-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.3-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 68c3e246f11cf7156cc5dcaf2481f2c1e9b444815e0f56ca6ce9f4b36e4bcd6e
MD5 430fb989ad70eddb7cbebcc52817ea17
BLAKE2b-256 7ac6711acbc013de2caecba62cc10a99d7f04cd69e6652592139ff2e9514df6c

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.3-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.3-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d8b21dbcc2d150d2bc4aa62a72ab1cea29676913df6b00b30604c0f72d7c40c2
MD5 b120405627903dee9aa1d9fc3409f792
BLAKE2b-256 9812b14eda27802edf1222803e1d33334423ee757fde1fe294fb6ed1a57e7570

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page