Skip to main content

SentencePiece python wrapper

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
You need to install SentencePiece before installing this python wrapper.

You can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

To install the wrapper manually, try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.0.4.tar.gz (490.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sentencepiece-0.0.4-cp36-cp36m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.6m

sentencepiece-0.0.4-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.5m

sentencepiece-0.0.4-cp34-cp34m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.4m

sentencepiece-0.0.4-cp33-cp33m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 3.3m

sentencepiece-0.0.4-cp27-cp27mu-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 2.7mu

sentencepiece-0.0.4-cp27-cp27m-manylinux1_x86_64.whl (1.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file sentencepiece-0.0.4.tar.gz.

File metadata

  • Download URL: sentencepiece-0.0.4.tar.gz
  • Upload date:
  • Size: 490.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sentencepiece-0.0.4.tar.gz
Algorithm Hash digest
SHA256 6896b6b43197cf3f4801757f4b8ca178251d223d38a1d857c714c040557c7608
MD5 60023b8bc613ee16580071d347fcfa99
BLAKE2b-256 40191878352c1adfc1df018bc69c962259774811ac0a90b2be3bc89b9907df81

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.4-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d39b138e6750c7b7f58cd457965492497755e6aafdff3415f9675412b0f1cadf
MD5 805a166833feb0308af468aa542cbe1b
BLAKE2b-256 a46432df8da88be1f09d13f11995e0cae8e0a438939070b95971d88eef5c962d

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.4-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.4-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 aa25cc0257de3a51164dab5a99dca1aa9b5eddf76ff0c156d5541b020a903693
MD5 0bb481a3f6509ea3727569f22188aeee
BLAKE2b-256 b3b5653de313b4a5354eab73f1e8758a052470027ff0b4828378b4233c24e5cd

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.4-cp34-cp34m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.4-cp34-cp34m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 4218451518596bd587d644a37c7468fdb63f73ce9c75c3070d25d222604015ff
MD5 42d3a27e96290c33cacd5ed2475a9f36
BLAKE2b-256 4c1699f6689634da7209d20c917268ddcf553885e6d5f018030d3fe0f27ec642

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.4-cp33-cp33m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.4-cp33-cp33m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3603c7406cce085916043a84d41de21b5003fe57954c946b48fce9974b298b4e
MD5 5e1547c0b1c2269d8d8dfe0fc60a755a
BLAKE2b-256 d333e6b0247ed69f4236ff3d76feb5fba90db4ec09565e7a79689d4e3969444f

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.4-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.4-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e5744f282320386da2aa272802759f3feb0cad0613879a5c7d12d1a705e059b5
MD5 674f7d064e5271d1164f9eb9bedf14fe
BLAKE2b-256 64911cca9eac9acde830f2e370eed458e017e396648ccb3db426e1647263e268

See more details on using hashes here.

File details

Details for the file sentencepiece-0.0.4-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for sentencepiece-0.0.4-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 0ccb845037ff1d62624514819b74d0ab851cf845274891a708e5162a331ecc92
MD5 46333f5edcf94494806bcf03d5bf42ef
BLAKE2b-256 5083c238ce5f9f1fe5d9795f0e7eeac3700a1456a749a12f12f6459d9775a365

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page