sentencepiece

SentencePiece python wrapper

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- Unix
Programming Language
- Python
Topic
- Software Development :: Libraries :: Python Modules
- Text Processing :: Linguistic

Project description

# SentencePiece Python Wrapper

Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications:
* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectevely.
* Support model training with SentencePieceTrainer.Train method.
* SentencePieceText proto is not supported.
* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively.

## Build and Install SentencePiece
You can simply use pip comand to install SentencePiece python module.

```
% pip install sentencepiece
```

To build and install the wrapper manually, you need to install SentencePiece C++ in advance, and then try the following commands:
```
% python setup.py build
% sudo python setup.py install
```

If you don’t have write permission to the global site-packages directory or don’t want to install into it, please try:
```
% python setup.py install --user
```

## Usage

### Segmentation
```
% python
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor()
>>> sp.Load("test/test_model.model")
True
>>> sp.EncodeAsPieces("This is a test")
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est']
>>> sp.EncodeAsIds("This is a test")
[284, 47, 11, 4, 15, 400]
>>> sp.DecodePieces(['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'])
'This is a test'
>>> sp.NBestEncode("This is a test", 5)
[['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'est'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st'], ['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'es', 't']]
>>> for x in range(10):
... sp.SampleEncode("This is a test", -1, 0.1)
...
['\xe2\x96\x81', 'T', 'h', 'i', 's', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'est']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81is', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 's', 't']
['\xe2\x96\x81T', 'h', 'is', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81a', '\xe2\x96\x81', 'te', 's', 't']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'is', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 't', 'e', 'st']
['\xe2\x96\x81This', '\xe2\x96\x81', 'i', 's', '\xe2\x96\x81', 'a', '\xe2\x96\x81', 'te', 's', 't']
>>> sp.DecodeIds([284, 47, 11, 4, 15, 400])
'This is a test'
>>> sp.GetPieceSize()
1000
>>> sp.IdToPiece(2)
'</s>'
>>> sp.PieceToId('</s>')
2
>>> len(sp)
1000
>>> sp['</s>']
2
```

### Model Training
Training is peformed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function.

```
>>> import sentencepiece as spm
>>> spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model_prefix=m --vocab_size=1000')
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "test/botchan.txt"
model_prefix: "m"
model_type: UNIGRAM
..snip..
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1239 obj=10.4055 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1239 obj=10.3187 num_tokens=36256 num_tokens/piece=29.2623
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=0 size=1100 obj=10.5285 num_tokens=37633 num_tokens/piece=34.2118
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(284) LOG(INFO) Saving model: m.model
trainer_interface.cc(293) LOG(INFO) Saving vocabs: m.vocab
>>>
```

## Python2/3 String/Unicode compatibility
Sentencepiece python wrapper accepts both Unicode string and legacy byte string.
The output string type is determined by the input string type.
The output type of IdToPiece/DecodeIds methods is *str*, but note that it is a legacy byte string in Python2 and Unicode string in Python3 respectively.

* Python2:
```
>>> sp.Encode('吾輩は猫である')
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.Encode(u'吾輩は猫である')
[u'\u2581', u'\u543e', u'\u8f29', u'\u306f', u'\u732b', u'\u3067\u3042\u308b']
>>> sp.Encode(u'吾輩は猫である'.encode('utf-8'))
['\xe2\x96\x81', '\xe5\x90\xbe', '\xe8\xbc\xa9', '\xe3\x81\xaf', '\xe7\x8c\xab', '\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>> sp.IdToPiece(10)
'\xe3\x81\xab'
>>> type(sp.IdToPiece(10))
<type 'str'>
```

* Python3:
```
>>> sp.Encode('吾輩は猫である')
['▁', '吾', '輩', 'は', '猫', 'である']
>>> sp.Encode('吾輩は猫である'.encode('utf-8'))
[b'\xe2\x96\x81', b'\xe5\x90\xbe', b'\xe8\xbc\xa9', b'\xe3\x81\xaf', b'\xe7\x8c\xab', b'\xe3\x81\xa7\xe3\x81\x82\xe3\x82\x8b']
>>>
>>> sp.IdToPiece(10)
'に'
>>> type(sp.IdToPiece(10))
<class 'str'>
```

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- Unix
Programming Language
- Python
Topic
- Software Development :: Libraries :: Python Modules
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

0.2.0

Feb 19, 2024

0.1.99

May 2, 2023

0.1.98

Apr 12, 2023

0.1.97

Aug 7, 2022

0.1.96

Jun 18, 2021

0.1.95

Jan 10, 2021

0.1.94

Oct 24, 2020

0.1.92 yanked

Jun 8, 2020

Reason this release was yanked:

Crash bug is reported (confirming)

0.1.91

May 21, 2020

0.1.90

May 13, 2020

0.1.86

Apr 24, 2020

0.1.85

Dec 15, 2019

0.1.83

Aug 16, 2019

0.1.82

Apr 13, 2019

0.1.81

Mar 22, 2019

0.1.8

Jan 11, 2019

0.1.7

Dec 26, 2018

0.1.6

Nov 12, 2018

0.1.5

Oct 29, 2018

0.1.4

Aug 26, 2018

0.1.3

Jul 30, 2018

0.1.2

Jul 13, 2018

0.1.1

Jun 26, 2018

0.1.0

Jun 10, 2018

0.0.9

May 11, 2018

0.0.7

Apr 29, 2018

This version

0.0.6

Apr 18, 2018

0.0.5

Apr 9, 2018

0.0.4

Feb 28, 2018

0.0.3

Dec 17, 2017

0.0.2

Nov 8, 2017

0.0.0

Aug 28, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentencepiece-0.0.6.tar.gz (492.3 kB view hashes)

Uploaded Apr 18, 2018 Source

Built Distributions

sentencepiece-0.0.6-cp36-cp36m-manylinux1_x86_64.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 3.6m

sentencepiece-0.0.6-cp36-cp36m-manylinux1_i686.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 3.6m

sentencepiece-0.0.6-cp35-cp35m-manylinux1_x86_64.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 3.5m

sentencepiece-0.0.6-cp35-cp35m-manylinux1_i686.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 3.5m

sentencepiece-0.0.6-cp34-cp34m-manylinux1_x86_64.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 3.4m

sentencepiece-0.0.6-cp34-cp34m-manylinux1_i686.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 3.4m

sentencepiece-0.0.6-cp33-cp33m-manylinux1_x86_64.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 3.3m

sentencepiece-0.0.6-cp33-cp33m-manylinux1_i686.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 3.3m

sentencepiece-0.0.6-cp27-cp27mu-manylinux1_x86_64.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 2.7mu

sentencepiece-0.0.6-cp27-cp27mu-manylinux1_i686.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 2.7mu

sentencepiece-0.0.6-cp27-cp27m-manylinux1_x86_64.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 2.7m

sentencepiece-0.0.6-cp27-cp27m-manylinux1_i686.whl (1.5 MB view hashes)

Uploaded Apr 18, 2018 CPython 2.7m

Hashes for sentencepiece-0.0.6.tar.gz

Hashes for sentencepiece-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`d101a584732789a2c1b3e633a0b1a8e367c86d2a607c5b911080e88f50483f4d`
MD5	`7f9da5fb9d1d490cfb36e16466605920`
BLAKE2b-256	`8a1e5eaa80c2a01d0d40face64d7de2d525c9d77d6ffb15f74377eb373552d7a`

Hashes for sentencepiece-0.0.6-cp36-cp36m-manylinux1_x86_64.whl

Hashes for sentencepiece-0.0.6-cp36-cp36m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`1155894efefa102d759c9fc7df37c7ff0a8f08e4ee484196b175116a75e73b8c`
MD5	`3093b8328c333634d231a0591550ef18`
BLAKE2b-256	`89a5780157c15e2d4e64ebe7e233ed27d4ddeb5b8100fbd1cd361b4d0d87fa71`

Hashes for sentencepiece-0.0.6-cp36-cp36m-manylinux1_i686.whl

Hashes for sentencepiece-0.0.6-cp36-cp36m-manylinux1_i686.whl
Algorithm	Hash digest
SHA256	`ae26889c635848e20a8ea9a22ec337b774ced751b475eb21d1aff858e42b38c9`
MD5	`4ad968a56a9616de7fbe0a56a9c2bcf8`
BLAKE2b-256	`b2db064f42ca61d26f02e980ed9d0b220541a08451a552f3164f480c4f629375`

Hashes for sentencepiece-0.0.6-cp35-cp35m-manylinux1_x86_64.whl

Hashes for sentencepiece-0.0.6-cp35-cp35m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`0b69486f2b3532735a34307befc1173ed27d71fdcf10cbe94ca6e5d43e717c59`
MD5	`c127e60d0fa5a9267f7527af7a49239b`
BLAKE2b-256	`7ef73e502d84e4b03a7f8ee39eea9e079402b73020bca8e6a13171441fcf704e`

Hashes for sentencepiece-0.0.6-cp35-cp35m-manylinux1_i686.whl

Hashes for sentencepiece-0.0.6-cp35-cp35m-manylinux1_i686.whl
Algorithm	Hash digest
SHA256	`9a1d149534fe528336a168bc9cdcfaea9c0b0a58d427dac0c765bba188a817da`
MD5	`63e1882fd1e528278db03b061bff5a23`
BLAKE2b-256	`3a66dd9e79d228d9d677fdcc6566bfb3e548c7b420530cfb233cb6a65b6cd3bb`

Hashes for sentencepiece-0.0.6-cp34-cp34m-manylinux1_x86_64.whl

Hashes for sentencepiece-0.0.6-cp34-cp34m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`5e03c1782e81df6c26c532632f6231764906e42efde36fbd89224c2f56a2c54e`
MD5	`798c392614e641835b125c1f7b3df7d2`
BLAKE2b-256	`0bf72de8a524d0e44adb0052f9ee67258048b6f8ce64aacbb35586dbe42ae608`

Hashes for sentencepiece-0.0.6-cp34-cp34m-manylinux1_i686.whl

Hashes for sentencepiece-0.0.6-cp34-cp34m-manylinux1_i686.whl
Algorithm	Hash digest
SHA256	`35d2b7be827d784e2ea7bb8c0ce6e8fa28ed07521b9d0a6a4afe1f5769fcb742`
MD5	`25001e52f744ae2355dd8b897b4d269d`
BLAKE2b-256	`080795ff4738933ee109dd2edc5ac6ed6ce0d8e3b17761cbce734e863edfbef8`

Hashes for sentencepiece-0.0.6-cp33-cp33m-manylinux1_x86_64.whl

Hashes for sentencepiece-0.0.6-cp33-cp33m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`46f3ff24afd75b62c67aceedfe9900f7325ce67d3aac3c06b0cc8e0a971cebcf`
MD5	`f2063f9efd393d281b2b3a7807b40aa7`
BLAKE2b-256	`0d0230bed1d5fb6e4c2729e6b9596f82d8e9b59c87919bfbea7878f47901006b`

Hashes for sentencepiece-0.0.6-cp33-cp33m-manylinux1_i686.whl

Hashes for sentencepiece-0.0.6-cp33-cp33m-manylinux1_i686.whl
Algorithm	Hash digest
SHA256	`786964dfab3f9061969b114cc1aa18460b0525ef20fe73f98dc6f6858442f372`
MD5	`1ee2ff5123983be569db41dbc49313b4`
BLAKE2b-256	`56976daa76674d24e235accead0f1e4eb811839e28f368350deb6d03f622309f`

Hashes for sentencepiece-0.0.6-cp27-cp27mu-manylinux1_x86_64.whl

Hashes for sentencepiece-0.0.6-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`2795c949885a9e13ddceb8f1f364892e69cb49a674534af3b20405b2d087dbd3`
MD5	`b2a05788fbbcbe932cbe00412cc866df`
BLAKE2b-256	`2bd219c235803c524252363155c49514ec725d62e3738f460d33850c40f1df9b`

Hashes for sentencepiece-0.0.6-cp27-cp27mu-manylinux1_i686.whl

Hashes for sentencepiece-0.0.6-cp27-cp27mu-manylinux1_i686.whl
Algorithm	Hash digest
SHA256	`f200a927950bf1bdfb4b9c4724459e74fbe48b458dfb98afdf6a74921ad4dfa4`
MD5	`41d8ca99c47fe52689c066b6e15f6f30`
BLAKE2b-256	`d972f0341c31a4b88ff04cdb547f76dc1f4208f455a09fc7da9685916b29c24e`

Hashes for sentencepiece-0.0.6-cp27-cp27m-manylinux1_x86_64.whl

Hashes for sentencepiece-0.0.6-cp27-cp27m-manylinux1_x86_64.whl
Algorithm	Hash digest
SHA256	`69df81c3aad62aa40d8b85d96c0ab77c70e1fe069c3cd294b2e600fd4c2c7ba0`
MD5	`1d5c9ba6f8c4c3d71cf7c8aeb81df3df`
BLAKE2b-256	`a95b55729d35a0f41db657b7a306d6767ecdfa7bc0dfdc1253b0d46d56f1a553`

Hashes for sentencepiece-0.0.6-cp27-cp27m-manylinux1_i686.whl

Hashes for sentencepiece-0.0.6-cp27-cp27m-manylinux1_i686.whl
Algorithm	Hash digest
SHA256	`f7b7aa8b96c41f149b031e1b638f4d2abd82a4a0e12b1909760a97b58dd0ce6b`
MD5	`ca0a402a7465a9ef9adbb939c2c2e784`
BLAKE2b-256	`0fc27c81e65d7de15a927a9c8c2855267363982f6cfcaa1e0d0f64b84f2f298c`