Byte pair encoding for graceful handling of rare words in NLP
Project description
BPE
AKA Byte Pair Encoding. Learns a vocab and byte pair encoding for provided white-space separated text.
Usage
$ pip3 install bpe
from bpe import Encoder
# Generated with http://pythonpsum.com
test_corpus = '''
Object raspberrypi functools dict kwargs. Gevent raspberrypi functools. Dunder raspberrypi decorator dict didn't lambda zip import pyramid, she lambda iterate?
Kwargs raspberrypi diversity unit object gevent. Import fall integration decorator unit django yield functools twisted. Dunder integration decorator he she future. Python raspberrypi community pypy. Kwargs integration beautiful test reduce gil python closure. Gevent he integration generator fall test kwargs raise didn't visor he itertools...
Reduce integration coroutine bdfl he python. Cython didn't integration while beautiful list python didn't nit!
Object fall diversity 2to3 dunder script. Python fall for: integration exception dict kwargs dunder pycon. Import raspberrypi beautiful test import six web. Future integration mercurial self script web. Return raspberrypi community test she stable.
Django raspberrypi mercurial unit import yield raspberrypi visual rocksdahouse. Dunder raspberrypi mercurial list reduce class test scipy helmet zip?
'''
encoder = Encoder(200, pct_bpe=0.88) # params chosen for demonstration purposes
encoder.fit(test_corpus.split('\n'))
example = "Vizzini: He didn't fall? INCONCEIVABLE!"
print(encoder.tokenize(example))
# ['__sow', 'vi', 'z', 'zi', 'ni', '__eow', '__sow', ':', '__eow', 'he', 'didn', "'", 't', 'fall', '__sow', '?', '__eow', '__sow', 'in', 'co', 'n', 'ce', 'iv', 'ab', 'le', '__eow', '__sow', '!', '__eow']
print(next(encoder.transform([example])))
# [26, 108, 79, 104, 72, 24, 26, 117, 24, 9, 11, 8, 12, 10, 26, 90, 24, 26, 154, 56, 37, 149, 80, 169, 84, 24, 26, 156, 24]
print(next(encoder.inverse_transform(encoder.transform([example]))))
# vizzini : he didn ' t fall ? inconceivable !
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bpe-1.0.tar.gz
(5.6 kB
view details)
Built Distribution
bpe-1.0-py3-none-any.whl
(6.8 kB
view details)
File details
Details for the file bpe-1.0.tar.gz
.
File metadata
- Download URL: bpe-1.0.tar.gz
- Upload date:
- Size: 5.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b12f6cb1293c19a4b23a7bda513d205274425252d7fc49aebe57431b5e7f16af |
|
MD5 | 399ceff1fb5cb81d134058776ad99196 |
|
BLAKE2b-256 | 81a51366eb4adcce291f46b9847a45471fbd2bd8cbb0e0d1dfceb4113fadced0 |
File details
Details for the file bpe-1.0-py3-none-any.whl
.
File metadata
- Download URL: bpe-1.0-py3-none-any.whl
- Upload date:
- Size: 6.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17b26c888812151630c4d50484c130e6f58474253d5c92e46ad232b3539b8086 |
|
MD5 | e51fff9b3d425b2506be7f814b49522a |
|
BLAKE2b-256 | b388b29a4daa4e32938662ee910b9f8869899e8437c3b8274d137d6089adc9ae |