Skip to main content

Byte pair encoding for graceful handling of rare words in NLP

Project description

BPE Build Status

AKA Byte Pair Encoding. Learns a vocab and byte pair encoding for provided white-space separated text.

Usage

$ pip3 install bpe
from bpe import Encoder

# Generated with http://pythonpsum.com
test_corpus = '''
    Object raspberrypi functools dict kwargs. Gevent raspberrypi functools. Dunder raspberrypi decorator dict didn't lambda zip import pyramid, she lambda iterate?
    Kwargs raspberrypi diversity unit object gevent. Import fall integration decorator unit django yield functools twisted. Dunder integration decorator he she future. Python raspberrypi community pypy. Kwargs integration beautiful test reduce gil python closure. Gevent he integration generator fall test kwargs raise didn't visor he itertools...
    Reduce integration coroutine bdfl he python. Cython didn't integration while beautiful list python didn't nit!
    Object fall diversity 2to3 dunder script. Python fall for: integration exception dict kwargs dunder pycon. Import raspberrypi beautiful test import six web. Future integration mercurial self script web. Return raspberrypi community test she stable.
    Django raspberrypi mercurial unit import yield raspberrypi visual rocksdahouse. Dunder raspberrypi mercurial list reduce class test scipy helmet zip?
'''

encoder = Encoder(200, pct_bpe=0.88)  # params chosen for demonstration purposes
encoder.fit(test_corpus.split('\n'))

example = "Vizzini: He didn't fall? INCONCEIVABLE!"
print(encoder.tokenize(example))
# ['__sow', 'vi', 'z', 'zi', 'ni', '__eow', '__sow', ':', '__eow', 'he', 'didn', "'", 't', 'fall', '__sow', '?', '__eow', '__sow', 'in', 'co', 'n', 'ce', 'iv', 'ab', 'le', '__eow', '__sow', '!', '__eow']
print(next(encoder.transform([example])))
# [26, 108, 79, 104, 72, 24, 26, 117, 24, 9, 11, 8, 12, 10, 26, 90, 24, 26, 154, 56, 37, 149, 80, 169, 84, 24, 26, 156, 24]
print(next(encoder.inverse_transform(encoder.transform([example]))))
# vizzini : he didn ' t fall ? inconceivable !

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe-1.0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

bpe-1.0-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file bpe-1.0.tar.gz.

File metadata

  • Download URL: bpe-1.0.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.7

File hashes

Hashes for bpe-1.0.tar.gz
Algorithm Hash digest
SHA256 b12f6cb1293c19a4b23a7bda513d205274425252d7fc49aebe57431b5e7f16af
MD5 399ceff1fb5cb81d134058776ad99196
BLAKE2b-256 81a51366eb4adcce291f46b9847a45471fbd2bd8cbb0e0d1dfceb4113fadced0

See more details on using hashes here.

File details

Details for the file bpe-1.0-py3-none-any.whl.

File metadata

  • Download URL: bpe-1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.18.4 setuptools/40.8.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.7

File hashes

Hashes for bpe-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17b26c888812151630c4d50484c130e6f58474253d5c92e46ad232b3539b8086
MD5 e51fff9b3d425b2506be7f814b49522a
BLAKE2b-256 b388b29a4daa4e32938662ee910b9f8869899e8437c3b8274d137d6089adc9ae

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page