Skip to main content

UBPE Tokenizer

Project description

UBPE Tokenizer

UBPE -- Universal Byte-Pair Encoding. Universal means that it works not only with strings, but with general sequences too.

The package provides Universal Byte-Pair Encoding tokenizers:

  • UBPEClassic -- optimized version of classic BPE algorithm
  • UBPE -- novel approach to BPE tokenization which allows you to choose between multiple different variants of encodings according to scores of tf-idf metric or something else; the most optimal encoding from this implementation was shorter than the encoding from classic implementation

Guides and theory

Installation

I am planning to deliver different implementations for the algorithm, so the package is divided into general import package (this one), and implementations (for now, only Python native). To install use:

pip install ubpe[native]

Contribution

I am pretty sure, that it has not the most optimal native Python implementation, so feel free to propose optimizations and find bugs.

It's planned to add Cython implementation and Rust implementation with Python bindings (not to bite Hugging Face, just because).

P.S. if you are working at Hugging Face, you can write me and hire me. Please.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ubpe-0.1.1.tar.gz (2.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ubpe-0.1.1-py3-none-any.whl (2.9 kB view details)

Uploaded Python 3

File details

Details for the file ubpe-0.1.1.tar.gz.

File metadata

  • Download URL: ubpe-0.1.1.tar.gz
  • Upload date:
  • Size: 2.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ubpe-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fcc88e97597f4fb68a9d2a086b31d68f533aa31fb6452b8c39cb744df7c30a1f
MD5 b94e0e4d79f945e589e5bb5fd4bd7e91
BLAKE2b-256 c8c4ae38d81fd3b404d277f99d6d775ac8b337c090b24b0d2b4e41c8bcc4188f

See more details on using hashes here.

Provenance

The following attestation bundles were made for ubpe-0.1.1.tar.gz:

Publisher: pypi-publish.yml on Scurrra/ubpe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ubpe-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ubpe-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 2.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ubpe-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8ae76a8b1fb39905e3316dc57713da0e42c518a6b775b0abd7d3c9482cbc1f34
MD5 96e762eefd702850c77a2c9b279a9cce
BLAKE2b-256 a3ddf592718bf0cf543b048be0d3e5a735ebdf097f6cdf2859d57b2e946eb2b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for ubpe-0.1.1-py3-none-any.whl:

Publisher: pypi-publish.yml on Scurrra/ubpe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page