Skip to main content

TokenGeeX is an efficient tokenizer for code based on UnigramLM and TokenMonster.

Project description

TokenGeeX - Efficient Tokenizer for CodeGeeX

This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018).

CLI

Exact

The most restrictive pattern. Does not allow punctuation to be mixed in with words and strictly adheres to code structure. Does not allow words that mix casing. Digits are encoded as a single token.

RUST_LOG=debug tokengeex regex --output data/exact.regex \
    $(for idiom in any-char lowercase-word uppercase-word capitalized-word english-contraction chinese-word indent few-repeated-punct-space; do echo "-i ${idiom} "; done)

Exact+

The pattern used for the merge step of exact vocabularies.

RUST_LOG=debug tokengeex regex --output data/exact-plus.regex \
    $(for idiom in any-char word english-word french-word chinese-word english-contraction punct-word newline-indent repeated-punct-space; do echo "-i ${idiom} "; done)

General

General-purpose pattern which is loosely analogous to GPT-4's pattern. Numbers of up to three digits are allowed.

RUST_LOG=debug tokengeex regex --output data/general.regex \
    $(for idiom in any-char word english-word french-word chinese-word english-contraction short-number punct-word newline-indent repeated-punct-space; do echo "-i ${idiom} "; done)

General+

The pattern used for the merge step of general vocabularies.

TODO!

Idiomatic

Permissive pattern which allows some common idioms to form. Allows multi-word tokens to form.

TODO!

Idiomatic+

The pattern used for the merge step of idiomatic vocabularies.

TODO!

Loose

Permits a wide range of patterns and idioms. Highest compression.

TODO!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tokengeex-1.0.0-pp310-pypy310_pp73-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded PyPy musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-pp310-pypy310_pp73-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded PyPy musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-pp310-pypy310_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

tokengeex-1.0.0-pp310-pypy310_pp73-macosx_11_0_arm64.whl (486.5 kB view hashes)

Uploaded PyPy macOS 11.0+ ARM64

tokengeex-1.0.0-pp310-pypy310_pp73-macosx_10_12_x86_64.whl (486.9 kB view hashes)

Uploaded PyPy macOS 10.12+ x86-64

tokengeex-1.0.0-pp39-pypy39_pp73-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded PyPy musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-pp39-pypy39_pp73-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded PyPy musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-pp39-pypy39_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

tokengeex-1.0.0-pp39-pypy39_pp73-macosx_11_0_arm64.whl (486.3 kB view hashes)

Uploaded PyPy macOS 11.0+ ARM64

tokengeex-1.0.0-pp39-pypy39_pp73-macosx_10_12_x86_64.whl (487.0 kB view hashes)

Uploaded PyPy macOS 10.12+ x86-64

tokengeex-1.0.0-pp38-pypy38_pp73-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded PyPy musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-pp38-pypy38_pp73-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded PyPy musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-pp38-pypy38_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

tokengeex-1.0.0-pp38-pypy38_pp73-macosx_11_0_arm64.whl (486.2 kB view hashes)

Uploaded PyPy macOS 11.0+ ARM64

tokengeex-1.0.0-pp38-pypy38_pp73-macosx_10_12_x86_64.whl (487.1 kB view hashes)

Uploaded PyPy macOS 10.12+ x86-64

tokengeex-1.0.0-pp37-pypy37_pp73-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded PyPy musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-pp37-pypy37_pp73-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded PyPy musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-pp37-pypy37_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-pp37-pypy37_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

tokengeex-1.0.0-pp37-pypy37_pp73-macosx_10_12_x86_64.whl (489.2 kB view hashes)

Uploaded PyPy macOS 10.12+ x86-64

tokengeex-1.0.0-cp313-cp313-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.13 musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-cp313-cp313-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded CPython 3.13 musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ s390x

tokengeex-1.0.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ppc64le

tokengeex-1.0.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARMv7l

tokengeex-1.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-cp313-cp313-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.12+ i686

tokengeex-1.0.0-cp313-cp313-macosx_11_0_arm64.whl (487.4 kB view hashes)

Uploaded CPython 3.13 macOS 11.0+ ARM64

tokengeex-1.0.0-cp313-cp313-macosx_10_12_x86_64.whl (487.7 kB view hashes)

Uploaded CPython 3.13 macOS 10.12+ x86-64

tokengeex-1.0.0-cp312-none-win_amd64.whl (353.8 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

tokengeex-1.0.0-cp312-none-win32.whl (337.1 kB view hashes)

Uploaded CPython 3.12 Windows x86

tokengeex-1.0.0-cp312-cp312-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-cp312-cp312-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded CPython 3.12 musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ s390x

tokengeex-1.0.0-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ppc64le

tokengeex-1.0.0-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARMv7l

tokengeex-1.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-cp312-cp312-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.12+ i686

tokengeex-1.0.0-cp312-cp312-macosx_11_0_arm64.whl (487.4 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

tokengeex-1.0.0-cp312-cp312-macosx_10_12_x86_64.whl (487.7 kB view hashes)

Uploaded CPython 3.12 macOS 10.12+ x86-64

tokengeex-1.0.0-cp311-none-win_amd64.whl (351.8 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

tokengeex-1.0.0-cp311-none-win32.whl (336.0 kB view hashes)

Uploaded CPython 3.11 Windows x86

tokengeex-1.0.0-cp311-cp311-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-cp311-cp311-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded CPython 3.11 musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ s390x

tokengeex-1.0.0-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ppc64le

tokengeex-1.0.0-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARMv7l

tokengeex-1.0.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-cp311-cp311-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.12+ i686

tokengeex-1.0.0-cp311-cp311-macosx_11_0_arm64.whl (487.1 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

tokengeex-1.0.0-cp311-cp311-macosx_10_12_x86_64.whl (488.1 kB view hashes)

Uploaded CPython 3.11 macOS 10.12+ x86-64

tokengeex-1.0.0-cp310-none-win_amd64.whl (351.8 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

tokengeex-1.0.0-cp310-none-win32.whl (336.0 kB view hashes)

Uploaded CPython 3.10 Windows x86

tokengeex-1.0.0-cp310-cp310-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-cp310-cp310-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded CPython 3.10 musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ s390x

tokengeex-1.0.0-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ppc64le

tokengeex-1.0.0-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARMv7l

tokengeex-1.0.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.12+ i686

tokengeex-1.0.0-cp310-cp310-macosx_11_0_arm64.whl (487.0 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

tokengeex-1.0.0-cp310-cp310-macosx_10_12_x86_64.whl (488.0 kB view hashes)

Uploaded CPython 3.10 macOS 10.12+ x86-64

tokengeex-1.0.0-cp39-none-win_amd64.whl (352.1 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

tokengeex-1.0.0-cp39-none-win32.whl (336.2 kB view hashes)

Uploaded CPython 3.9 Windows x86

tokengeex-1.0.0-cp39-cp39-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-cp39-cp39-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded CPython 3.9 musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ s390x

tokengeex-1.0.0-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ppc64le

tokengeex-1.0.0-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARMv7l

tokengeex-1.0.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.12+ i686

tokengeex-1.0.0-cp39-cp39-macosx_11_0_arm64.whl (486.7 kB view hashes)

Uploaded CPython 3.9 macOS 11.0+ ARM64

tokengeex-1.0.0-cp39-cp39-macosx_10_12_x86_64.whl (487.7 kB view hashes)

Uploaded CPython 3.9 macOS 10.12+ x86-64

tokengeex-1.0.0-cp38-none-win_amd64.whl (350.9 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

tokengeex-1.0.0-cp38-none-win32.whl (336.2 kB view hashes)

Uploaded CPython 3.8 Windows x86

tokengeex-1.0.0-cp38-cp38-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-cp38-cp38-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded CPython 3.8 musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ s390x

tokengeex-1.0.0-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ppc64le

tokengeex-1.0.0-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARMv7l

tokengeex-1.0.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686

tokengeex-1.0.0-cp38-cp38-macosx_11_0_arm64.whl (486.6 kB view hashes)

Uploaded CPython 3.8 macOS 11.0+ ARM64

tokengeex-1.0.0-cp38-cp38-macosx_10_12_x86_64.whl (486.9 kB view hashes)

Uploaded CPython 3.8 macOS 10.12+ x86-64

tokengeex-1.0.0-cp37-none-win_amd64.whl (351.7 kB view hashes)

Uploaded CPython 3.7 Windows x86-64

tokengeex-1.0.0-cp37-none-win32.whl (336.2 kB view hashes)

Uploaded CPython 3.7 Windows x86

tokengeex-1.0.0-cp37-cp37m-musllinux_1_1_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ x86-64

tokengeex-1.0.0-cp37-cp37m-musllinux_1_1_aarch64.whl (1.5 MB view hashes)

Uploaded CPython 3.7m musllinux: musl 1.1+ ARM64

tokengeex-1.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

tokengeex-1.0.0-cp37-cp37m-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.5 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ s390x

tokengeex-1.0.0-cp37-cp37m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ppc64le

tokengeex-1.0.0-cp37-cp37m-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ARMv7l

tokengeex-1.0.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ARM64

tokengeex-1.0.0-cp37-cp37m-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.12+ i686

tokengeex-1.0.0-cp37-cp37m-macosx_11_0_arm64.whl (485.8 kB view hashes)

Uploaded CPython 3.7m macOS 11.0+ ARM64

tokengeex-1.0.0-cp37-cp37m-macosx_10_12_x86_64.whl (486.3 kB view hashes)

Uploaded CPython 3.7m macOS 10.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page