Skip to main content

Partition Cover Approach to Tokenization

Project description

A partition cover approach to tokenization

In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation greedy algorithm.

GreedTok

  1. If using python wrapper

    a. Using pip (use the lightweight source code w/o data/notebooks):

    wget "https://github.com/PreferredAI/pcatt/archive/refs/tags/v0.13.tar.gz"
    unzip pcatt-0.13.zip -d pcatt
    cd pcatt
    pip install -r requirements.txt
    pip install .
    

    b. Or compile manually e.g. (have to specify links)

    c++ -O3 -Wall -shared -std=c++20 \
    -fPIC $(python3 -m pybind11 --includes) \
    -I$CONDA_PREFIX/include/ \
    -I$CONDA_PREFIX/include/tbb \
    -I$CONDA_PREFIX/include/oneapi \
    -L$CONDA_PREFIX/lib/ \
    -l tbb \
    ./pcatt/greedy_builder.cpp \
    -o ./pcatt/greedy_builder$(python3-config --extension-suffix) 
    

    c. import and use! Examples in eval_tokenizer_example.ipynb

  2. If using C++ files directly

    a. Install dependencies for C++ code, we use oneTBB to parallelize the code, simplest way is to use Conda or pip:

    conda install tbb-devel
    

    b. Compile greedy_cache.py e.g.:

    c++ -O3 -std=c++20 \
    -I$CONDA_PREFIX/include/ \
    -I$CONDA_PREFIX/include/tbb \
    -I$CONDA_PREFIX/include/oneapi \
    -L$CONDA_PREFIX/lib/ \
    -l tbb \
    pcatt/greedy_cache.cpp \
    -o pcatt/greedy.exe 
    

    c. Prepare inputs (refer to cpp_inputs for examples):

    • counts: a file with '\n' delimited integers
    • words: a file with ' ' (space) delimited words

    d. Run compiled program (currently looks for domain inputs in fixed path under cpp_inputs/*) ./greedy.exe <domain> <k> e. Now we obtained our ranked token sequence (refer to cpp_outputs for examples):

    • merges: the number of covers at each step, delimited by '\n'
    • tokens: byte sequences in hex-format, delimited by '\n'

Evaluations in eval_notebook.ipynb

Citation

TBD

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

greedtok-0.13.tar.gz (10.6 kB view details)

Uploaded Source

File details

Details for the file greedtok-0.13.tar.gz.

File metadata

  • Download URL: greedtok-0.13.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.13.1

File hashes

Hashes for greedtok-0.13.tar.gz
Algorithm Hash digest
SHA256 eb3a1b8fa34530563da85f5fef39fafb513a53f88df99d531398c15391e038d5
MD5 b13b0bdbecbcf8e8f930eb116fa7b6fd
BLAKE2b-256 96c531250566e1c0af17f671dd65038f3b7c45b9fb66c331f56a244ec73caef7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page