DOM-aware tokenization for 🤗 Hugging Face language models
Project description
DOM tokenizers
DOM-aware tokenization for Hugging Face language models.
Installation
With PIP
pip install dom-tokenizers[train]
From sources
git clone https://github.com/gbenson/dom-tokenizers.git
cd dom-tokenizers
python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e .[dev,train]
Train a tokenizer
On the command line
Check everything's working using a small dataset of around 300 examples:
train-tokenizer gbenson/interesting-dom-snapshots
Train a tokenizer with a 10,000-token vocabulary using a dataset of 4,536 examples and upload it to the Hub:
train-tokenizer gbenson/webui-dom-snapshots -n 10000 -N 4536
huggingface-cli login
huggingface-cli upload dom-tokenizer-10k
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
Close
Hashes for dom_tokenizers-0.0.14-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58d90d965b4a82830f8725538d214b3d7f52e39513d5d66d8dbb1c6ce46aaa7b |
|
MD5 | 611c8b3c04c6b1346dbc4524c48a326e |
|
BLAKE2b-256 | 2e01fb35c4707ace4ab5a44bfd16a3a524aafc32f97c8b4eb39e7ab79bf32817 |