DOM-aware tokenizers for 🤗 Hugging Face language models
Project description
DOM tokenizers
DOM-aware tokenizers for Hugging Face language models.
Installation
With PIP
pip install dom-tokenizers[train]
From sources
git clone https://github.com/gbenson/dom-tokenizers.git
cd dom-tokenizers
python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e .[dev,train]
Load a pretrained tokenizer from the Hub
Train your own
On the command line
Check everything's working using a small dataset of around 300 examples:
train-tokenizer gbenson/interesting-dom-snapshots
Train a tokenizer with a 10,000-token vocabulary using a dataset of 4,536 examples and upload it to the Hub:
train-tokenizer gbenson/webui-dom-snapshots -n 10000 -N 4536
huggingface-cli login
huggingface-cli upload dom-tokenizer-10k
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dom_tokenizers-0.0.5.tar.gz
(55.4 kB
view hashes)
Built Distribution
Close
Hashes for dom_tokenizers-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 117f19fdfa5a339f27f1b6470288b0cdb69f474d5f0bb0331497910f25774a9c |
|
MD5 | 1cddbbffa0254e4e968be695aa5cf535 |
|
BLAKE2b-256 | 08e5beeb69da7fbe01c33f5de80bb65d59cb970ba6966d0e2f550a2f83694067 |