DOM-aware tokenization for 🤗 Hugging Face language models
Project description
DOM tokenizers
DOM-aware tokenization for Hugging Face language models.
TL;DR
Input:
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width">
<title>Hello world</title>
<script>
document.getElementById("demo").innerHTML = "Hello JavaScript!";
</script>
...
Output:
Installation
With PIP
pip install dom-tokenizers[train]
From sources
git clone https://github.com/gbenson/dom-tokenizers.git
cd dom-tokenizers
python3 -m venv .venv
. .venv/bin/activate
pip install --upgrade pip
pip install -e .[dev,train]
Train a tokenizer
On the command line
Check everything's working using a small dataset of around 300 examples:
train-tokenizer gbenson/interesting-dom-snapshots
Train a tokenizer with a 10,000-token vocabulary using a dataset of 4,536 examples and upload it to the Hub:
train-tokenizer gbenson/webui-dom-snapshots -n 10000 -N 4536
huggingface-cli login
huggingface-cli upload dom-tokenizer-10k
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dom_tokenizers-0.0.17.tar.gz
(85.1 kB
view details)
Built Distribution
File details
Details for the file dom_tokenizers-0.0.17.tar.gz
.
File metadata
- Download URL: dom_tokenizers-0.0.17.tar.gz
- Upload date:
- Size: 85.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b20f7fb0add1414a645b4c7cb5143b005432bb55fdadcd7765511fcbc1de7774 |
|
MD5 | 07e095fd8a0af2af35ae33c9d35bfcfc |
|
BLAKE2b-256 | df42a22c3aeca3674b38f2dd90d491ee95c612d93cc8d341ebfb86bff053ea45 |
File details
Details for the file dom_tokenizers-0.0.17-py3-none-any.whl
.
File metadata
- Download URL: dom_tokenizers-0.0.17-py3-none-any.whl
- Upload date:
- Size: 28.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.0 CPython/3.12.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e14b260d25b0823db01aaf66ee08f836d7140a9aba9779ce899079ae6cfde6d |
|
MD5 | 030b48548448afd2fbb4df273c441465 |
|
BLAKE2b-256 | e15fdc2631c1a691e53534e811d602642cbfc087a722021c766d6bb3bf58868b |