High-performance string segmentation using BiLSTM-CRF
Project description
DKSplit
v0.3.1 — Model upgraded to EuroHPC infrastructure (Leonardo Booster, NVIDIA A100). ~3% accuracy improvement over v0.2.x on real-world domains. API unchanged.
String segmentation using BiLSTM-CRF. Splits concatenated words into meaningful parts.
DKSplit is a lightweight model trained on millions of labeled samples covering domain names, brand names, tech terms, and multilingual phrases. It uses a BiLSTM-CRF architecture (9.47M parameters) exported to ONNX with INT8 quantization, delivering fast CPU inference in a 9 MB package.
Originally built for domain name analysis at DomainKits, but works well on any concatenated text: hashtags, URLs, identifiers, compound strings.
Install
pip install dksplit
Usage
import dksplit
dksplit.split("chatgptlogin")
# ['chatgpt', 'login']
dksplit.split("kubernetescluster")
# ['kubernetes', 'cluster']
dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']
dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]
What's New in v0.3.1
Model training upgraded from AWS to EuroHPC Leonardo Booster (NVIDIA A100), with optimized training configuration for better generalization. Improved accuracy on real-world domains, especially for brand names, multilingual inputs, and edge cases. The API is unchanged.
pip install --upgrade dksplit
Examples of improvements:
| Input | v0.2.x | v0.3.1 |
|---|---|---|
cloudflarecdn |
cloud flare cdn | cloudflare cdn |
databricks |
data bricks | databricks |
instacart |
insta cart | instacart |
robinhood |
robin hood | robinhood |
mailchimp |
mail chimp | mailchimp |
Benchmark
Dataset
1,000 newly registered .com domains randomly sampled from ABTdomain.com daily feed (April 8, 2026). No filtering or cherry-picking. Ground truth was established through multi-model cross-validation (BiLSTM, Qwen 9B LoRA, Gemma 31B) and human audit.
The dataset and evaluation script are available on GitHub.
Results
Accuracy on 1,000 randomly sampled real-world .com domains, human-audited ground truth:
| Model | Accuracy |
|---|---|
| DKSplit v0.3.1 | 85.0% |
| DKSplit v0.2.x | 82.8% |
| WordSegment | 54.0% |
| WordNinja | 46.1% |
DKSplit outperforms WordSegment by 31 percentage points and WordNinja by 39 percentage points.
Note: The accuracy above is measured against a single reference segmentation. Domain names are inherently ambiguous. For example,
tiantian5could betiantian 5(Chinese compound name) ortian tian 5(two separate syllables);noranitecould benora niteor an intact brand;pikahugcould bepika hugor an intact brand name. Our audit found ~5% of test samples have multiple valid segmentations. Accounting for these, effective accuracy is closer to 90%.
Comparison
| Input | DKSplit v0.3.1 | WordSegment | WordNinja |
|---|---|---|---|
chatgptprompts |
chatgpt prompts | chat gpt prompts | chat gp t prompts |
tensorflowserving |
tensorflow serving | tensor flow serving | tensor flow serving |
spotifywrapped |
spotify wrapped | spot if y wrapped | spot if y wrapped |
ethereumwallet |
ethereum wallet | e there um wallet | e there um wallet |
cloudflarecdn |
cloudflare cdn | cloud flare cdn | cloud flare cd n |
kubernetescluster |
kubernetes cluster | ku bernet es cluster | ku berne tes cluster |
hackathonwinners |
hackathon winners | hackathon winners | hack a th on winners |
whatsappstatus |
whatsapp status | what sapp status | what s app status |
drwatsonai |
dr watson ai | dr watson a i | dr watson a i |
escribirenvozalta |
escribir en voz alta | escribir env oz alta | es crib ire nv oz alta |
tuvasou |
tu vas ou | tuva sou | tuva so u |
candidiasenuncamais |
candidiase nunca mais | candid iase nunca mais | can didi as e nun cama is |
robertdeniro |
robert de niro | robert deniro | robert deniro |
mercibeaucoup |
merci beaucoup | merci beaucoup | mer ci beau coup |
How It Works
DKSplit treats segmentation as a sequence labeling task.
The training data includes:
- LLM-labeled domain name segmentations
- Brand names
- Personal name combinations
- Multilingual phrases (English, French, German, Spanish, and more)
- Tech product names and terminology
At inference, the BiLSTM runs as an INT8-quantized ONNX model and CRF decoding is performed in NumPy — no GPU required.
Features
- Brand-aware: Recognizes thousands of brands, tech products, and proper nouns
- Multilingual: Handles English, French, German, Spanish, and romanized text
- Lightweight: 9 MB model, minimal dependencies (numpy + onnxruntime)
- Offline: No API keys, no internet required
Limitations
- Characters: Only
a-zand0-9. Input is automatically lowercased. - Max length: 64 characters.
- Script: Latin script only. Non-Latin scripts (汉字, かな, 한글, العربية) are not supported.
- Ambiguity: Some inputs are genuinely ambiguous. DKSplit optimizes for the most common interpretation.
- Rare languages: Accuracy is highest on English and major European languages.
Requirements
- Python >= 3.8
- numpy
- onnxruntime
Links
- Website: domainkits.com, ABTdomain.com
- GitHub: github.com/ABTdomain/dksplit
- PyPI: pypi.org/project/dksplit
- Issues: GitHub Issues
License
This project is licensed under the Apache License 2.0.
Please attribute as: DKsplit by ABTdomain
Acknowledgements
The v0.3.1 model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (project AIFAC_P02_281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dksplit-0.3.1.tar.gz.
File metadata
- Download URL: dksplit-0.3.1.tar.gz
- Upload date:
- Size: 7.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
469e8ee24e302a324365df8725a246561b61df221f352e69efcacd0b2bff201c
|
|
| MD5 |
2c6569f74ab87617547ee9978f87db4d
|
|
| BLAKE2b-256 |
f9684b486a75d71bcd9116a313f63b9e695fddc246e1478807c49287c0484c58
|
File details
Details for the file dksplit-0.3.1-py3-none-any.whl.
File metadata
- Download URL: dksplit-0.3.1-py3-none-any.whl
- Upload date:
- Size: 7.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df9cbfbdf365d2647a9e1c72cd7151430f8c09805a102186d5762beb3c83415e
|
|
| MD5 |
31fdc6c05903e868a51cf8d17628d3c7
|
|
| BLAKE2b-256 |
a406ba078585b888838239134ca34730951781f7611bba4704cc52952eca96a5
|