High-performance string segmentation using BiLSTM-CRF
Project description
DKSplit
v0.2.3 — Retrained model with expanded brand and name coverage. ~7% accuracy improvement on real-world domains. API unchanged — just
pip install --upgrade dksplit.
String segmentation using BiLSTM-CRF. Splits concatenated words into meaningful parts.
DKSplit is a lightweight model trained on millions of labeled samples covering domain names, brand names, tech terms, and multilingual phrases. It uses a BiLSTM-CRF architecture (9.47M parameters) exported to ONNX with INT8 quantization, delivering fast CPU inference in a 9 MB package.
Originally built for domain name analysis at DomainKits, but works well on any concatenated text — hashtags, URLs, identifiers, compound strings.
Install
pip install dksplit
Usage
import dksplit
dksplit.split("chatgptlogin")
# ['chatgpt', 'login']
dksplit.split("kubernetescluster")
# ['kubernetes', 'cluster']
dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']
dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]
What's New in v0.2.3
Retrained model with significantly expanded brand and name coverage. The API is unchanged — just upgrade.
pip install --upgrade dksplit
Examples of improvements:
| Input | v0.1.0 | v0.2.3 |
|---|---|---|
cloudflarecdn |
cloud flare cdn | cloudflare cdn |
snowdenyes |
snow den yes | snowden yes |
robertdeniro |
robert deniro | robert de niro |
Benchmark
Dataset
1,000 newly registered .com domains randomly sampled from ABTdomain.com daily feed (February 8, 2026). No filtering or cherry-picking — a raw random sample of real-world domain registrations. GPT-5.2 segmentation is used as the reference answer.
The dataset and evaluation script are available on GitHub.
Results
Accuracy (agreement with GPT-5.2 on 1,000 real domains):
| Model | Accuracy |
|---|---|
| DKSplit v0.2.3 | 80.5% |
| WordSegment | 59.1% |
| WordNinja | 47.6% |
DKSplit outperforms WordSegment by 21 percentage points and WordNinja by 33 percentage points.
Note: The remaining ~20% disagreement with GPT-5.2 largely comes from rare languages, invented words, and genuinely ambiguous cases (e.g., is
christianalucas→christiana lucasorchristian a lucas?). On standard English inputs, agreement is significantly higher.
Comparison
| Input | DKSplit v0.2.3 | WordSegment | WordNinja |
|---|---|---|---|
chatgptlogin |
chatgpt login | chat gpt login | chat gp t login |
cloudflarecdn |
cloudflare cdn | cloud flare cdn | cloud flare cd n |
kubernetescluster |
kubernetes cluster | ku bernet es cluster | ku berne tes cluster |
instagramlogin |
instagram login | insta gram login | insta gram login |
ethereumwallet |
ethereum wallet | e there um wallet | e there um wallet |
spotifyplaylist |
spotify playlist | spot if y playlist | spot if y playlist |
lululemonoutlet |
lululemon outlet | lululemon outlet | lulu lemon outlet |
tensorflowlite |
tensorflow lite | tensor flow lite | tensor flow lite |
mercibeaucoup |
merci beaucoup | merci beaucoup | mer ci beau coup |
robertdeniro |
robert de niro | robert deniro | robert deniro |
snowdenyes |
snowden yes | snowden yes | snow deny es |
youtubedownloader |
youtube downloader | youtube downloader | youtube down loader |
How It Works
DKSplit treats segmentation as a sequence labeling task.
The training data includes:
- LLM-labeled domain name segmentations
- Brand names
- Personal name combinations
- Multilingual phrases (English, French, German, Spanish, and more)
- Tech product names and terminology
At inference, the BiLSTM runs as an INT8-quantized ONNX model and CRF decoding is performed in NumPy — no GPU required.
Features
- Brand-aware: Recognizes thousands of brands, tech products, and proper nouns
- Multilingual: Handles English, French, German, Spanish, and romanized text
- Lightweight: 9 MB model, minimal dependencies (numpy + onnxruntime)
- Offline: No API keys, no internet required
Limitations
- Characters: Only
a-zand0-9. Input is automatically lowercased. - Max length: 64 characters.
- Script: Latin script only. Non-Latin scripts (汉字, かな, 한글, العربية) are not supported.
- Ambiguity: Some inputs are genuinely ambiguous. DKSplit optimizes for the most common interpretation.
- Rare languages: Accuracy is highest on English and major European languages.
Requirements
- Python >= 3.8
- numpy
- onnxruntime
Links
- Website: domainkits.com, ABTdomain.com
- GitHub: github.com/ABTdomain/dksplit
- PyPI: pypi.org/project/dksplit
- Issues: GitHub Issues
License
This project is licensed under the Apache License 2.0.
Please attribute as: DKsplit by ABTdomain
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dksplit-0.2.4.tar.gz.
File metadata
- Download URL: dksplit-0.2.4.tar.gz
- Upload date:
- Size: 8.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d96f8f277fefd364936f0ddc6fe63d7c99c29e8871eb105b9be7b0465b251dfc
|
|
| MD5 |
a7494732730accfe7f4996e0dfccff2c
|
|
| BLAKE2b-256 |
fa472961949c065fe8d11ed80889aa33236f784450d604ff5a7e4516fba7e078
|
File details
Details for the file dksplit-0.2.4-py3-none-any.whl.
File metadata
- Download URL: dksplit-0.2.4-py3-none-any.whl
- Upload date:
- Size: 8.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c71097d0054b0b15e90adba1b98f7d7564bfae5952ceff00fd56ebb98646003b
|
|
| MD5 |
ad08fe4c58439d08b2d3062aee43b065
|
|
| BLAKE2b-256 |
dfdd1f682bfbd920135b6aacc6319399adfbdef5a0a3fa3918986c64692cbec8
|