Skip to main content

High-performance string segmentation using BiLSTM-CRF

Project description

DKSplit

v0.3.1 — Model upgraded to EuroHPC infrastructure (Leonardo Booster, NVIDIA A100). ~3% accuracy improvement over v0.2.x on real-world domains. API unchanged.

String segmentation using BiLSTM-CRF. Splits concatenated words into meaningful parts.

DKSplit is a lightweight model trained on millions of labeled samples covering domain names, brand names, tech terms, and multilingual phrases. It uses a BiLSTM-CRF architecture (9.47M parameters) exported to ONNX with INT8 quantization, delivering fast CPU inference in a 9 MB package.

Originally built for domain name analysis at DomainKits, but works well on any concatenated text: hashtags, URLs, identifiers, compound strings.

Install

pip install dksplit

Usage

import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split("kubernetescluster")
# ['kubernetes', 'cluster']

dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']

dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]

What's New in v0.3.1

Model training upgraded from AWS to EuroHPC Leonardo Booster (NVIDIA A100), with optimized training configuration for better generalization. Improved accuracy on real-world domains, especially for brand names, multilingual inputs, and edge cases. The API is unchanged.

pip install --upgrade dksplit

Examples of improvements:

Input v0.2.x v0.3.1
cloudflarecdn cloud flare cdn cloudflare cdn
databricks data bricks databricks
instacart insta cart instacart
robinhood robin hood robinhood
mailchimp mail chimp mailchimp

Benchmark

Dataset

1,000 newly registered .com domains randomly sampled from ABTdomain.com daily feed (April 8, 2026). No filtering or cherry-picking. Ground truth was established through multi-model cross-validation (BiLSTM, Qwen 9B LoRA, Gemma 31B) and human audit.

The dataset and evaluation script are available on GitHub.

Results

Accuracy on 1,000 randomly sampled real-world .com domains, human-audited ground truth:

Model Accuracy
DKSplit v0.3.1 85.0%
DKSplit v0.2.x 82.8%
WordSegment 54.0%
WordNinja 46.1%

DKSplit outperforms WordSegment by 31 percentage points and WordNinja by 39 percentage points.

Note: The accuracy above is measured against a single reference segmentation. Domain names are inherently ambiguous. For example, tiantian5 could be tiantian 5 (Chinese compound name) or tian tian 5 (two separate syllables); noranite could be nora nite or an intact brand; pikahug could be pika hug or an intact brand name. Our audit found ~5% of test samples have multiple valid segmentations. Accounting for these, effective accuracy is closer to 90%.

Comparison

Input DKSplit v0.3.1 WordSegment WordNinja
chatgptprompts chatgpt prompts chat gpt prompts chat gp t prompts
tensorflowserving tensorflow serving tensor flow serving tensor flow serving
spotifywrapped spotify wrapped spot if y wrapped spot if y wrapped
ethereumwallet ethereum wallet e there um wallet e there um wallet
cloudflarecdn cloudflare cdn cloud flare cdn cloud flare cd n
kubernetescluster kubernetes cluster ku bernet es cluster ku berne tes cluster
hackathonwinners hackathon winners hackathon winners hack a th on winners
whatsappstatus whatsapp status what sapp status what s app status
drwatsonai dr watson ai dr watson a i dr watson a i
escribirenvozalta escribir en voz alta escribir env oz alta es crib ire nv oz alta
tuvasou tu vas ou tuva sou tuva so u
candidiasenuncamais candidiase nunca mais candid iase nunca mais can didi as e nun cama is
robertdeniro robert de niro robert deniro robert deniro
mercibeaucoup merci beaucoup merci beaucoup mer ci beau coup

How It Works

DKSplit treats segmentation as a sequence labeling task.

The training data includes:

  • LLM-labeled domain name segmentations
  • Brand names
  • Personal name combinations
  • Multilingual phrases (English, French, German, Spanish, and more)
  • Tech product names and terminology

At inference, the BiLSTM runs as an INT8-quantized ONNX model and CRF decoding is performed in NumPy — no GPU required.

Features

  • Brand-aware: Recognizes thousands of brands, tech products, and proper nouns
  • Multilingual: Handles English, French, German, Spanish, and romanized text
  • Lightweight: 9 MB model, minimal dependencies (numpy + onnxruntime)
  • Offline: No API keys, no internet required

Limitations

  • Characters: Only a-z and 0-9. Input is automatically lowercased.
  • Max length: 64 characters.
  • Script: Latin script only. Non-Latin scripts (汉字, かな, 한글, العربية) are not supported.
  • Ambiguity: Some inputs are genuinely ambiguous. DKSplit optimizes for the most common interpretation.
  • Rare languages: Accuracy is highest on English and major European languages.

Requirements

  • Python >= 3.8
  • numpy
  • onnxruntime

Links

License

This project is licensed under the Apache License 2.0.

Please attribute as: DKsplit by ABTdomain

Acknowledgements

EuroHPC JU   Co-funded by the EU

The v0.3.1 model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (project AIFAC_P02_281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dksplit-0.3.1.tar.gz (7.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dksplit-0.3.1-py3-none-any.whl (7.1 MB view details)

Uploaded Python 3

File details

Details for the file dksplit-0.3.1.tar.gz.

File metadata

  • Download URL: dksplit-0.3.1.tar.gz
  • Upload date:
  • Size: 7.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dksplit-0.3.1.tar.gz
Algorithm Hash digest
SHA256 469e8ee24e302a324365df8725a246561b61df221f352e69efcacd0b2bff201c
MD5 2c6569f74ab87617547ee9978f87db4d
BLAKE2b-256 f9684b486a75d71bcd9116a313f63b9e695fddc246e1478807c49287c0484c58

See more details on using hashes here.

File details

Details for the file dksplit-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: dksplit-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 7.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for dksplit-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 df9cbfbdf365d2647a9e1c72cd7151430f8c09805a102186d5762beb3c83415e
MD5 31fdc6c05903e868a51cf8d17628d3c7
BLAKE2b-256 a406ba078585b888838239134ca34730951781f7611bba4704cc52952eca96a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page