Skip to main content

High-performance string segmentation using BiLSTM-CRF

Project description

DKSplit

v0.2.3 — Retrained model with expanded brand and name coverage. ~7% accuracy improvement on real-world domains. API unchanged — just pip install --upgrade dksplit.

String segmentation using BiLSTM-CRF. Splits concatenated words into meaningful parts.

DKSplit is a lightweight model trained on millions of labeled samples covering domain names, brand names, tech terms, and multilingual phrases. It uses a BiLSTM-CRF architecture (9.47M parameters) exported to ONNX with INT8 quantization, delivering fast CPU inference in a 9 MB package.

Originally built for domain name analysis at DomainKits, but works well on any concatenated text — hashtags, URLs, identifiers, compound strings.

Install

pip install dksplit

Usage

import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split("kubernetescluster")
# ['kubernetes', 'cluster']

dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']

dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]

What's New in v0.2.3

Retrained model with significantly expanded brand and name coverage. The API is unchanged — just upgrade.

pip install --upgrade dksplit

Examples of improvements:

Input v0.1.0 v0.2.3
cloudflarecdn cloud flare cdn cloudflare cdn
snowdenyes snow den yes snowden yes
robertdeniro robert deniro robert de niro

Benchmark

Dataset

1,000 newly registered .com domains randomly sampled from ABTdomain.com daily feed (February 8, 2026). No filtering or cherry-picking — a raw random sample of real-world domain registrations. GPT-5.2 segmentation is used as the reference answer.

The dataset and evaluation script are available on GitHub.

Results

Accuracy (agreement with GPT-5.2 on 1,000 real domains):

Model Accuracy
DKSplit v0.2.3 80.5%
WordSegment 59.1%
WordNinja 47.6%

DKSplit outperforms WordSegment by 21 percentage points and WordNinja by 33 percentage points.

Note: The remaining ~20% disagreement with GPT-5.2 largely comes from rare languages, invented words, and genuinely ambiguous cases (e.g., is christianalucaschristiana lucas or christian a lucas?). On standard English inputs, agreement is significantly higher.

Comparison

Input DKSplit v0.2.3 WordSegment WordNinja
chatgptlogin chatgpt login chat gpt login chat gp t login
cloudflarecdn cloudflare cdn cloud flare cdn cloud flare cd n
kubernetescluster kubernetes cluster ku bernet es cluster ku berne tes cluster
instagramlogin instagram login insta gram login insta gram login
ethereumwallet ethereum wallet e there um wallet e there um wallet
spotifyplaylist spotify playlist spot if y playlist spot if y playlist
lululemonoutlet lululemon outlet lululemon outlet lulu lemon outlet
tensorflowlite tensorflow lite tensor flow lite tensor flow lite
mercibeaucoup merci beaucoup merci beaucoup mer ci beau coup
robertdeniro robert de niro robert deniro robert deniro
snowdenyes snowden yes snowden yes snow deny es
youtubedownloader youtube downloader youtube downloader youtube down loader

How It Works

DKSplit treats segmentation as a sequence labeling task.

The training data includes:

  • LLM-labeled domain name segmentations
  • Brand names
  • Personal name combinations
  • Multilingual phrases (English, French, German, Spanish, and more)
  • Tech product names and terminology

At inference, the BiLSTM runs as an INT8-quantized ONNX model and CRF decoding is performed in NumPy — no GPU required.

Features

  • Brand-aware: Recognizes thousands of brands, tech products, and proper nouns
  • Multilingual: Handles English, French, German, Spanish, and romanized text
  • Lightweight: 9 MB model, minimal dependencies (numpy + onnxruntime)
  • Offline: No API keys, no internet required

Limitations

  • Characters: Only a-z and 0-9. Input is automatically lowercased.
  • Max length: 64 characters.
  • Script: Latin script only. Non-Latin scripts (汉字, かな, 한글, العربية) are not supported.
  • Ambiguity: Some inputs are genuinely ambiguous. DKSplit optimizes for the most common interpretation.
  • Rare languages: Accuracy is highest on English and major European languages.

Requirements

  • Python >= 3.8
  • numpy
  • onnxruntime

Links

License

This project is licensed under the Apache License 2.0.

Please attribute as: DKsplit by ABTdomain

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dksplit-0.2.4.tar.gz (8.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dksplit-0.2.4-py3-none-any.whl (8.6 MB view details)

Uploaded Python 3

File details

Details for the file dksplit-0.2.4.tar.gz.

File metadata

  • Download URL: dksplit-0.2.4.tar.gz
  • Upload date:
  • Size: 8.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dksplit-0.2.4.tar.gz
Algorithm Hash digest
SHA256 d96f8f277fefd364936f0ddc6fe63d7c99c29e8871eb105b9be7b0465b251dfc
MD5 a7494732730accfe7f4996e0dfccff2c
BLAKE2b-256 fa472961949c065fe8d11ed80889aa33236f784450d604ff5a7e4516fba7e078

See more details on using hashes here.

File details

Details for the file dksplit-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: dksplit-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 8.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for dksplit-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 c71097d0054b0b15e90adba1b98f7d7564bfae5952ceff00fd56ebb98646003b
MD5 ad08fe4c58439d08b2d3062aee43b065
BLAKE2b-256 dfdd1f682bfbd920135b6aacc6319399adfbdef5a0a3fa3918986c64692cbec8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page