German compound word splitter using Rust + FST
Project description
charsplit-fst
A memory-efficient Rust port of the CharSplit algorithm for German compound splitting, using Finite State Transducers (FST).
Overview
Charsplit-fst implements the CharSplit algorithm for splitting German compound words into their component parts. It achieves 89% memory reduction compared to the original Python implementation by using Finite State Transducer (FST) data structures.
Based on CharSplit by Don Tuggener: https://github.com/dtuggener/CharSplit
Features
- 51% smaller data files: 39 MB JSON → 18.2 MB FST
- 89% lower memory usage: 19.6 MB vs 180 MB runtime
- UTF-8 safe: Proper character-based indexing for German Unicode characters
- Python bindings via PyO3
- WebAssembly demo for browser-based usage
- CLI tool for batch processing
Installation
Python
Available on PyPI.
pip install charsplit-fst
Rust
cargo add charsplit-fst
Quick Start
Python
from charsplit_fst import Splitter
splitter = Splitter()
results = splitter.split_compound("Autobahnraststätte")
# Returns: [(0.795, 'Autobahn', 'Raststätte'), ...]
Rust
use charsplit_fst::Splitter;
let splitter = Splitter::new()?;
let results = splitter.split_compound("Autobahnraststätte");
CLI
cargo run --bin charsplit-fst -- Autobahnraststätte
Algorithm
The algorithm splits German compounds using ngram probability scoring:
Score formula: start_prob - in_prob + pre_prob
Where:
start_prob: Maximum prefix probability of second partin_prob: Minimum infix probability crossing split boundarypre_prob: Maximum suffix probability of first part
Performance
- Memory: 19.6 MB RSS (vs 180 MB for Python)
- Data size: 18.2 MB on disk (vs 39 MB JSON)
Web Demo
A browser-based demo using WebAssembly is available in web-demo/.
# Build the WASM version
./build-wasm.sh
# Serve from project root
python -m http.server 8000
# Open http://localhost:8000/web-demo/
The demo runs entirely in the browser using WebAssembly. No server-side processing is required. Browser support: The demo will try to use Brotli compression via DecompressionStream API where supported, falling back to uncompressed data for browsers that don't support it. Works in all modern browsers.
Development
# Build
cargo build --release
# Run tests
cargo test
# Build Python bindings
maturin develop
# Build WASM
./build-wasm.sh
Acknowledgments
This project is a Rust port of CharSplit by Don Tuggener.
- Algorithm: Based on Tuggener (2016), Incremental Coreference Resolution for German, University of Zurich.
- Original Implementation: dtuggener/CharSplit (https://github.com/dtuggener/CharSplit) (MIT Licensed).
- Data: The n-gram probabilities are derived from the model provided by the original author.
License
MIT OR Apache-2.0
See LICENSE-MIT and LICENSE-APACHE-2.0 for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file charsplit_fst-0.1.4.tar.gz.
File metadata
- Download URL: charsplit_fst-0.1.4.tar.gz
- Upload date:
- Size: 8.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0365f37c98180c25289d8a21e056c192fb130467d3289dbee2ebde533d7ed3a3
|
|
| MD5 |
d18b3bee294b28e7186709cb6803af8b
|
|
| BLAKE2b-256 |
c191cdb2672df6ccfeeee6927f5f4ac991515f37fce1a1620580697baaabc30a
|
Provenance
The following attestation bundles were made for charsplit_fst-0.1.4.tar.gz:
Publisher:
publish.yml on steadfastgaze/charsplit-fst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
charsplit_fst-0.1.4.tar.gz -
Subject digest:
0365f37c98180c25289d8a21e056c192fb130467d3289dbee2ebde533d7ed3a3 - Sigstore transparency entry: 1587197517
- Sigstore integration time:
-
Permalink:
steadfastgaze/charsplit-fst@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/steadfastgaze
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file charsplit_fst-0.1.4-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: charsplit_fst-0.1.4-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 8.9 MB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4946737d5bc751385af6c8e4f96ecf9dca96022b20005cda0c054410438d0b5
|
|
| MD5 |
57fafb6364da89de3f5266185ec9aae7
|
|
| BLAKE2b-256 |
97136cd4ddc55bc03c9baabecaf68ce12a0460d8589e6093683962c6e123ee62
|
Provenance
The following attestation bundles were made for charsplit_fst-0.1.4-cp312-abi3-win_amd64.whl:
Publisher:
publish.yml on steadfastgaze/charsplit-fst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
charsplit_fst-0.1.4-cp312-abi3-win_amd64.whl -
Subject digest:
e4946737d5bc751385af6c8e4f96ecf9dca96022b20005cda0c054410438d0b5 - Sigstore transparency entry: 1587198670
- Sigstore integration time:
-
Permalink:
steadfastgaze/charsplit-fst@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/steadfastgaze
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file charsplit_fst-0.1.4-cp312-abi3-musllinux_1_2_x86_64.whl.
File metadata
- Download URL: charsplit_fst-0.1.4-cp312-abi3-musllinux_1_2_x86_64.whl
- Upload date:
- Size: 9.1 MB
- Tags: CPython 3.12+, musllinux: musl 1.2+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2a3330354c58d9d5d3aae291512850fcd44d9e3f0937ae14769d7e32e08ee5f
|
|
| MD5 |
d704611466100458943fb1e5b0889625
|
|
| BLAKE2b-256 |
a80f13454510076b4154a18c16d96d1041aac12550fb62fa7e045de4da55ba8f
|
Provenance
The following attestation bundles were made for charsplit_fst-0.1.4-cp312-abi3-musllinux_1_2_x86_64.whl:
Publisher:
publish.yml on steadfastgaze/charsplit-fst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
charsplit_fst-0.1.4-cp312-abi3-musllinux_1_2_x86_64.whl -
Subject digest:
c2a3330354c58d9d5d3aae291512850fcd44d9e3f0937ae14769d7e32e08ee5f - Sigstore transparency entry: 1587198345
- Sigstore integration time:
-
Permalink:
steadfastgaze/charsplit-fst@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/steadfastgaze
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file charsplit_fst-0.1.4-cp312-abi3-manylinux_2_28_x86_64.whl.
File metadata
- Download URL: charsplit_fst-0.1.4-cp312-abi3-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 9.0 MB
- Tags: CPython 3.12+, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e72bf11e0bd6b92ec50326d58ce4d74bd0fbb7ee695dd5bba923faf1f3dbc351
|
|
| MD5 |
7c60855149536022052269fe78b50004
|
|
| BLAKE2b-256 |
33a379046534d66ce3935eade88e57d98bf5d92b3e02793a77821cf079e81aeb
|
Provenance
The following attestation bundles were made for charsplit_fst-0.1.4-cp312-abi3-manylinux_2_28_x86_64.whl:
Publisher:
publish.yml on steadfastgaze/charsplit-fst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
charsplit_fst-0.1.4-cp312-abi3-manylinux_2_28_x86_64.whl -
Subject digest:
e72bf11e0bd6b92ec50326d58ce4d74bd0fbb7ee695dd5bba923faf1f3dbc351 - Sigstore transparency entry: 1587198774
- Sigstore integration time:
-
Permalink:
steadfastgaze/charsplit-fst@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/steadfastgaze
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file charsplit_fst-0.1.4-cp312-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: charsplit_fst-0.1.4-cp312-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 9.1 MB
- Tags: CPython 3.12+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf868474c131b79bf341e3025ea0bf0366f56b16b76643f106a9a15d45113c77
|
|
| MD5 |
c53e938c6331e454dca7305256656d07
|
|
| BLAKE2b-256 |
83b9ca3951cd55181b37b76131f1a4f53fbb3a531ddc5526de423586100c673c
|
Provenance
The following attestation bundles were made for charsplit_fst-0.1.4-cp312-abi3-macosx_11_0_arm64.whl:
Publisher:
publish.yml on steadfastgaze/charsplit-fst
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
charsplit_fst-0.1.4-cp312-abi3-macosx_11_0_arm64.whl -
Subject digest:
bf868474c131b79bf341e3025ea0bf0366f56b16b76643f106a9a15d45113c77 - Sigstore transparency entry: 1587197979
- Sigstore integration time:
-
Permalink:
steadfastgaze/charsplit-fst@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/steadfastgaze
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cb1061e03d13eb1afb5f2f57425f40bfe32340c7 -
Trigger Event:
release
-
Statement type: