Skip to main content

German compound word splitter using Rust + FST

Project description

charsplit-fst

A memory-efficient Rust port of the CharSplit algorithm for German compound splitting, using Finite State Transducers (FST).

Overview

Charsplit-fst implements the CharSplit algorithm for splitting German compound words into their component parts. It achieves 89% memory reduction compared to the original Python implementation by using Finite State Transducer (FST) data structures.

Based on CharSplit by Don Tuggener: https://github.com/dtuggener/CharSplit

Features

  • 51% smaller data files: 39 MB JSON → 18.2 MB FST
  • 89% lower memory usage: 19.6 MB vs 180 MB runtime
  • UTF-8 safe: Proper character-based indexing for German Unicode characters
  • Python bindings via PyO3
  • WebAssembly demo for browser-based usage
  • CLI tool for batch processing

Installation

Python

Available on PyPI.

pip install charsplit-fst

Rust

cargo add charsplit-fst

Quick Start

Python

from charsplit_fst import Splitter

splitter = Splitter()
results = splitter.split_compound("Autobahnraststätte")
# Returns: [(0.795, 'Autobahn', 'Raststätte'), ...]

Rust

use charsplit_fst::Splitter;

let splitter = Splitter::new()?;
let results = splitter.split_compound("Autobahnraststätte");

CLI

cargo run --bin charsplit-fst -- Autobahnraststätte

Algorithm

The algorithm splits German compounds using ngram probability scoring:

Score formula: start_prob - in_prob + pre_prob

Where:

  • start_prob: Maximum prefix probability of second part
  • in_prob: Minimum infix probability crossing split boundary
  • pre_prob: Maximum suffix probability of first part

Performance

  • Memory: 19.6 MB RSS (vs 180 MB for Python)
  • Data size: 18.2 MB on disk (vs 39 MB JSON)

Web Demo

A browser-based demo using WebAssembly is available in web-demo/.

# Build the WASM version
./build-wasm.sh

# Serve from project root
python -m http.server 8000
# Open http://localhost:8000/web-demo/

The demo runs entirely in the browser using WebAssembly. No server-side processing is required. Browser support: The demo will try to use Brotli compression via DecompressionStream API where supported, falling back to uncompressed data for browsers that don't support it. Works in all modern browsers.

Development

# Build
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build WASM
./build-wasm.sh

Acknowledgments

This project is a Rust port of CharSplit by Don Tuggener.

  • Algorithm: Based on Tuggener (2016), Incremental Coreference Resolution for German, University of Zurich.
  • Original Implementation: dtuggener/CharSplit (https://github.com/dtuggener/CharSplit) (MIT Licensed).
  • Data: The n-gram probabilities are derived from the model provided by the original author.

License

MIT OR Apache-2.0

See LICENSE-MIT and LICENSE-APACHE-2.0 for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charsplit_fst-0.1.4.tar.gz (8.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

charsplit_fst-0.1.4-cp312-abi3-win_amd64.whl (8.9 MB view details)

Uploaded CPython 3.12+Windows x86-64

charsplit_fst-0.1.4-cp312-abi3-musllinux_1_2_x86_64.whl (9.1 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ x86-64

charsplit_fst-0.1.4-cp312-abi3-manylinux_2_28_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ x86-64

charsplit_fst-0.1.4-cp312-abi3-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

File details

Details for the file charsplit_fst-0.1.4.tar.gz.

File metadata

  • Download URL: charsplit_fst-0.1.4.tar.gz
  • Upload date:
  • Size: 8.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for charsplit_fst-0.1.4.tar.gz
Algorithm Hash digest
SHA256 0365f37c98180c25289d8a21e056c192fb130467d3289dbee2ebde533d7ed3a3
MD5 d18b3bee294b28e7186709cb6803af8b
BLAKE2b-256 c191cdb2672df6ccfeeee6927f5f4ac991515f37fce1a1620580697baaabc30a

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.4.tar.gz:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.4-cp312-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.4-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e4946737d5bc751385af6c8e4f96ecf9dca96022b20005cda0c054410438d0b5
MD5 57fafb6364da89de3f5266185ec9aae7
BLAKE2b-256 97136cd4ddc55bc03c9baabecaf68ce12a0460d8589e6093683962c6e123ee62

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.4-cp312-abi3-win_amd64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.4-cp312-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.4-cp312-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 c2a3330354c58d9d5d3aae291512850fcd44d9e3f0937ae14769d7e32e08ee5f
MD5 d704611466100458943fb1e5b0889625
BLAKE2b-256 a80f13454510076b4154a18c16d96d1041aac12550fb62fa7e045de4da55ba8f

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.4-cp312-abi3-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.4-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.4-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e72bf11e0bd6b92ec50326d58ce4d74bd0fbb7ee695dd5bba923faf1f3dbc351
MD5 7c60855149536022052269fe78b50004
BLAKE2b-256 33a379046534d66ce3935eade88e57d98bf5d92b3e02793a77821cf079e81aeb

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.4-cp312-abi3-manylinux_2_28_x86_64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.4-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.4-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bf868474c131b79bf341e3025ea0bf0366f56b16b76643f106a9a15d45113c77
MD5 c53e938c6331e454dca7305256656d07
BLAKE2b-256 83b9ca3951cd55181b37b76131f1a4f53fbb3a531ddc5526de423586100c673c

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.4-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page