Skip to main content

German compound word splitter using Rust + FST

Project description

charsplit-fst

A memory-efficient Rust port of the CharSplit algorithm for German compound splitting, using Finite State Transducers (FST).

Overview

Charsplit-fst implements the CharSplit algorithm for splitting German compound words into their component parts. It achieves 89% memory reduction compared to the original Python implementation by using Finite State Transducer (FST) data structures.

Based on CharSplit by Don Tuggener: https://github.com/dtuggener/CharSplit

Features

  • 51% smaller data files: 39 MB JSON → 18.2 MB FST
  • 89% lower memory usage: 19.6 MB vs 180 MB runtime
  • UTF-8 safe: Proper character-based indexing for German Unicode characters
  • Python bindings via PyO3
  • WebAssembly demo for browser-based usage
  • CLI tool for batch processing

Installation

Python

Available on PyPI.

pip install charsplit-fst

Rust

cargo add charsplit-fst

Quick Start

Python

from charsplit_fst import Splitter

splitter = Splitter()
results = splitter.split_compound("Autobahnraststätte")
# Returns: [(0.795, 'Autobahn', 'Raststätte'), ...]

Rust

use charsplit_fst::Splitter;

let splitter = Splitter::new()?;
let results = splitter.split_compound("Autobahnraststätte");

CLI

cargo run --bin charsplit-fst -- Autobahnraststätte

Algorithm

The algorithm splits German compounds using ngram probability scoring:

Score formula: start_prob - in_prob + pre_prob

Where:

  • start_prob: Maximum prefix probability of second part
  • in_prob: Minimum infix probability crossing split boundary
  • pre_prob: Maximum suffix probability of first part

Performance

  • Memory: 19.6 MB RSS (vs 180 MB for Python)
  • Data size: 18.2 MB on disk (vs 39 MB JSON)

Web Demo

A browser-based demo using WebAssembly is available in web-demo/.

# Build the WASM version
./build-wasm.sh

# Serve from project root
python -m http.server 8000
# Open http://localhost:8000/web-demo/

The demo runs entirely in the browser using WebAssembly. No server-side processing is required. Browser support: The demo will try to use Brotli compression via DecompressionStream API where supported, falling back to uncompressed data for browsers that don't support it. Works in all modern browsers.

Development

# Build
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build WASM
./build-wasm.sh

Acknowledgments

This project is a Rust port of CharSplit by Don Tuggener.

  • Algorithm: Based on Tuggener (2016), Incremental Coreference Resolution for German, University of Zurich.
  • Original Implementation: dtuggener/CharSplit (https://github.com/dtuggener/CharSplit) (MIT Licensed).
  • Data: The n-gram probabilities are derived from the model provided by the original author.

License

MIT OR Apache-2.0

See LICENSE-MIT and LICENSE-APACHE-2.0 for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

charsplit_fst-0.1.3.tar.gz (8.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

charsplit_fst-0.1.3-cp312-abi3-win_amd64.whl (8.9 MB view details)

Uploaded CPython 3.12+Windows x86-64

charsplit_fst-0.1.3-cp312-abi3-musllinux_1_2_x86_64.whl (9.1 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ x86-64

charsplit_fst-0.1.3-cp312-abi3-manylinux_2_28_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ x86-64

charsplit_fst-0.1.3-cp312-abi3-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

File details

Details for the file charsplit_fst-0.1.3.tar.gz.

File metadata

  • Download URL: charsplit_fst-0.1.3.tar.gz
  • Upload date:
  • Size: 8.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for charsplit_fst-0.1.3.tar.gz
Algorithm Hash digest
SHA256 7747697537d08fdef57e312e9b5cb6b6788d7c50b6cb8a5a8d98b13357c9c6e8
MD5 76be0dd3bf217ccfedbd45e58dee73b4
BLAKE2b-256 22dcbd1919a699e44809284ffdfcc623384fc3c93e600a695e0152e28562ad49

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.3.tar.gz:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.3-cp312-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.3-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 609cfcd214bbd944710da2921e820875b6bd19d289cf503bacf8bb07c5726077
MD5 abe9809b991e96e11c2e9c349ef5723b
BLAKE2b-256 1538736ec03ebbd180d1ef495e48791b9a3c814e93068fbf1f847175f6b81df2

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.3-cp312-abi3-win_amd64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.3-cp312-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.3-cp312-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 2de05f541dd2ee34197ef9943cb38c2e4fd2ba3956df391e1e2d1fc48f63b91a
MD5 2735dfd7344c5460c667e8edc49900d2
BLAKE2b-256 c6483d702cb309a0f06b3ca196f38821b58f051a751f514df7d4cc52b5c0b313

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.3-cp312-abi3-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.3-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.3-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ef788e71ea47746669d8430cd9ceacad44ff149e62bcd12b22242093143c0cbf
MD5 ba054aa6681a261c9477d6750636cebc
BLAKE2b-256 774a0b31fb3d9da1f3fe29c15621949789d02b97485d9848874a36c0fb42699c

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.3-cp312-abi3-manylinux_2_28_x86_64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.3-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.3-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0c021b8e5b5427b46732642bb216266caefddf4053a172e9c31a9811d234eac2
MD5 29f218b91b6c0464f913bb8020de1f7a
BLAKE2b-256 ce3dd86bf37ac3ad364d4628ef8663420c14c4c1268ec229d6fb117f8062f859

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.3-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page