Skip to main content

German compound word splitter using Rust + FST

Project description

charsplit-fst

A memory-efficient Rust port of the CharSplit algorithm for German compound splitting, using Finite State Transducers (FST).

Overview

Charsplit-fst implements the CharSplit algorithm for splitting German compound words into their component parts. It achieves 89% memory reduction compared to the original Python implementation by using Finite State Transducer (FST) data structures.

Based on CharSplit by Don Tuggener: https://github.com/dtuggener/CharSplit

Features

  • 51% smaller data files: 39 MB JSON → 18.2 MB FST
  • 89% lower memory usage: 19.6 MB vs 180 MB runtime
  • UTF-8 safe: Proper character-based indexing for German Unicode characters
  • Python bindings via PyO3
  • WebAssembly demo for browser-based usage
  • CLI tool for batch processing

Installation

Python

Available on PyPI.

pip install charsplit-fst

Rust

cargo add charsplit-fst

Quick Start

Python

from charsplit_fst import Splitter

splitter = Splitter()
results = splitter.split_compound("Autobahnraststätte")
# Returns: [(0.795, 'Autobahn', 'Raststätte'), ...]

Rust

use charsplit_fst::Splitter;

let splitter = Splitter::new()?;
let results = splitter.split_compound("Autobahnraststätte");

CLI

cargo run --bin charsplit-fst -- Autobahnraststätte

Algorithm

The algorithm splits German compounds using ngram probability scoring:

Score formula: start_prob - in_prob + pre_prob

Where:

  • start_prob: Maximum prefix probability of second part
  • in_prob: Minimum infix probability crossing split boundary
  • pre_prob: Maximum suffix probability of first part

Performance

  • Memory: 19.6 MB RSS (vs 180 MB for Python)
  • Data size: 18.2 MB on disk (vs 39 MB JSON)

Web Demo

A browser-based demo using WebAssembly is available in web-demo/.

# Build the WASM version
./build-wasm.sh

# Serve from project root
python -m http.server 8000
# Open http://localhost:8000/web-demo/

The demo runs entirely in the browser using WebAssembly. No server-side processing is required. Browser support: The demo will try to use Brotli compression via DecompressionStream API where supported, falling back to uncompressed data for browsers that don't support it. Works in all modern browsers.

Development

# Build
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build WASM
./build-wasm.sh

Acknowledgments

This project is a Rust port of CharSplit by Don Tuggener.

  • Algorithm: Based on Tuggener (2016), Incremental Coreference Resolution for German, University of Zurich.
  • Original Implementation: dtuggener/CharSplit (https://github.com/dtuggener/CharSplit) (MIT Licensed).
  • Data: The n-gram probabilities are derived from the model provided by the original author.

License

MIT OR Apache-2.0

See LICENSE-MIT and LICENSE-APACHE-2.0 for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

charsplit_fst-0.1.2-cp313-cp313-win_amd64.whl (8.9 MB view details)

Uploaded CPython 3.13Windows x86-64

charsplit_fst-0.1.2-cp313-cp313-musllinux_1_2_x86_64.whl (9.1 MB view details)

Uploaded CPython 3.13musllinux: musl 1.2+ x86-64

charsplit_fst-0.1.2-cp313-cp313-manylinux_2_28_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

charsplit_fst-0.1.2-cp313-cp313-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

charsplit_fst-0.1.2-cp312-cp312-win_amd64.whl (8.9 MB view details)

Uploaded CPython 3.12Windows x86-64

charsplit_fst-0.1.2-cp312-cp312-musllinux_1_2_x86_64.whl (9.1 MB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

charsplit_fst-0.1.2-cp312-cp312-manylinux_2_28_x86_64.whl (9.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

charsplit_fst-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (9.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file charsplit_fst-0.1.2-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 470c758e1193244c32cc96dcfebbb61c11086a2ec683e9d357eed092574d3cf5
MD5 2dd03f6cad8d900ab202a88abc63b4be
BLAKE2b-256 589624c3059cfb5279c96ed48b89d4b4f9e43b5e48f8ca1d598b4257c2a31ed4

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.2-cp313-cp313-win_amd64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.2-cp313-cp313-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.2-cp313-cp313-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 b3489f3d7986db1f99593274d8e59997650e4c23c4d3965e2db0394c97601c32
MD5 dd5876653803707667c6282b586e066b
BLAKE2b-256 32ef1ddd3ec9c2a111e99df5298fb9f3da227a77fa65c9fde6b1dac9b7839cba

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.2-cp313-cp313-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.2-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.2-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9d71190b4a3bd7ec672be67250c4db4a6548fe62613b864203a49d8fab699a84
MD5 178a0cdcfcb158c14d557f68a4199dc2
BLAKE2b-256 e23ffc28d15dae2a7f1d2bc4c1928991e17941ca9eab4c3e00e6796525124c12

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.2-cp313-cp313-manylinux_2_28_x86_64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 65dc7d13bc8b107b8d3759639671bb883d24c0113e813103a12f81f288e5f5bb
MD5 a4a4444f6776ffac0a5c28779789f96b
BLAKE2b-256 ded92a34d2d4a85c80626b09bbcff8e610c6a6c7602bfdabce39626fa6e1b2e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.2-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d042c7a3218f5a6f48c8051733066e81c59ca15acab51ba1d1c27908cbeb0ebf
MD5 c775961b9eab92f655553289e7decbff
BLAKE2b-256 8a9ea536ec2beea18ad9356c0a3f8974f0d230dc4ed20eca446476a3afaacf4f

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.2-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.2-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.2-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 7b98ba4ef6f1ab2513baf7abd7195d9f66a113a4b232bdb948a3e631409e5ac1
MD5 b27895419fbc65324f78a16e79aabf22
BLAKE2b-256 94eef7c3f9ae3b3fd55f63fb7f40c5deccf4a74a308e06c9fd3ca44fb66421e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.2-cp312-cp312-musllinux_1_2_x86_64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.2-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.2-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e3970a52f27ce9d884075a7fbfca5deed6e58b782ca7149daa7ad3e34d44f7d7
MD5 fb6fe527064c2636e682b79ab8e24415
BLAKE2b-256 b8250ca49e63f70efdaa708779f8e5be2407fd7a1396f408673e3b8c20451b38

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.2-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file charsplit_fst-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for charsplit_fst-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cc43e27168ceedcd2a7d2f416b8c572bb264c938c33018efa7f01219d467848c
MD5 c7f6b15e79a379ebc0fbfb7b1b6e8245
BLAKE2b-256 ca78acf2c80871b23b24767551ab4794962c5f4f2820d2f1318afefdc73c7b11

See more details on using hashes here.

Provenance

The following attestation bundles were made for charsplit_fst-0.1.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: publish.yml on steadfastgaze/charsplit-fst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page