Skip to main content

Split German words

Project description

pytestpythonCode style: blackPyPI version

split-words

CharSplit repacked with poetry, published as pypi split-words-- all credit goes to the original author.

pip install split-words
# replace charsplit with split_words in the following, e.g.
from split_words import Splitter
...

CharSplit - An ngram-based compound splitter for German

url

Splits a German compound into its body and head, e.g.

Autobahnraststätte -> Autobahn - Raststätte

Implementation of the method decribed in the appendix of the thesis:

Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.

TL;DR: The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.

The method achieves ~95% accuracy for head detection on the Germanet compound test set.

A model is provided, trained on 1 Mio. German nouns from Wikipedia.

Usage

Train a new model:

training.py --input_file --output_file

from command line, where input_file contains one word (noun) per line and output_file is a json file with computed n-gram probabilities.

Compound splitting

In python

>> from charsplit import Splitter
>> splitter = Splitter()
>> splitter.split_compound("Autobahnraststätte")

returns a list of all possible splits, ranked by their score, e.g.

[(0.7945872450631273, 'Autobahn', 'Raststätte'),
(-0.7143290887876655, 'Auto', 'Bahnraststätte'),
(-1.1132332878581173, 'Autobahnrast', 'Stätte'), ...]

By default, Splitter uses the data from the file charsplit/ngram_probs.json. If you retrained the model, you may specify a custom file with

>> splitter = Splitter(ngram_path=<json_data_file_with_ngram_probs>)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

split_words-0.1.3.tar.gz (12.9 MB view hashes)

Uploaded Source

Built Distribution

split_words-0.1.3-py3-none-any.whl (13.3 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page