Skip to main content

Split German words

Project description

pytestpythonCode style: blackPyPI version

split-words

CharSplit repacked with poetry, published as pypi split-words-- all credit goes to the original author.

pip install split-words
# replace charsplit with split_words in the following, e.g.
from split_words import Splitter
...

CharSplit - An ngram-based compound splitter for German

url

Splits a German compound into its body and head, e.g.

Autobahnraststätte -> Autobahn - Raststätte

Implementation of the method decribed in the appendix of the thesis:

Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.

TL;DR: The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.

The method achieves ~95% accuracy for head detection on the Germanet compound test set.

A model is provided, trained on 1 Mio. German nouns from Wikipedia.

Usage

Train a new model:

training.py --input_file --output_file

from command line, where input_file contains one word (noun) per line and output_file is a json file with computed n-gram probabilities.

Compound splitting

In python

>> from charsplit import Splitter
>> splitter = Splitter()
>> splitter.split_compound("Autobahnraststätte")

returns a list of all possible splits, ranked by their score, e.g.

[(0.7945872450631273, 'Autobahn', 'Raststätte'),
(-0.7143290887876655, 'Auto', 'Bahnraststätte'),
(-1.1132332878581173, 'Autobahnrast', 'Stätte'), ...]

By default, Splitter uses the data from the file charsplit/ngram_probs.json. If you retrained the model, you may specify a custom file with

>> splitter = Splitter(ngram_path=<json_data_file_with_ngram_probs>)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

split_words-0.1.3.tar.gz (12.9 MB view details)

Uploaded Source

Built Distribution

split_words-0.1.3-py3-none-any.whl (13.3 MB view details)

Uploaded Python 3

File details

Details for the file split_words-0.1.3.tar.gz.

File metadata

  • Download URL: split_words-0.1.3.tar.gz
  • Upload date:
  • Size: 12.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.8.10 Windows/10

File hashes

Hashes for split_words-0.1.3.tar.gz
Algorithm Hash digest
SHA256 7dbb69a85a722355c32de8520fe4d083aa9b4020cbe2cb44175758260fdadeb0
MD5 7edf0d1941190040ce0b900003c4c02e
BLAKE2b-256 6a28f9c3c5e886b9e2a0088e50c0f87c3080b345be46bc7bf540e18ae5ab0e94

See more details on using hashes here.

File details

Details for the file split_words-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: split_words-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 13.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.8.10 Windows/10

File hashes

Hashes for split_words-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2bd73849a7786a2a1892d87a2cde2ecf78b69bcdc9146a2f38680e74d3908274
MD5 9d4b666dfd060d6693d4efaf2f2f7da3
BLAKE2b-256 0066c753381236dd46ccecb423292ae16e44768588ac922093868291080448c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page