Split German words
Project description
split-words
CharSplit
repacked with poetry
, published as pypi
split-words
--
all credit goes to the original author.
pip install split-words
# replace charsplit with split_words in the following, e.g.
from split_words import Splitter
...
CharSplit - An ngram-based compound splitter for German
Splits a German compound into its body and head, e.g.
Autobahnraststätte -> Autobahn - Raststätte
Implementation of the method decribed in the appendix of the thesis:
Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.
TL;DR: The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.
The method achieves ~95% accuracy for head detection on the Germanet compound test set.
A model is provided, trained on 1 Mio. German nouns from Wikipedia.
Usage
Train a new model:
training.py --input_file --output_file
from command line, where input_file
contains one word (noun) per line and output_file
is a json file with computed n-gram probabilities.
Compound splitting
In python
>> from charsplit import Splitter
>> splitter = Splitter()
>> splitter.split_compound("Autobahnraststätte")
returns a list of all possible splits, ranked by their score, e.g.
[(0.7945872450631273, 'Autobahn', 'Raststätte'),
(-0.7143290887876655, 'Auto', 'Bahnraststätte'),
(-1.1132332878581173, 'Autobahnrast', 'Stätte'), ...]
By default, Splitter
uses the data from the file charsplit/ngram_probs.json
. If you retrained the model, you may specify a custom file with
>> splitter = Splitter(ngram_path=<json_data_file_with_ngram_probs>)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file split_words-0.1.3.tar.gz
.
File metadata
- Download URL: split_words-0.1.3.tar.gz
- Upload date:
- Size: 12.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.8.10 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7dbb69a85a722355c32de8520fe4d083aa9b4020cbe2cb44175758260fdadeb0 |
|
MD5 | 7edf0d1941190040ce0b900003c4c02e |
|
BLAKE2b-256 | 6a28f9c3c5e886b9e2a0088e50c0f87c3080b345be46bc7bf540e18ae5ab0e94 |
File details
Details for the file split_words-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: split_words-0.1.3-py3-none-any.whl
- Upload date:
- Size: 13.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.8.10 Windows/10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2bd73849a7786a2a1892d87a2cde2ecf78b69bcdc9146a2f38680e74d3908274 |
|
MD5 | 9d4b666dfd060d6693d4efaf2f2f7da3 |
|
BLAKE2b-256 | 0066c753381236dd46ccecb423292ae16e44768588ac922093868291080448c1 |