Split German words
Project description
split-words
CharSplit repacked with poetry, published as pypi split-words--
all credit goes to the original author.
pip install split-words
# replace charsplit with split_words in the following, e.g.
from split_words import Splitter
...
CharSplit - An ngram-based compound splitter for German
Splits a German compound into its body and head, e.g.
Autobahnraststätte -> Autobahn - Raststätte
Implementation of the method decribed in the appendix of the thesis:
Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.
TL;DR: The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.
The method achieves ~95% accuracy for head detection on the Germanet compound test set.
A model is provided, trained on 1 Mio. German nouns from Wikipedia.
Usage
Train a new model:
training.py --input_file --output_file
from command line, where input_file contains one word (noun) per line and output_file is a json file with computed n-gram probabilities.
Compound splitting
In python
>> from charsplit import Splitter
>> splitter = Splitter()
>> splitter.split_compound("Autobahnraststätte")
returns a list of all possible splits, ranked by their score, e.g.
[(0.7945872450631273, 'Autobahn', 'Raststätte'),
(-0.7143290887876655, 'Auto', 'Bahnraststätte'),
(-1.1132332878581173, 'Autobahnrast', 'Stätte'), ...]
By default, Splitter uses the data from the file charsplit/ngram_probs.json. If you retrained the model, you may specify a custom file with
>> splitter = Splitter(ngram_path=<json_data_file_with_ngram_probs>)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file split_words-0.1.3.tar.gz.
File metadata
- Download URL: split_words-0.1.3.tar.gz
- Upload date:
- Size: 12.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.8.10 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dbb69a85a722355c32de8520fe4d083aa9b4020cbe2cb44175758260fdadeb0
|
|
| MD5 |
7edf0d1941190040ce0b900003c4c02e
|
|
| BLAKE2b-256 |
6a28f9c3c5e886b9e2a0088e50c0f87c3080b345be46bc7bf540e18ae5ab0e94
|
File details
Details for the file split_words-0.1.3-py3-none-any.whl.
File metadata
- Download URL: split_words-0.1.3-py3-none-any.whl
- Upload date:
- Size: 13.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.8.10 Windows/10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2bd73849a7786a2a1892d87a2cde2ecf78b69bcdc9146a2f38680e74d3908274
|
|
| MD5 |
9d4b666dfd060d6693d4efaf2f2f7da3
|
|
| BLAKE2b-256 |
0066c753381236dd46ccecb423292ae16e44768588ac922093868291080448c1
|