Splits a compound into its body and head. So far German and Dutch are supported.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

CharSplit - An ngram-based compound splitter for German

Splits a German compound into its body and head, e.g.

Autobahnraststätte -> Autobahn - Raststätte

Implementation of the method described in the appendix of the thesis:

Tuggener, Don (2016). Incremental Coreference Resolution for German. University of Zurich, Faculty of Arts.

TL;DR

The method calculates probabilities of ngrams occurring at the beginning, end and in the middle of words and identifies the most likely position for a split.

The method achieves ~95% accuracy for head detection on the Germanet compound test set.

A model is provided, trained on 1 Mio. German nouns from Wikipedia.

Usage

Train a new model:

$ python char_split_train.py <your_train_file>

where <your_train_file> contains one word (noun) per line.

Compound splitting

From command line:

$ python char_split.py <word>

Outputs all possible splits, ranked by their score, e.g.

$ python char_split.py Autobahnraststätte
0.84096566854	Autobahn	Raststätte
-0.54568851959	Auto	Bahnraststätte
-0.719082070993	Autobahnrast	Stätte
...

As a module:

$ python
>>> from compound_split import char_split
>>> char_split.split_compound('Autobahnraststätte')
[[0.7945872450631273, 'Autobahn', 'Raststätte'],
 [-0.7143290887876655, 'Auto', 'Bahnraststätte'],  
 [-1.1132332878581173, 'Autobahnrast', 'Stätte'],  
 [-1.4010051533086552, 'Aut', 'Obahnraststätte'],  
 [-2.3447843979244944, 'Autobahnrasts', 'Tätte'],  
 [-2.4761904761904763, 'Autobahnra', 'Ststätte'],  
 [-2.4761904761904763, 'Autobahnr', 'Aststätte'],  
 [-2.5733333333333333, 'Autob', 'Ahnraststätte'],  
 [-2.604651162790698, 'Autobahnras', 'Tstätte'],  
 [-2.7142857142857144, 'Autobah', 'Nraststätte'],  
 [-2.730248306997743, 'Autobahnrastst', 'Ätte'],  
 [-2.8033113109925973, 'Autobahnraststä', 'Tte'],  
 [-3.0, 'Autoba', 'Hnraststätte']]

Document splitting

From command line:

$ python doc_split.py <dict>

Reads everything from standard input and writes out the same, with the best splits separated by the middle dot character ·.

Each word is split as many times as possible based on the file , which contains German words one per line (comment lines beginning with # are allowed).

The name of the default dictionary is in the file doc_config.py.

Note that the doc_split module retains a cache of words already split, so long documents will typically be processed proportionately faster than short ones. The cache is discarded when the program ends.

$ python sentence1.txt
Um die in jeder Hinsicht zufriedenzustellen, tüftelt er einen Weg aus,
sinnlose Bürokratie wie Ladenschlußgesetz und Nachtbackverbot auszutricksen.  
$ python doc_split.py <sentence1.txt  
Um die in jeder Hinsicht zufriedenzustellen, tüftelt er einen Weg aus,
sinnlose Bürokratie wie Laden·schluß·gesetz und Nacht·back·verbot auszutricksen.

As a module:

$ python
>>> from compound_split import doc_split
>>> # Constant containing a middle dot
>>> doc_split.MIDDLE_DOT
'·'
>>> # Split a word as much as possible, return a list
>>> doc_split.maximal_split('Verfassungsschutzpräsident')
['Verfassungs', 'Schutz', 'Präsident']
>>> # Split a word as much as possible, return a word with middle dots
'Verfassungs·schutz·präsident'
>>> # Split all splittable words in a sentence
>>> doc_split.doc_split('Der Marquis schlug mit dem Handteller auf sein Regiepult.')
Der Marquis schlug mit dem Hand·teller auf sein Regie·pult.

Document splitting server

Because of the startup time, you can run the document splitter as a simple server, and the responses will be quicker.

$ python doc_server [ -d ] <dict> <port>

The server will load <dict> and listen on <port>. The client must send the raw data in UTF-8 encoding to the port and close the write side of the port, and the server will return the split data.

The option -d causes the server to return a sorted dictionary of split words instead. Each word is on a single line, with the original word followed by a tab character followed by the split word.

Because of Python restrictions, the server is single-threaded.

The default dictionary and port are in the file doc_config.py.

A trivial client is provided:

$ python doc_client <port> <host>

Reads a document from standard input, send it to the server running on <host> and <port>, and send the server's output to standard output. Thus it has the same interface as doc_split (except that the dictionary cannot be specified), but should run somewhat faster.

The default host and port are in the file doc_config.py.

Downloading dictionaries

To download German and Dutch dictionaries for doc_split and doc_server:

$ cd dicts
$ sh getdicts

This will download the spelling plugins from the LibreOffice site, extract the wordlists, and write five files into the current directory. It leaves a good many files in /tmp, which are not needed further.

The dictionaries de-DE.dic, de-AT.dic, and de-CH.dic are fairly extensive (about 250,000 words each) and provide current German, Austrian, and Swiss spelling.
The file de-1901.dic provides the spelling used between 1901 and 1996.
The file misc.dic is a collection of nouns that are mis-split and are therefore included in the dictionary so that they won't be split.
The file legal.dic contains legal terms. Remove it before running getdicts if you don't want it to be included.
The file de-mixed.dic is a merger of all of the other files.
The file nl-NL.dic is from OpenOffice and provides Dutch spelling (not currently used).

You can add your own wordlists before running getdicts if you want. They must be plain UTF-8 text with one word per line and begin with the correct language code (de for German).

If the program is not splitting hard enough for your purposes, you may want to find and use a smaller dictionary.

Since it is only checked if the exact word is in these dictionaries the following problem can arise: "Beschwerden" is not split because the dictionaries only contain "Beschwerde"! A solution to this problem would be to do this compound splitting only on the lemmatized text with dictionaries containing lemmatized words. => TODO: implement this OR make it possible to run it on a list of tokens!

TODO: Write more documentation

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.0.2

Oct 8, 2020

1.0.2.dev4 pre-release

Oct 8, 2020

1.0.2.dev3 pre-release

Oct 8, 2020

1.0.2.dev2 pre-release

Oct 8, 2020

1.0.2.dev1 pre-release

Oct 8, 2020

1.0.1

Oct 2, 2020

1.0.1.dev2 pre-release

Oct 2, 2020

1.0.1.dev1 pre-release

Oct 2, 2020

1.0.0

Oct 2, 2020

0.1.0.dev5 pre-release

Oct 1, 2020

0.1.0.dev4 pre-release

Oct 1, 2020

0.1.0.dev3 pre-release

Oct 1, 2020

0.1.0.dev2 pre-release

Oct 1, 2020

0.1.0.dev1 pre-release

Oct 1, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compound_split-1.0.2.tar.gz (18.9 MB view details)

Uploaded Oct 8, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

compound_split-1.0.2-py3-none-any.whl (19.4 MB view details)

Uploaded Oct 8, 2020 Python 3

File details

Details for the file compound_split-1.0.2.tar.gz.

File metadata

Download URL: compound_split-1.0.2.tar.gz
Upload date: Oct 8, 2020
Size: 18.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2

File hashes

Hashes for compound_split-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`862f782e6a528c110295b33dc37094338fadf5d825c75ef2614592b3dce3a658`
MD5	`339dc5fa8eeb0ebc1e7bfff5364dfcf3`
BLAKE2b-256	`bae235961725c69e542f12917a48899cf3a4820a1d8399f4d5f90c6a59d0a48c`

See more details on using hashes here.

File details

Details for the file compound_split-1.0.2-py3-none-any.whl.

File metadata

Download URL: compound_split-1.0.2-py3-none-any.whl
Upload date: Oct 8, 2020
Size: 19.4 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.2

File hashes

Hashes for compound_split-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68de08c7172694352e3638ec480207ec9f95ac84aad4a0051ba7199e4f195772`
MD5	`8e1830940ead9dfa088d1136d4db9ee1`
BLAKE2b-256	`c486c5b9faf27d65fad28d2218620a71046777ad54d5742e68e469a61e7c84cb`

See more details on using hashes here.

compound-split 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CharSplit - An ngram-based compound splitter for German

TL;DR

Usage

Train a new model:

Compound splitting

Document splitting

Document splitting server

Downloading dictionaries

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes