Compound word splitter, dictionary-based
Project description
wikdict-compound
This library splits compound words into the individual parts. It uses a large dictionary including inflected forms and keeps the amount of language specific rules to a minimum in order to support a variety of languages. The dictionaries come from Wiktionary via WikDict and are licensed under Creative Commons BY-SA.
Installation
Install this library using pip
:
pip install wikdict-compound
Usage
Create Required Databases
To use wikdict-compound, you need a database with the required compound splitting dictionaries. These are created based on the WikDict dictionaries at https://download.wikdict.com/dictionaries/sqlite/2/. For each language you want to use
- Download the corresponding WikDict SQLite dictionary (e.g.
de.sqlite3
for German) - Execute
make_db(lang, input_path, output_path)
whereinput
path contains the WikDict dictionary andoutput_path
is the directory where the generated compound splitting db should be placed.
Split Compound Words
>>> from wikdict_compound import split_compound
>>> split_compound(db_path='compound_dbs', lang='de', compound='Bücherkiste')
Solution(parts=[
Part(written_rep='Buch', score=63.57055093514545, match='bücher'),
Part(written_rep='Kiste', score=33.89508861315521, match='kiste')
])
The returned solution object has a parts
attribute, which contains the separate word parts in the correct order, along with the matched word part and a matching score (mostly interesting when comparing different splitting possibilites for the same word).
Supported Languages and Splitting Quality
The results for each language are compared against compound word information from Wikidata. For each language a success range is given, where the higher value includes all compounds where a splitting could be found while the lower value only counts those where the results are the same as on Wikidata. Since some words have multiple valid splittings and the Wikidata entries are not perfect either, the true success rate should be somewhere within this range.
- de: 81.8%-97.7% success, tested over 2984 cases
- en: 69.6%-98.2% success, tested over 16061 cases
- es: 27.5%-75.6% success, tested over 1000 cases
- fi: 78.5%-96.9% success, tested over 65 cases
- fr: 15.2%-36.3% success, tested over 328 cases
- it: 18.4%-60.3% success, tested over 136 cases
- nl: 33.3%-100.0% success, tested over 3 cases
- pl: 30.9%-90.9% success, tested over 220 cases
- sv: 75.7%-97.8% success, tested over 5979 cases
Development
To contribute to this library, first checkout the code. Then create a new virtual environment:
cd wikdict-compound
python -m venv .venv
source .venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
Related Resources
The approach is similar to the one described in Simple Compound Splitting for German (Weller-Di Marco, MWE 2017). I can also recommend the paper as an overview of the problems and approaches to compound words splitting of German words.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wikdict-compound-0.2.tar.gz
.
File metadata
- Download URL: wikdict-compound-0.2.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f073da75469a7a9012c873e70585bfcfcf49c1b53ef7d249938fe3240a94c886 |
|
MD5 | 619e229b6856cd0cc8aefecc142e4227 |
|
BLAKE2b-256 | b85c32da3eed59e4748c3361c12624750788300d5ad6797f0e5962df23d90863 |
Provenance
File details
Details for the file wikdict_compound-0.2-py3-none-any.whl
.
File metadata
- Download URL: wikdict_compound-0.2-py3-none-any.whl
- Upload date:
- Size: 8.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7962d0b0482b4bc3e0898aa37c24fd92b0d4f9fa0eeadb1a2afd26446b0785b |
|
MD5 | 444d16c785f2e1f1a2ef93ca0b408f77 |
|
BLAKE2b-256 | 9b3c8f79098d477ab4480983928bde30a8f3f2cd88bdd72d0376da14eb023cc1 |