Skip to main content

Compound word splitter, dictionary-based

Project description

wikdict-compound

PyPI Changelog

This library splits compound words into the individual parts. It uses a large dictionary including inflected forms and keeps the amount of language specific rules to a minimum in order to support a variety of languages. The dictionaries come from Wiktionary via WikDict and are licensed under Creative Commons BY-SA.

Installation

Install this library using pip:

pip install wikdict-compound

Usage

Create Required Databases

To use wikdict-compound, you need a database with the required compound splitting dictionaries. These are created based on the WikDict dictionaries at https://download.wikdict.com/dictionaries/sqlite/2/. For each language you want to use

  • Download the corresponding WikDict SQLite dictionary (e.g. de.sqlite3 for German)
  • Execute make_db(lang, input_path, output_path) where input path contains the WikDict dictionary and output_path is the directory where the generated compound splitting db should be placed.

Split Compound Words

>>> from wikdict_compound import split_compound
>>> split_compound(db_path='compound_dbs', lang='de', compound='Bücherkiste')
Solution(parts=[
    Part(written_rep='Buch', score=63.57055093514545, match='bücher'),
    Part(written_rep='Kiste', score=33.89508861315521, match='kiste')
])

The returned solution object has a parts attribute, which contains the separate word parts in the correct order, along with the matched word part and a matching score (mostly interesting when comparing different splitting possibilites for the same word).

Supported Languages and Splitting Quality

The results for each language are compared against compound word information from Wikidata. For each language a success range is given, where the higher value includes all compounds where a splitting could be found while the lower value only counts those where the results are the same as on Wikidata. Since some words have multiple valid splittings and the Wikidata entries are not perfect either, the true success rate should be somewhere within this range.

  • de: 81.8%-97.7% success, tested over 2984 cases
  • en: 69.6%-98.2% success, tested over 16061 cases
  • es: 27.5%-75.6% success, tested over 1000 cases
  • fi: 78.5%-96.9% success, tested over 65 cases
  • fr: 15.2%-36.3% success, tested over 328 cases
  • it: 18.4%-60.3% success, tested over 136 cases
  • nl: 33.3%-100.0% success, tested over 3 cases
  • pl: 30.9%-90.9% success, tested over 220 cases
  • sv: 75.7%-97.8% success, tested over 5979 cases

Development

To contribute to this library, first checkout the code. Then create a new virtual environment:

cd wikdict-compound
python -m venv .venv
source .venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

Related Resources

The approach is similar to the one described in Simple Compound Splitting for German (Weller-Di Marco, MWE 2017). I can also recommend the paper as an overview of the problems and approaches to compound words splitting of German words.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikdict-compound-0.2.tar.gz (8.2 kB view hashes)

Uploaded Source

Built Distribution

wikdict_compound-0.2-py3-none-any.whl (8.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page