Skip to main content

The hyphenation library of LibreOffice and FireFox wrapped for Python

Project description

  1. 2008-2021 PyHyphen developers

Contact: fhaxbox66@gmail.com

Project home: https://github.com/dr-leo/PyHyphen

Mailing list: https://groups.google.com/group/pyhyphen

0. Quickstart

With Python 3.7 or higher and a current version of pip, issue:

$ pip install pyhyphen
$ python
>>> from hyphen import Hyphenator
>>> # Download and install the hyphenation dict for German, if needed
>>> h = Hyphenator('de_DE') # `language`defaults to 'en_US'
>>> s='Politikverdrossenheit'
>>> h.pairs(s)
[['Po', 'litikverdrossenheit'],
['Poli', 'tikverdrossenheit'],
['Politik', 'verdrossenheit'],
['Politikver', 'drossenheit'],
['Politikverdros', 'senheit'],
['Politikverdrossen', 'heit']]
>>> h.syllables(s)
['Po', 'li', 'tik', 'ver', 'dros', 'sen', 'heit']
>>> h.wrap(s, 5)
['Poli-', 'tikverdrossenheit']

1. Overview

Pyhyphen is a pythonic interface to the hyphenation library used in projects such as Libre Office and the Mozilla suite. It comes with tools to download, install and uninstall hyphenation dictionaries from LibreOffice’s Git repository. PyHyphen provides the hyphen package. hyphen.textwrap2 is a modified version of the familiar textwrap module which wraps a text with hyphenation given a specified width. See the code example below.

PyHyphen supports Python 3.7 or higher.

1.1 Content of the hyphen package

The ‘hyphen’ package contains the following:

  • the class hyphen.Hyphenator: each instance of it can hyphenate and wrap words using a dictionary compatible with the hyphenation feature of LibreOffice and Mozilla. Required dictionaries are automatically, if not already installed. downloaded at runtime.

  • the module dictools contains useful functions such as for downloading and installing dictionaries from a configurable repository. After installation of PyHyphen, the LibreOffice repository is used by default. Dictionaries are storedin the platform-specific user’s app directory.

  • ‘hyphen.hnj’ is the C extension module that does all the ground work. It contains the high quality C library libhyphen. It supports hyphenation with replacements as well as compound words.

1.2 The module ‘textwrap2’

This module is an enhanced, though backwards-compatible version of the module ‘textwrap’ from the Python standard library. Unsurprisingly, it adds hyphenation functionality to ‘textwrap’. To this end, a new key word parameter ‘use_hyphenator’ has been added to the __init__ method of the TextWrapper class which defaults to None. It can be initialized with any hyphenator object.

2. Code examples

>>> from hyphen import Hyphenator
# Create some hyphenators
h_de = Hyphenator('de_DE')
h_en = Hyphenator('en_US')

# Now hyphenate some words h_en.pairs(‘beautiful’ [[‘beau’, ‘tiful’], [‘beauti’, ‘ful’]]

h_en.wrap(‘beautiful’, 6) [‘beau-’, ‘tiful’]

h_en.wrap(‘beautiful’, 7) [‘beauti-’, ‘ful’]

h_en.syllables(‘beautiful’) [‘beau’, ‘ti’, ‘ful’]

>>> from hyphen.textwrap2 import fill
print fill('very long text...', width=40, use_hyphenator=h_en)

Just by creating Hyphenator objects for a language, the corresponding dictionaries will be automatically downloaded. For the HTTP connection to the LibreOffice server, PyHyphen uses the familiar`requests <https://www.python-requests.org>`_ library. Requests are fully configurable to handle proxies etc. Alternatively, dictionaries may be manually installed and listed with the dictools module:

>>> from hyphen.dictools import *

# Download and install some dictionaries in the default directory using the default
# repository, usually the LibreOffice website
>>> for lang in ['de_DE', 'en_US']:
    install(lang) # provide kwargs to configure the HTTP request

# Show locales of installed dictionaries
>>> list_installed()
['de', 'de_DE', 'en_PH', 'en_US']

3. Installation

PyHyphen is pip-installable from PyPI. In most scenarios the easiest way to install PyHyphen is to type from the shell prompt:

$ pip install pyhyphen

Besides the source distribution, there is a wheel on PyPI for Windows. As the C extension uses the limited C API, the wheel should work on all Python versions >= 3.7.

Building PyHyphen from source under Linux or MacOS should be straightforward. On Windows, the wheel isinstalled by default, so no C compiler is needed.

4. Managing dictionaries

The dictools module contains a non-exhaustive list of available language strings that can be used to instantiate Hyphenator objects as shown above:

>>>from hyphen import dictools
>>>dictools.LANGUAGES
['af_ZA', 'an_ES', 'ar', 'be_BY', 'bg_BG', 'bn_BD', 'br_FR', 'ca', 'cs_C
Z', 'da_DK', 'de', 'el_GR', 'en', 'es_ES', 'et_EE', 'fr_FR', 'gd_GB', 'gl', 'gu_
IN', 'he_IL', 'hi_IN', 'hr_HR', 'hu_HU', 'it_IT', 'ku_TR', 'lt_LT', 'lv_LV', 'ne
_NP', 'nl_NL', 'no', 'oc_FR', 'pl_PL', 'prj', 'pt_BR', 'pt_PT', 'ro', 'ru_RU', '
si_LK', 'sk_SK', 'sl_SI', 'sr', 'sv_SE', 'sw_TZ', 'te_IN', 'th_TH', 'uk_UA', 'zu
_ZA']

The downloaded dictionary files are stored in a local data folder, along with a dictionaries.json file that lists the downloaded files and the associated locales:

$ ls ~/.local/share/pyhyphen
dictionaries.json  hyph_de_DE.dic  hyph_en_US.dic

$ cat ~/.local/share/pyhyphen/dictionaries.json
{
  "de": {
    "file": "hyph_de_DE.dic",
    "url": "http://cgit.freedesktop.org/libreoffice/dictionaries/plain/de/hyph_de_DE.dic"
  },
  "de_DE": {
    "file": "hyph_de_DE.dic",
    "url": "http://cgit.freedesktop.org/libreoffice/dictionaries/plain/de/hyph_de_DE.dic"
  },
  "en_PH": {
    "file": "hyph_en_US.dic",
    "url": "http://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/hyph_en_US.dic"
  },
  "en_US": {
    "file": "hyph_en_US.dic",
    "url": "http://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/hyph_en_US.dic"
  }
}

Each entry of the dictionaries.json file contains both the path to the dictionary file and the url from which it was downloaded.

5. Contributing and reporting bugs

Questions can be asked in the Google group (https://groups.google.com/group/pyhyphen). Or just send an e-mail to the authors.

Browse or fork the repository and report bugs at PyHyphen’s project site on Github.

Before submitting a PR, run the unit tests
$ python -m unittest

6. License

Without prejudice to third party licenses, PyHyphen is distributed under the Apache 2.0 license. PyHyphen ships with third party code including the hyphenation library hyphen.c and a patched version of the Python standard module textwrap.

7. Changelog

New in version 4.0.0 (2021-02-15):

This is a big release. The entire code-base has been overhauled. A cross-Py-version wheel for Windows and the use of the excellent requests package for HTTP connections are but some of the highlights.

  • hyphen.Hyphenator:

    • support of hyphenation of upper-cased words as in version 2.x

    • better error-handling

    • human-friendly str representation of Hyphenator objects

  • Builds:

    • single-source package version (requires setuptools >= 47.0)

    • CI: move to Github actions. Build ABI3-compatible wheel for Windows

  • C extension:

    • partial rewrite to support the limited API (PEP 384)

    • multi-phase initialization of the module

    • upgrade hyphen.c from hunspell

    • clean-ups

  • hyphen.dictools:

    • use requests instead of urllib for HTTP connections

    • make HTTP connections configurable through kwargs passed to requests.get

    • improve error-handling

    • fix URL generation in some cases

    • clean-ups

  • make textwrap2 a submodule of hyphen

  • remove wraptext script

New in Version 3.0.1:

Fix source distribution which did not include C header files.

New in Version 3.0.0:

  • lazy dictionary install at runtime

  • switch to user-specific data directory for storing dictionaries

  • unit tests

  • migration from distutils to setuptools and simplified setup

  • get rid of config module and config scripts

  • upgrade textwrap2 to latest python2 and python3 versions; add CLI script to wrap text files with hyphenation

  • improve detection of dictionary location

  • Remove Windows binaries from the source distribution. Provide wheels instead thanks to the awesome cibuildwheel tool.

New in Version 2.0.9:

  • add support for Python 3.6

New in Version 2.0.8:

  • fix python 3 install

  • fix install from source

New in Version 2.0.7:

  • add win binary for AMD64, win27

  • make it pip-installable (PR1)

  • minor fixes

New in Version 2.0.5:

  • remove pre-compiled win32 C extension for Python 2.6, add one for Python 3.4

  • avoid unicode error in config.py while installing on some Windows systems

New in Version 2.0.4:

  • Update C library to v2.8.6

New in Version 2.0.2:

  • minor bugfixes and refactorings

New in Version 2.0.1:

  • updated URL for LibreOffice’s dictionaries

  • no longer attempt to hyphenate uppercased words such as ‘LONDON’. This feature had to be dropped to work around a likely bug in the C extension which, under Python 3.3, caused the hyphenator to return words starting with a capital letter as lowercase.

New in Version 2.0

The hyphen.dictools module has been completely rewritten. This was required by the switch from OpenOffice to LibreOffice which does no longer support the old formats for dictionaries and meta data. these changes made it impossible to release a stable v1.0. The new dictionary management is more flexible and powerful. There is now a registry for locally installed hyphenation dictionaries. Each dictionary can have its own file path. It is thus possible to add persistent metadata on pre-existing hyphenation dictionaries, e.g. from a LibreOffice installation. Each dictionary and hence Hyphenator can now be associated with multiple locales such as for ‘en_US’ and ‘en_NZ’. These changes cause some backwards-incompatible API changes. Further changes are:

  • Hyphenator.info is of a container type for ‘url’, ‘locales’ and ‘filepath’ of the dictionary.

  • the Hyphenator.language attribute deprecated in v1.0 has been removed

  • download and install dictionaries from LibreOffice’s git repository by default

  • dictools.install(‘xx_YY’) will install all dictionaries found for the ‘xx’ language and associate them with all relevant locales as described in the dictionaries.xcu file in LibreOffice’s git repository.

  • upgraded the C library libhyphen to v2.8.3

  • use lib2to3 instead of separate code bases

  • dropped support for Python 2.4 and 2.5

  • support Python 3.3

New in version 1.0

  • Upgraded the C library libhyphen to v2.7 which brings significant improvements, most notably correct treatment of already hyphenated words such as ‘Python-powered’

  • use a CSV file from the oo website with meta information on dictionaries for installation of dictionaries and instantiation of hyphenators. Apps can access the metadata on all downloadable dicts through the new module-level attribute hyphen.dict_info or for each hyphenator through the ‘info’ attribute,

  • Hyphenator objects have a ‘info’ attribute which is a Python dictionary with meta information on the hyphenation dictionary. The ‘language’ attribute is deprecated. Note: These new features add complexity to the installation process as the metadata and dictionary files are downloaded at install time. These features have to be tested in various environments before declaring the package stable.

  • Streamlined the installation process

  • The en_US hyphenation dictionary has been removed from the package. Instead, the dictionaries for en_US and the local language are automatically downloaded at install time.

  • restructured the package and merged 2.x and 3.x setup files

  • switch from svn to hg

  • added win32 binary of the C extension module for Python32, currently no binaries for Python 2.4 and 2.5

New in version 0.10

  • added win32 binary for Python 2.7

  • renamed ‘hyphenator’ class to to more conventional ‘Hyphenator’. ‘hyphenator’ is deprecated.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PyHyphen-4.0.0.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

PyHyphen-4.0.0-cp37-abi3-win_amd64.whl (39.9 kB view details)

Uploaded CPython 3.7+Windows x86-64

File details

Details for the file PyHyphen-4.0.0.tar.gz.

File metadata

  • Download URL: PyHyphen-4.0.0.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.9

File hashes

Hashes for PyHyphen-4.0.0.tar.gz
Algorithm Hash digest
SHA256 6a57e0e0a0c089b7819cec3334c956bb39a6ec0105e2c58ec5e88e044058514a
MD5 71634c630ce38767e55381bb92638546
BLAKE2b-256 2b7416d8d60c43cb90ba1d4c69a6883188c9b95505a4a0cc35e7aef66781a846

See more details on using hashes here.

File details

Details for the file PyHyphen-4.0.0-cp37-abi3-win_amd64.whl.

File metadata

  • Download URL: PyHyphen-4.0.0-cp37-abi3-win_amd64.whl
  • Upload date:
  • Size: 39.9 kB
  • Tags: CPython 3.7+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.0.post20201006 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.7.9

File hashes

Hashes for PyHyphen-4.0.0-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c8684a9ce61b85825f3aa05d6830df1c4f392f6e7471baa40522601d30de5b47
MD5 5e02f7d2ce8b3da896c68baf6352d61e
BLAKE2b-256 9ce39022ae7647caad615f3194000bf7e0b49236654e1c5d12c6b55dd153e107

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page