Skip to main content

Python package containing various utilities relevant in the field of digital humanities.

Project description

Digital Humanities Utilities

Python 3.6+ package containing various utilities relevant in the field of digital humanities.

$ pip install dh-utils

Unicode utilities

Convert Greek beta code to unicode:

>>> from dh_utils import unicode as u
>>> u.beta2uni('lo/gos')
'λόγος'

This is a wrapper of the CLTK converter. We used this converter to also create inverse:

>>> u.uni2beta('λόγος')
'lo/gos'

Decompose any unicode string:

>>> u.decompose('λόγος')
λ U+03bb GREEK SMALL LETTER LAMDA
ο U+03bf GREEK SMALL LETTER OMICRON
́ U+0301 COMBINING ACUTE ACCENT
γ U+03b3 GREEK SMALL LETTER GAMMA
ο U+03bf GREEK SMALL LETTER OMICRON
ς U+03c2 GREEK SMALL LETTER FINAL SIGMA

TEI utilities

Convert markdown to TEI

A basic converter from markdown to TEI has been added. It will convert markdown file like:

Some paragraph block

> A blockquote

1. An
2. Ordered
3. List

Another paragraph block with _italics_ and __bold__, and:

* An
* Unordered
* List

using a snippet like

>>> from dh_utils import tei as t
>>> with open('file.md') as f:
>>>    t.md2tei(f.read())

to the following TEI XML:

<p>Some paragraph block</p>
<quote>
  <p>A blockquote</p>
</quote>
<list rend="numbered">
  <item>An</item>
  <item>Ordered</item>
  <item>List</item>
</list>
<p>Another paragraph block with <hi rend="italic">italics</hi> and <hi rend="bold">bold</hi>, and:</p>
<list rend="bulleted">
  <item>An</item>
  <item>Unordered</item>
  <item>List</item>
</list>

The function md2tei is syntactic sugar for the markdown extension ToTEI, which can be used in combination with other extensions as follows:

>>> from markdown import markdown
>>> from dh_utils.tei import ToTEI
>>> markdown('some text', extensions=[ToTEI()]) # Other extensions can be added to this list

The extension ToTEI in turn exists solely of the postprocessor TEIPostprocessor, which converts the . It has priority 0, which in most cases means that it will be ran after all other postprocessors have finished. If any other behaviour or prioritization is required, this processor can also be directly imported and used in a custom markdown extension.

Tag languages

Tag languages in a given string based on its script:

>>> t.tag('A line contaning the hebrew אגוז מלך inline', 'Hebr')
'A line contaning the hebrew <foreign xml:lang="he-Hebr">אגוז מלך</foreign> inline'

It is also possible to tag a given language based on its script in a TEI XML document (NB: file will be overwritten!):

>>> t.tag_xml('path/to/file.xml', 'Arab')

The available scripts are stored in AVAILABLE_SCRIPTS and are enumerated below:

>>> t.AVAILABLE_SCRIPTS
['Arab', 'Copt', 'Hebr', 'Latn', 'Cyrl']

Default language-script codes are used to tag the scripts (stored in DEFAULT_LCS), which can be adjusted using the language_code keyword argument:

>>> t.tag_xml('path/to/file.xml', 'Cyrl', language_code = 'ov-Cyrs')

Refsdecl generator

To generate refsdecl elements, the generator can be used to create etree xml elements:

from dh_utils.tei import refsdecl_generator

refs_decl = refsdecl_generator.generate_for_file("./path/to/file")
refs_decls = refsdecl_generator.generate_for_path("./path/to/files")

It can also be used trough the command line interface:

python -m dh_utils.tei.refsdecl_generator [--update] [PATH]

By default, it does not update the file but outputs the refsdecl xml to the terminal. If the --update flag is given, the file is updated with the generated refsdecl.

MyCapytain-compatilble critical apparatus

The Python API MyCapytain only serves the main text of a CTS structured text version, and does not support stand-off annotation, bibliographies, critical apparati, etc. To overcome the last problem, we have developed a script that generates a separate text version of the critical apparatus that can be served through MyCapytain. Brill's Scholarly Editions uses these separate text versions, which can be displayed in parallel.

The following snippet creates such a critapp file from textgroup.work.edition-extension.xml located in path/to/data/textgroup/work and saves it as textgroup.work.edition-appcrit1.xml

>>> import crit_app as ca
>>> data_dir = "path/to/data/textgroup/work"
>>> filename = "textgroup.work.edition-extension.xml"
>>> ca_ext = "appcrit1" # Or any other extension
>>> ca.create(filename, ca_ext, data_dir)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dh-utils-0.1.16.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

dh_utils-0.1.16-py3-none-any.whl (13.9 kB view details)

Uploaded Python 3

File details

Details for the file dh-utils-0.1.16.tar.gz.

File metadata

  • Download URL: dh-utils-0.1.16.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for dh-utils-0.1.16.tar.gz
Algorithm Hash digest
SHA256 dd122d2dec0bbf88f4ea3f9844b3eceb9a2fb85f0bb6b95a6d17e03321161ea2
MD5 78f6b137fa56912a79b0946b8e1178df
BLAKE2b-256 e7bec325cd294a9b798a8c0ef969547312fdaa985325a1a6069fe8c93a3b14e3

See more details on using hashes here.

File details

Details for the file dh_utils-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: dh_utils-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 13.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.9.0

File hashes

Hashes for dh_utils-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 3f54ca4cca7a1d916d631c8aae54ca8a1af83e1d2500a0895c2e0fcf3fdcabd5
MD5 d96e2a28e05afa6aed175b9f6555b1e4
BLAKE2b-256 00b5e85ad8be415e9fffebecdcce9f9b6d984454b1b5729bf6616724e2ebe739

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page