Skip to main content

Parse docx file containing article digest content

Project description

digest-parser

Parse docx file containing digest content and produce output in other formats.

The contents of the .docx must follow a specific formatting scheme for it to be understood; each section of content is prefaced by a bold formatted title, such as DIGEST TITLE, and the content below it is used to populate the section of the output.

There are four types of output content which can be after parsing a .docx file:

  1. DOCX output,
  2. JATS XML output format,
  3. JSON format, compatible with eLife API schema, or
  4. Medium format, which can be used to create a new post at Medium service using their API, which can optionally overwrite some values if supplied a JATS XML research article file

Optionally, a .zip file can contain the .docx file and an optional graphic image file,. The image caption content can be included in the .docx and will be added to the Image object.

Requirements

Parsing .docx files uses Python library dependency python-docx, as defined in the installation requirements files.

Configuration

The digest.cfg configuration file provided in this repository can be changed in order to produce slightly different output, depending on the situation. It includes a way to change the Medium post content, .docx output file name, and to change IIIF image server URL paths.

Example usage

This library is meant to be integrated into another operational system, however the following are examples using interactive Python:

Example 1 - Simple conversion of a .docx to JATS XML

>>> from digestparser import parse
>>> content = parse.parse_content("tests/test_data/DIGEST 99999.docx")
>>> print(content)
<b>AUTHOR</b>
Anonymous
<b>DIGEST TITLE</b>

Example 2 - Parse a .docx into Digest object and then output JSON

>>> from digestparser import build
>>> from digestparser import json_output
>>> from digestparser.conf import raw_config, parse_raw_config
>>> digest = build.build_digest("tests/test_data/DIGEST 99999.zip")
>>> digest_config = parse_raw_config(raw_config("elife"))
>>> print(json_output.digest_json(digest, digest_config))
OrderedDict([('id', 'None'), ('title', 'Fishing for errors in the\xa0tests'), ('impactStatement', ...

Example 3 - Parse a .zip and then output Medium post content

>>> from digestparser import medium_post
>>> from digestparser.conf import raw_config, parse_raw_config
>>> digest_config = parse_raw_config(raw_config("elife"))
>>> print(medium_post.build_medium_content("tests/test_data/DIGEST 99999.zip", digest_config=digest_config))
OrderedDict([('title', 'Fishing for errors in the\xa0tests'), ('contentFormat', 'html'), ...

License

Licensed under MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digestparser-0.2.0.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

digestparser-0.2.0-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file digestparser-0.2.0.tar.gz.

File metadata

  • Download URL: digestparser-0.2.0.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for digestparser-0.2.0.tar.gz
Algorithm Hash digest
SHA256 3bbcc06ebd768244ac5cbf76fc35bf94072ae4c90655966f7fa7c433c8d5cd27
MD5 ed6717988102311a378a4972978e5f1b
BLAKE2b-256 91531f9610b5cd69896755c5e7672cc1605760dbdf65bf9a1a4ea1a8830998db

See more details on using hashes here.

File details

Details for the file digestparser-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: digestparser-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.10

File hashes

Hashes for digestparser-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 52da3fd346cf9228ca5133f3af78d8276125de1915b224ea1250d6e96f7a7f30
MD5 963d98c43c3005fc400b77c06d2e3c26
BLAKE2b-256 f9650281ab142f430c71c3cd5430468dd87452f4ed74dbb54514eb326586b79b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page