Skip to main content

Parse docx file containing article digest content

Project description

digest-parser

Parse docx file containing digest content and produce output in other formats.

The contents of the .docx must follow a specific formatting scheme for it to be understood; each section of content is prefaced by a bold formatted title, such as DIGEST TITLE, and the content below it is used to populate the section of the output.

There are four types of output content which can be after parsing a .docx file:

  1. DOCX output,
  2. JATS XML output format,
  3. JSON format, compatible with eLife API schema, or
  4. Medium format, which can be used to create a new post at Medium service using their API, which can optionally overwrite some values if supplied a JATS XML research article file

Optionally, a .zip file can contain the .docx file and an optional graphic image file,. The image caption content can be included in the .docx and will be added to the Image object.

Requirements

Parsing .docx files uses Python library dependency python-docx, as defined in the installation requirements files.

Configuration

The digest.cfg configuration file provided in this repository can be changed in order to produce slightly different output, depending on the situation. It includes a way to change the Medium post content, .docx output file name, and to change IIIF image server URL paths.

Example usage

This library is meant to be integrated into another operational system, however the following are examples using interactive Python:

Example 1 - Simple conversion of a .docx to JATS XML

>>> from digestparser import parse
>>> content = parse.parse_content("tests/test_data/DIGEST 99999.docx")
>>> print(content)
<b>AUTHOR</b>
Anonymous
<b>DIGEST TITLE</b>

Example 2 - Parse a .docx into Digest object and then output JSON

>>> from digestparser import build
>>> from digestparser import json_output
>>> from digestparser.conf import raw_config, parse_raw_config
>>> digest = build.build_digest("tests/test_data/DIGEST 99999.zip")
>>> digest_config = parse_raw_config(raw_config("elife"))
>>> print(json_output.digest_json(digest, digest_config))
OrderedDict([('id', 'None'), ('title', 'Fishing for errors in the\xa0tests'), ('impactStatement', ...

Example 3 - Parse a .zip and then output Medium post content

>>> from digestparser import medium_post
>>> from digestparser.conf import raw_config, parse_raw_config
>>> digest_config = parse_raw_config(raw_config("elife"))
>>> print(medium_post.build_medium_content("tests/test_data/DIGEST 99999.zip", digest_config=digest_config))
OrderedDict([('title', 'Fishing for errors in the\xa0tests'), ('contentFormat', 'html'), ...

License

Licensed under MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digestparser-0.2.0.tar.gz (22.5 kB view hashes)

Uploaded Source

Built Distribution

digestparser-0.2.0-py3-none-any.whl (16.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page