Parse docx file containing article digest content
Project description
digest-parser
Parse docx file containing digest content and produce output in other formats.
The contents of the .docx
must follow a specific formatting scheme for it to be understood; each section of content is prefaced by a bold formatted title, such as DIGEST TITLE
, and the content below it is used to populate the section of the output.
There are four types of output content which can be after parsing a .docx
file:
- DOCX output,
- JATS XML output format,
- JSON format, compatible with eLife API schema, or
- Medium format, which can be used to create a new post at Medium service using their API, which can optionally overwrite some values if supplied a JATS XML research article file
Optionally, a .zip
file can contain the .docx
file and an optional graphic image file,. The image caption content can be included in the .docx
and will be added to the Image
object.
Requirements
Parsing .docx
files uses Python library dependency python-docx
, as defined in the installation requirements files.
Configuration
The digest.cfg
configuration file provided in this repository can be changed in order to produce slightly different output, depending on the situation. It includes a way to change the Medium post content, .docx
output file name, and to change IIIF image server URL paths.
Example usage
This library is meant to be integrated into another operational system, however the following are examples using interactive Python:
Example 1 - Simple conversion of a .docx
to JATS XML
>>> from digestparser import parse
>>> content = parse.parse_content("tests/test_data/DIGEST 99999.docx")
>>> print(content)
<b>AUTHOR</b>
Anonymous
<b>DIGEST TITLE</b>
Example 2 - Parse a .docx
into Digest object and then output JSON
>>> from digestparser import build
>>> from digestparser import json_output
>>> from digestparser.conf import raw_config, parse_raw_config
>>> digest = build.build_digest("tests/test_data/DIGEST 99999.zip")
>>> digest_config = parse_raw_config(raw_config("elife"))
>>> print(json_output.digest_json(digest, digest_config))
OrderedDict([('id', 'None'), ('title', 'Fishing for errors in the\xa0tests'), ('impactStatement', ...
Example 3 - Parse a .zip
and then output Medium post content
>>> from digestparser import medium_post
>>> from digestparser.conf import raw_config, parse_raw_config
>>> digest_config = parse_raw_config(raw_config("elife"))
>>> print(medium_post.build_medium_content("tests/test_data/DIGEST 99999.zip", digest_config=digest_config))
OrderedDict([('title', 'Fishing for errors in the\xa0tests'), ('contentFormat', 'html'), ...
License
Licensed under MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file digestparser-0.2.0.tar.gz
.
File metadata
- Download URL: digestparser-0.2.0.tar.gz
- Upload date:
- Size: 22.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3bbcc06ebd768244ac5cbf76fc35bf94072ae4c90655966f7fa7c433c8d5cd27 |
|
MD5 | ed6717988102311a378a4972978e5f1b |
|
BLAKE2b-256 | 91531f9610b5cd69896755c5e7672cc1605760dbdf65bf9a1a4ea1a8830998db |
File details
Details for the file digestparser-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: digestparser-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52da3fd346cf9228ca5133f3af78d8276125de1915b224ea1250d6e96f7a7f30 |
|
MD5 | 963d98c43c3005fc400b77c06d2e3c26 |
|
BLAKE2b-256 | f9650281ab142f430c71c3cd5430468dd87452f4ed74dbb54514eb326586b79b |