A tool to standardise text and table data extracted from full text publications.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jmp111

These details have not been verified by PyPI

Project links

Publication

Project description

Auto-CORPus

Requires Python 3.10+

The Automated pipeline for Consistent Outputs from Research Publications (Auto-CORPus) is a tool for the standardisation and conversion of publication HTML to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC format. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition.

We present a JSON format for sharing table content and metadata that is based on the BioC format. The JSON schema for the tables JSON can be found within the keyFiles directory.

The documentation for Auto-CORPus is available on our GitHub Pages site.

Installation

Install with pip:

pip install autocorpus

If you want to be able to process PDF files (only available with Auto-CORPus >v1.1.0), you will need to install (large!) additional dependencies. To install Auto-CORPUS with PDF processing support, run:

pip install autocorpus[pdf]

Usage

You can run Auto-CORPus on a single file like so:

auto-corpus -b PMC -t "output" -f "path/to/html/file" -o JSON

Auto-CORPus can also process whole directories:

auto-corpus -b PMC -t "output" -f "path/to/directory/of/html/files" -o JSON

Available arguments

Flag	Name	Description
`-f`	Input File Path	File or directory to run Auto-CORPus on
`-t`	Output File Path	Directory path where Auto-CORPus should save output files
`-c`	Config	Which config file to use
`-o`	Output Format	Either `JSON` or `XML` (defaults to `JSON`)

Config files

If you wish to contribute or edit a config file then please follow the instructions in the config guide.

Auto-CORPus is able to parse HTML from different publishers, which utilise different HTML structures and naming conventions. This is made possible by the inclusion of config files which tell Auto-CORPus how to identify specific sections of the article/table within the source HTML. We have supplied a config template along with example config files for PubMed Central, PLOS Genetics and Nature Genetics in the configs directory. Users of Auto-CORPus can submit their own config files for different sources via the issues tab.

Auto-CORPus recognises 2 types of input file which are:

Full text HTML documents covering the entire article
HTML files which describe a single table

Auto-CORPus does not provide functionality to retrieve input files directly from the publisher. Input file retrieval must be completed by the user in a way which the publisher permits.

Auto-CORPus relies on a standard naming convention to recognise the files and identify the correct order of tables. The naming convention can be seen below:

Full article HTML: {any_name_you_want}.html

{any_name_you_want} is how Auto-CORPus will group articles and linked tables/image files

Linked table HTML: {any_name_you_want}_table_X.html

{any_name_you_want} must be identical to the name given to the full text file followed by_table_X where X is the table number

If passing a single file via the file path then that file will be processed in the most suitable manner, if a directory is passed then Auto-CORPus will first group files based on common elements in their file name {any_name_you_want} and process all related files at once. Related files in separate directories will not be processed at the same time. Files processed at the same time will be output into the same files, an example input and output directory can be seen below:

Input:

PMC1.html
PMC1_table_1.html
PMC1_table_2.html
/subdir
    PMC1_table_3.html
    PMC1_table_4.html

Output:

PMC1_bioc.json
PMC1_abbreviations.json
PMC1_tables.json (contains table 1 & 2 and any tables described within the main text)
/subdir
    PMC1_tables.json (contains tables 3 & 4 only)

A log file is produced in the output directory providing details of the day/time Auto-CORPus was run, the arguments used and information about which files were successfully/unsuccessfully processed with a relevant error message.

For developers

This is a Python application that uses poetry for packaging and dependency management. It also provides pre-commit hooks for various linters and formatters and automated tests using pytest and GitHub Actions.

To get started:

Download and install Poetry following the instructions for your OS.
Clone this repository and make it your working directory
(Optionally) download private test data for additional regression tests. This uses data which cannot be redistributed publicly (only available to members of the omicsNLP organisation).
```
git submodule update --init
```
Set up the virtual environment:
```
poetry install --all-extras
```
Note: The --all-extras flag is because of the additional dependencies required for analysing extra file types (PDF, Word, Excel, etc).
Activate the virtual environment (alternatively, ensure any Python-related command is preceded by poetry run):
```
poetry shell
```
Install the git hooks:
```
pre-commit install
```

Run the main app for a single file example:

auto-corpus -c "autocorpus/configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON

Run the main app for a directory of files example

auto-corpus -c "autocorpus/configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

jmp111

These details have not been verified by PyPI

Project links

Publication

Release history Release notifications | RSS feed

This version

1.1.1

Jun 3, 2025

1.1.0

Jan 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocorpus-1.1.1.tar.gz (71.3 kB view details)

Uploaded Jun 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autocorpus-1.1.1-py3-none-any.whl (89.4 kB view details)

Uploaded Jun 3, 2025 Python 3

File details

Details for the file autocorpus-1.1.1.tar.gz.

File metadata

Download URL: autocorpus-1.1.1.tar.gz
Upload date: Jun 3, 2025
Size: 71.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for autocorpus-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`366680df0a42ac268f66abef6352a1c1eda176fc3a72e70cf035ccc455ad1bb3`
MD5	`ce3b0d6e7eae8e578e8a026ea27df40b`
BLAKE2b-256	`40d7af625c5844b0baff42147b66166889752243b7d70942b9ec8c5e03b3c6b8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autocorpus-1.1.1.tar.gz:

Publisher: release.yml on omicsNLP/Auto-CORPus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autocorpus-1.1.1.tar.gz
- Subject digest: 366680df0a42ac268f66abef6352a1c1eda176fc3a72e70cf035ccc455ad1bb3
- Sigstore transparency entry: 228986197
- Sigstore integration time: Jun 3, 2025
Source repository:
- Permalink: omicsNLP/Auto-CORPus@d6079962e052b2c77e1aca722a6e7443c1230df4
- Branch / Tag: refs/tags/v1.1.1
- Owner: https://github.com/omicsNLP
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d6079962e052b2c77e1aca722a6e7443c1230df4
- Trigger Event: release

File details

Details for the file autocorpus-1.1.1-py3-none-any.whl.

File metadata

Download URL: autocorpus-1.1.1-py3-none-any.whl
Upload date: Jun 3, 2025
Size: 89.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for autocorpus-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aaa0996f3bd8e5ee2b9afee3f770919801a69291a10353ac57fb75be08f6cbde`
MD5	`a16c03967838f9335e262c35857c78d6`
BLAKE2b-256	`02ee7438d3f6a8dcf88eb11b86a57e8189d66e93670aeff1ec1139e6720b6d86`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autocorpus-1.1.1-py3-none-any.whl:

Publisher: release.yml on omicsNLP/Auto-CORPus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autocorpus-1.1.1-py3-none-any.whl
- Subject digest: aaa0996f3bd8e5ee2b9afee3f770919801a69291a10353ac57fb75be08f6cbde
- Sigstore transparency entry: 228986205
- Sigstore integration time: Jun 3, 2025
Source repository:
- Permalink: omicsNLP/Auto-CORPus@d6079962e052b2c77e1aca722a6e7443c1230df4
- Branch / Tag: refs/tags/v1.1.1
- Owner: https://github.com/omicsNLP
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@d6079962e052b2c77e1aca722a6e7443c1230df4
- Trigger Event: release

autocorpus 1.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Auto-CORPus

Installation

Usage

Available arguments

Config files

For developers

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance