Archive journal articles into Portico and PMC

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PubArchiver

A program that creates archives of articles from specific journal sites (currently microPublication and Prompt) for sending to Portico and PMC.

Authors: Michael Hucka, Tom Morrell
Repository: https://github.com/caltechlibrary/pubarchiver
License: BSD/MIT derivative – see the LICENSE file for more information

Introduction
Installation
Usage
Getting help and support
Contributing
License
Authors and history
Acknowledgments

☀ Introduction

The Caltech Library is the publisher of a few academic journals and provides services for them. The services include archiving in a dark archive (specifically, Portico) as well as submitting articles to PMC. The archiving process involves pulling down articles from the journals and packaging them up in a format suitable for sending to the archives. PubArchiver is a program to help automate this process.

✺ Installation

There are multiple ways of installing PubArchiver. Please choose the alternative that suits you.

Alternative 1: installing PubArchiver using `pipx`

You can use pipx to install PubArchiver. Pipx will install it into a separate Python environment that isolates the dependencies needed by PubArchiver from other Python programs on your system, and yet the resulting pubarchiver command wil be executable from any shell – like any normal program on your computer. If you do not already have pipx on your system, it can be installed in a variety of easy ways and it is best to consult Pipx's installation guide for instructions. Once you have pipx on your system, you can install PubArchiver with the following command:

pipx install pubarchiver

Pipx can also let you run PubArchiver directly using pipx run pubarchiver, although in that case, you must always prefix every pubarchiver command with pipx run. Consult the documentation for pipx run for more information.

Alternative 2: installing PubArchiver using `pip`

The instructions below assume you have a Python 3 interpreter installed on your computer. Note that the default on macOS at least through 10.14 (Mojave) is Python 2 – please first install Python version 3 and familiarize yourself with running Python programs on your system before proceeding further.

On Linux, macOS, and Windows operating systems, you should be able to install pubarchiver with pip for Python 3. To install pubarchiver from the Python package repository (PyPI), run the following command:

python3 -m pip install pubarchiver

As an alternative to getting it from PyPI, you can use pip to install pubarchiver directly from GitHub:

python3 -m pip install git+https://github.com/calitechlibrary/pubarchiver.git

If you already installed PubArchiver once before, and want to update to the latest version, add --upgrade to the end of either command line above.

Alternative 3: installing PubArchiver from sources

If you prefer to install PubArchiver directly from the source code, you can do that too. To get a copy of the files, you can clone the GitHub repository:

git clone https://github.com/caltechlibrary/pubarchiver

Alternatively, you can download the files as a ZIP archive using this link directly from your browser using this link: https://github.com/caltechlibrary/pubarchiver/archive/refs/heads/main.zip

Next, after getting a copy of the files, run setup.py inside the code directory:

cd pubarchiver
python3 setup.py install

▶︎ Usage

PubArchiver is a command-line program. The installation process should put a program named pubarchiver in a location normally searched by your shell interpreter. For help with usage at any time, run pubarchiver with the option --help (or -h for short).

pubarchiver -h

Basic usage

Options to pubarchiver use a dash (-) as the prefix character on macOS and Linux, and forward slash (/) on Windows.

The journal whose articles are to be archived must be indicated using the required option --journal (or -j for short). To see a list of supported journals, you can use --journal list like this:

pubarchiver --journal list

If not given any additional options besides a --journal option to select the journal, pubarchiver will proceed to contact the journal website as well as either DataCite or Crossref, and create an archive containing articles and their metadata for all articles published to date by the journal. The options below can be used to select articles and influence other pubarchiver behaviors.

Printing information without doing anything

The option --list-dois (or -l for short) can be used to obtain a list of all DOIs for all articles published by the selected journal. When --list-dois is used, pubarchiver prints the list to the terminal and exits without doing further work. This can be useful if you intend to use the --doi-file option discussed below.

If given the option --preview (or -p for short), pubarchiver will only print a list of articles it will archive and stop short of creating the archive. This is useful to see what would be produced without actually doing it. Note, however, that because it does not attempt to download the articles and associated files, it cannot report errors that might occur when actually creating an archive. Consequently, do not use the output of --preview as a prediction of eventual success or failure.

Selecting the archive format and archive output location

The value supplied after the option --dest (or -d for short) can be used to tell pubarchiver the intended destination where the archive will be sent. The option changes the structure and content of the archive created by pubarchiver. The possible alternatives are portico and pmc. Portico is assumed to be the default destination if no --dest option is given.

By default, pubarchiver will write its output to a new subdirectory it creates in the directory from which pubarchiver is being run. The option --output-dir (or /o on Windows) can be used to select a different location. For example,

pubarchiver --journal micropublication --output-dir /tmp/micropub

The materials for each article will be written to an individual subdirectory named after the DOI of the article. The output for each article will consist of an XML metadata file describing the article, the article itself in PDF format, and (if the journal provides JATS) a subdirectory named jats containing the article in JATS XML format along with any image that may appear in the article. The image is always converted to uncompressed TIFF format, because it is considered a good preservation format. The PMC structure follows the naming and delivery specifications defined at https://www.ncbi.nlm.nih.gov/pmc/pub/filespec-delivery/.

Unless the option --no-zip (or -Z for short) is given, the output will be archived in ZIP format. If the output structure (as determine by the --dest option) is being generated for PMC, each article will be put into its own individual ZIP archive; else, the default action is to put the collected output of all articles into a single ZIP archive file. The option --no-zip makes pubarchiver leave the output unarchived in the directory determined by the --output-dir option.

Selecting a subset of articles

If the option --after-date is given, pubarchiver will download only articles whose publication dates are after the given date. Valid date descriptors are those accepted by the Python dateparser library. Make sure to enclose descriptions within single or double quotes. Examples:

  pubarchiver --after-date "2014-08-29"   ....
  pubarchiver --after-date "12 Dec 2014"  ....
  pubarchiver --after-date "July 4, 2013"  ....
  pubarchiver --after-date "2 weeks ago"  ....

The option --doi-file (or -f for short) can be used to tell pubarchiver to read a file containing DOIs and only fetch those particular articles instead of asking the journal for all articles. The format of the file indicated after the --doi-file option must be a simple text file containing one DOI per line.

The selection by date performed by the --after-date option is performed after reading the list of articles using the --doi-file option if present, and thus can be used to filter by date the articles whose DOIs are provided.

Writing a report

As it works, pubarchiver writes information to the terminal about the articles it puts into the archive, including whether any problems are encountered. To save this information to a file, use the option --rep-file (or -r for short), which will make pubarchiver write a report file. By default, the format of the report file is CSV; the option --rep-fmt (or -s for short) can be used to select csv or html (or both) as the format. The title of the report will be based on the current date, unless the option --rep-title (or -t for short) is used to supply a different title.

Additional command-line options

When pubarchiver downloads the JATS XML version of articles from the journal site, it will by default validate the XML content against the JATS DTD. To skip the XML validation step, use the option --no-check (or -X for short).

pubarchiver will print informational messages as it works. To reduce messages to only warnings and errors, use the option --quiet (or -q for short). Also, output is color-coded by default unless the --no-color option (or -C for short) is given; this option can be helpful if the color control sequences create problems for your terminal emulator.

If given the --debug option (or -@ for short), this program will output a detailed real-time trace of what it is doing. The output will be written to the given destination, which can be a dash character (-) to indicate console output, or a file path.

If given the --version option (or -V for short), this program will print version information and exit without doing anything else.

Return values

This program exits with a return code of 0 if no problems are encountered while fetching data from the server. It returns a nonzero value otherwise, following conventions for use in shells such as bash which only understand return code values of 0 to 255. If no network is detected, it returns a value of 1; if it is interrupted (e.g., using ctrl-c) it returns a value of 2; if it encounters a fatal error, it returns a value of 3. If it encounters any non-fatal problems (such as a missing PDF file or JATS validation error), it returns a nonzero value equal to 100 + the number of articles that had failures. Summarizing the possible return codes:

Return value	Meaning
`0`	No errors were encountered – success
`1`	No network detected – cannot proceed
`2`	The user interrupted program execution
`3`	An exception or fatal error occurred
`100` + n	Encountered non-fatal problems on a total of n articles

Summary of command-line options

The following table summarizes all the command line options available. (Note: on Windows computers, / must be used as the prefix character instead of -):

Short	Long form opt	Meaning	Default
`-a`A	`--after-date`A	Only get articles published after date A	Get all articles	⬥
`-C`	`--no-color`	Don't color-code info messages	Color-code terminal output
`-d`D	`--dest`D	Destination for archive: Portico or PMC	Portico
`-f`F	`--doi-file`F	Only get articles whose DOIs are in file F	Get all articles
`-j`J	`--journal`J	Work with journal J		★
`-l`	`--list-dois`	Print a list of all known DOIs & exit	Do other actions instead
`-o`O	`--output-dir`O	Write output in directory O	Write in current dir
`-p`	`--preview`	Preview what would be archived & exit	Obtain the articles
`-q`	`--quiet`	Only print important messages	Be chatty while working
`-r`R	`--rep-file`R	Write list of article & results in file R	Don't write a report
`-s`S	`--rep-fmt`S	With `-r`, write either `html` or `csv`	`csv`
`-t`T	`--rep-title`T	With `-r`, use T as the report title	Use the current date
`-V`	`--version`	Print program version info & exit	Do other actions instead
`-X`	`--no-check`	Don't validate JATS XML files	Validate JATS XML
`-Z`	`--no-zip`	Don't put output into one ZIP archive	ZIP up the output
`-@`OUT	`--debug`OUT	Debugging mode; write trace to OUT	Normal mode	⚑

⬥ Enclose the date in quotes if it contains space characters; e.g., "12 Dec 2014".
★ Required argument.
⚑ To write to the console, use the character - (a single dash) as the value of OUT; otherwise, OUT must be the name of a file where the output should be written.

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

♬ Contributing

We would be happy to receive your help and participation with enhancing pubarchiver! Please visit the guidelines for contributing for some tips on getting started.

☥ License

❡ Authors and history

Tom Morrell developed the original algorithm for extracting metadata from DataCite and creating XML files for use with Portico submissions of microPublication.org articles. Starting with that original script, Mike Hucka created the much-expanded Microarchiver program (later renamed to PubArchiver).

♥︎ Acknowledgments

The vector artwork used as a starting point for the logo for this repository was created by Cuby Design for the Noun Project. It is licensed under the Creative Commons Attribution 3.0 Unported license. The vector graphics was modified by Mike Hucka to change the color.

Nick Stiffler from the microPublication.org team helped figure out the requirements for PMC output (introduced in Microarchiver version 1.9), helped guide development of Microarchiver/PubArchiver, and engaged in many discussions about microPublication.org's needs.

PubArchiver makes use of numerous open-source packages, without which it would have been effectively impossible to develop PubArchiver with the resources we had. We want to acknowledge this debt. In alphabetical order, the packages are:

Beautiful Soup – an HTML parsing library
bun – a set of basic user interface classes and functions
commonpy – a collection of commonly-useful Python functions
crossrefapi – a python library that implements the Crossref API
dateparser – parser for human-readable dates
humanize – make numbers more easily readable by humans
lxml – an XML parsing library for Python
Pillow – a fork of the Python Imaging Library
plac – a command line argument parser
recordclass – a mutable version of Python named tuples
setuptools – library for setup.py
sidetrack – simple debug logging/tracing package
slack-cli – a command-line interface to Slack written in Bash
urllib3 – a powerful HTTP library for Python
xmltodict – a module to make working with XML feel like working with JSON

Finally, we are grateful for computing & institutional resources made available by the California Institute of Technology.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

2.1.2

Apr 12, 2022

2.1.1

Apr 9, 2022

2.1.0

Apr 8, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pubarchiver-2.1.2.tar.gz (261.8 kB view hashes)

Uploaded Apr 12, 2022 Source

Built Distribution

pubarchiver-2.1.2-py3-none-any.whl (353.3 kB view hashes)

Uploaded Apr 12, 2022 Python 3

Hashes for pubarchiver-2.1.2.tar.gz

Hashes for pubarchiver-2.1.2.tar.gz
Algorithm	Hash digest
SHA256	`326ec3a50d32f8d9f893597bfb03eb342b0b17c165f179dab06cad3986b12d32`
MD5	`e6eee566d4562bcd214d8352efb784f1`
BLAKE2b-256	`44aed3cf740c2fca382b41ffafba1e1e1594c859f53c3444cf5c1e8d6301dff9`

Hashes for pubarchiver-2.1.2-py3-none-any.whl

Hashes for pubarchiver-2.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ca0c240720086550af52399285fb12d1a8287ca330fb0f2bf94f9051a04db287`
MD5	`635fc8949801bccf883f205215f717c3`
BLAKE2b-256	`6ab7c813becc8a1742c17735d002fce823ef1bb3ea853591571a5bbf5e189c53`

pubarchiver 2.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

PubArchiver

Table of Contents

☀ Introduction

✺ Installation

Alternative 1: installing PubArchiver using pipx

Alternative 2: installing PubArchiver using pip

Alternative 3: installing PubArchiver from sources

▶︎ Usage

Basic usage

Printing information without doing anything

Selecting the archive format and archive output location

Selecting a subset of articles

Writing a report

Additional command-line options

Return values

Summary of command-line options

⁇ Getting help and support

♬ Contributing

☥ License

❡ Authors and history

♥︎ Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Alternative 1: installing PubArchiver using `pipx`

Alternative 2: installing PubArchiver using `pip`