Skip to main content

ScienceBeam Utils

Project description

ScienceBeam Utils

License: MIT

Provides utility functions related to the ScienceBeam project.

Please refer to the development documentation if you wish to contribute to the project.

Most tools are not yet documented. Please feel free to browse the code or tests, or raise an issue.

Pre-requisites

Apache Beam may be used to for preprocessing but also its transparent FileSystems API which makes it easy to access files in the cloud.

Install

pip install apache_beam[gcp]
pip install sciencebeam-utils

CLI Tools

Find File Pairs

The preferred input layout is a directory containing a gzipped pdf (.pdf.gz) and gzipped xml (.nxml.gz), e.g.:

  • manuscript_1/
    • manuscript_1.pdf.gz
    • manuscript_1.nxml.gz
  • manuscript_2/
    • manuscript_2.pdf.gz
    • manuscript_2.nxml.gz

Using compressed files is optional but recommended to reduce file storage cost.

The parent directory per manuscript is optional. If that is not the case then the name before the extension must be identical (which is recommended in general).

Run:

python -m sciencebeam_utils.tools.find_file_pairs \
--data-path <source directory> \
--source-pattern *.pdf.gz --xml-pattern *.nxml.gz \
--out <output file list csv/tsv>

e.g.:

python -m sciencebeam_utils.tools.find_file_pairs \
--data-path gs://some-bucket/some-dataset \
--source-pattern *.pdf.gz --xml-pattern *.nxml.gz \
--out gs://some-bucket/some-dataset/file-list.tsv

That will create the TSV (tab separated) file file-list.tsv with the following columns:

  • source_url
  • xml_url

That file could also be generated using any other preferred method.

Split File List

To separate the file list into a training, validation and test dataset, the following script can be used:

python -m sciencebeam_utils.tools.split_csv_dataset \
--input <csv/tsv file list> \
--train 0.5 --validation 0.2 --test 0.3 --random --fill

e.g.:

python -m sciencebeam_utils.tools.split_csv_dataset \
--input gs://some-bucket/some-dataset/file-list.tsv \
--train 0.5 --validation 0.2 --test 0.3 --random --fill

That will create three separate files in the same directory:

  • file-list-train.tsv
  • file-list-validation.tsv
  • file-list-test.tsv

The file pairs will be randomly selected (--random) and one group will also include all remaining file pairs that wouldn't get include due to rounding (--fill).

As with the previous step, you may decide to use your own process instead.

Note: those files shouldn't change anymore once you used those files

Get Output Files

Since ScienceBeam is intended to convert files, there will be output files. To make it specific what the filenames are, the output files are also kept in a file list. This tool will generate the file list (it doesn't matter whether the files actually exist for this purpose).

e.g.

python -m sciencebeam_utils.tools.get_output_files \
  --source-file-list path/to/source/file-list-train.tsv \
  --source-file-column=source_url \
  --output-file-suffix=.xml \
  --output-file-list path/to/results/file-list.lst

By adding the --check argument, it will check whether the output files exist (see below).

Check File List

After generating an output file list, this tool can be used whether the output files exist or are complete.

e.g.

python -m sciencebeam_utils.tools.check_file_list \
  --file-list path/to/results/file-list.lst \
  --file-column=source_url \
  --limit=100

This will check the first 100 output files and report on it. The command will fail if none of the output files exist.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sciencebeam_utils-0.1.5.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

sciencebeam_utils-0.1.5-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file sciencebeam_utils-0.1.5.tar.gz.

File metadata

  • Download URL: sciencebeam_utils-0.1.5.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.2.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.63.0 CPython/3.6.8

File hashes

Hashes for sciencebeam_utils-0.1.5.tar.gz
Algorithm Hash digest
SHA256 89ed4cd771fce007e363a055bb604125aa8a316478106e4995a47000343b91d1
MD5 261293c13a61f7d230c25bab3d7e3e49
BLAKE2b-256 7d6b0d029631654ce68315d92250bf763e9910cd8167e25bba9a55c593d53795

See more details on using hashes here.

File details

Details for the file sciencebeam_utils-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: sciencebeam_utils-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 44.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.2.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.63.0 CPython/3.6.8

File hashes

Hashes for sciencebeam_utils-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 dc0ebfcf8a9a3623899efd3bf685b66114372d1efa141494cc58d7f8df961c68
MD5 84b049898d42cc1388a8599d796904c8
BLAKE2b-256 7e0c5e7d2e935a18b3f590f15dc2b8a601b324604f4043d0bb35a8af7f16665d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page