ScienceBeam Utils
Project description
ScienceBeam Utils
Provides utility functions related to the ScienceBeam project.
Please refer to the development documentation if you wish to contribute to the project.
Most tools are not yet documented. Please feel free to browse the code or tests, or raise an issue.
Pre-requisites
- Python 3
- Apache Beam
Apache Beam may be used to for preprocessing but also its transparent FileSystems API which makes it easy to access files in the cloud.
Install
pip install apache_beam[gcp]
pip install sciencebeam-utils
CLI Tools
Find File Pairs
The preferred input layout is a directory containing a gzipped pdf (.pdf.gz
) and gzipped xml (.nxml.gz
), e.g.:
- manuscript_1/
- manuscript_1.pdf.gz
- manuscript_1.nxml.gz
- manuscript_2/
- manuscript_2.pdf.gz
- manuscript_2.nxml.gz
Using compressed files is optional but recommended to reduce file storage cost.
The parent directory per manuscript is optional. If that is not the case then the name before the extension must be identical (which is recommended in general).
Run:
python -m sciencebeam_utils.tools.find_file_pairs \
--data-path <source directory> \
--source-pattern *.pdf.gz --xml-pattern *.nxml.gz \
--out <output file list csv/tsv>
e.g.:
python -m sciencebeam_utils.tools.find_file_pairs \
--data-path gs://some-bucket/some-dataset \
--source-pattern *.pdf.gz --xml-pattern *.nxml.gz \
--out gs://some-bucket/some-dataset/file-list.tsv
That will create the TSV (tab separated) file file-list.tsv
with the following columns:
- source_url
- xml_url
That file could also be generated using any other preferred method.
Split File List
To separate the file list into a training, validation and test dataset, the following script can be used:
python -m sciencebeam_utils.tools.split_csv_dataset \
--input <csv/tsv file list> \
--train 0.5 --validation 0.2 --test 0.3 --random --fill
e.g.:
python -m sciencebeam_utils.tools.split_csv_dataset \
--input gs://some-bucket/some-dataset/file-list.tsv \
--train 0.5 --validation 0.2 --test 0.3 --random --fill
That will create three separate files in the same directory:
file-list-train.tsv
file-list-validation.tsv
file-list-test.tsv
The file pairs will be randomly selected (--random) and one group will also include all remaining file pairs that wouldn't get include due to rounding (--fill).
As with the previous step, you may decide to use your own process instead.
Note: those files shouldn't change anymore once you used those files
Get Output Files
Since ScienceBeam is intended to convert files, there will be output files. To make it specific what the filenames are, the output files are also kept in a file list. This tool will generate the file list (it doesn't matter whether the files actually exist for this purpose).
e.g.
python -m sciencebeam_utils.tools.get_output_files \
--source-file-list path/to/source/file-list-train.tsv \
--source-file-column=source_url \
--output-file-suffix=.xml \
--output-file-list path/to/results/file-list.lst
By adding the --check
argument, it will check whether the output files exist (see below).
Check File List
After generating an output file list, this tool can be used whether the output files exist or are complete.
e.g.
python -m sciencebeam_utils.tools.check_file_list \
--file-list path/to/results/file-list.lst \
--file-column=source_url \
--limit=100
This will check the first 100 output files and report on it. The command will fail if none of the output files exist.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sciencebeam_utils-0.1.5.tar.gz
.
File metadata
- Download URL: sciencebeam_utils-0.1.5.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.2.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.63.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89ed4cd771fce007e363a055bb604125aa8a316478106e4995a47000343b91d1 |
|
MD5 | 261293c13a61f7d230c25bab3d7e3e49 |
|
BLAKE2b-256 | 7d6b0d029631654ce68315d92250bf763e9910cd8167e25bba9a55c593d53795 |
File details
Details for the file sciencebeam_utils-0.1.5-py3-none-any.whl
.
File metadata
- Download URL: sciencebeam_utils-0.1.5-py3-none-any.whl
- Upload date:
- Size: 44.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.2.0 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.63.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc0ebfcf8a9a3623899efd3bf685b66114372d1efa141494cc58d7f8df961c68 |
|
MD5 | 84b049898d42cc1388a8599d796904c8 |
|
BLAKE2b-256 | 7e0c5e7d2e935a18b3f590f15dc2b8a601b324604f4043d0bb35a8af7f16665d |