Skip to main content

modular framework for parsing, mapping and transforming STS data

Project description

Parscival

Description

Parscival is a data parsing and transformation tool. Even though it can potentially interpret any input format and subsequently produce any output format, it was originally designed to process PubMed .nbib files and export them to the CorText Graph format.

Data parsing and transforming is performed according to an experimental specification described in a YAML file. For an example see here.

The output parsed data is saved by default using the HDF5 binary data format. HDF5 is an open source file format that supports large, complex, heterogeneous data. It is designed for fast I/O processing and storage.

To enable parallel (on-the-fly) access to the HDF5 data produced, Parscival uses klepto, a python library that provides fast and flexible access to large amounts of storage.

In order to define how to transform the parsed data into an arbitrary output format, Parscival implements a lightweight plugin architecture. For example, by using the render-template plugin, the output result can be simple described as a Jinja template. For an example on how to transform the parsed data into json see here.

Install

pip install parscival

Usage

usage: parscival [-h] [--version] [-v] [-vv] FILE_PARSER_SPEC FILE_OUTPUT FILE_DATASET [FILE_DATASET ...]

A modular framework for parsing, mapping and transforming data

positional arguments:
  FILE_PARSER_SPEC     parser specification
  FILE_OUTPUT          parsed data output
  FILE_DATASET         input dataset

optional arguments:
  -h, --help           show this help message and exit
  --version            show program's version number and exit
  -v, --verbose        set loglevel to INFO
  -vv, --very-verbose  set loglevel to DEBUG

Examples

# converts documents from pesticides-s.nbib into pesticides.cortext.json as described by pubmed-nbib.yaml
parscival -v src/parscival_specs/pubmed-nbib.yaml /tmp/pesticides.cortext.json tests/datasets/pesticides-s.nbib

# converts documents from both pesticides-s.nbib and hetercat-s.nbib into pesticides.db as described by pubmed-nbib.yaml
parscival -v src/parscival_specs/pubmed-nbib.yaml /tmp/pesticides.cortext.db tests/datasets/pesticides-s.nbib tests/datasets/hetercat-s.nbib

Supported formats

Input

  • PubMed nbib: PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The parsing spec is avalaible here. You can find a more detailed description in the related documentation.

Parsing

The intermediate parsing data is stored usign the CorText Graph format:

Field Value Type Description
file sourceFile(fieldName) text source file for the data
id fieldName.doc[0,n-1] integer ID of each document
rank fieldName.doc[id][0,m-1] integer field cardinal index
parserank fieldName.doc[id][rank][0,p-1] integer parsed cardinal index
data fieldName.doc[id][rank][parserank] [text,integer] parsed data

Output

  • cortext.json: Parsed data is converted to json using the cortext.json template

  • cortext.sqlite: Parsed data is converted to a sqlite script using the cortext.sqlite template. If requested by the parsing spec, the resulting sqlite script can be intepreted and thus converted to a binary database.

Requirements

Parscival has been set up using PyScaffold, a project generator for bootstrapping high-quality Python packages. For details and usage information on PyScaffold see https://pyscaffold.org.

This project uses PyScaffold in combination with Tox, a generic virtualenv management and test command line tool acting as frontend to Continuous Integration servers. A list with all the available tasks is obtained via the tox -av command.

To prepare your environment you will need to install the following dependencies:

pip install -U pip setuptools
pip install -U tox

Deployment

virtualenv .venv
source .venv/bin/activate
# ... edit setup.cfg to add dependencies ...
pip install -e .
tox

# to compile docs
tox -e docs

# to build distribution
tox -e build

Credits

Parscival is being developed by the CorTexT Platform and Cogniteva SAS.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parscival-0.6.2.tar.gz (1.4 MB view hashes)

Uploaded Source

Built Distribution

parscival-0.6.2-py3-none-any.whl (37.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page