modular framework for parsing, mapping and transforming STS data
Project description
Parscival
Description
Parscival is a modular framework for ingesting, parsing, mapping, curating, validating and storing textual data. It is originally designed to process STS inputs and export them to any arbitrary format.
Data parsing and transforming is performed according to an experimental specification described in a YAML file. For an example see here.
The output data is saved by default using the HDF5 binary data format. HDF5 is an open source file format that supports large, complex, heterogeneous data. It is designed for fast I/O processing and storage.
To enable parallel (on-the-fly) access to the HDF5 data produced, Parscival uses klepto, a python library that provides fast and flexible access to large amounts of storage.
In order to define how to transform the data into an arbitrary output
format, Parscival implements a lightweight plugin architecture. For example, by using
the render-template plugin, the output
result can be simple described as a Jinja
template. For an example on how to transform the data into json
see here.
Install
pip install parscival
Usage
usage: parscival [-h] [--job-id JOB_ID] [--version] [-v] [-vv] FILE_PARSER_SPEC FILE_OUTPUT FILE_DATASET [FILE_DATASET ...]
A modular framework for ingesting, parsing, mapping, curating, validating and storing heterogeneous data
positional arguments:
FILE_PARSER_SPEC parscival specification
FILE_OUTPUT processed data output
FILE_DATASET input dataset
options:
-h, --help show this help message and exit
--job-id JOB_ID job identifier for logging
--version show program's version number and exit
-v, --verbose set loglevel to INFO
-vv, --very-verbose set loglevel to DEBUG
Examples
# converts documents from pesticides-s.nbib into pesticides.cortext.json as described by pubmed-nbib.yaml
parscival -v src/parscival_specs/pubmed/pubmed-nbib.yaml /tmp/pesticides.cortext.json tests/datasets/pesticides-s.nbib
# converts documents from both pesticides-s.nbib and hetercat-s.nbib into pesticides.db as described by pubmed-nbib.yaml
parscival -v src/parscival_specs/pubmed/pubmed-nbib.yaml /tmp/pesticides.cortext.db tests/datasets/pesticides-s.nbib tests/datasets/hetercat-s.nbib
Supported formats
Sources
-
PubMed (.nbib)
: PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The parsing spec is avalaible here. You can find a more detailed description in the related documentation. -
Europresse (.html)
: Europresse is a comprehensive database providing access to a vast range of news and information from various sources. The parsing specification allows for extracting structured data from Europresse HTML files.
Intermediate data
The intermediate data is stored usign the CorText Graph format:
Field | Value | Type | Description |
---|---|---|---|
file |
sourceFile(fieldName) |
text |
source file for the data |
id |
fieldName.doc[0,n-1] |
integer |
ID of each document |
rank |
fieldName.doc[id][0,m-1] |
integer |
field cardinal index |
parserank |
fieldName.doc[id][rank][0,p-1] |
integer |
parsed cardinal index |
data |
fieldName.doc[id][rank][parserank] |
[text,integer] |
parsed data |
Output
-
cortext.json
: intermediate data is converted tojson
using the cortext.json template -
cortext.sqlite
: intermediate data is converted to asqlite
script using the cortext.sqlite template. If requested by the processing spec, the resultingsqlite
script can be intepreted and thus converted to a binary database.
Requirements
Parscival has been set up using PyScaffold, a project generator for bootstrapping high-quality Python packages. For details and usage information on PyScaffold see https://pyscaffold.org.
This project uses PyScaffold in combination with Tox, a generic virtualenv management
and test command line tool acting as frontend to Continuous Integration servers.
A list with all the available tasks is obtained via the tox -av
command.
To prepare your environment you will need to install the following dependencies:
pip install -U pip setuptools
pip install -U tox
Development
To facilitate development, you can use Docker to run Parscival and set up a remote debugging environment.
Running the Docker Container
You can run the Docker container with the following command:
docker run -it \
-v ./test:/tmp/test \
parscival
This command will:
- Start an interactive terminal session within the Docker container.
- Mount the
./test
directory from your host machine to/tmp/test
in the container.
Building Documentation
To build the project documentation inside the docker using tox
, you can execute
the following command from your host machine:
docker run -it \
-v ./docs/_build/html:/app/parscival/docs/_build/html \
parscival \
tox -e docs
This command will:
- Start an interactive terminal session within the Docker container.
- Mount the
./docs/_build/html
directory from your host machine to/app/parscival/docs/_build/html
in the container.
Deployment
virtualenv .venv
source .venv/bin/activate
# ... if needed, edit setup.cfg to add dependencies ...
pip install .
tox
# to build distribution
tox -e build
Documentation
In order to compile the Parscival documentation you must type:
tox -e docs
Dependences
-
libhdf5-dev
: Provides the development files for the HDF5 (Hierarchical Data Format version 5) library. HDF5 is designed to store and organize large amounts of data, making it suitable for high-performance data processing applications. -
Python >= 3.9
: Ensures compatibility withParscival >= 0.7
. This version supports the necessary libraries and features used in the project.
Environment variables
-
PARSCIVAL_PLUGINS_PATHS
: Specifies the directories where Parscival should look for plugins. -
PARSCIVAL_PLUGIN_RENDER_TEMPLATE_DIR
: Specifies the directory where Parscival should look for default rendering templates used by plugins. -
PARSCIVAL_LOG_PATH
Specifies the directory where Parscival should keep the logging activity.
Learn more
To learn more about Parscival, compile the documentation by executing the following command: tox -e docs
Alternatively, you may directly refer to some raw documentation pages linked below:
General
- How to process HTML documents with Parscival
- How to process plain text key-value documents with Parscival
Plugins
Parscival specification examples
Credits
Parscival is being developed by the CorTexT Platform and Cogniteva SAS.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file parscival-0.7.1.tar.gz
.
File metadata
- Download URL: parscival-0.7.1.tar.gz
- Upload date:
- Size: 4.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b4b69383a60e615d240fba10e0d5baf7a420fb4d3ab0b7d22aa6e3f7b86e728f |
|
MD5 | 6c63d8b0acef8cd16f1d05ce42a4f0c6 |
|
BLAKE2b-256 | 6b780476e065939724119dfe2ea8301711bbbe97bb87a0c55722290cfc9fc277 |
File details
Details for the file parscival-0.7.1-py3-none-any.whl
.
File metadata
- Download URL: parscival-0.7.1-py3-none-any.whl
- Upload date:
- Size: 2.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df0945156597e06100008983ec3862bfe2b5063d0e9903a59f0766444b236b79 |
|
MD5 | cc16d94a9375312565758e93d1ff1d0e |
|
BLAKE2b-256 | 9e637cf266bd8ef2bd9f24c520ce6b4e18346e70ee33212f2c516523c8721502 |