modular framework for ingesting, parsing, mapping, curating, validating and storing heterogeneous textual data

These details have not been verified by PyPI

Project links

Project description

Parscival

Description

Parscival is a modular framework designed to ingest, parse, map, curate, validate, and store heterogeneous textual data. It's especially tailored to handle Scientific and Technical Sources (STS) and can output them into any arbitrary format, such as JSON, SQLite, or custom templates.

Parscival is not limited to parsing; it empowers a complete, adaptative orchestration of the data lifecycle. From ingestion to transformation, curation to validation, and finally to structured export.

Data parsing and transforming is performed according to an experimental specification described in a YAML file. For an example see here.

The output data is saved by default using the HDF5 binary data format. HDF5 is an open source file format that supports large, complex, heterogeneous data. It is designed for fast I/O processing and storage.

To enable parallel (on-the-fly) access to the HDF5 data produced, Parscival uses klepto, a python library that provides fast and flexible access to large amounts of storage.

In order to define how to transform the data into an arbitrary output format, Parscival implements a lightweight plugin architecture. For example, by using the render-template plugin, the output result can be simple described as a Jinja template. For an example on how to transform the data into json see here.

Install

pip install parscival

Usage

usage: parscival [-h] [--job-id JOB_ID] [--version] [--with-config [CONFIGURATION_FILES ...]] [-v] [-vv] [-vvv] FILE_PARSER_SPEC FILE_OUTPUT FILE_DATASET [FILE_DATASET ...]

A modular framework for ingesting, parsing, mapping, curating, validating and storing data

positional arguments:
  FILE_PARSER_SPEC      parscival specification
  FILE_OUTPUT           processed data output
  FILE_DATASET          input dataset

options:
  -h, --help            show this help message and exit
  --job-id JOB_ID       job identifier for logging
  --version             show program's version number and exit
  --with-config [CONFIGURATION_FILES ...]
                        YAML configuration files
  -v, --verbose         set loglevel to INFO
  -vv, --very-verbose   set loglevel to DEBUG
  -vvv, --very-very-verbose
                        set loglevel to TRACE

Examples

# converts documents from pesticides-s.nbib into pesticides.cortext.json as described by pubmed-nbib.yaml
parscival -v src/parscival_specs/pubmed/pubmed-nbib.yaml /tmp/pesticides.cortext.json tests/datasets/pesticides-s.nbib

# converts documents from both pesticides-s.nbib and hetercat-s.nbib into pesticides.db as described by pubmed-nbib.yaml
parscival -v src/parscival_specs/pubmed/pubmed-nbib.yaml /tmp/pesticides.cortext.db tests/datasets/pesticides-s.nbib tests/datasets/hetercat-s.nbib

# converts documents from the HTML dataset file europresse-sample1.html into JSON output file /tmp/test.cortext.json
# uses the parsing specification file europresse-html.yaml
# additionally, loads supplementary YAML configuration `(--with-config)` from the file europress-args.yaml
parscival --with-config tests/datasets/europresse-html/europress-args.yaml -v src/parscival_specs/europresse/europresse-html.yaml /tmp/test.cortext.json tests/datasets/europress/europresse-sample1.html

Supported formats

Sources

PubMed (.nbib) : PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The parsing spec is avalaible here. You can find a more detailed description in the related documentation.
Europresse (.html) : Europresse is a comprehensive database providing access to a vast range of news and information from various sources. The parsing specification allows for extracting structured data from Europresse HTML files.

Intermediate data

The intermediate data is stored usign the CorText Graph format:

Field	Value	Type	Description
`file`	`sourceFile(fieldName)`	`text`	source file for the data
`id`	`fieldName.doc[0,n-1]`	`integer`	ID of each document
`rank`	`fieldName.doc[id][0,m-1]`	`integer`	field cardinal index
`parserank`	`fieldName.doc[id][rank][0,p-1]`	`integer`	parsed cardinal index
`data`	`fieldName.doc[id][rank][parserank]`	`[text,integer]`	parsed data

Output

cortext.json: intermediate data is converted to json using the cortext.json template
cortext.sqlite: intermediate data is converted to a sqlite script using the cortext.sqlite template. If requested by the processing spec, the resulting sqlite script can be intepreted and thus converted to a binary database.

Requirements

Parscival has been set up using PyScaffold, a project generator for bootstrapping high-quality Python packages. For details and usage information on PyScaffold see https://pyscaffold.org.

This project uses PyScaffold in combination with Tox, a generic virtualenv management and test command line tool acting as frontend to Continuous Integration servers. A list with all the available tasks is obtained via the tox -av command.

To prepare your environment you will need to install the following dependencies:

pip install -U pip setuptools
pip install -U tox

Development

To facilitate development, you can use Docker to run Parscival and set up a remote debugging environment.

Running the Docker Container

You can run the Docker container with the following command:

docker run -it \
    -v ./test:/tmp/test \
    parscival

This command will:

Start an interactive terminal session within the Docker container.
Mount the ./test directory from your host machine to /tmp/test in the container.

Building Documentation

To build the project documentation inside the docker using tox, you can execute the following command from your host machine:

docker run -it \
    -v ./docs/_build/html:/app/parscival/docs/_build/html \
    parscival \
    tox -e docs

This command will:

Start an interactive terminal session within the Docker container.
Mount the ./docs/_build/html directory from your host machine to /app/parscival/docs/_build/html in the container.

Deployment

virtualenv .venv
source .venv/bin/activate
# ... if needed, edit setup.cfg to add dependencies ...
pip install .
tox

# to build distribution
tox -e build

Documentation

In order to compile the Parscival documentation you must type:

tox -e docs

Dependences

libhdf5-dev: Provides the development files for the HDF5 (Hierarchical Data Format version 5) library. HDF5 is designed to store and organize large amounts of data, making it suitable for high-performance data processing applications.
Python >= 3.9: Ensures compatibility with Parscival >= 0.7. This version supports the necessary libraries and features used in the project.

Environment variables

PARSCIVAL_PLUGINS_PATHS: Specifies the directories where Parscival should look for plugins.
PARSCIVAL_PLUGIN_RENDER_TEMPLATE_DIR: Specifies the directory where Parscival should look for default rendering templates used by plugins.
PARSCIVAL_LOG_PATH Specifies the directory where Parscival should keep the logging activity.

Learn more

To learn more about Parscival, compile the documentation by executing the following command: tox -e docs

Alternatively, you may directly refer to some raw documentation pages linked below:

General

Plugins

Parscival specification examples

Credits

Parscival is being developed by the CorTexT Platform and Cogniteva SAS.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.11.0

Jun 23, 2025

0.10.1

May 28, 2025

This version

0.10.0

May 28, 2025

0.9.2

Mar 20, 2025

0.9.1

Mar 20, 2025

0.9.0

Mar 20, 2025

0.8.0

Mar 19, 2025

0.7.1

Jun 18, 2024

0.7.0

Jun 18, 2024

0.6.2

Nov 30, 2022

0.6.1

Nov 30, 2022

0.6.0

Sep 12, 2022

0.5.9

Sep 12, 2022

0.5.8

Sep 12, 2022

0.5.7

May 9, 2022

0.5.6

May 9, 2022

0.5.4

May 9, 2022

0.5.3

May 9, 2022

0.5.2

May 9, 2022

0.5.1

May 9, 2022

0.5.0

May 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parscival-0.10.0.tar.gz (4.7 MB view details)

Uploaded May 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parscival-0.10.0-py3-none-any.whl (2.2 MB view details)

Uploaded May 28, 2025 Python 3

File details

Details for the file parscival-0.10.0.tar.gz.

File metadata

Download URL: parscival-0.10.0.tar.gz
Upload date: May 28, 2025
Size: 4.7 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for parscival-0.10.0.tar.gz
Algorithm	Hash digest
SHA256	`3565177ddf0fdf4fe1a267872ff589dd36d54738120893ab9bf9f2f8e1b63d13`
MD5	`2c5d77f4aeec728468b73e9748a70a9d`
BLAKE2b-256	`f26d0a9a1918c69003d126a1ed343e0818998e7d3f55191e0836fd9ff25b2bc9`

See more details on using hashes here.

File details

Details for the file parscival-0.10.0-py3-none-any.whl.

File metadata

Download URL: parscival-0.10.0-py3-none-any.whl
Upload date: May 28, 2025
Size: 2.2 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.0 CPython/3.11.7

File hashes

Hashes for parscival-0.10.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a257eb7f9608edca04f72503a475df942034ee2fd51f0dcd4e78ed5eb9e06e1`
MD5	`fa9f727057240401f5bf452b6adec7a5`
BLAKE2b-256	`f7b0ed8a1b8f1c4cdb203b6f826da8ecefcb3c122e393f7fc813f4f220e7f576`

See more details on using hashes here.

parscival 0.10.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Parscival

Description

Install

Usage

Examples

Supported formats

Sources

Intermediate data

Output

Requirements

Development

Running the Docker Container

Building Documentation

Deployment

Documentation

Dependences

Environment variables

Learn more

General

Plugins

Parscival specification examples

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes