Skip to main content

This package is meant to preprocess OpenCitations source dumps so to make them easily usable in OpenCitations main processes, by deleting unused information, splitting big files, and validating identifiers

Project description

Python package

OpenCitations: Preprocess

This software is meant to preprocess data dumps to be ingested in OpenCitations, provided by different data sources. The aim of the software is that of preprocessing data dumps in order to facilitate data parsing and extraction in OpenCitations Meta and OpenCitation Index processes. Note that preprocessing is not a mandatory step of data ingestion in OpenCitations. However, preprocessing is suggested when:

  1. A consistent part of the bibliographic entities represented in the dump come without citation data
  2. The dump content is redundant with respect to OpenCitations scopes (e.g.: duplicated citations retrievable both as addressed and received citations)
  3. The dump consists of a unique big file, and it is too heavy to be processed all at once
  4. A consistent part of the data provided is not relevant with respect to OpenCitations scopes (e.g.: discipline-specific and content-related metadata)

Mandatory

  • Python 3.8+

Start the tests

$ python -m unittest discover -s ./preprocessing/test -p "*.py"

License

OpenCitations Index is released under the ISC License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oc_preprocessing-0.0.5.tar.gz (23.0 kB view details)

Uploaded Source

Built Distribution

oc_preprocessing-0.0.5-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file oc_preprocessing-0.0.5.tar.gz.

File metadata

  • Download URL: oc_preprocessing-0.0.5.tar.gz
  • Upload date:
  • Size: 23.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for oc_preprocessing-0.0.5.tar.gz
Algorithm Hash digest
SHA256 59e1c08f1f71ba96c9ae8cf662cde49bbe5c0d1386d318c3171564e57b164fba
MD5 d5c96cadacbf593e757f731081cf9cce
BLAKE2b-256 1b02de710bec42e662155c89650d8bafc9b9ca2d40f7fe127ac7b220435d528b

See more details on using hashes here.

File details

Details for the file oc_preprocessing-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for oc_preprocessing-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c011eef4e1253c03445a4140ab7c111d6f0b9dd02c885987e0052ebadd2e2aca
MD5 ef4cfa776753c24b846e839a86792a96
BLAKE2b-256 0fec632c6ac97235fc0b360fa215144071cf2dd977d3ec47da3da9f62f0db721

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page