A tool for removing invalid rows from an OpenCitations metadata or citations table based on the table's validation report.

These details have not been verified by PyPI

Project description

oc_pruner

A tool for removing rows from an OpenCitations metadata or citations table based on the table's validation report, with support for running complete validation and pruning pipelines.

Features

Selective filtering: Filter by error type (error/warning) and/or specific error labels
Flexible configuration: Configure via CLI arguments or configuration files
Row-level deletion: Removes entire rows containing issues
Verbose output: Detailed information about processing when needed
Complete pipeline: Run validation + pruning pipeline with multiple rounds for thorough cleaning

Quick Start

Run the Complete Pipeline

Run a full validation and pruning pipeline for metadata and citations files:

oc_pruner pipeline --meta metadata.csv --cits citations.csv --out-dir output_dir

This will:

Validate both files
Remove invalid rows
Re-validate the cleaned files
Repeat the process to catch any newly exposed issues
Perform a final validation check

Running the pipeline from the CLI does not allow for any configuration. For more flexibility, see the following sections illustrating how to prune a single CSV table (either metadata or citations) given its pre-existing validation report.

Prune a Single Table Based On Its Existing Validation Report

Remove all issues (errors and warnings) from a CSV file:

oc_pruner --csv input.csv --report report.json --output output.csv

Or use the explicit prune subcommand:

oc_pruner prune --csv input.csv --report report.json --output output.csv

With Verbose Output

See detailed information about what's being processed:

oc_pruner prune --csv input.csv --report report.json --output output.csv --verbose

Configuration

CLI Arguments for `pipeline` mode (`pipeline` subcommand)

Argument	Abbreviation	Required	Description
`--meta PATH`	`-m`	Yes	Path to the input metadata CSV file
`--cits PATH`	`-c`	Yes	Path to the input citations CSV file
`--output PATH`	`-o`	Yes	Path to the output directory where to write the output (pruned) file

CLI Arguments for single document mode (`prune` subcommand)

Argument	Abbreviation	Required	Description
`--csv PATH`	`-t`	Yes	Path to the input CSV file
`--report PATH`	`-r`	Yes	Path to the validation report JSON file
`--output PATH`	`-o`	Yes	Path for the output CSV file
`--config PATH`	`-c`	No	Path to configuration file (YAML or JSON)
`--error-type`	`-e`	No	Filter by error type: all or error
`--ignore-labels`	`-i`	No	Comma-separated error labels to ignore
`--verbose`	`-v`	No	Show detailed processing information
`--init-config`	—	No	Generate a configuration file template
`--list-labels`	—	No	List all valid error labels
`--help`	`-h`	No	Show help message

Configuration File

Create a configuration file for default settings. The tool looks for:

Explicitly specified file (via --config)
oc_pruner_config.yaml or oc_pruner_config.json in current directory
~/.oc_pruner_config.yaml in home directory

Generate a template:

oc_pruner --init-config

Example oc_pruner_config.yaml:

# oc_pruner Configuration File

# Filter by error type: "all" (errors and warnings) or "error" (errors only)
error_type_filter: "all"

# List of error labels to ignore (rows with these issues will be kept, unless interested by other issues)
ignore_error_labels:
  - "extra_space"
  - "br_id_format"

Configuration Priority

Settings are applied in this order (later override earlier):

Default values from the code
Configuration file if found
CLI arguments (highest priority)

Usage Examples

Run the Complete Validation + Pruning Pipeline from CLI

For thorough cleaning of OpenCitations metadata and citations files, use the pipeline command:

oc_pruner pipeline -m metadata.csv -c citations.csv -o output_dir

Pipeline Arguments:

Argument	Abbreviation	Required	Description
`--meta PATH`	`-m`	Yes	Path to original metadata CSV
`--cits PATH`	`-c`	Yes	Path to original citations CSV
`--out-dir`	`-o`	Yes	Base output directory for results

What the pipeline does:

First validation: Validates both metadata and citations files
First pruning: Removes rows with validation errors
Second validation: Re-validates the cleaned files to catch new issues
Second pruning: Removes any newly exposed errors
Final validation: Performs a sanity check on the final cleaned files

Running oc_pruner in pipeline mode from the CLI does not allow to configure which error types or labels to ignore.

The pipeline creates the following structure in the output directory:

output_dir/
├── cleaned/
│   ├── metadata.csv       # Final cleaned metadata
│   └── citations.csv      # Final cleaned citations
└── validation_reports/
    ├── first_round/
    │   ├── metadata/
    │   └── citations/
    ├── second_round/
    │   ├── metadata/
    │   └── citations/
    └── final_round/
        ├── metadata/
        └── citations/

All operations are logged to logs/pipeline_YYYYMMDD_HHMMSS.log.

Remove Only Errors (Single Document)

Ignore warnings and only remove rows with errors:

oc_pruner --csv data.csv --report report.json --output clean.csv --error-type error

Ignore Specific Error Labels (Single Document)

Keep rows that have specific issues:

oc_pruner --csv data.csv --report report.json --output clean.csv \
  --ignore-labels extra_space,br_id_format

Use Configuration File (Single Document)

Create a config file and use it:

oc_pruner --init-config
# Edit oc_pruner_config.yaml
oc_pruner --csv data.csv --report report.json --output clean.csv

Combine Filters (Single Document)

Remove only errors except for specific labels:

oc_pruner --csv data.csv --report report.json --output clean.csv \
  --error-type error \
  --ignore-labels extra_space,type_format

List Available Error Labels

See all valid error labels:

oc_pruner --list-labels

Validation Report Model

The validation report is a JSON file following the validation report schema. It consists of a list of issue objects, where each object represents a validation issue tied to specific locations in the CSV table.

Issue Object Structure

{
  "validation_level": "csv_wellformedness",
  "error_type": "error",
  "error_label": "extra_space",
  "message": "The value in this field is not expressed in compliance with the syntax...",
  "valid": false,
  "position": {
    "located_in": "item",
    "table": {
      "0": {
        "id": [1]
      }
    }
  }
}

Error Labels Reference

The supported issue labels are listed in the validation report schema and the associated issues are explained in this summary table.

How It Works

Load Files: Reads the CSV file and validation report
Filter Issues: Based on configuration, determines which issues to consider
- --error-type error: Only considers "error" type issues
- --ignore-labels: Ignores issues with specified labels
Extract Affected Rows: For each relevant issue, extracts row numbers from the position data
Remove Rows: Removes entire rows that contain any non-ignored issue
Write Output: Saves the cleaned CSV file

Important: If a row has both an ignorable issue and a non-ignorable issue, the entire row is removed (the non-ignorable issue takes precedence).

API Usage

You can also use oc_pruner as a Python library:

from oc_pruner import prune
from oc_pruner.config import PrunerConfig

# Create configuration
config = PrunerConfig(
    error_type_filter="all",
    ignore_error_labels=["extra_space"]
)

# Prune the CSV file
prune(
    csv_path="input.csv",
    report_path="report.json",
    output_path="output.csv",
    config=config,
    verbose=True
)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Apr 29, 2026

This version

0.1.2

Mar 9, 2026

0.1.1

Mar 6, 2026

0.1.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

oc_pruner-0.1.2.tar.gz (12.1 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

oc_pruner-0.1.2-py3-none-any.whl (15.1 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file oc_pruner-0.1.2.tar.gz.

File metadata

Download URL: oc_pruner-0.1.2.tar.gz
Upload date: Mar 9, 2026
Size: 12.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.11.9 Windows/10

File hashes

Hashes for oc_pruner-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`63b204b29372be9b23276d6c804a15214e076138fc02f57a5c399fb24a18f51b`
MD5	`e3626d109fa4dc77cc5b3d8f1a1f3109`
BLAKE2b-256	`e2b99183d5d9e4a87894c1ee5debcf44396e9a7977a91ae64f21667c09a4a0e3`

See more details on using hashes here.

File details

Details for the file oc_pruner-0.1.2-py3-none-any.whl.

File metadata

Download URL: oc_pruner-0.1.2-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 15.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.3.2 CPython/3.11.9 Windows/10

File hashes

Hashes for oc_pruner-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`512a788b4d5cb111e74e09ee3cbfefeee90667b675e5baf0c91bbb9342737d2f`
MD5	`959c0b19a1ec60c4993472bab23a0f32`
BLAKE2b-256	`961df2a5ab1add19a085cdc5f281d8ad657608e938c9296d8361045822317f82`

See more details on using hashes here.

oc-pruner 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

oc_pruner

Features

Quick Start

Run the Complete Pipeline

Prune a Single Table Based On Its Existing Validation Report

With Verbose Output

Configuration

CLI Arguments for pipeline mode (pipeline subcommand)

CLI Arguments for single document mode (prune subcommand)

Configuration File

Configuration Priority

Usage Examples

Run the Complete Validation + Pruning Pipeline from CLI

Remove Only Errors (Single Document)

Ignore Specific Error Labels (Single Document)

Use Configuration File (Single Document)

Combine Filters (Single Document)

List Available Error Labels

Validation Report Model

Issue Object Structure

Error Labels Reference

How It Works

API Usage

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

CLI Arguments for `pipeline` mode (`pipeline` subcommand)

CLI Arguments for single document mode (`prune` subcommand)