Skip to main content

Tools for generating and validating Universal Dependencies datasets in Parquet format for HuggingFace

Project description

UD-HF-Parquet-Tools

Tools for generating and validating Universal Dependencies datasets in Parquet format for HuggingFace.

Python 3.12+ PyPi Version License: Apache 2.0

Features

  • Generate Parquet files from Universal Dependencies CoNLL-U data
  • Validate Parquet files against original CoNLL-U with 100% fidelity checking
  • Handle CoNLL-U edge cases: double equals bug, duplicate metadata keys, empty nodes, MWTs
  • CLI and Python API for programmatic use
  • Comprehensive test suite with 60+ tests

Installation

Using uv (recommended):

uv pip install ud-hf-parquet-tools

Using pip:

pip install ud-hf-parquet-tools

Quick Start

Command Line Interface

Generate Parquet files:

# Generate for all treebanks
ud-hfp-tools generate --metadata metadata.json --output-dir parquet/

# Generate for specific treebanks
ud-hfp-tools generate --metadata metadata.json --treebanks fr_gsd,en_ewt --output-dir parquet/

# Test mode (3 treebanks only)
ud-hfp-tools generate --metadata metadata.json --test

Validate Parquet files:

# Validate from local files
ud-hfp-tools validate --local --metadata metadata.json

# Validate specific treebanks
ud-hfp-tools validate --local --treebanks fr_gsd,en_ewt

# Validate from HuggingFace Hub
ud-hfp-tools validate --revision 2.17 --treebanks fr_gsd

Python API

Generate Parquet files:

from ud_hf_parquet_tools import generate_parquet_for_treebank
from pathlib import Path
import json

# Load metadata
with open("metadata.json") as f:
    metadata = json.load(f)

# Generate for one treebank
success = generate_parquet_for_treebank(
    name="fr_gsd",
    metadata=metadata["fr_gsd"],
    ud_repos_dir=Path("UD_repos"),
    output_dir=Path("parquet"),
    verbose=True
)

Validate Parquet files:

from ud_hf_parquet_tools import validate_treebank
from pathlib import Path
import json

# Load metadata
with open("metadata.json") as f:
    metadata = json.load(f)

# Validate one treebank
results = validate_treebank(
    name="fr_gsd",
    metadata=metadata["fr_gsd"],
    parquet_dir=Path("parquet"),
    ud_repos_dir=Path("UD_repos"),
    verbose=True
)

print(f"Success: {results['success']}")
print(f"Total sentences: {results['total_sentences']}")
print(f"Total errors: {results['total_errors']}")

CoNLL-U Parsing Features

This library handles several CoNLL-U parsing edge cases to ensure 100% fidelity:

1. Double Equals Bug

The conllu library fails to parse values starting with =:

  • Example: Gloss==POSS becomes {'Gloss': None} instead of {'Gloss': '=POSS'}
  • Solution: Direct raw field extraction bypasses the parser

2. Duplicate Metadata Keys

Some treebanks have multiple entries with the same key (e.g., multiple # media lines):

  • Problem: Dictionary-based storage keeps only the last value
  • Solution: Preserve metadata as ordered list with special markers

3. Empty Metadata Values

Lines like # text_en = (with empty value) are ignored by the parser:

  • Solution: Raw comment extraction preserves all metadata

4. Keys Without Values

Comments like # newpar without = become {'newpar': None}:

  • Solution: Store as just "newpar" (not "newpar = None")

5. Multi-Word Tokens (MWTs)

Contractions like "du" → "de le" (French) with ID 1-2:

  • Stored with tuple IDs like (1, '-', 2)
  • Preserved with form, FEATS (for Typo=Yes), and MISC

6. Empty Nodes

Enhanced dependencies with decimal IDs like 22.1:

  • Stored with tuple IDs like (22, '.', 1)
  • Full 10-field preservation including all annotations

For complete details, see CONLLU_PARSING.md which documents:

  • All parsing issues with examples from real treebanks
  • Affected treebank counts and statistics
  • Implementation strategies and code locations
  • Testing and validation procedures
  • Known limitations and their rationale

Dataset Schema

Generated Parquet files include:

{
    "sent_id": str,              # Sentence ID
    "text": str,                 # Full sentence text
    "comments": [str],           # Metadata comments (ordered, with duplicates)
    "tokens": [str],             # Word forms (syntactic words only)
    "lemmas": [str],             # Lemmas
    "upos": [str],               # Universal POS tags (ClassLabel)
    "xpos": [str],               # Language-specific POS
    "feats": [str],              # Morphological features
    "head": [str],               # Dependency heads
    "deprel": [str],             # Dependency relations
    "deps": [str],               # Enhanced dependencies
    "misc": [str],               # Miscellaneous annotations
    "mwt": [{                    # Multi-word tokens
        "id": str,                 # e.g., "1-2"
        "form": str,
        "feats": str,              # Optional (for Typo=Yes)
        "misc": str
    }],
    "empty_nodes": [{            # Empty nodes (enhanced deps)
        "id": str,                 # e.g., "22.1"
        "form": str,
        # ... all 10 CoNLL-U fields
    }]
}

Documentation

  • CONLLU_PARSING.md: Comprehensive guide to CoNLL-U parsing issues

    • All 7 parsing challenges with examples
    • Affected treebank statistics
    • Implementation details and code locations
    • Testing and validation procedures
    • 100% fidelity achievement documentation
  • RELEASE.md: Complete guide for publishing new releases

    • Pre-release checklist
    • Version numbering guidelines
    • Git tagging and PyPI publishing workflow
    • Troubleshooting guide
  • CHANGELOG.md: Version history and release notes

  • INSTALLATION.md: Detailed installation instructions

  • CONTRIBUTING.md: Guidelines for contributors

Development

Clone and install with development dependencies:

git clone https://github.com/egon-stemle/ud-hf-parquet-tools
cd ud-hf-parquet-tools
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"

Run tests:

pytest

Run tests with coverage:

pytest --cov=ud_hf_parquet_tools --cov-report=html

License

Apache License 2.0 - see LICENSE for details.

Author

Egon W. Stemle egon.stemle@eurac.edu

Acknowledgments

This library was developed for the Universal Dependencies project to enable efficient distribution of UD treebanks via HuggingFace Datasets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ud_hf_parquet_tools-1.2.0.tar.gz (37.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ud_hf_parquet_tools-1.2.0-py3-none-any.whl (25.4 kB view details)

Uploaded Python 3

File details

Details for the file ud_hf_parquet_tools-1.2.0.tar.gz.

File metadata

  • Download URL: ud_hf_parquet_tools-1.2.0.tar.gz
  • Upload date:
  • Size: 37.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.2

File hashes

Hashes for ud_hf_parquet_tools-1.2.0.tar.gz
Algorithm Hash digest
SHA256 5c105d8dcccc706edf06c559f1487b50dfb9644e2eb144808894c54215d58997
MD5 f0d400d2a0dedd86e268ab9fb326cf63
BLAKE2b-256 8b5732388049ad5c020edbc099bf72158b6614547256cbd30f36397df1cd66cb

See more details on using hashes here.

File details

Details for the file ud_hf_parquet_tools-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ud_hf_parquet_tools-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f001a8ee68131dc1e62e5af1bd9ec18a16021f3204c556b67b554e08ab5632be
MD5 46e84a17c0c42deefb574a00a32a2488
BLAKE2b-256 5922b4577d174c1bcd990395c7ea92151af6889bef04c64f4cc79410cc8f4d76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page