Tools for generating and validating Universal Dependencies datasets in Parquet format for HuggingFace
Project description
UD-HF-Parquet-Tools
Tools for generating and validating Universal Dependencies datasets in Parquet format for HuggingFace.
Features
- Generate Parquet files from Universal Dependencies CoNLL-U data
- Validate Parquet files against original CoNLL-U with 100% fidelity checking
- Handle CoNLL-U edge cases: double equals bug, duplicate metadata keys, empty nodes, MWTs
- CLI and Python API for programmatic use
- Comprehensive test suite with 60+ tests
Installation
Using uv (recommended):
uv pip install ud-hf-parquet-tools
Using pip:
pip install ud-hf-parquet-tools
Quick Start
Command Line Interface
Generate Parquet files:
# Generate for all treebanks
ud-hfp-tools generate --metadata metadata.json --output-dir parquet/
# Generate for specific treebanks
ud-hfp-tools generate --metadata metadata.json --treebanks fr_gsd,en_ewt --output-dir parquet/
# Test mode (3 treebanks only)
ud-hfp-tools generate --metadata metadata.json --test
Validate Parquet files:
# Validate from local files
ud-hfp-tools validate --local --metadata metadata.json
# Validate specific treebanks
ud-hfp-tools validate --local --treebanks fr_gsd,en_ewt
# Validate from HuggingFace Hub
ud-hfp-tools validate --revision 2.17 --treebanks fr_gsd
Python API
Generate Parquet files:
from ud_hf_parquet_tools import generate_parquet_for_treebank
from pathlib import Path
import json
# Load metadata
with open("metadata.json") as f:
metadata = json.load(f)
# Generate for one treebank
success = generate_parquet_for_treebank(
name="fr_gsd",
metadata=metadata["fr_gsd"],
ud_repos_dir=Path("UD_repos"),
output_dir=Path("parquet"),
verbose=True
)
Validate Parquet files:
from ud_hf_parquet_tools import validate_treebank
from pathlib import Path
import json
# Load metadata
with open("metadata.json") as f:
metadata = json.load(f)
# Validate one treebank
results = validate_treebank(
name="fr_gsd",
metadata=metadata["fr_gsd"],
parquet_dir=Path("parquet"),
ud_repos_dir=Path("UD_repos"),
verbose=True
)
print(f"Success: {results['success']}")
print(f"Total sentences: {results['total_sentences']}")
print(f"Total errors: {results['total_errors']}")
CoNLL-U Parsing Features
This library handles several CoNLL-U parsing edge cases to ensure 100% fidelity:
1. Double Equals Bug
The conllu library fails to parse values starting with =:
- Example:
Gloss==POSSbecomes{'Gloss': None}instead of{'Gloss': '=POSS'} - Solution: Direct raw field extraction bypasses the parser
2. Duplicate Metadata Keys
Some treebanks have multiple entries with the same key (e.g., multiple # media lines):
- Problem: Dictionary-based storage keeps only the last value
- Solution: Preserve metadata as ordered list with special markers
3. Empty Metadata Values
Lines like # text_en = (with empty value) are ignored by the parser:
- Solution: Raw comment extraction preserves all metadata
4. Keys Without Values
Comments like # newpar without = become {'newpar': None}:
- Solution: Store as just
"newpar"(not"newpar = None")
5. Multi-Word Tokens (MWTs)
Contractions like "du" → "de le" (French) with ID 1-2:
- Stored with tuple IDs like
(1, '-', 2) - Preserved with form, FEATS (for
Typo=Yes), and MISC
6. Empty Nodes
Enhanced dependencies with decimal IDs like 22.1:
- Stored with tuple IDs like
(22, '.', 1) - Full 10-field preservation including all annotations
For complete details, see CONLLU_PARSING.md which documents:
- All parsing issues with examples from real treebanks
- Affected treebank counts and statistics
- Implementation strategies and code locations
- Testing and validation procedures
- Known limitations and their rationale
Dataset Schema
Generated Parquet files include:
{
"sent_id": str, # Sentence ID
"text": str, # Full sentence text
"comments": [str], # Metadata comments (ordered, with duplicates)
"tokens": [str], # Word forms (syntactic words only)
"lemmas": [str], # Lemmas
"upos": [str], # Universal POS tags (ClassLabel)
"xpos": [str], # Language-specific POS
"feats": [str], # Morphological features
"head": [str], # Dependency heads
"deprel": [str], # Dependency relations
"deps": [str], # Enhanced dependencies
"misc": [str], # Miscellaneous annotations
"mwt": [{ # Multi-word tokens
"id": str, # e.g., "1-2"
"form": str,
"feats": str, # Optional (for Typo=Yes)
"misc": str
}],
"empty_nodes": [{ # Empty nodes (enhanced deps)
"id": str, # e.g., "22.1"
"form": str,
# ... all 10 CoNLL-U fields
}]
}
Documentation
-
CONLLU_PARSING.md: Comprehensive guide to CoNLL-U parsing issues
- All 7 parsing challenges with examples
- Affected treebank statistics
- Implementation details and code locations
- Testing and validation procedures
- 100% fidelity achievement documentation
-
RELEASE.md: Complete guide for publishing new releases
- Pre-release checklist
- Version numbering guidelines
- Git tagging and PyPI publishing workflow
- Troubleshooting guide
-
CHANGELOG.md: Version history and release notes
-
INSTALLATION.md: Detailed installation instructions
-
CONTRIBUTING.md: Guidelines for contributors
Development
Clone and install with development dependencies:
git clone https://github.com/egon-stemle/ud-hf-parquet-tools
cd ud-hf-parquet-tools
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e ".[dev]"
Run tests:
pytest
Run tests with coverage:
pytest --cov=ud_hf_parquet_tools --cov-report=html
License
Apache License 2.0 - see LICENSE for details.
Author
Egon W. Stemle egon.stemle@eurac.edu
Acknowledgments
This library was developed for the Universal Dependencies project to enable efficient distribution of UD treebanks via HuggingFace Datasets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ud_hf_parquet_tools-1.2.0.tar.gz.
File metadata
- Download URL: ud_hf_parquet_tools-1.2.0.tar.gz
- Upload date:
- Size: 37.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c105d8dcccc706edf06c559f1487b50dfb9644e2eb144808894c54215d58997
|
|
| MD5 |
f0d400d2a0dedd86e268ab9fb326cf63
|
|
| BLAKE2b-256 |
8b5732388049ad5c020edbc099bf72158b6614547256cbd30f36397df1cd66cb
|
File details
Details for the file ud_hf_parquet_tools-1.2.0-py3-none-any.whl.
File metadata
- Download URL: ud_hf_parquet_tools-1.2.0-py3-none-any.whl
- Upload date:
- Size: 25.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f001a8ee68131dc1e62e5af1bd9ec18a16021f3204c556b67b554e08ab5632be
|
|
| MD5 |
46e84a17c0c42deefb574a00a32a2488
|
|
| BLAKE2b-256 |
5922b4577d174c1bcd990395c7ea92151af6889bef04c64f4cc79410cc8f4d76
|