Convert Transkribus ZIP files to HuggingFace datasets
Project description
Transkribus-HF
Convert Transkribus ZIP files to HuggingFace datasets with ease.
Overview
transkribus-hf is a Python package that converts Transkribus export ZIP files into HuggingFace datasets. It supports multiple export formats and can automatically upload datasets to the HuggingFace Hub.
Features
- Multiple Export Modes: Convert your Transkribus data to different dataset formats
- Automatic Upload: Direct integration with HuggingFace Hub
- Region & Line Extraction: Extract individual text regions and lines as separate images
- Windowed Extraction: Create sliding windows of multiple lines for data augmentation
- Preserves Metadata: Maintains reading order, region types, and other important metadata
- Command Line Interface: Easy-to-use CLI for batch processing
Installation
pip install transkribus-hf
Or install from source:
git clone https://github.com/wjbmattingly/transkribus-hf.git
cd transkribus-hf
pip install -e .
Export Modes
1. Raw XML (raw_xml)
Exports the original image with the complete PAGE XML content.
Fields:
image: Original page imagexml: Complete PAGE XML contentfilename: Original image filenameproject: Project name
2. Text (text) - Default
Exports the image with concatenated text from all regions.
Fields:
image: Original page imagetext: Full text content (all regions combined)filename: Original image filenameproject: Project name
3. Region (region)
Exports each text region as a separate cropped image.
Fields:
image: Cropped region imagetext: Region text contentregion_type: Type of region (e.g., "paragraph")region_id: Unique region identifierreading_order: Reading order of the regionfilename: Original image filenameproject: Project name
4. Line (line)
Exports each text line as a separate cropped image.
Fields:
image: Cropped line imagetext: Line text contentline_id: Unique line identifierline_reading_order: Reading order within the regionregion_id: Parent region identifierregion_reading_order: Reading order of parent regionregion_type: Type of parent regionfilename: Original image filenameproject: Project name
5. Window (window) - NEW!
Exports sliding windows of multiple text lines, perfect for data augmentation and multi-line text recognition training.
Configuration:
window_size: Number of lines per window (1, 2, 3, 4, etc.)overlap: Number of lines to overlap between windows (0 = no overlap)
Fields:
image: Cropped window image (bounding box of all lines in window)text: Combined text from all lines in window (newline separated)window_size: Actual number of lines in this windowwindow_index: Index of this window within the regionline_ids: Comma-separated list of line IDs in this windowline_reading_orders: Comma-separated list of line reading ordersregion_id: Parent region identifierregion_reading_order: Reading order of parent regionregion_type: Type of parent regionfilename: Original image filenameproject: Project name
Examples:
window_size=1, overlap=0: Same as line modewindow_size=2, overlap=0: Non-overlapping pairs of lineswindow_size=3, overlap=1: 3-line windows with 1-line overlap (lines 1-3, 2-4, 3-5, etc.)window_size=4, overlap=2: 4-line windows with 2-line overlap (lines 1-4, 3-6, 5-8, etc.)
Usage
Command Line Interface
# Basic usage - convert and upload to HuggingFace Hub
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name
# Specify export mode
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --mode region
# Window mode with 3 lines per window, 1 line overlap
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --mode window --window-size 3 --overlap 1
# Convert to local directory only
transkribus-hf path/to/your/transkribus.zip --local-only --output-dir ./my_dataset
# View statistics only (including window estimates)
transkribus-hf path/to/your/transkribus.zip --stats-only --mode window --window-size 2
# Create private repository
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --private
# Use custom HuggingFace token
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --token your_token_here
Python API
from transkribus_hf import TranskribusConverter
# Initialize converter
converter = TranskribusConverter("path/to/your/transkribus.zip")
# Get statistics
stats = converter.get_stats()
print(f"Total pages: {stats['total_pages']}")
print(f"Total regions: {stats['total_regions']}")
print(f"Total lines: {stats['total_lines']}")
# Convert to dataset (text mode)
dataset = converter.convert(export_mode='text')
print(f"Created dataset with {len(dataset)} examples")
# Convert to different modes
region_dataset = converter.convert(export_mode='region')
line_dataset = converter.convert(export_mode='line')
xml_dataset = converter.convert(export_mode='raw_xml')
# NEW: Window mode with different configurations
window_2_dataset = converter.convert(export_mode='window', window_size=2, overlap=0)
window_3_overlap_dataset = converter.convert(export_mode='window', window_size=3, overlap=1)
window_4_dataset = converter.convert(export_mode='window', window_size=4, overlap=2)
print(f"2-line windows: {len(window_2_dataset)} examples")
print(f"3-line windows (1 overlap): {len(window_3_overlap_dataset)} examples")
print(f"4-line windows (2 overlap): {len(window_4_dataset)} examples")
# Upload to HuggingFace Hub
repo_url = converter.upload_to_hub(
dataset=window_3_overlap_dataset,
repo_id="wjbmattingly/my-transkribus-windows",
private=False
)
print(f"Dataset uploaded: {repo_url}")
# Convert and upload in one step
repo_url = converter.convert_and_upload(
repo_id="wjbmattingly/my-transkribus-dataset",
export_mode="window",
window_size=2,
overlap=1,
private=False
)
Transkribus ZIP Structure
The package expects Transkribus ZIP files with the following structure:
transkribus_export.zip
├── project1/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── page/
│ ├── image1.xml
│ └── image2.xml
├── project2/
│ ├── image3.jpg
│ └── page/
│ └── image3.xml
└── ...
Window Mode Use Cases
The window mode is particularly useful for:
- Data Augmentation: Generate more training examples from existing data
- Multi-line Text Recognition: Train models to recognize multiple lines at once
- Reading Order Training: Train models to understand line sequences
- Flexible Context: Adjust context size (1-4+ lines) based on your needs
- Overlapping Context: Create overlapping examples for better generalization
Authentication
To upload datasets to HuggingFace Hub, you need to authenticate:
- Set environment variable:
export HF_TOKEN=your_token_here - Or pass the token directly:
--token your_token_here - Or use
huggingface-cli login
Requirements
- Python ≥ 3.8
- datasets ≥ 2.0.0
- huggingface_hub ≥ 0.15.0
- Pillow ≥ 9.0.0
- lxml ≥ 4.6.0
- numpy ≥ 1.21.0
- tqdm ≥ 4.62.0
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file transkribus_hf-0.1.0.tar.gz.
File metadata
- Download URL: transkribus_hf-0.1.0.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88458edf03e1786b79a674482e819cc04619eb372bcda65811e562e06f921bcc
|
|
| MD5 |
de8b2efaee8f640f6648e4f9ca1f023d
|
|
| BLAKE2b-256 |
72c5621f2096e8e5476275d1093373d3d27c42610a10afcf6d615cfda6c54b65
|
File details
Details for the file transkribus_hf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: transkribus_hf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
350d3c117e605c8871a001d7a6d98e1622a3f6682519c5c2a8c61ee920598aea
|
|
| MD5 |
6914658f297ab0363b50a66acb42442b
|
|
| BLAKE2b-256 |
63e7a223a04de26d4f641b06b87d663e71d0cb9a0e02e6a15ca008397cfcdf60
|