Skip to main content

Convert Transkribus ZIP files to HuggingFace datasets

Project description

Transkribus-HF

Convert Transkribus ZIP files to HuggingFace datasets with ease.

Overview

transkribus-hf is a Python package that converts Transkribus export ZIP files into HuggingFace datasets. It supports multiple export formats and can automatically upload datasets to the HuggingFace Hub.

Features

  • Multiple Export Modes: Convert your Transkribus data to different dataset formats
  • Automatic Upload: Direct integration with HuggingFace Hub
  • Region & Line Extraction: Extract individual text regions and lines as separate images
  • Windowed Extraction: Create sliding windows of multiple lines for data augmentation
  • Preserves Metadata: Maintains reading order, region types, and other important metadata
  • Command Line Interface: Easy-to-use CLI for batch processing

Installation

pip install transkribus-hf

Or install from source:

git clone https://github.com/wjbmattingly/transkribus-hf.git
cd transkribus-hf
pip install -e .

Export Modes

1. Raw XML (raw_xml)

Exports the original image with the complete PAGE XML content.

Fields:

  • image: Original page image
  • xml: Complete PAGE XML content
  • filename: Original image filename
  • project: Project name

2. Text (text) - Default

Exports the image with concatenated text from all regions.

Fields:

  • image: Original page image
  • text: Full text content (all regions combined)
  • filename: Original image filename
  • project: Project name

3. Region (region)

Exports each text region as a separate cropped image.

Fields:

  • image: Cropped region image
  • text: Region text content
  • region_type: Type of region (e.g., "paragraph")
  • region_id: Unique region identifier
  • reading_order: Reading order of the region
  • filename: Original image filename
  • project: Project name

4. Line (line)

Exports each text line as a separate cropped image.

Fields:

  • image: Cropped line image
  • text: Line text content
  • line_id: Unique line identifier
  • line_reading_order: Reading order within the region
  • region_id: Parent region identifier
  • region_reading_order: Reading order of parent region
  • region_type: Type of parent region
  • filename: Original image filename
  • project: Project name

5. Window (window) - NEW!

Exports sliding windows of multiple text lines, perfect for data augmentation and multi-line text recognition training.

Configuration:

  • window_size: Number of lines per window (1, 2, 3, 4, etc.)
  • overlap: Number of lines to overlap between windows (0 = no overlap)

Fields:

  • image: Cropped window image (bounding box of all lines in window)
  • text: Combined text from all lines in window (newline separated)
  • window_size: Actual number of lines in this window
  • window_index: Index of this window within the region
  • line_ids: Comma-separated list of line IDs in this window
  • line_reading_orders: Comma-separated list of line reading orders
  • region_id: Parent region identifier
  • region_reading_order: Reading order of parent region
  • region_type: Type of parent region
  • filename: Original image filename
  • project: Project name

Examples:

  • window_size=1, overlap=0: Same as line mode
  • window_size=2, overlap=0: Non-overlapping pairs of lines
  • window_size=3, overlap=1: 3-line windows with 1-line overlap (lines 1-3, 2-4, 3-5, etc.)
  • window_size=4, overlap=2: 4-line windows with 2-line overlap (lines 1-4, 3-6, 5-8, etc.)

Usage

Command Line Interface

# Basic usage - convert and upload to HuggingFace Hub
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name

# Specify export mode
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --mode region

# Window mode with 3 lines per window, 1 line overlap
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --mode window --window-size 3 --overlap 1

# Convert to local directory only
transkribus-hf path/to/your/transkribus.zip --local-only --output-dir ./my_dataset

# View statistics only (including window estimates)
transkribus-hf path/to/your/transkribus.zip --stats-only --mode window --window-size 2

# Create private repository
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --private

# Use custom HuggingFace token
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --token your_token_here

Python API

from transkribus_hf import TranskribusConverter

# Initialize converter
converter = TranskribusConverter("path/to/your/transkribus.zip")

# Get statistics
stats = converter.get_stats()
print(f"Total pages: {stats['total_pages']}")
print(f"Total regions: {stats['total_regions']}")
print(f"Total lines: {stats['total_lines']}")

# Convert to dataset (text mode)
dataset = converter.convert(export_mode='text')
print(f"Created dataset with {len(dataset)} examples")

# Convert to different modes
region_dataset = converter.convert(export_mode='region')
line_dataset = converter.convert(export_mode='line')
xml_dataset = converter.convert(export_mode='raw_xml')

# NEW: Window mode with different configurations
window_2_dataset = converter.convert(export_mode='window', window_size=2, overlap=0)
window_3_overlap_dataset = converter.convert(export_mode='window', window_size=3, overlap=1)
window_4_dataset = converter.convert(export_mode='window', window_size=4, overlap=2)

print(f"2-line windows: {len(window_2_dataset)} examples")
print(f"3-line windows (1 overlap): {len(window_3_overlap_dataset)} examples")
print(f"4-line windows (2 overlap): {len(window_4_dataset)} examples")

# Upload to HuggingFace Hub
repo_url = converter.upload_to_hub(
    dataset=window_3_overlap_dataset,
    repo_id="wjbmattingly/my-transkribus-windows",
    private=False
)
print(f"Dataset uploaded: {repo_url}")

# Convert and upload in one step
repo_url = converter.convert_and_upload(
    repo_id="wjbmattingly/my-transkribus-dataset",
    export_mode="window",
    window_size=2,
    overlap=1,
    private=False
)

Transkribus ZIP Structure

The package expects Transkribus ZIP files with the following structure:

transkribus_export.zip
├── project1/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── page/
│       ├── image1.xml
│       └── image2.xml
├── project2/
│   ├── image3.jpg
│   └── page/
│       └── image3.xml
└── ...

Window Mode Use Cases

The window mode is particularly useful for:

  1. Data Augmentation: Generate more training examples from existing data
  2. Multi-line Text Recognition: Train models to recognize multiple lines at once
  3. Reading Order Training: Train models to understand line sequences
  4. Flexible Context: Adjust context size (1-4+ lines) based on your needs
  5. Overlapping Context: Create overlapping examples for better generalization

Authentication

To upload datasets to HuggingFace Hub, you need to authenticate:

  1. Set environment variable: export HF_TOKEN=your_token_here
  2. Or pass the token directly: --token your_token_here
  3. Or use huggingface-cli login

Requirements

  • Python ≥ 3.8
  • datasets ≥ 2.0.0
  • huggingface_hub ≥ 0.15.0
  • Pillow ≥ 9.0.0
  • lxml ≥ 4.6.0
  • numpy ≥ 1.21.0
  • tqdm ≥ 4.62.0

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transkribus_hf-0.1.0.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

transkribus_hf-0.1.0-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file transkribus_hf-0.1.0.tar.gz.

File metadata

  • Download URL: transkribus_hf-0.1.0.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for transkribus_hf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 88458edf03e1786b79a674482e819cc04619eb372bcda65811e562e06f921bcc
MD5 de8b2efaee8f640f6648e4f9ca1f023d
BLAKE2b-256 72c5621f2096e8e5476275d1093373d3d27c42610a10afcf6d615cfda6c54b65

See more details on using hashes here.

File details

Details for the file transkribus_hf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: transkribus_hf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for transkribus_hf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 350d3c117e605c8871a001d7a6d98e1622a3f6682519c5c2a8c61ee920598aea
MD5 6914658f297ab0363b50a66acb42442b
BLAKE2b-256 63e7a223a04de26d4f641b06b87d663e71d0cb9a0e02e6a15ca008397cfcdf60

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page