Skip to main content

Export validated JSON to standard formats.

Project description

Dorsal

PyPI version codecov License

Dorsal Adapters translates validated JSON records into various industry-standard formats.

Supported Formats

Document Extraction (open/document-extraction)

  • md: RAG-Optimized Markdown — Injects semantic headings, hallucination warnings, and visual placeholders directly into the text stream for LLM consumption.
  • html: Semantic HTML (.html) — Renders a responsive, visually inferred 2D DOM layout from raw spatial coordinates.
  • hocr: hOCR (.hocr.html) — An industry-standard OCR output format embedding layout, confidence scores, and style information in standard HTML.
  • tsv: Tab-Separated Values — Perfect for spreadsheet ingestion and tabular data analysis.
  • txt: Plain Text — Flattens the document layout into clean, stitched paragraphs.

Audio Transcription (open/audio-transcription)

  • srt: SubRip Text (.srt) — The most widely used plaintext subtitle format.
  • vtt: WebVTT (.vtt) — The W3C standard web subtitle format for HTML5 video players.
  • md: RAG-Optimized Markdown — Merges speaker tags, non-verbal events (e.g., [laughter]), and low-confidence warnings into clean markdown.
  • tsv: Tab-Separated Values — Organizes segments, start/end times, and speakers into a neat table.
  • txt: Plain Text — A continuous, readable transcript.

arXiv Metadata (dorsal/arxiv)

  • bib: BibTeX (.bib) — A standard bibliographic reference format.
  • csl-json: CSL-JSON (.json) — A standard bibliographic reference format.
  • ris: RIS (.ris) — A standard bibliographic reference format.
  • md: RAG-Optimized Markdown — Embeds standard YAML frontmatter (ID, DOI, Categories, Year) and markdown formatting for ingestion into PKMs or RAG pipelines.

Installation

Dorsal Adapters is available on PyPI as dorsalhub-adapters:

pip install dorsalhub-adapters

Usage

Within Dorsal

Dorsal Adapters is a core dependency for Dorsal.

Example: using --export to generate a subtitle file:

$ dorsal run dorsalhub/whisper /home/video/test.mkv --export=srt
1
00:00:01,970 --> 00:00:05,970
You might be wondering how I ended up in this situation.

2
00:00:05,970 --> 00:00:08,970
Yeah that's me. A young subtitle.

3
00:00:08,970 --> 00:00:18,590
Little did I know what life had in store for me.


Outputs saved successfully:
  ↳ /home/user/sandbox/test.dorsal.json
  ↳ /home/user/sandbox/test.srt
  • --export can take the ID of adapter for a given output schema (e.g. md for Markdown or txt for text).

Example 2: using --export to generate a BibTeX citation from an ArXiv preprint:

$ dorsal run dorsalhub/arxiv-pdf ./examples/9304001v1.pdf --export=bibtex
@misc{solv-int_9304001,
  title = {q-Discrete Toda Molecule Equation},
  author = {Kenji Kajiwara and Yasuhiro Ohta and Junkichi Satsuma},
  eprint = {solv-int/9304001},
  archivePrefix = {arXiv},
  primaryClass = {solv-int},
  doi = {10.1016/0375-9601(93)90705-5},
  url = {https://arxiv.org/abs/solv-int/9304001},
  year = {1993}
}
Outputs saved successfully:
  ↳ /home/user/sandbox/arxiv-pdf/9304001v1.dorsal.json
  ↳ /home/user/sandbox/arxiv-pdf/9304001v1.bib

Standalone Usage

Adapters are Python classes with methods for exporting to and parsing from the supported file formats:

  • export(record) / export_file(record, fp): Converts a JSON record into a standard format.
  • parse(content) / parse_file(fp): Best-effort conversion from a standard format back into a Dorsal JSON Record.

Example: Audio to Subtitles (SRT)

In this example, a valid open/audio-transcription record is converted into a subtitle file.

from dorsal_adapters.registry import get_adapter

# 1. The raw JSON record from your model
transcription = {
    "track_id": 1,
    "language": "eng",
    "segments": [
        {
            "start_time": 0.5,
            "end_time": 4.75,
            "text": "Welcome back! Today, my guest is the renowned chef, Jean-Pierre."
        }
    ]
}

# 2. Retrieve the adapter for the schema and target format
adapter = get_adapter("audio-transcription", "srt")

# 3. Export to the target format (.srt)
srt_string = adapter.export(transcription)
print(srt_string)

# 4. Parse the formatted string back into a Dorsal record
parsed_record = adapter.parse(srt_string)

Tip: You can programmatically check what formats are supported for a given schema using list_formats:

from dorsal_adapters.registry import list_formats
print(list_formats("document-extraction"))

Contributing

We welcome contributions! If you have written a translation script for an Open Validation Schema, please open a PR.

License

Dorsal Adapters is open source and provided under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dorsalhub_adapters-0.4.0.tar.gz (42.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dorsalhub_adapters-0.4.0-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file dorsalhub_adapters-0.4.0.tar.gz.

File metadata

  • Download URL: dorsalhub_adapters-0.4.0.tar.gz
  • Upload date:
  • Size: 42.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.11

File hashes

Hashes for dorsalhub_adapters-0.4.0.tar.gz
Algorithm Hash digest
SHA256 d47620d16fbc1596379ba64fef5e2e1030682c06de79170b08d7bfddeb653941
MD5 fcaeae87c8007172216eaa788de6dec7
BLAKE2b-256 3120ef48a1e21ddbaf59b02a8c06a4562a628c699bf487f253749dfbade323e9

See more details on using hashes here.

File details

Details for the file dorsalhub_adapters-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dorsalhub_adapters-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9918d77d389badff8ebbfe13e36b95a52d6f17a3ad6dce11ddedd8bef6aa90df
MD5 29ccf9034faf671c98fa74ca1aa84719
BLAKE2b-256 8fbe979b1482971a1606bfc987067e6a5ec39423fafad0a52f78b2d896140af7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page