Skip to main content

Dorsal is a local-first metadata generation and management toolkit.

Project description

Dorsal

A local-first file metadata generation and management toolkit.

PyPI version Python versions License Documentation
Tests codecov Mypy checked

Dorsal is an extensible, local-first framework and command line tool for generating, validating, and managing structured file metadata.

Dorsal provides configurable extraction and annotation pipelines for files.

Dorsal is...

  • Local First: Metadata extraction happens locally, not in the cloud. Use the CLI or python API to run the built-in extraction models or incorporate your own.
  • Strictly Validated: All annotations are automatically checked against strict JSON Schemas and Pydantic models, ensuring predictability and easy downstream integration.
  • Batteries Included: No file-type restrictions, and out-of-the-box support for core metadata extraction for many common file types including PDFs, Office documents, Media files and more.
  • Extensible: Support your own file types and metadata annotation needs. Integrate your own models easily.

Installation

Dorsal is available on pypi as dorsalhub.

pip install dorsalhub

Authentication

To sync metadata records with DorsalHub, authenticate with an API Key (generate one on your DorsalHub settings page).

dorsal auth login

Alternatively, set the DORSAL_API_KEY environment variable.


CLI Usage

1. Scan a File

Generate a metadata record for a file using the default extraction pipeline.

dorsal file scan "docs/PDFSPEC.pdf"

Output:

๐Ÿ“„ Scanning metadata for PDFSPEC.pdf
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ File Record: PDFSPEC.pdf โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                                                            โ”‚
โ”‚    Hashes                                                                                  โ”‚
โ”‚       SHA-256:  3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368           โ”‚
โ”‚        BLAKE3:  9abdfb32750a278d5ca550b876e94a72cd8eec82d0e506a127dfb94bd56ca4b2           โ”‚
โ”‚          TLSH:  T13465D67BB4C61D6DF893CA46571C579B8B0D71533BAEA58604BDAF0AC6338029AC3F41   โ”‚
โ”‚                                                                                            โ”‚
โ”‚    File Info                                                                               โ”‚
โ”‚     Full Path:  /mnt/c/testdata/PDFSPEC.pdf                                                โ”‚
โ”‚      Modified:  2025-04-09 15:09:05                                                        โ”‚
โ”‚          Name:  PDFSPEC.pdf                                                                โ”‚
โ”‚          Size:  1 MiB                                                                      โ”‚
โ”‚    Media Type:  application/pdf                                                            โ”‚
โ”‚                                                                                            โ”‚
โ”‚    Tags                                                                                    โ”‚
โ”‚        No tags found.                                                                      โ”‚
โ”‚                                                                                            โ”‚
โ”‚    Pdf Info                                                                                โ”‚
โ”‚            author:  Tim Bienz, Richard Cohn, James R. Meehan                               โ”‚
โ”‚             title:  Portable Document Format Reference Manual (v 1.2)                      โ”‚
โ”‚           creator:  FrameMaker 5.1.1                                                       โ”‚
โ”‚          producer:  Acrobat Distiller 3.0 for Power Macintosh                              โ”‚
โ”‚           subject:  Description of the PDF file format                                     โ”‚
โ”‚          keywords:  Acrobat PDF                                                            โ”‚
โ”‚           version:  1.2                                                                    โ”‚
โ”‚        page_count:  394                                                                    โ”‚
โ”‚     creation_date:  1996-11-12T03:08:43                                                    โ”‚
โ”‚     modified_date:  1996-11-12T07:58:15                                                    โ”‚
โ”‚                                                                                            โ”‚
โ”‚                                                                                            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

2. Push Metadata

Sync the metadata record to DorsalHub. By default, this creates a private record visible only to you.

dorsal file push "docs/PDFSPEC.pdf"

3. Run Annotation Models

Annotation Models are plug and play packages for Dorsal which perform file extraction, annotation or conversion.

Explore the models available on dorsalhub.com or follow a tutorial to build your own.

You can run and install models directly from the command line:

dorsal install dorsalhub/pdf-extractor

You can also export to any format supported by Dorsal Adapters:

$ dorsal run dorsalhub/whisper /home/video/test.mkv --export=srt
1
00:00:01,970 --> 00:00:05,970
You might be wondering how I ended up in this situation.

2
00:00:05,970 --> 00:00:08,970
Yeah that's me. A young subtitle.

3
00:00:08,970 --> 00:00:18,590
Little did I know what life had in store for me.


Outputs saved successfully:
  โ†ณ /home/user/sandbox/test.dorsal.json
  โ†ณ /home/user/sandbox/test.srt

4. Parse, Validate, and Export

Dorsal has two companion libraries to handle data structure and interoperability:

  • Open Validation Schemas: Dorsal annotations are strictly validated against these versioned, source-agnostic JSON schemas (e.g., open/classification, open/document-extraction). This ensures predictable outputs.

  • Dorsal Adapters: A bundled utility that converts between strictly validated JSON records and standard file formats.

Example: Parse a standard file into a validated JSON record:

$ dorsal adapter parse OSR_uk_000_0020_8k.srt audio-transcription

Example: List available export formats for a schema:

$ dorsal adapter list open/document-extraction

Supported Export Formats

You can currently export validated records into the following formats:

Document Extraction (open/document-extraction):

  • Markdown (.md)
  • HTML (.html)
  • hOCR (.hocr.html)
  • TSV (.tsv)
  • Plain Text (.txt)

Audio Transcription (open/audio-transcription):

  • SRT (.srt)
  • WebVTT (.vtt)
  • Markdown (.md)
  • TSV (.tsv)
  • Plain Text (.txt)

Citation / Reference ('dorsal/arxiv'):

  • BibTeX (.bib)
  • CSL-JSON (.json)
  • RIS (.ris)
  • Markdown (.md)

Python API

The LocalFile class runs the extraction pipeline on a specific file path.

1. Access Extracted Data

from dorsal import LocalFile

# 1. Initialize (runs the pipeline)
lf = LocalFile("docs/PDFSPEC.pdf")

# 2. Access base attributes
print(f"Hash: {lf.hash}")
print(f"Type: {lf.media_type}")

# 3. Access format-specific attributes (if available)
if lf.pdf:
    print(f"Pages: {lf.pdf.page_count}")
    print(f"Title: {lf.pdf.title}")

2. Add Tags & Annotations

# Add a simple key-value tag
lf.add_private_tag(name="project_id", value=12345)

# Add a structured annotation (validates against the 'open/classification' schema)
lf.add_classification(
    labels=[{"label": "urgent", "score": 1.0}],
    vocabulary=["urgent", "review"],
    private=True
)

# Sync the enriched record to DorsalHub
lf.push()

3. Batch Reporting

Generate self-contained HTML dashboards for local directories.

from dorsal.api import generate_html_directory_report

generate_html_directory_report(
    dir_path="./projects",
    output_path="storage_audit.html",
    recursive=True
)

Custom Annotation Models

You can extend Dorsal by adding custom Annotation Models to the extraction pipeline. These are Python classes that define extraction logic and the output schema.

Example: A "Hello Word" Model

This toy model counts the top 5 words in a text file.

from collections import Counter
from dorsal import AnnotationModel
from dorsal.testing import run_model
from dorsal.file.helpers import build_generic_record

class HelloWord(AnnotationModel):
    def main(self):
        with open(self.file_path, 'r') as f:
            words = f.read().split()
            
        data = {str(i+1): v[0] for i, v in enumerate(Counter(words).most_common(5))}
        
        return build_generic_record(
            description="Top 5 most common words",
            data=data
        )

# Validate the model
result = run_model(
    annotation_model=HelloWord,
    file_path="./path/to/test/file.txt",
    schema_id="open/generic"
)

assert not result.error

You can add it to Dorsal's local file metadata extraction pipeline:

from dorsal.api import register_model
from helloword import HelloWord

# Add the model to your pipeline
register_model(
    annotation_model=HelloWord,
    schema_id="open/generic"
)

Now, each time you run dorsal file scan or LocalFile(), this model will execute automatically.


Resources

License

Dorsal is open source and provided under the Apache 2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dorsalhub-0.9.0.tar.gz (3.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dorsalhub-0.9.0-py3-none-any.whl (1.4 MB view details)

Uploaded Python 3

File details

Details for the file dorsalhub-0.9.0.tar.gz.

File metadata

  • Download URL: dorsalhub-0.9.0.tar.gz
  • Upload date:
  • Size: 3.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.11

File hashes

Hashes for dorsalhub-0.9.0.tar.gz
Algorithm Hash digest
SHA256 dbda68b801291e28365550af1eb91616d841ca3709eea68a61641d2ab3cd1eb1
MD5 4dbbc065cd65a625efea3345bf7a00fb
BLAKE2b-256 4898b1ba4ac93efb0438d10b1046ed71cc61c02fe9e66c8aa044a13202f84b07

See more details on using hashes here.

File details

Details for the file dorsalhub-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: dorsalhub-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.11

File hashes

Hashes for dorsalhub-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b81962b160fe3a5c138b6a26f3a930bbc7d68f33076ef0e682e00d05eb143d2
MD5 868cef415cdee4eb68b46702c73fe0f4
BLAKE2b-256 298dfffe6b7ae23f0200447e941144df4d6afa6ed33967fadd6505e3cd595f18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page