Dorsal is a local-first metadata generation and management toolkit.
Project description
A local-first file metadata generation and management toolkit.
Dorsal is an extensible, local-first framework and command line tool for generating, validating, and managing structured file metadata.
Dorsal provides configurable extraction and annotation pipelines for files.
Dorsal is...
- Local First: Metadata extraction happens locally, not in the cloud. Use the CLI or python API to run the built-in extraction models or incorporate your own.
- Strictly Validated: All annotations are automatically checked against strict JSON Schemas and Pydantic models, ensuring predictability and easy downstream integration.
- Batteries Included: No file-type restrictions, and out-of-the-box support for core metadata extraction for many common file types including PDFs, Office documents, Media files and more.
- Extensible: Support your own file types and metadata annotation needs. Integrate your own models easily.
Installation
Dorsal is available on pypi as dorsalhub.
pip install dorsalhub
Authentication
To sync metadata records with DorsalHub, authenticate with an API Key (generate one on your DorsalHub settings page).
dorsal auth login
Alternatively, set the DORSAL_API_KEY environment variable.
CLI Usage
1. Scan a File
Generate a metadata record for a file using the default extraction pipeline.
dorsal file scan "docs/PDFSPEC.pdf"
Output:
๐ Scanning metadata for PDFSPEC.pdf
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ File Record: PDFSPEC.pdf โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ โ
โ Hashes โ
โ SHA-256: 3383fb2ab568ca7019834d438f9a14b9d2ccaa2f37f319373848350005779368 โ
โ BLAKE3: 9abdfb32750a278d5ca550b876e94a72cd8eec82d0e506a127dfb94bd56ca4b2 โ
โ TLSH: T13465D67BB4C61D6DF893CA46571C579B8B0D71533BAEA58604BDAF0AC6338029AC3F41 โ
โ โ
โ File Info โ
โ Full Path: /mnt/c/testdata/PDFSPEC.pdf โ
โ Modified: 2025-04-09 15:09:05 โ
โ Name: PDFSPEC.pdf โ
โ Size: 1 MiB โ
โ Media Type: application/pdf โ
โ โ
โ Tags โ
โ No tags found. โ
โ โ
โ Pdf Info โ
โ author: Tim Bienz, Richard Cohn, James R. Meehan โ
โ title: Portable Document Format Reference Manual (v 1.2) โ
โ creator: FrameMaker 5.1.1 โ
โ producer: Acrobat Distiller 3.0 for Power Macintosh โ
โ subject: Description of the PDF file format โ
โ keywords: Acrobat PDF โ
โ version: 1.2 โ
โ page_count: 394 โ
โ creation_date: 1996-11-12T03:08:43 โ
โ modified_date: 1996-11-12T07:58:15 โ
โ โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
2. Push Metadata
Sync the metadata record to DorsalHub. By default, this creates a private record visible only to you.
dorsal file push "docs/PDFSPEC.pdf"
3. Run Annotation Models
Annotation Models are plug and play packages for Dorsal which perform file extraction, annotation or conversion.
Explore the models available on dorsalhub.com or follow a tutorial to build your own.
You can run and install models directly from the command line:
dorsal install dorsalhub/pdf-extractor
You can also export to any format supported by Dorsal Adapters:
$ dorsal run dorsalhub/whisper /home/video/test.mkv --export=srt
1
00:00:01,970 --> 00:00:05,970
You might be wondering how I ended up in this situation.
2
00:00:05,970 --> 00:00:08,970
Yeah that's me. A young subtitle.
3
00:00:08,970 --> 00:00:18,590
Little did I know what life had in store for me.
Outputs saved successfully:
โณ /home/user/sandbox/test.dorsal.json
โณ /home/user/sandbox/test.srt
4. Parse, Validate, and Export
Dorsal has two companion libraries to handle data structure and interoperability:
-
Open Validation Schemas: Dorsal annotations are strictly validated against these versioned, source-agnostic JSON schemas (e.g.,
open/classification,open/document-extraction). This ensures predictable outputs. -
Dorsal Adapters: A bundled utility that converts between strictly validated JSON records and standard file formats.
Example: Parse a standard file into a validated JSON record:
$ dorsal adapter parse OSR_uk_000_0020_8k.srt audio-transcription
Example: List available export formats for a schema:
$ dorsal adapter list open/document-extraction
Supported Export Formats
You can currently export validated records into the following formats:
Document Extraction (open/document-extraction):
- Markdown (
.md) - HTML (
.html) - hOCR (
.hocr.html) - TSV (
.tsv) - Plain Text (
.txt)
Audio Transcription (open/audio-transcription):
- SRT (
.srt) - WebVTT (
.vtt) - Markdown (
.md) - TSV (
.tsv) - Plain Text (
.txt)
Citation / Reference ('dorsal/arxiv'):
- BibTeX (
.bib) - CSL-JSON (
.json) - RIS (
.ris) - Markdown (
.md)
Python API
The LocalFile class runs the extraction pipeline on a specific file path.
1. Access Extracted Data
from dorsal import LocalFile
# 1. Initialize (runs the pipeline)
lf = LocalFile("docs/PDFSPEC.pdf")
# 2. Access base attributes
print(f"Hash: {lf.hash}")
print(f"Type: {lf.media_type}")
# 3. Access format-specific attributes (if available)
if lf.pdf:
print(f"Pages: {lf.pdf.page_count}")
print(f"Title: {lf.pdf.title}")
2. Add Tags & Annotations
# Add a simple key-value tag
lf.add_private_tag(name="project_id", value=12345)
# Add a structured annotation (validates against the 'open/classification' schema)
lf.add_classification(
labels=[{"label": "urgent", "score": 1.0}],
vocabulary=["urgent", "review"],
private=True
)
# Sync the enriched record to DorsalHub
lf.push()
3. Batch Reporting
Generate self-contained HTML dashboards for local directories.
from dorsal.api import generate_html_directory_report
generate_html_directory_report(
dir_path="./projects",
output_path="storage_audit.html",
recursive=True
)
Custom Annotation Models
You can extend Dorsal by adding custom Annotation Models to the extraction pipeline. These are Python classes that define extraction logic and the output schema.
Example: A "Hello Word" Model
This toy model counts the top 5 words in a text file.
from collections import Counter
from dorsal import AnnotationModel
from dorsal.testing import run_model
from dorsal.file.helpers import build_generic_record
class HelloWord(AnnotationModel):
def main(self):
with open(self.file_path, 'r') as f:
words = f.read().split()
data = {str(i+1): v[0] for i, v in enumerate(Counter(words).most_common(5))}
return build_generic_record(
description="Top 5 most common words",
data=data
)
# Validate the model
result = run_model(
annotation_model=HelloWord,
file_path="./path/to/test/file.txt",
schema_id="open/generic"
)
assert not result.error
You can add it to Dorsal's local file metadata extraction pipeline:
from dorsal.api import register_model
from helloword import HelloWord
# Add the model to your pipeline
register_model(
annotation_model=HelloWord,
schema_id="open/generic"
)
Now, each time you run dorsal file scan or LocalFile(), this model will execute automatically.
Resources
- Documentation: Full API reference, CLI guides, and tutorials.
- DorsalHub: The hosted platform for managing your metadata.
- Issue Tracker: Report bugs or request features.
License
Dorsal is open source and provided under the Apache 2.0 license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dorsalhub-0.9.0.tar.gz.
File metadata
- Download URL: dorsalhub-0.9.0.tar.gz
- Upload date:
- Size: 3.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbda68b801291e28365550af1eb91616d841ca3709eea68a61641d2ab3cd1eb1
|
|
| MD5 |
4dbbc065cd65a625efea3345bf7a00fb
|
|
| BLAKE2b-256 |
4898b1ba4ac93efb0438d10b1046ed71cc61c02fe9e66c8aa044a13202f84b07
|
File details
Details for the file dorsalhub-0.9.0-py3-none-any.whl.
File metadata
- Download URL: dorsalhub-0.9.0-py3-none-any.whl
- Upload date:
- Size: 1.4 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b81962b160fe3a5c138b6a26f3a930bbc7d68f33076ef0e682e00d05eb143d2
|
|
| MD5 |
868cef415cdee4eb68b46702c73fe0f4
|
|
| BLAKE2b-256 |
298dfffe6b7ae23f0200447e941144df4d6afa6ed33967fadd6505e3cd595f18
|