Skip to main content

Extract and aggregate YAML frontmatter from Markdown files into structured JSON

Project description

Matterify

Python 3.12+ License: MIT Status: Alpha PyPI

Extract and aggregate YAML frontmatter from Markdown files, with optional SHA-256 hashes and file statistics.

Features

  • Recursive Markdown discovery with configurable directory exclusions
  • YAML frontmatter extraction with structured ok/illegal status reporting
  • Optional SHA-256 file hashes and file stats (size, mtime, atime)
  • Parallel scan workers for faster processing on larger vaults/projects

Quick Start

pip install matterify
matterify ./docs -o output.json

Installation

# Using uv (recommended)
uv add matterify

# Or with pip
pip install matterify

CLI Usage

matterify DIRECTORY [OPTIONS]

DIRECTORY must exist and is scanned recursively for .md and .markdown files.

Options:

  • --version - Show version information and exit
  • --debug - Enable debug logging
  • -o, --output PATH - Write JSON to file instead of stdout (if omitted, outputs to stdout)
  • --n-procs INT - Worker process count (default: auto-detect CPU cores)
  • -v, --verbose - Show progress and summary
  • -e, --exclude TEXT - Glob patterns to exclude (e.g., **/.git, **/__pycache__)
  • -i, --include PATH - Additional file paths to include in scan (repeatable)
  • --hash / --no-hash - Enable/disable SHA-256 hash computation
  • --stats / --no-stats - Enable/disable file statistics (size, modified time, access time)
  • --frontmatter / --no-frontmatter - Enable/disable YAML frontmatter extraction
  • --help - Show command help and exit

When --no-frontmatter is used, metadata fields files_with_frontmatter and files_without_frontmatter are null.

Examples:

# Output to stdout (JSON)
matterify ./docs

# Output to file
matterify ./docs -o output.json

# Verbose output
matterify ./docs --verbose

# Disable hashes and file stats
matterify ./docs --no-hash --no-stats

# Hash + stats only (skip YAML parsing)
matterify ./docs --no-frontmatter

# Exclude directories using glob patterns
matterify ./docs -e '**/build' -e '**/.cache'

# Include additional files (any extension)
matterify ./docs -i notes.txt -i ../shared/changelog.txt

# Full help
matterify --help

Python API

Public Functions

from pathlib import Path
from matterify import (
    scan_directory,
)

scan_directory

Scan directory and aggregate frontmatter using parallel workers. Returns a ScanResults dataclass.

from pathlib import Path
from matterify import scan_directory

result = scan_directory(Path("./docs"))

# ScanResults contains:
# - result.metadata: ScanMetadata with scan statistics
# - result.files: list of file entries with extraction results

# Access metadata
print(result.metadata.total_files)
print(result.metadata.files_with_frontmatter)
print(result.metadata.scan_duration_seconds)

# Access files
for entry in result.files:
    print(entry.file_path, entry.status)
    print(entry.stats.file_size if entry.stats else None)

Custom data callback

You can pass a callback function to inject custom data into each file entry. The callback receives the raw file content as a string and should return any value or None. The result is stored in the custom_data field of each FileEntry.

from pathlib import Path
from matterify import scan_directory

def count_words(content: str) -> object:
    return {"word_count": len(content.split())}

result = scan_directory(Path("./docs"), callback=count_words)

for entry in result.files:
    if entry.custom_data:
        print(entry.file_path, entry.custom_data["word_count"])

Important: The callback must be a module-level function (picklable for multiprocessing), not a lambda or closure.

Public Types

from matterify import (
    FileEntry,
    ScanMetadata,
    ScanResults,
)

# FileEntry: extracted frontmatter from a single file
entry: FileEntry

# ScanMetadata: summary statistics about a scan
metadata: ScanMetadata

# ScanResults: holds metadata and file entries
result: ScanResults

JSON Output Structure

When using CLI (stdout or --output), the payload has this shape:

{
  "metadata": {
    "root": "/path/to/docs",
    "total_files": 10,
    "files_with_frontmatter": 8,
    "files_without_frontmatter": 2,
    "errors": 0,
    "scan_duration_seconds": 0.523,
    "avg_duration_per_file_ms": 52.3,
    "throughput_files_per_second": 19.1
  },
  "files": [
    {
      "file_path": "getting-started.md",
      "frontmatter": {
        "title": "Getting Started",
        "date": "2024-01-15",
        "tags": ["guide", "tutorial"]
      },
      "status": "ok",
      "error": null,
      "stats": {
        "file_size": 1234,
        "modified_time": "2024-01-15T10:30:00",
        "access_time": "2024-01-15T10:30:00"
      },
      "file_hash": "abc123..."
    }
  ]
}

status is either "ok" or "illegal".

Default Exclusions

The following glob patterns are excluded from scanning by default:

  • **/.git - Git repositories
  • **/.obsidian - Obsidian vault settings
  • **/__pycache__ - Python bytecode cache
  • **/.venv - Python virtual environments
  • **/venv - Python virtual environments
  • **/node_modules - Node.js dependencies
  • **/.mypy_cache - MyPy type checker cache
  • **/.pytest_cache - Pytest cache
  • **/.ruff_cache - Ruff linter cache

The **/ prefix matches directories at any depth. Use -e or --exclude to add custom exclusion patterns.

Development

# Install with dev dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Format and lint
uv run ruff format src/ tests/
uv run ruff check src/ tests/

# Type check
uv run mypy src/

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matterify-0.7.0.tar.gz (62.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matterify-0.7.0-py3-none-any.whl (15.2 kB view details)

Uploaded Python 3

File details

Details for the file matterify-0.7.0.tar.gz.

File metadata

  • Download URL: matterify-0.7.0.tar.gz
  • Upload date:
  • Size: 62.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for matterify-0.7.0.tar.gz
Algorithm Hash digest
SHA256 86a4fc3cc14d65be0f5b19153572d9f06e19c1be8b2e2f1b06bcc3bf49c7a7b5
MD5 dafb58d6bf70597e233b9bbbd945ebed
BLAKE2b-256 b578c97bd6bd3f3553ba8b07966cb388ec7d379b30e88cf5d9c54924c2943ff1

See more details on using hashes here.

Provenance

The following attestation bundles were made for matterify-0.7.0.tar.gz:

Publisher: publish.yml on chgroeling/matterify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file matterify-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: matterify-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 15.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for matterify-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dce171c463e646b4358d97da74c9b030d92ab888bc6e2a0bec3099cc8fd1f5b5
MD5 97af4f62af34ded9b2130184f182906e
BLAKE2b-256 3345e3079c00aa121e4707224d7aff67accfa5ce39d098db629e20497d142f81

See more details on using hashes here.

Provenance

The following attestation bundles were made for matterify-0.7.0-py3-none-any.whl:

Publisher: publish.yml on chgroeling/matterify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page