Skip to main content

Extract and aggregate YAML frontmatter from Markdown files into structured JSON

Project description

Matterify

Python 3.12+ License: MIT Status: Alpha PyPI

Extract and aggregate YAML frontmatter from Markdown files, with optional SHA-256 hashes and file statistics.

Features

  • Recursive Markdown discovery with configurable directory exclusions
  • YAML frontmatter extraction with structured ok/illegal status reporting
  • Optional SHA-256 file hashes and file stats (size, mtime, atime)
  • Parallel scan workers for faster processing on larger vaults/projects
  • In-memory single-entry scan cache for repeated Python API calls
  • Cache control via force_refresh=True and clear_cache()

Quick Start

pip install matterify
matterify ./docs -o output.json

Installation

# Using uv (recommended)
uv add matterify

# Or with pip
pip install matterify

CLI Usage

matterify DIRECTORY [OPTIONS]

DIRECTORY must exist and is scanned recursively for .md and .markdown files.

Options:

  • --version - Show version information and exit
  • --debug - Enable debug logging
  • -o, --output PATH - Write JSON to file instead of stdout (if omitted, outputs to stdout)
  • --n-procs INT - Worker process count (default: auto-detect CPU cores)
  • -v, --verbose - Show progress and summary
  • -e, --exclude TEXT - Additional directories to exclude
  • --hash / --no-hash - Enable/disable SHA-256 hash computation
  • --stats / --no-stats - Enable/disable file statistics (size, modified time, access time)
  • --frontmatter / --no-frontmatter - Enable/disable YAML frontmatter extraction
  • --help - Show command help and exit

When --no-frontmatter is used, metadata fields files_with_frontmatter and files_without_frontmatter are null.

Examples:

# Output to stdout (JSON)
matterify ./docs

# Output to file
matterify ./docs -o output.json

# Verbose output
matterify ./docs --verbose

# Disable hashes and file stats
matterify ./docs --no-hash --no-stats

# Hash + stats only (skip YAML parsing)
matterify ./docs --no-frontmatter

# Exclude additional directories
matterify ./docs -e build -e .cache

# Full help
matterify --help

Python API

Public Functions

from pathlib import Path
from matterify import (
    scan_directory,
)

scan_directory

Scan directory and aggregate frontmatter using parallel workers. Returns an AggregatedResult dataclass.

scan_directory() uses an in-memory single-entry cache keyed by directory and scan options. Repeated calls with the same inputs return the cached result unless force_refresh=True is passed. Use clear_cache() to manually invalidate the cache.

from pathlib import Path
from matterify import scan_directory

result = scan_directory(Path("./docs"))

# Bypass in-memory cache and force recompute
fresh_result = scan_directory(Path("./docs"), force_refresh=True)

# AggregatedResult contains:
# - result.metadata: ScanMetadata with scan statistics
# - result.files: list of file entries with extraction results

# Access metadata
print(result.metadata.total_files)
print(result.metadata.files_with_frontmatter)
print(result.metadata.scan_duration_seconds)

# Access files
for entry in result.files:
    print(entry.file_path, entry.status)
    print(entry.stats.file_size if entry.stats else None)

# Clear the in-memory single-entry scan cache
from matterify import clear_cache

clear_cache()

force_refresh and clear_cache() are Python API controls only; the CLI always performs a fresh scan per command invocation.

Public Types

from matterify import (
    FileEntry,
    ScanMetadata,
    AggregatedResult,
)

# FileEntry: extracted frontmatter from a single file
entry: FileEntry

# ScanMetadata: summary statistics about a scan
metadata: ScanMetadata

# AggregatedResult: holds metadata and file entries
result: AggregatedResult

JSON Output Structure

When using CLI (stdout or --output), the payload has this shape:

{
  "metadata": {
    "source_directory": "/path/to/docs",
    "total_files": 10,
    "files_with_frontmatter": 8,
    "files_without_frontmatter": 2,
    "errors": 0,
    "scan_duration_seconds": 0.523,
    "avg_duration_per_file_ms": 52.3,
    "throughput_files_per_second": 19.1
  },
  "files": [
    {
      "file_path": "getting-started.md",
      "frontmatter": {
        "title": "Getting Started",
        "date": "2024-01-15",
        "tags": ["guide", "tutorial"]
      },
      "status": "ok",
      "error": null,
      "stats": {
        "file_size": 1234,
        "modified_time": "2024-01-15T10:30:00",
        "access_time": "2024-01-15T10:30:00"
      },
      "file_hash": "abc123..."
    }
  ]
}

status is either "ok" or "illegal".

Default Exclusions

The following directories are excluded from scanning by default:

  • .git
  • .obsidian
  • __pycache__
  • .venv
  • venv
  • node_modules
  • .mypy_cache
  • .pytest_cache
  • .ruff_cache

Use -e or --exclude to add custom exclusions.

Development

# Install with dev dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Format and lint
uv run ruff format src/ tests/
uv run ruff check src/ tests/

# Type check
uv run mypy src/

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matterify-0.3.1.tar.gz (60.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matterify-0.3.1-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file matterify-0.3.1.tar.gz.

File metadata

  • Download URL: matterify-0.3.1.tar.gz
  • Upload date:
  • Size: 60.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for matterify-0.3.1.tar.gz
Algorithm Hash digest
SHA256 39f08ae2e06ff9d50c6e426d6e3bf2a157043bbe4baa616c96824b5a3e952c4c
MD5 12f7c5f4ad233cc16c0e1b8139f7a959
BLAKE2b-256 b488b20f332031b16b8fa3b620e146910f7689e8824d53b567a8cd918b9f886c

See more details on using hashes here.

Provenance

The following attestation bundles were made for matterify-0.3.1.tar.gz:

Publisher: publish.yml on chgroeling/matterify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file matterify-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: matterify-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for matterify-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 46c6a5d4647bd207d8a46ec2a48b298dc0523778012acf2c9ad70a1561df40b2
MD5 9a032b57a025d68ce32bebb2ac9647bb
BLAKE2b-256 138476f0d12141fab75b3a8888995ead896920fa54e5f6f40b2325e24a62ef99

See more details on using hashes here.

Provenance

The following attestation bundles were made for matterify-0.3.1-py3-none-any.whl:

Publisher: publish.yml on chgroeling/matterify

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page