Extract and aggregate YAML frontmatter from Markdown files into structured JSON
Project description
Matterify
Extract and aggregate YAML frontmatter from Markdown files, with optional SHA-256 hashes and file statistics.
Features
- Recursive Markdown discovery with configurable directory exclusions
- YAML frontmatter extraction with structured
ok/illegalstatus reporting - Optional SHA-256 file hashes and file stats (size, mtime, atime)
- Parallel scan workers for faster processing on larger vaults/projects
- In-memory single-entry scan cache for repeated Python API calls
- Cache control via
force_refresh=Trueandclear_cache()
Quick Start
pip install matterify
matterify ./docs -o output.json
Installation
# Using uv (recommended)
uv add matterify
# Or with pip
pip install matterify
CLI Usage
matterify DIRECTORY [OPTIONS]
DIRECTORY must exist and is scanned recursively for .md and .markdown files.
Options:
--version- Show version information and exit--debug- Enable debug logging-o, --output PATH- Write JSON to file instead of stdout (if omitted, outputs to stdout)--n-procs INT- Worker process count (default: auto-detect CPU cores)-v, --verbose- Show progress and summary-e, --exclude TEXT- Additional directories to exclude--hash / --no-hash- Enable/disable SHA-256 hash computation--stats / --no-stats- Enable/disable file statistics (size, modified time, access time)--frontmatter / --no-frontmatter- Enable/disable YAML frontmatter extraction--help- Show command help and exit
When --no-frontmatter is used, metadata fields files_with_frontmatter and
files_without_frontmatter are null.
Examples:
# Output to stdout (JSON)
matterify ./docs
# Output to file
matterify ./docs -o output.json
# Verbose output
matterify ./docs --verbose
# Disable hashes and file stats
matterify ./docs --no-hash --no-stats
# Hash + stats only (skip YAML parsing)
matterify ./docs --no-frontmatter
# Exclude additional directories
matterify ./docs -e build -e .cache
# Full help
matterify --help
Python API
Public Functions
from pathlib import Path
from matterify import (
scan_directory,
)
scan_directory
Scan directory and aggregate frontmatter using parallel workers. Returns an
AggregatedResult dataclass.
scan_directory() uses an in-memory single-entry cache keyed by directory and scan options.
Repeated calls with the same inputs return the cached result unless force_refresh=True is
passed. Use clear_cache() to manually invalidate the cache.
from pathlib import Path
from matterify import scan_directory
result = scan_directory(Path("./docs"))
# Bypass in-memory cache and force recompute
fresh_result = scan_directory(Path("./docs"), force_refresh=True)
# AggregatedResult contains:
# - result.metadata: ScanMetadata with scan statistics
# - result.files: list of file entries with extraction results
# Access metadata
print(result.metadata.total_files)
print(result.metadata.files_with_frontmatter)
print(result.metadata.scan_duration_seconds)
# Access files
for entry in result.files:
print(entry.file_path, entry.status)
print(entry.stats.file_size if entry.stats else None)
# Clear the in-memory single-entry scan cache
from matterify import clear_cache
clear_cache()
force_refresh and clear_cache() are Python API controls only; the CLI always performs
a fresh scan per command invocation.
Public Types
from matterify import (
FileEntry,
ScanMetadata,
AggregatedResult,
)
# FileEntry: extracted frontmatter from a single file
entry: FileEntry
# ScanMetadata: summary statistics about a scan
metadata: ScanMetadata
# AggregatedResult: holds metadata and file entries
result: AggregatedResult
JSON Output Structure
When using CLI (stdout or --output), the payload has this shape:
{
"metadata": {
"source_directory": "/path/to/docs",
"total_files": 10,
"files_with_frontmatter": 8,
"files_without_frontmatter": 2,
"errors": 0,
"scan_duration_seconds": 0.523,
"avg_duration_per_file_ms": 52.3,
"throughput_files_per_second": 19.1
},
"files": [
{
"file_path": "getting-started.md",
"frontmatter": {
"title": "Getting Started",
"date": "2024-01-15",
"tags": ["guide", "tutorial"]
},
"status": "ok",
"error": null,
"stats": {
"file_size": 1234,
"modified_time": "2024-01-15T10:30:00",
"access_time": "2024-01-15T10:30:00"
},
"file_hash": "abc123..."
}
]
}
status is either "ok" or "illegal".
Default Exclusions
The following directories are excluded from scanning by default:
.git.obsidian__pycache__.venvvenvnode_modules.mypy_cache.pytest_cache.ruff_cache
Use -e or --exclude to add custom exclusions.
Development
# Install with dev dependencies
uv sync --all-extras
# Run tests
uv run pytest
# Format and lint
uv run ruff format src/ tests/
uv run ruff check src/ tests/
# Type check
uv run mypy src/
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file matterify-0.3.1.tar.gz.
File metadata
- Download URL: matterify-0.3.1.tar.gz
- Upload date:
- Size: 60.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39f08ae2e06ff9d50c6e426d6e3bf2a157043bbe4baa616c96824b5a3e952c4c
|
|
| MD5 |
12f7c5f4ad233cc16c0e1b8139f7a959
|
|
| BLAKE2b-256 |
b488b20f332031b16b8fa3b620e146910f7689e8824d53b567a8cd918b9f886c
|
Provenance
The following attestation bundles were made for matterify-0.3.1.tar.gz:
Publisher:
publish.yml on chgroeling/matterify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
matterify-0.3.1.tar.gz -
Subject digest:
39f08ae2e06ff9d50c6e426d6e3bf2a157043bbe4baa616c96824b5a3e952c4c - Sigstore transparency entry: 1239371201
- Sigstore integration time:
-
Permalink:
chgroeling/matterify@a7c4b835454bc72b080d6f39ad73ec7a012bccc0 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/chgroeling
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a7c4b835454bc72b080d6f39ad73ec7a012bccc0 -
Trigger Event:
push
-
Statement type:
File details
Details for the file matterify-0.3.1-py3-none-any.whl.
File metadata
- Download URL: matterify-0.3.1-py3-none-any.whl
- Upload date:
- Size: 13.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
46c6a5d4647bd207d8a46ec2a48b298dc0523778012acf2c9ad70a1561df40b2
|
|
| MD5 |
9a032b57a025d68ce32bebb2ac9647bb
|
|
| BLAKE2b-256 |
138476f0d12141fab75b3a8888995ead896920fa54e5f6f40b2325e24a62ef99
|
Provenance
The following attestation bundles were made for matterify-0.3.1-py3-none-any.whl:
Publisher:
publish.yml on chgroeling/matterify
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
matterify-0.3.1-py3-none-any.whl -
Subject digest:
46c6a5d4647bd207d8a46ec2a48b298dc0523778012acf2c9ad70a1561df40b2 - Sigstore transparency entry: 1239371205
- Sigstore integration time:
-
Permalink:
chgroeling/matterify@a7c4b835454bc72b080d6f39ad73ec7a012bccc0 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/chgroeling
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@a7c4b835454bc72b080d6f39ad73ec7a012bccc0 -
Trigger Event:
push
-
Statement type: