File format identification framework with heuristic evaluators and feature validators for Python

Project description

Path Format

File format identification framework with heuristic evaluators and feature validators for Python.

Installation

pip install vcti-path-format>=1.2.0

In `pyproject.toml` dependencies

dependencies = [
    "vcti-path-format>=1.2.0",
]

Quick Start

from pathlib import Path
from vcti.pathformat import (
    FormatDescriptor,
    FormatIdentifier,
    FormatRegistry,
    MatchConfidence,
)
from vcti.pathformat.evaluator import HeuristicEvaluator

# Define a format descriptor with validators
hdf5_descriptor = FormatDescriptor(
    id="hdf5-file",
    name="HDF5 File",
    evaluator=(
        HeuristicEvaluator()
        .check_magic_bytes(b"\x89HDF\r\n\x1a\n")  # GATE
        .check_extension([".h5", ".hdf5", ".he5"])  # EVIDENCE
    ),
    attributes={"path_type": "file", "structure": "hdf5"},
)

# Register in a format registry
registry = FormatRegistry()
registry.register(hdf5_descriptor)

# Identify a file
identifier = FormatIdentifier(registry)
results = identifier.identify_file_format(Path("data.h5"))

for result in results:
    print(f"{result.descriptor.name}: {result.confidence.name}")

# Get best match above a confidence threshold
best = identifier.get_best_match(
    Path("data.h5"),
    min_confidence=MatchConfidence.LIKELY,
)

Core Concepts

FormatDescriptor

Extends Descriptor[Evaluator] from vcti-plugin-catalog. Wraps an evaluator with format metadata and attributes.

FormatRegistry

Extends Registry[FormatDescriptor]. Central catalog of known formats with attribute-based filtering via registry.lookup.

FormatIdentifier

Evaluates a path against all (or filtered) registered formats and returns results sorted by confidence.

HeuristicEvaluator

Builder-pattern evaluator that aggregates validation evidence:

evaluator = (
    HeuristicEvaluator()
    .check_magic_bytes(b"\x89PNG\r\n\x1a\n")  # GATE
    .check_extension([".png"])                   # EVIDENCE
    .add_validator(custom_validator)             # Custom
)

Heuristic rules:

Failed GATE -> CERTAINLY_NOT
All passed + GATE present -> DEFINITE
All passed + no GATE -> LIKELY
Some EVIDENCE failed -> UNLIKELY
No validators -> CANT_EVALUATE

Feature Validators

Validator	Role	Tier	Checks
`MagicBytesValidator`	GATE	IDENTIFICATION	File signature bytes
`ExtensionValidator`	EVIDENCE	IDENTIFICATION	File extension

Custom validators implement the FeatureValidator protocol.

Validation Tiers

Control evaluation depth with max_tier:

Tier	Cost	Examples
`IDENTIFICATION`	Cheap	Magic bytes, file extension
`STRUCTURE`	Medium	Schema validation, header parsing
`SEMANTIC`	Expensive	Content analysis, business logic

from vcti.pathformat import ValidationTier

# Only run cheap checks
results = identifier.identify_file_format(path, max_tier=ValidationTier.IDENTIFICATION)

Custom Validators

Implement the FeatureValidator protocol to add domain-specific checks:

from pathlib import Path
from vcti.pathformat.feature_validator import (
    FeatureValidator,
    ValidationResult,
    ValidationTier,
    ValidatorRole,
)

class HeaderValidator:
    """Checks for a text header line in the first line of a file."""

    id = "header-check"
    description = "Header line validator"
    role = ValidatorRole.EVIDENCE
    tier = ValidationTier.STRUCTURE

    def __init__(self, expected_header: str):
        self.expected_header = expected_header

    def validate(self, path: Path) -> ValidationResult:
        try:
            first_line = path.read_text(encoding="utf-8").split("\n", 1)[0]
            is_passed = first_line.strip() == self.expected_header
        except (OSError, UnicodeDecodeError):
            is_passed = False
        return ValidationResult(
            validator_id=self.id,
            role=self.role,
            is_passed=is_passed,
            details=f"Header {'matches' if is_passed else 'mismatch'}",
        )

# Use with the builder pattern
evaluator = (
    HeuristicEvaluator()
    .check_extension([".csv"])
    .add_validator(HeaderValidator("id,name,value"))
)

Evaluator Caching

HeuristicEvaluator includes an LRU cache keyed by (path, max_tier):

# Default: 128 entries
evaluator = HeuristicEvaluator(cache_size=128)

# Disable caching
evaluator = HeuristicEvaluator(cache_size=0)

# Bypass cache for a single call
report = descriptor.evaluate(path, use_cache=False)

# Inspect and manage
info = evaluator.cache_info()  # (hits, misses, maxsize, currsize) or None
evaluator.clear_cache()

Cache entries become stale if file contents change. Call clear_cache() after known file modifications, or pass use_cache=False for one-off re-evaluation.

Pre-filtering with Rules

from vcti.lookup import Rule

# Only evaluate formats with structure="hdf5"
results = identifier.identify_file_format(
    path,
    rules=[Rule("structure", "==", "hdf5")],
)

Error Handling

The framework raises typed exceptions:

Exception	When
`FileNotFoundError`	Path does not exist
`PathAccessError`	Path is not a file or directory, or cannot be read
`EvaluatorError`	Base class for evaluator errors
`ValidationError`	A validator raised an unexpected exception
`InvalidValidatorError`	Invalid validator passed to builder

from vcti.pathformat import PathAccessError

try:
    results = identifier.identify_file_format(path)
except FileNotFoundError:
    print("File not found")
except PathAccessError as e:
    print(f"Cannot access path: {e}")

Ecosystem

This package is the identification engine in a three-repo system:

Package	Role
vcti-path-format	Framework: evaluators, validators, registry, identifier
vcti-path-format-attributes	Vocabulary: standardized attribute enums
vcti-path-format-descriptors	Built-in format definitions (HDF5, CAX, etc.)

Dependencies

vcti-plugin-catalog (>=1.0.1) — descriptor/registry framework
vcti-lookup (>=1.0.1) — attribute-based filtering

Project details

Release history Release notifications | RSS feed

This version

1.2.0

Mar 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcti_path_format-1.2.0.tar.gz (20.8 kB view details)

Uploaded Mar 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vcti_path_format-1.2.0-py3-none-any.whl (19.0 kB view details)

Uploaded Mar 28, 2026 Python 3

File details

Details for the file vcti_path_format-1.2.0.tar.gz.

File metadata

Download URL: vcti_path_format-1.2.0.tar.gz
Upload date: Mar 28, 2026
Size: 20.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcti_path_format-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`09bf7321e1f0fe2a476121eb5f24c38590f22dbfa81461fc67ef5165c6ad1af4`
MD5	`0d96071c100f402ea138de173874bdb5`
BLAKE2b-256	`2ea20a410d727f5931501f04c9e9423a76a9f0f0abd05855ff6dff8e8478e0eb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcti_path_format-1.2.0.tar.gz:

Publisher: publish.yml on vcollab/vcti-python-path-format

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vcti_path_format-1.2.0.tar.gz
- Subject digest: 09bf7321e1f0fe2a476121eb5f24c38590f22dbfa81461fc67ef5165c6ad1af4
- Sigstore transparency entry: 1189827580
- Sigstore integration time: Mar 28, 2026
Source repository:
- Permalink: vcollab/vcti-python-path-format@b3a8e33f1e8b46cf3cd2f7b5785227c98f243acc
- Branch / Tag: refs/heads/main
- Owner: https://github.com/vcollab
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b3a8e33f1e8b46cf3cd2f7b5785227c98f243acc
- Trigger Event: workflow_dispatch

File details

Details for the file vcti_path_format-1.2.0-py3-none-any.whl.

File metadata

Download URL: vcti_path_format-1.2.0-py3-none-any.whl
Upload date: Mar 28, 2026
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcti_path_format-1.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b7682413cb78276d05fb2ca2bac008f8f0bfc0ec22cc824132925bd84e59c57d`
MD5	`8946498fa7b803c32151b0a127580957`
BLAKE2b-256	`5790f0f0b07c4f35547540097db200536eea01a00d21d11127f1c8f962bb5026`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcti_path_format-1.2.0-py3-none-any.whl:

Publisher: publish.yml on vcollab/vcti-python-path-format

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vcti_path_format-1.2.0-py3-none-any.whl
- Subject digest: b7682413cb78276d05fb2ca2bac008f8f0bfc0ec22cc824132925bd84e59c57d
- Sigstore transparency entry: 1189827630
- Sigstore integration time: Mar 28, 2026
Source repository:
- Permalink: vcollab/vcti-python-path-format@b3a8e33f1e8b46cf3cd2f7b5785227c98f243acc
- Branch / Tag: refs/heads/main
- Owner: https://github.com/vcollab
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b3a8e33f1e8b46cf3cd2f7b5785227c98f243acc
- Trigger Event: workflow_dispatch

vcti-path-format 1.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Path Format

Installation

In pyproject.toml dependencies

Quick Start

Core Concepts

FormatDescriptor

FormatRegistry

FormatIdentifier

HeuristicEvaluator

Feature Validators

Validation Tiers

Custom Validators

Evaluator Caching

Pre-filtering with Rules

Error Handling

Ecosystem

Dependencies

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

In `pyproject.toml` dependencies