Skip to main content

File format identification framework with heuristic evaluators and feature validators for Python

Project description

Path Format

File format identification framework with heuristic evaluators and feature validators for Python.

Installation

pip install vcti-path-format>=1.2.0

In pyproject.toml dependencies

dependencies = [
    "vcti-path-format>=1.2.0",
]

Quick Start

from pathlib import Path
from vcti.pathformat import (
    FormatDescriptor,
    FormatIdentifier,
    FormatRegistry,
    MatchConfidence,
)
from vcti.pathformat.evaluator import HeuristicEvaluator

# Define a format descriptor with validators
hdf5_descriptor = FormatDescriptor(
    id="hdf5-file",
    name="HDF5 File",
    evaluator=(
        HeuristicEvaluator()
        .check_magic_bytes(b"\x89HDF\r\n\x1a\n")  # GATE
        .check_extension([".h5", ".hdf5", ".he5"])  # EVIDENCE
    ),
    attributes={"path_type": "file", "structure": "hdf5"},
)

# Register in a format registry
registry = FormatRegistry()
registry.register(hdf5_descriptor)

# Identify a file
identifier = FormatIdentifier(registry)
results = identifier.identify_file_format(Path("data.h5"))

for result in results:
    print(f"{result.descriptor.name}: {result.confidence.name}")

# Get best match above a confidence threshold
best = identifier.get_best_match(
    Path("data.h5"),
    min_confidence=MatchConfidence.LIKELY,
)

Core Concepts

FormatDescriptor

Extends Descriptor[Evaluator] from vcti-plugin-catalog. Wraps an evaluator with format metadata and attributes.

FormatRegistry

Extends Registry[FormatDescriptor]. Central catalog of known formats with attribute-based filtering via registry.lookup.

FormatIdentifier

Evaluates a path against all (or filtered) registered formats and returns results sorted by confidence.

HeuristicEvaluator

Builder-pattern evaluator that aggregates validation evidence:

evaluator = (
    HeuristicEvaluator()
    .check_magic_bytes(b"\x89PNG\r\n\x1a\n")  # GATE
    .check_extension([".png"])                   # EVIDENCE
    .add_validator(custom_validator)             # Custom
)

Heuristic rules:

  • Failed GATE -> CERTAINLY_NOT
  • All passed + GATE present -> DEFINITE
  • All passed + no GATE -> LIKELY
  • Some EVIDENCE failed -> UNLIKELY
  • No validators -> CANT_EVALUATE

Feature Validators

Validator Role Tier Checks
MagicBytesValidator GATE IDENTIFICATION File signature bytes
ExtensionValidator EVIDENCE IDENTIFICATION File extension

Custom validators implement the FeatureValidator protocol.


Validation Tiers

Control evaluation depth with max_tier:

Tier Cost Examples
IDENTIFICATION Cheap Magic bytes, file extension
STRUCTURE Medium Schema validation, header parsing
SEMANTIC Expensive Content analysis, business logic
from vcti.pathformat import ValidationTier

# Only run cheap checks
results = identifier.identify_file_format(path, max_tier=ValidationTier.IDENTIFICATION)

Custom Validators

Implement the FeatureValidator protocol to add domain-specific checks:

from pathlib import Path
from vcti.pathformat.feature_validator import (
    FeatureValidator,
    ValidationResult,
    ValidationTier,
    ValidatorRole,
)

class HeaderValidator:
    """Checks for a text header line in the first line of a file."""

    id = "header-check"
    description = "Header line validator"
    role = ValidatorRole.EVIDENCE
    tier = ValidationTier.STRUCTURE

    def __init__(self, expected_header: str):
        self.expected_header = expected_header

    def validate(self, path: Path) -> ValidationResult:
        try:
            first_line = path.read_text(encoding="utf-8").split("\n", 1)[0]
            is_passed = first_line.strip() == self.expected_header
        except (OSError, UnicodeDecodeError):
            is_passed = False
        return ValidationResult(
            validator_id=self.id,
            role=self.role,
            is_passed=is_passed,
            details=f"Header {'matches' if is_passed else 'mismatch'}",
        )

# Use with the builder pattern
evaluator = (
    HeuristicEvaluator()
    .check_extension([".csv"])
    .add_validator(HeaderValidator("id,name,value"))
)

Evaluator Caching

HeuristicEvaluator includes an LRU cache keyed by (path, max_tier):

# Default: 128 entries
evaluator = HeuristicEvaluator(cache_size=128)

# Disable caching
evaluator = HeuristicEvaluator(cache_size=0)

# Bypass cache for a single call
report = descriptor.evaluate(path, use_cache=False)

# Inspect and manage
info = evaluator.cache_info()  # (hits, misses, maxsize, currsize) or None
evaluator.clear_cache()

Cache entries become stale if file contents change. Call clear_cache() after known file modifications, or pass use_cache=False for one-off re-evaluation.


Pre-filtering with Rules

from vcti.lookup import Rule

# Only evaluate formats with structure="hdf5"
results = identifier.identify_file_format(
    path,
    rules=[Rule("structure", "==", "hdf5")],
)

Error Handling

The framework raises typed exceptions:

Exception When
FileNotFoundError Path does not exist
PathAccessError Path is not a file or directory, or cannot be read
EvaluatorError Base class for evaluator errors
ValidationError A validator raised an unexpected exception
InvalidValidatorError Invalid validator passed to builder
from vcti.pathformat import PathAccessError

try:
    results = identifier.identify_file_format(path)
except FileNotFoundError:
    print("File not found")
except PathAccessError as e:
    print(f"Cannot access path: {e}")

Ecosystem

This package is the identification engine in a three-repo system:

Package Role
vcti-path-format Framework: evaluators, validators, registry, identifier
vcti-path-format-attributes Vocabulary: standardized attribute enums
vcti-path-format-descriptors Built-in format definitions (HDF5, CAX, etc.)

Dependencies

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcti_path_format-1.2.0.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcti_path_format-1.2.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file vcti_path_format-1.2.0.tar.gz.

File metadata

  • Download URL: vcti_path_format-1.2.0.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcti_path_format-1.2.0.tar.gz
Algorithm Hash digest
SHA256 09bf7321e1f0fe2a476121eb5f24c38590f22dbfa81461fc67ef5165c6ad1af4
MD5 0d96071c100f402ea138de173874bdb5
BLAKE2b-256 2ea20a410d727f5931501f04c9e9423a76a9f0f0abd05855ff6dff8e8478e0eb

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcti_path_format-1.2.0.tar.gz:

Publisher: publish.yml on vcollab/vcti-python-path-format

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vcti_path_format-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for vcti_path_format-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b7682413cb78276d05fb2ca2bac008f8f0bfc0ec22cc824132925bd84e59c57d
MD5 8946498fa7b803c32151b0a127580957
BLAKE2b-256 5790f0f0b07c4f35547540097db200536eea01a00d21d11127f1c8f962bb5026

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcti_path_format-1.2.0-py3-none-any.whl:

Publisher: publish.yml on vcollab/vcti-python-path-format

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page