File format identification framework with heuristic evaluators and feature validators for Python
Project description
Path Format
File format identification framework with heuristic evaluators and feature validators for Python.
Installation
pip install vcti-path-format>=1.2.0
In pyproject.toml dependencies
dependencies = [
"vcti-path-format>=1.2.0",
]
Quick Start
from pathlib import Path
from vcti.pathformat import (
FormatDescriptor,
FormatIdentifier,
FormatRegistry,
MatchConfidence,
)
from vcti.pathformat.evaluator import HeuristicEvaluator
# Define a format descriptor with validators
hdf5_descriptor = FormatDescriptor(
id="hdf5-file",
name="HDF5 File",
evaluator=(
HeuristicEvaluator()
.check_magic_bytes(b"\x89HDF\r\n\x1a\n") # GATE
.check_extension([".h5", ".hdf5", ".he5"]) # EVIDENCE
),
attributes={"path_type": "file", "structure": "hdf5"},
)
# Register in a format registry
registry = FormatRegistry()
registry.register(hdf5_descriptor)
# Identify a file
identifier = FormatIdentifier(registry)
results = identifier.identify_file_format(Path("data.h5"))
for result in results:
print(f"{result.descriptor.name}: {result.confidence.name}")
# Get best match above a confidence threshold
best = identifier.get_best_match(
Path("data.h5"),
min_confidence=MatchConfidence.LIKELY,
)
Core Concepts
FormatDescriptor
Extends Descriptor[Evaluator] from vcti-plugin-catalog. Wraps an evaluator
with format metadata and attributes.
FormatRegistry
Extends Registry[FormatDescriptor]. Central catalog of known formats with
attribute-based filtering via registry.lookup.
FormatIdentifier
Evaluates a path against all (or filtered) registered formats and returns results sorted by confidence.
HeuristicEvaluator
Builder-pattern evaluator that aggregates validation evidence:
evaluator = (
HeuristicEvaluator()
.check_magic_bytes(b"\x89PNG\r\n\x1a\n") # GATE
.check_extension([".png"]) # EVIDENCE
.add_validator(custom_validator) # Custom
)
Heuristic rules:
- Failed GATE ->
CERTAINLY_NOT - All passed + GATE present ->
DEFINITE - All passed + no GATE ->
LIKELY - Some EVIDENCE failed ->
UNLIKELY - No validators ->
CANT_EVALUATE
Feature Validators
| Validator | Role | Tier | Checks |
|---|---|---|---|
MagicBytesValidator |
GATE | IDENTIFICATION | File signature bytes |
ExtensionValidator |
EVIDENCE | IDENTIFICATION | File extension |
Custom validators implement the FeatureValidator protocol.
Validation Tiers
Control evaluation depth with max_tier:
| Tier | Cost | Examples |
|---|---|---|
IDENTIFICATION |
Cheap | Magic bytes, file extension |
STRUCTURE |
Medium | Schema validation, header parsing |
SEMANTIC |
Expensive | Content analysis, business logic |
from vcti.pathformat import ValidationTier
# Only run cheap checks
results = identifier.identify_file_format(path, max_tier=ValidationTier.IDENTIFICATION)
Custom Validators
Implement the FeatureValidator protocol to add domain-specific checks:
from pathlib import Path
from vcti.pathformat.feature_validator import (
FeatureValidator,
ValidationResult,
ValidationTier,
ValidatorRole,
)
class HeaderValidator:
"""Checks for a text header line in the first line of a file."""
id = "header-check"
description = "Header line validator"
role = ValidatorRole.EVIDENCE
tier = ValidationTier.STRUCTURE
def __init__(self, expected_header: str):
self.expected_header = expected_header
def validate(self, path: Path) -> ValidationResult:
try:
first_line = path.read_text(encoding="utf-8").split("\n", 1)[0]
is_passed = first_line.strip() == self.expected_header
except (OSError, UnicodeDecodeError):
is_passed = False
return ValidationResult(
validator_id=self.id,
role=self.role,
is_passed=is_passed,
details=f"Header {'matches' if is_passed else 'mismatch'}",
)
# Use with the builder pattern
evaluator = (
HeuristicEvaluator()
.check_extension([".csv"])
.add_validator(HeaderValidator("id,name,value"))
)
Evaluator Caching
HeuristicEvaluator includes an LRU cache keyed by (path, max_tier):
# Default: 128 entries
evaluator = HeuristicEvaluator(cache_size=128)
# Disable caching
evaluator = HeuristicEvaluator(cache_size=0)
# Bypass cache for a single call
report = descriptor.evaluate(path, use_cache=False)
# Inspect and manage
info = evaluator.cache_info() # (hits, misses, maxsize, currsize) or None
evaluator.clear_cache()
Cache entries become stale if file contents change. Call clear_cache() after
known file modifications, or pass use_cache=False for one-off re-evaluation.
Pre-filtering with Rules
from vcti.lookup import Rule
# Only evaluate formats with structure="hdf5"
results = identifier.identify_file_format(
path,
rules=[Rule("structure", "==", "hdf5")],
)
Error Handling
The framework raises typed exceptions:
| Exception | When |
|---|---|
FileNotFoundError |
Path does not exist |
PathAccessError |
Path is not a file or directory, or cannot be read |
EvaluatorError |
Base class for evaluator errors |
ValidationError |
A validator raised an unexpected exception |
InvalidValidatorError |
Invalid validator passed to builder |
from vcti.pathformat import PathAccessError
try:
results = identifier.identify_file_format(path)
except FileNotFoundError:
print("File not found")
except PathAccessError as e:
print(f"Cannot access path: {e}")
Ecosystem
This package is the identification engine in a three-repo system:
| Package | Role |
|---|---|
| vcti-path-format | Framework: evaluators, validators, registry, identifier |
| vcti-path-format-attributes | Vocabulary: standardized attribute enums |
| vcti-path-format-descriptors | Built-in format definitions (HDF5, CAX, etc.) |
Dependencies
- vcti-plugin-catalog (>=1.0.1) — descriptor/registry framework
- vcti-lookup (>=1.0.1) — attribute-based filtering
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vcti_path_format-1.2.0.tar.gz.
File metadata
- Download URL: vcti_path_format-1.2.0.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09bf7321e1f0fe2a476121eb5f24c38590f22dbfa81461fc67ef5165c6ad1af4
|
|
| MD5 |
0d96071c100f402ea138de173874bdb5
|
|
| BLAKE2b-256 |
2ea20a410d727f5931501f04c9e9423a76a9f0f0abd05855ff6dff8e8478e0eb
|
Provenance
The following attestation bundles were made for vcti_path_format-1.2.0.tar.gz:
Publisher:
publish.yml on vcollab/vcti-python-path-format
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vcti_path_format-1.2.0.tar.gz -
Subject digest:
09bf7321e1f0fe2a476121eb5f24c38590f22dbfa81461fc67ef5165c6ad1af4 - Sigstore transparency entry: 1189827580
- Sigstore integration time:
-
Permalink:
vcollab/vcti-python-path-format@b3a8e33f1e8b46cf3cd2f7b5785227c98f243acc -
Branch / Tag:
refs/heads/main - Owner: https://github.com/vcollab
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3a8e33f1e8b46cf3cd2f7b5785227c98f243acc -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file vcti_path_format-1.2.0-py3-none-any.whl.
File metadata
- Download URL: vcti_path_format-1.2.0-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7682413cb78276d05fb2ca2bac008f8f0bfc0ec22cc824132925bd84e59c57d
|
|
| MD5 |
8946498fa7b803c32151b0a127580957
|
|
| BLAKE2b-256 |
5790f0f0b07c4f35547540097db200536eea01a00d21d11127f1c8f962bb5026
|
Provenance
The following attestation bundles were made for vcti_path_format-1.2.0-py3-none-any.whl:
Publisher:
publish.yml on vcollab/vcti-python-path-format
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vcti_path_format-1.2.0-py3-none-any.whl -
Subject digest:
b7682413cb78276d05fb2ca2bac008f8f0bfc0ec22cc824132925bd84e59c57d - Sigstore transparency entry: 1189827630
- Sigstore integration time:
-
Permalink:
vcollab/vcti-python-path-format@b3a8e33f1e8b46cf3cd2f7b5785227c98f243acc -
Branch / Tag:
refs/heads/main - Owner: https://github.com/vcollab
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3a8e33f1e8b46cf3cd2f7b5785227c98f243acc -
Trigger Event:
workflow_dispatch
-
Statement type: