Know what's in your files before you open them. Deterministic file observation engine with cryptographic vector identity.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

russalo

These details have not been verified by PyPI

Project description

File Observer

Know what's in your files before you open them.

File Observer scans directories and tells you exactly what's inside — file types, metadata, conversation patterns, author fingerprints, structural signals — all in a deterministic JSON manifest. It reads everything. It changes nothing.

pip install file-observer
fo ./your-project --specialists

Scanned 4,366 files (3,526 text, 840 binary) in 31 directories.

1,163 supported (336 with specialist metadata). 3,203 unsupported extensions.
Quality: 676 clean, 3,690 degraded. 4 safety flags, 2 polyglots.

Vectors: author_aggregate found 64 distinct authors across 114 files.
chatlog matched 22 files. reference_tokens ran on 806 files (2,164 URLs,
382 paths, 262 @mentions). filename_patterns matched 84 of 4366 files.

Largest directories: tika-parsers (2,037), tika-pipes (459), tika-core (440).

That's the human-readable summary. The full manifest has per-file metadata, provenance traces, vector digests, and a signed integrity envelope.


Package	`file-observer`
CLI	`file-observer` or `fo` (shorthand)
Version	`1.0.0`
Schema	`1.0`
Python	`>= 3.12`
License	AGPL-3.0 (commercial license available)
Tests	564 passed, validated against 12 corpora / 28,756 files

Why File Observer?

Your pipeline needs to know what it's processing before it processes it. File Observer is the observation layer that sits at the front of any document pipeline — ingestion, classification, OCR, embedding, audit. It tells the pipeline what's coming without touching the files.

Deterministic. Same files + same config = identical manifest, every time. Cross-environment variance is explained, never hidden.
Auditable. Every derived field has a provenance trace — which method, which trigger, which inputs. Nothing is a black box.
Honest. null means "not observed within bounds," not "not present." Safety flags are observations, not assessments. The scanner records; the consumer interprets.
Verified. Cryptographic identity digests on every vector. HMAC-signed manifests. Chain-of-custody across incremental scans.

What it observes

25 file types, 4 capability tiers

Tier	Runs for	What it extracts
Universal	Every file	Identity, checksum, MIME, file signatures, polyglot detection, routing flags
Baseline	Text files	Encoding, preview, tags, frontmatter, chatlog detection, reference tokens, filename patterns
Structural	Text files	Title, headings, CSV headers, JSON/YAML/XML/TOML keys, technology hints
Specialist	Supported formats (opt-in)	PDF pages, image dimensions, email envelopes, spreadsheet structure, document metadata

Supported specialist formats: .pdf, .png, .jpg, .msg, .eml, .xlsx, .xls, .docx, .doc, .rtf, .jsonl

4 observation vectors with cryptographic identity

Vector	What it finds
chatlog	Conversation patterns — turns, speakers, section markers. Works on `.txt`, `.md`, `.jsonl`.
reference_tokens	@mentions, wiki links, code blocks, URLs, emails, file paths, ticket numbers
author_aggregate	Cross-format author normalization. Spots template defaults vs real humans.
filename_patterns	Date prefixes, version markers, numbered revisions, template names, UUIDs, copy suffixes

Each vector carries an identity digest (SHA-256). Same digest = same rules + same tuning = same output. Always.

Safety and integrity

Safety flags — detects JavaScript in PDFs, macros in DOCX, OLE objects in RTF, external entities in XML
Manifest checksum — SHA-256 over the canonical manifest
HMAC signatures — optional signed manifests for audit chains
Delta scanning — track added/modified/removed files across incremental scans
Per-directory summary — corpus shape visible at a glance

Quick start

Install

pip install file-observer

# Optional: specialist format support
pip install "file-observer[msg]"       # .msg/.doc/.xls (OLE2 formats)
pip install "file-observer[security]"  # Hardened XML parsing
pip install "file-observer[dev]"       # Full dev environment

System requirement: libmagic for content-based MIME detection.

sudo apt install libmagic1    # Debian/Ubuntu
brew install libmagic         # macOS
pip install python-magic-bin  # Windows

Scan

# Quick scan
fo ./project

# Deep scan with specialist metadata
fo ./project --specialists

# Named profile with JSONL output
fo ./project --profile deep_extract --format jsonl

# Delta scan against a previous manifest, signed
fo ./project --previous-manifest ./last.json --signing-key-file ./key

Use in code

from pathlib import Path
from scanner import Scanner, ScannerConfig, manifest_to_json

config = ScannerConfig(enable_specialists=True)
manifest = Scanner(source_dir=Path("./documents"), config=config).scan()

# Human-readable summary
print(manifest.summary)

# Find conversation logs
for f in manifest.files:
    if f.is_chatlog and f.specialist_metadata:
        chat = f.specialist_metadata["chatlog"]
        print(f"{f.path}: {chat['turn_count']} turns, {chat['speaker_labels']}")

# Triage via quality block
q = manifest.quality
print(f"{q.clean_files}/{q.total_files} clean, {q.safety_flags} safety flags")

# Write manifest
Path("manifest.json").write_text(manifest_to_json(manifest))

Every scan also produces a standalone Markdown report (report_v{version}_{timestamp}.md) — readable in any browser, shareable, no JSON parsing required.

Use cases

Document pipeline preprocessing

Point File Observer at an incoming document folder before your ingestor touches it. Know which files need OCR, which have specialist metadata, which are mislabeled, and which carry safety flags — before processing begins.

AI training data curation

Scanning AI conversation logs, knowledge bases, and document corpora? File Observer detects chatlog patterns in .txt, .md, and .jsonl files, counts turns and speakers, and surfaces reference tokens (URLs, @mentions, code blocks) across thousands of files. Built for the datasets that train and evaluate language models.

Audit and compliance

Every field has a provenance trace. Every vector has a cryptographic identity digest. Manifests can be HMAC-signed with chain-of-custody across incremental scans. When the auditor asks "how do you know this file contains X?" — the manifest answers.

Knowledge management and vault analysis

Run File Observer against an Obsidian vault, a Confluence export, or a shared drive. The per-directory summary shows corpus shape instantly. Reference tokens reveal link density, cross-references, and structural patterns. Author aggregation spots template defaults vs real contributors.

Migration and deduplication

Moving files between systems? File Observer gives you checksums, MIME analysis, format signatures, and polyglot detection for every file. Delta scanning tracks what changed between runs. Filename patterns catch copy suffixes, numbered revisions, and UUID-named files.

Security triage

Safety flags surface JavaScript in PDFs, macros in DOCX files, OLE objects in RTF, and external entities in XML — without opening or executing anything. Feed the flags into your security pipeline for automated quarantine decisions.

How it works

fo ./corpus --specialists
  |
  +-- Universal tier     Every file: checksum, MIME, signatures, routing
  +-- Baseline tier      Text files: encoding, preview, tags, chatlog detection
  +-- Structural tier    Text files: title, headings, keys, technology hints
  +-- Specialist tier    Format-specific: PDF, images, email, spreadsheets, documents
  +-- Vector pass        chatlog, reference_tokens, filename_patterns (per-file)
  +-- Corpus vectors     author_aggregate (after all files processed)
  +-- Summary            Human-readable paragraph + per-directory breakdown
  |
  +-- Output: manifest.json + report.md

One file failure never halts the scan. Errors are captured per-file, per-stage. The manifest is always complete.

Configurable depth

Profile	Baseline	Specialists	Use case
`fast_sort`	8KB	Off	Quick triage, file routing
`general`	64KB	Off	Standard observation
`deep_extract`	1MB	On	Full metadata extraction

Per-extension overrides let you give specific formats more budget:

fo ./docs --specialists --extension-override .pdf:specialist_budget=524288

Validated at scale

File Observer has been tested against 12 real-world corpora totaling 28,756 files with zero errors:

Corpus	Files	What it tested
Apache Tika	4,366	152 document specialists, 69 PDFs, 57 spreadsheets, 13 emails
OBS Studio	5,201	Large C/C++ project, 91 filename patterns
AutoGPT	3,945	AI platform, 208 chatlog detections, 1,612 @mentions
FastAPI	3,002	Documentation-heavy Python, chatlog tuning validation
OpenPreserve	753	Adversarial format samples, 285 PDFs
Claude Code logs	125	Real AI conversation transcripts, JSONL chatlog detection
Flask, tmux, self-scan	11K+	Diverse code repos

Documentation

Document	What it covers
HISTORY.md	Every version from v0.1 to v1.0, with specs and compliance reports
PUBLIC_CONTRACT.md	Consumer stability commitments — what you can rely on
CONVENTIONS.md	Internal naming, versioning, and tracking
v1.0.0 RFC Specification	Current release spec — schema freeze, binding contract

API Reference

Core classes

Scanner(source_dir: Path, config: ScannerConfig | None = None)
Scanner.scan() -> ScanManifest

Configuration

ScannerConfig(
    enable_specialists=False,    # Enable format-specific extraction
    preview_max_chars=1000,      # Content preview length
    sample_size=8192,            # Binary detection sample
    baseline_max_bytes=65536,    # Text decode limit
    specialist_budget=131072,    # OOXML read budget
    format="json",               # "json" or "jsonl"
    exclude_hidden=False,        # Skip dot-files
    ignore_file=None,            # Path to .scannerignore
    previous_manifest=None,      # Delta scan reference
    signing_key=None,            # HMAC signing key
)

Output

manifest_to_json(manifest)      # Pretty-printed JSON
manifest_to_jsonl(manifest)     # NDJSON streaming format
manifest_to_markdown(manifest)  # Human-readable report

Key data classes

ScanManifest — top-level: context, stats, quality, vectors_collected, summary, files[]
FileRecord — per-file: path, mime, checksum, encoding, specialist_metadata, reference_tokens, filename_patterns, safety_flags, signal_provenance, errors
ScanContext — environment fingerprint: versions, platform, dependencies
VectorRecord — vector identity, digest, scope, applied count, summary

Contributing

We welcome contributions. See CONTRIBUTING.md for the full guide.

Quick version:

Fork and clone
pip install -e ".[dev]" and run tests
Sign the CLA on your first PR
One concern per PR, tests required, determinism preserved

License

File Observer is dual-licensed:

Open source under AGPL-3.0 — use freely, contribute back
Commercial license available for SaaS, proprietary embedding, and distribution without source disclosure

Internal use under AGPL requires no commercial license. Contact russalo@russalo.com for commercial terms.

Built by Russalo. The scanner records. The consumer interprets. The identity digest makes the recording auditable.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

russalo

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.14.0

Jun 13, 2026

1.13.0

Jun 10, 2026

1.12.1

Jun 8, 2026

1.12.0

Jun 7, 2026

1.11.0

Jun 6, 2026

1.10.0

Jun 5, 2026

1.9.1

Jun 5, 2026

1.9.0

Jun 5, 2026

1.8.2

Jun 4, 2026

1.8.1

Jun 4, 2026

1.8.0

Jun 4, 2026

1.7.0

Jun 3, 2026

1.6.0

Jun 3, 2026

1.5.0

Jun 2, 2026

1.4.0

Jun 2, 2026

1.3.0

Jun 2, 2026

1.2.4

Jun 2, 2026

1.2.3

Jun 1, 2026

1.2.2

Jun 1, 2026

1.2.1

Jun 1, 2026

1.2.0

May 31, 2026

1.1.0

May 31, 2026

1.0.2

May 31, 2026

1.0.1

May 31, 2026

This version

1.0.0

May 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

file_observer-1.0.0.tar.gz (92.5 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

file_observer-1.0.0-py3-none-any.whl (56.5 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file file_observer-1.0.0.tar.gz.

File metadata

Download URL: file_observer-1.0.0.tar.gz
Upload date: May 28, 2026
Size: 92.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for file_observer-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`319b7e1a3b614965c6dd366a2ddf5593aa484f2f84940c8cde3975d9fba87a78`
MD5	`49ff8ebda92114c23c77d5ad18301975`
BLAKE2b-256	`88e7b27e270980c209d8505e6dde50d7b6a98ca3850f11474e38ccfaa65e3703`

See more details on using hashes here.

Provenance

The following attestation bundles were made for file_observer-1.0.0.tar.gz:

Publisher: publish.yml on russalo/file-observer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: file_observer-1.0.0.tar.gz
- Subject digest: 319b7e1a3b614965c6dd366a2ddf5593aa484f2f84940c8cde3975d9fba87a78
- Sigstore transparency entry: 1650555328
- Sigstore integration time: May 28, 2026
Source repository:
- Permalink: russalo/file-observer@d5ed2cc4ad76deb1e63b0417d32e361dfe89f704
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/russalo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5ed2cc4ad76deb1e63b0417d32e361dfe89f704
- Trigger Event: release

File details

Details for the file file_observer-1.0.0-py3-none-any.whl.

File metadata

Download URL: file_observer-1.0.0-py3-none-any.whl
Upload date: May 28, 2026
Size: 56.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for file_observer-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a48aed2227b860ed8ee231df815c6838906755edc8173ccb24040c01e0364cb`
MD5	`b6c339b5c9ff4845d049d009446fab56`
BLAKE2b-256	`ff6241498d52fa9465a0dcf07bc9567dc96f4321769170e7b221346bfbd19ebe`

See more details on using hashes here.

Provenance

The following attestation bundles were made for file_observer-1.0.0-py3-none-any.whl:

Publisher: publish.yml on russalo/file-observer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: file_observer-1.0.0-py3-none-any.whl
- Subject digest: 2a48aed2227b860ed8ee231df815c6838906755edc8173ccb24040c01e0364cb
- Sigstore transparency entry: 1650555347
- Sigstore integration time: May 28, 2026
Source repository:
- Permalink: russalo/file-observer@d5ed2cc4ad76deb1e63b0417d32e361dfe89f704
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/russalo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d5ed2cc4ad76deb1e63b0417d32e361dfe89f704
- Trigger Event: release

file-observer 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

File Observer

Why File Observer?

What it observes

25 file types, 4 capability tiers

4 observation vectors with cryptographic identity

Safety and integrity

Quick start

Install

Scan

Use in code

Use cases

Document pipeline preprocessing

AI training data curation

Audit and compliance

Knowledge management and vault analysis

Migration and deduplication

Security triage

How it works

Configurable depth

Validated at scale

Documentation

API Reference

Core classes

Configuration

Output

Key data classes

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance