filedna

Discover a file's true identity.

These details have not been verified by PyPI

Project links

Project description

███████╗██╗██╗     ███████╗██████╗ ███╗   ██╗ █████╗
██╔════╝██║██║     ██╔════╝██╔══██╗████╗  ██║██╔══██╗
█████╗  ██║██║     █████╗  ██║  ██║██╔██╗ ██║███████║
██╔══╝  ██║██║     ██╔══╝  ██║  ██║██║╚██╗██║██╔══██║
██║     ██║███████╗███████╗██████╔╝██║ ╚████║██║  ██║
╚═╝     ╚═╝╚══════╝╚══════╝╚═════╝ ╚═╝  ╚═══╝╚═╝  ╚═╝

Discover a file's true identity.

The Python file analysis library that trusts content, not extensions.

Quick Start · Why FileDNA? · API Reference · CLI · Real-World Use Cases

The Problem Every Developer Knows

# You've written some version of this. Every project. Every time.
def handle_upload(file_path):
    ext = file_path.split(".")[-1]         # ← trusting the extension
    if ext == "pdf":
        process_pdf(file_path)             # ← what if it's actually a PNG?
    elif ext == "docx":
        process_docx(file_path)            # ← what if it's corrupted?
    elif ext == "mp3":
        process_audio(file_path)           # ← what if it's a ZIP with malware?

Extensions lie. FileDNA doesn't.

from filedna import analyze

result = analyze("invoice.pdf")

result.real_type          # "png"   ← it's actually a PNG
result.extension_matches  # False   ← extension lied
result.risk_score         # 90      ← high risk
result.errors             # ["File is not a valid PDF (real type: png)"]

One function. Every file type. No system dependencies. No API keys.

🚀 Quick Start

pip install filedna

from filedna import analyze

result = analyze("report.pdf")

print(result.valid)           # True
print(result.real_type)       # "pdf"
print(result.risk_score)      # 0
print(result.summary)
# ✓ Valid PDF
# Pages: 34
# Language: en
# Contains tables
# Size: 4.2 MB
# Tokens: 15.4k
# Risk Score: 0

🧬 Why FileDNA?

The competition does one thing. FileDNA does everything.

Feature	`python-magic`	`filetype`	`puremagic`	`file-validator`	FileDNA
Magic byte detection	✓	✓	✓	✓	✓
No system deps (no `libmagic`)	✗	✓	✓	partial	✓
Extension mismatch detection	✗	✗	✗	✗	✓
Structural validation (is PDF parseable?)	✗	✗	✗	✗	✓
Risk score 0–100	✗	✗	✗	✗	✓
Rich metadata (pages, dimensions, duration)	✗	✗	✗	✗	✓
Embedded executable detection	✗	✗	✗	✗	✓
PII detection & redaction	✗	✗	✗	✗	✓
Duplicate file detection	✗	✗	✗	✗	✓
Batch analysis with concurrency	✗	✗	✗	✗	✓
Single unified API	✗	✗	✗	✗	✓

python-magic requires libmagic as a system dependency — apt install libmagic1 or brew install libmagic. Breaks on Windows, in Docker, in serverless. FileDNA is pure Python. Zero system dependencies.

🔬 What FileDNA Detects

Extension Mismatch — The Spoofed File Problem

# photo.png renamed to invoice.pdf
result = analyze("invoice.pdf")

result.valid              # False
result.real_type          # "png"          ← magic bytes say PNG
result.mime               # "image/png"
result.extension          # "pdf"          ← what the filename claims
result.extension_matches  # False          ← they don't match
result.risk_score         # 90             ← high risk
result.errors             # ["File is not a valid PDF (real type: png)"]
result.warnings           # ["Extension mismatch"]

Corrupted Files

result = analyze("broken.docx")

result.valid       # False
result.real_type   # "docx"
result.risk_score  # 70
result.errors      # ["DOCX is corrupted: bad ZIP structure"]

Embedded Executables

# ZIP containing a hidden .exe
result = analyze("document.zip")

result.risk_score  # 100
result.warnings    # ["Embedded executable detected: payload.exe"]

PII in Documents

from filedna import detect_pii, redact_pii

text = "Contact sarah@company.com or call +1-415-555-0192. Card: 4532015112830366"

pii = detect_pii(text)
pii.has_pii       # True
pii.types_found   # ["email", "phone_us", "credit_card"]
pii.count         # 3

# Replace all PII instantly
clean = redact_pii(text)
# "Contact [REDACTED_EMAIL] or call +[REDACTED_PHONE_US]. Card: [REDACTED_CREDIT_CARD]"

📦 Full API Reference

Core Analysis

from filedna import analyze, validate, detect_type, inspect_file, inspect_url, estimate_tokens

# Full identity report: type + validation + metadata + risk score
result = analyze("file.pdf")
result = analyze("file.pdf", skip_metadata=True)   # faster, no metadata

# Structural validation only (faster than analyze)
result = validate("file.pdf")
if not result.valid:
    print(result.errors)    # ["File is not a valid PDF"]

# Real type from magic bytes — fastest, no validation
detect_type("photo.pdf")    # → "png"    extension lied
detect_type("data.zip")     # → "docx"  actually a Word document

# Type-specific metadata
meta = inspect_file("report.pdf")
# {"pages": 34, "language": "en", "contains_tables": True, "estimated_tokens": 15423}

# URL inspection via HTTP HEAD — no download
info = inspect_url("https://example.com/file.pdf")
# {"valid": True, "real_type": "pdf", "size_human": "4.2 MB", "status_code": 200}

# LLM token count estimate
estimate_tokens("report.pdf")   # → 15423

The AnalysisResult Object

result = analyze("file.pdf")

result.valid              # bool   — passed all checks?
result.real_type          # str    — "pdf", "png", "mp3", "zip"...
result.mime               # str    — "application/pdf"
result.extension          # str    — declared extension from filename
result.extension_matches  # bool   — does extension match real type?
result.size_bytes         # int    — 4213567
result.size_human         # str    — "4.2 MB"
result.risk_score         # int    — 0 (clean) to 100 (dangerous)
result.warnings           # list   — ["Extension mismatch"]
result.errors             # list   — ["File is not a valid PDF"]
result.metadata           # dict   — pages, dims, duration, etc.
result.summary            # str    — human-readable one-liner

# Serialize to JSON
import json
print(json.dumps(result.model_dump(), indent=2))

File Identity Utilities

from filedna import (
    extract_exif,       # GPS coords, camera model, focal length, ISO
    detect_pii,         # email, phone, credit card, SSN, IBAN, API keys
    redact_pii,         # replace PII with [REDACTED_TYPE] tags
    content_hash,       # SHA-256 + MD5 in one call
    find_duplicates,    # content-based dedup across a folder
    diff_files,         # what changed between two versions?
    analyze_many,       # batch analysis with thread pool
)

# EXIF: GPS, camera, timestamps — no manual DMS→decimal conversion
exif = extract_exif("photo.jpg")
exif.camera_make          # "Apple"
exif.camera_model         # "iPhone 15 Pro"
exif.focal_length         # 6.86   (mm)
exif.iso                  # 50
exif.datetime_taken       # "2024:03:15 14:22:31"
exif.gps.latitude         # 51.507351   (decimal degrees, ready to use)
exif.gps.google_maps_url  # "https://www.google.com/maps?q=51.5,-0.12"

# Content hashing — SHA-256 + MD5, streams large files
h = content_hash("contract.pdf")
h.sha256   # "a750aec01847d06d..."
h.md5      # "d7591a0ac484c964..."
h == content_hash("contract_copy.pdf")   # True if identical content

# Find duplicates in an uploads folder
groups = find_duplicates(list(Path("uploads").rglob("*")))
for g in groups:
    print(f"{g.count} copies, {g.wasted_bytes} bytes wasted")
    # keep first, delete the rest
    for dup in g.paths[1:]:
        dup.unlink()

# Diff two document versions
diff = diff_files("contract_v1.pdf", "contract_v2.pdf")
diff.lines_added    # 6
diff.lines_removed  # 3
diff.diff_ratio     # 0.72
diff.summary        # "+6 added, -3 removed, 72% similar"
diff.unified_diff   # standard --- a/ +++ b/ format

# Batch analysis with concurrency
batch = analyze_many(list(Path("uploads").glob("*")), max_workers=8)
batch.total            # 50
batch.succeeded        # 47
batch.duration_seconds # 1.24
# Find all high-risk files
risky = [p for p, r in batch.results.items() if r.risk_score > 50]

AI Features (Optional — Requires API Key)

from filedna.features.ai_features import AIConfig, classify_content, extract_structured

# Works with OpenAI, Anthropic, Gemini, Mistral, Ollama, and 100+ providers
config = AIConfig(
    provider="openai",
    model="gpt-4o-mini",
    fallbacks=[
        AIConfig(provider="anthropic", model="claude-haiku-4-5"),
        AIConfig(provider="gemini",    model="gemini-1.5-flash"),
    ]
)

# "Is this a legal contract or invoice?" — beyond what extensions tell you
result = classify_content(text, config=config)
result.value            # {"label": "invoice", "confidence": "high"}
result.provider_used    # "openai/gpt-4o-mini"
result.used_fallback    # True if primary failed, fallback served it
result.summary()        # full attempt audit log with ✓/✗ per call

# Extract structured fields from any document
data = extract_structured(
    text,
    schema={
        "invoice_number": "string",
        "total_amount":   "float",
        "vendor_name":    "string",
        "line_items":     "list of {description: str, amount: float}",
    },
    config=config,
)
data.value["invoice_number"]   # "INV-2024-001"
data.value["total_amount"]     # 4250.00

AI features use exponential backoff with jitter, automatic fallback chains, and error classification (rate limits retry, auth failures skip to next provider immediately). Every call returns an AIResponse with full audit trail — which provider served it, how many retries, what failed.

🛡️ Risk Score Engine

Scores range from 0 (clean) to 100 (dangerous). Capped at 100.

Condition	Points
Extension mismatch	+40
Corrupted / invalid structure	+50
Errors present	+30
Unreadable metadata	+20
Empty file	+30
Embedded executable (.exe, .dll, .bat, .ps1...)	+80

result = analyze("suspicious.zip")

if result.risk_score == 0:
    print("✓ Clean")
elif result.risk_score < 40:
    print("⚠ Low risk — review warnings")
elif result.risk_score < 70:
    print("⚠ Medium risk — manual review required")
else:
    print("✗ High risk — quarantine this file")

📁 Supported File Formats

Category	Formats	Validation	Metadata
Documents	PDF, DOCX, XLSX, PPTX, EPUB, CSV, TXT, MD, JSON, XML, HTML	✓	✓
Images	PNG, JPG, WebP, GIF, BMP, TIFF, SVG	✓	✓
Audio	MP3, WAV, FLAC, OGG, M4A, AAC	✓	✓
Video	MP4, MOV, MKV, WebM, AVI	✓	✓*
Archives	ZIP, TAR, GZ, BZ2, 7Z, RAR	✓	✓

*Full video metadata (fps, codec, resolution) requires ffprobe: brew install ffmpeg or apt install ffmpeg

📊 Metadata by File Type

PDF

meta = inspect_file("report.pdf")
meta["pages"]             # 34
meta["encrypted"]         # False
meta["contains_images"]   # True
meta["contains_tables"]   # True
meta["language"]          # "en"
meta["estimated_tokens"]  # 15423

DOCX

meta["paragraphs"]        # 82
meta["words"]             # 3210
meta["estimated_pages"]   # 11
meta["language"]          # "en"
meta["estimated_tokens"]  # 6780

XLSX

meta["sheets"]            # 3
meta["sheet_names"]       # ["Q1", "Q2", "Summary"]
meta["rows"]              # 1204
meta["columns"]           # 12
meta["estimated_tokens"]  # 3201

Images (PNG, JPG, WebP, GIF, BMP, TIFF)

meta["width"]             # 1920
meta["height"]            # 1080
meta["mode"]              # "RGB"
meta["dpi"]               # (72, 72)
meta["has_transparency"]  # False

Audio (MP3, WAV, FLAC, OGG, M4A)

meta["duration"]          # 213.4   (seconds)
meta["bitrate"]           # 320000  (bits/s)
meta["sample_rate"]       # 44100   (Hz)
meta["channels"]          # 2

Video (MP4, MOV, MKV, WebM, AVI)

meta["duration"]          # 92.4    (seconds)
meta["resolution"]        # "1920x1080"
meta["fps"]               # 29.97
meta["codec"]             # "h264"

Archives (ZIP, TAR, GZ)

meta["file_count"]                # 24
meta["total_uncompressed_bytes"]  # 4194304

💻 CLI

# Full analysis (JSON output)
filedna analyze report.pdf

# Human-friendly output
filedna analyze report.pdf --pretty

# Validate only — exits 0 (valid) or 1 (invalid), perfect for CI/CD
filedna validate upload.pdf && echo "safe to process"

# Detect real type — ignores the extension completely
filedna type photo.pdf
# → png

# Token count estimate
filedna tokens report.pdf
# → 15423

# URL inspection (HEAD only, no download)
filedna url https://example.com/file.pdf --pretty

--pretty output:

✓ PDF

Pages:        34
Language:     en
Contains tables
Size:         4.2 MB
Tokens:       15.4k
Risk Score:   0
MIME:         application/pdf
Ext match:    yes  ('pdf' declared)

🔧 Real-World Use Cases

File Upload Validation (Web Apps / APIs)

from filedna import analyze

def validate_upload(path: str, allowed_types: list[str]) -> dict:
    result = analyze(path, skip_metadata=True)   # fast path

    if not result.valid:
        return {"accept": False, "reason": result.errors[0]}

    if result.real_type not in allowed_types:
        return {"accept": False, "reason": f"File type '{result.real_type}' not allowed"}

    if result.risk_score > 50:
        return {"accept": False, "reason": f"High-risk file (score: {result.risk_score})"}

    return {"accept": True, "type": result.real_type, "size": result.size_human}

RAG Pipeline — Pre-flight Check Before Indexing

from filedna import analyze, estimate_tokens

MAX_TOKENS = 100_000

def preflight(path: str) -> bool:
    result = analyze(path)

    if not result.valid:
        print(f"Skipping {path}: {result.errors}")
        return False

    tokens = result.metadata.get("estimated_tokens", 0)
    if tokens > MAX_TOKENS:
        print(f"Skipping {path}: {tokens:,} tokens exceeds limit")
        return False

    return True

Scan an Entire Uploads Folder

from pathlib import Path
from filedna import analyze_many

batch = analyze_many(
    list(Path("uploads").rglob("*")),
    max_workers=8,
    on_progress=lambda done, total, path: print(f"{done}/{total}: {path}")
)

print(f"Processed {batch.total} files in {batch.duration_seconds}s")
print(f"Success rate: {batch.success_rate:.0%}")

# Files needing attention
for path, result in batch.results.items():
    if result.risk_score > 0:
        print(f"⚠ {path}: risk={result.risk_score}, {result.warnings}")

Deduplicate an Archive

from pathlib import Path
from filedna import find_duplicates

groups = find_duplicates(list(Path("documents").rglob("*")))

total_wasted = sum(g.wasted_bytes for g in groups)
print(f"Found {len(groups)} duplicate groups, {total_wasted / 1024 / 1024:.1f} MB wasted")

for group in groups:
    print(f"\nDuplicate ({group.count}x, {group.size_bytes} bytes each):")
    for i, path in enumerate(group.paths):
        marker = "KEEP" if i == 0 else "DELETE"
        print(f"  [{marker}] {path}")

EXIF GPS Extraction

from filedna import extract_exif

exif = extract_exif("photo.jpg")

if exif.has_gps:
    print(f"Location: {exif.gps}")                    # "51.507351, -0.127758"
    print(f"Maps: {exif.gps.google_maps_url}")         # ready-to-use URL
    print(f"Camera: {exif.camera_make} {exif.camera_model}")
    print(f"Settings: f/{exif.aperture}, ISO {exif.iso}, {exif.shutter_speed}")
    print(exif.summary)

⚙️ Architecture

filedna/
├── __init__.py              ← public API (all functions)
├── core/
│   ├── engine.py            ← analysis pipeline orchestration
│   ├── risk.py              ← risk scoring engine (0–100)
│   └── url_inspector.py     ← HTTP HEAD inspection
├── detectors/
│   └── type_detector.py     ← magic bytes + binary signatures (no libmagic)
├── validators/
│   └── file_validators.py   ← structural validation per type
├── inspectors/
│   └── metadata.py          ← metadata extraction per type
├── extractors/
│   ├── exif_extractor.py    ← EXIF + GPS extraction
│   └── text_extractor.py    ← plain text extraction (internal)
├── features/
│   ├── pipeline.py          ← PII, hashing, dedup, diff, batch
│   └── ai_features.py       ← AI layer with retry/fallback orchestration
├── models/
│   └── result.py            ← AnalysisResult (Pydantic v2)
└── cli/
    └── commands.py          ← Click CLI

Detection Pipeline

File input
    │
    ▼
Magic bytes / binary signatures ──── offset-based pattern matching
    │
    ├── ZIP container? ──────────────── peek inside → docx/xlsx/pptx/epub/zip
    │
    ├── Text content? ───────────────── sniff → json/xml/html/csv/md/txt
    │
    ├── filetype library ────────────── fallback
    │
    └── puremagic ───────────────────── fallback
    │
    ▼
Extension mismatch check
    │
    ▼
Structural validation (is it actually parseable?)
    │
    ▼
Metadata extraction
    │
    ▼
Risk score computation
    │
    ▼
AnalysisResult

🏎️ Performance

Operation	Time
`detect_type()`	< 10ms
`validate()`	< 100ms
`analyze()`	< 500ms
`analyze_many(50 files, workers=8)`	~1.2s

All imports are lazy — dependencies only load for the relevant file type. detect_type() loads nothing extra.

🔌 Installation Options

# Core (everything in this README)
pip install filedna

# With AI features (litellm for classify, extract_structured, etc.)
pip install filedna[ai]

# Development
pip install filedna[dev]

Zero system dependencies. Unlike python-magic, FileDNA does not require libmagic, so it works on Windows, macOS, Linux, Docker, and serverless without any apt install or brew install.

🛠️ Development

git clone https://github.com/filedna/filedna
cd filedna
pip install -e ".[dev]"

# Run tests (190 tests)
pytest

# With coverage
pytest --cov=filedna --cov-report=term-missing

# Lint
ruff check filedna/

🗺️ Roadmap

Version	Features
v1.2 ✓	PII detection, content hashing, deduplication, file diff, batch analysis, EXIF extraction, AI classify/extract with retry+fallback
v1.3	OCR for scanned PDFs (AI-powered), archive deep inspection, HEIC/HEIF support
v1.4	Malware heuristics (YARA rules), steganography detection, content-level dedup
v2.0	MCP server, REST API, async API, FileDNA Server

🤝 Contributing

Contributions are welcome. Please open an issue first to discuss what you want to change.

Fork the repo
Create a branch: git checkout -b feature/your-feature
Make your changes and add tests
Ensure tests pass: pytest
Ensure lint passes: ruff check filedna/
Open a pull request

📄 License

MIT — see LICENSE.

FileDNA · PyPI · Issues

If FileDNA saved you from writing boilerplate, consider giving it a ⭐

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.2.6

Jun 11, 2026

This version

1.2.4

Jun 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filedna-1.2.4.tar.gz (58.8 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filedna-1.2.4-py3-none-any.whl (57.4 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file filedna-1.2.4.tar.gz.

File metadata

Download URL: filedna-1.2.4.tar.gz
Upload date: Jun 11, 2026
Size: 58.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for filedna-1.2.4.tar.gz
Algorithm	Hash digest
SHA256	`5b22d083e9418abdc43dc9618e6bed925dd3ceabc9d71270e80a927461973237`
MD5	`cefe567bf14be872d97716fddd66cc05`
BLAKE2b-256	`fcf8366d488bcf6d53e4b3a6e7e59f116f33e97681bbe1091cf79209de1050e4`

See more details on using hashes here.

File details

Details for the file filedna-1.2.4-py3-none-any.whl.

File metadata

Download URL: filedna-1.2.4-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 57.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for filedna-1.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6e9c68d3c550ce45d1f080b51639f6e4ae29f989e4c650b91afb24923d2894bc`
MD5	`fbadc162289684d28ba68dc6262f689c`
BLAKE2b-256	`558990402d3362460c46526b7f9be86bfd7a72e668c3b585c4c5f2d2c3c74d99`

See more details on using hashes here.

filedna 1.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

The Problem Every Developer Knows

🚀 Quick Start

🧬 Why FileDNA?

The competition does one thing. FileDNA does everything.

🔬 What FileDNA Detects

Extension Mismatch — The Spoofed File Problem

Corrupted Files

Embedded Executables

PII in Documents

📦 Full API Reference

Core Analysis

The AnalysisResult Object

File Identity Utilities

AI Features (Optional — Requires API Key)

🛡️ Risk Score Engine

📁 Supported File Formats

📊 Metadata by File Type

💻 CLI

🔧 Real-World Use Cases

File Upload Validation (Web Apps / APIs)

RAG Pipeline — Pre-flight Check Before Indexing

Scan an Entire Uploads Folder

Deduplicate an Archive

EXIF GPS Extraction

⚙️ Architecture

Detection Pipeline

🏎️ Performance

🔌 Installation Options

🛠️ Development

🗺️ Roadmap

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes