A flexible Python library for file type inference using multiple strategies including extension-based, magic number-based, and AI-powered detection

These details have not been verified by PyPI

Project description

filetype-detector

A Python library for detecting file types using multiple inference strategies, including path-based extraction, magic number detection, and AI-powered content analysis.

Features

Multiple Inference Methods: Choose from lexical, magic-based, AI-powered, or cascading inference strategies
Type-Safe API: Type hints and type-safe inference method selection
Flexible Input: Supports both Path objects and string paths
Performance Optimized: Cascading inferencer intelligently combines methods for optimal performance
Well-Tested: Comprehensive test suite with logging support
Extensible: Base class architecture for custom inferencer implementations

Installation

Python Package

pip install filetype-detector

Or using rye:

rye sync

System Dependencies

Important: MagicInferencer and CascadingInferencer require the libmagic system library to be installed.

Ubuntu/Debian

sudo apt-get update
sudo apt-get install libmagic1

Fedora/RHEL/CentOS

sudo dnf install file-libs
# or for older versions:
# sudo yum install file-libs

Arch Linux

sudo pacman -S file

macOS

Using Homebrew:

brew install libmagic

Using MacPorts:

sudo port install file

Windows

Windows users need to use python-magic-bin as an alternative:

pip install python-magic-bin

Or download libmagic DLL manually from file.exe releases.

Alpine Linux (Docker)

apk add --no-cache file

Verification

After installation, verify libmagic is available:

file --version

If the command works, libmagic is properly installed.

Quick Start

Using the Inferencer Map (Recommended)

The simplest way to use filetype-detector is through the centralized FILE_FORMAT_INFERENCER_MAP:

from filetype_detector.inferencer import FILE_FORMAT_INFERENCER_MAP, InferencerType
from pathlib import Path

# Lexical inference (fastest, extension-based)
lexical_infer = FILE_FORMAT_INFERENCER_MAP[None]
extension = lexical_infer("document.pdf")  # Returns: '.pdf'

# Magic inference (content-based using magic numbers)
magic_infer = FILE_FORMAT_INFERENCER_MAP["magic"]
extension = magic_infer("file_without_ext")  # Returns: '.txt' (detected from content)

# Magika inference (AI-powered with confidence scores)
magika_infer = FILE_FORMAT_INFERENCER_MAP["magika"]
extension = magika_infer("script.py")  # Returns extension detected by AI

Using Individual Inferencers

You can also use inferencer classes directly:

from filetype_detector.lexical_inferencer import LexicalInferencer
from filetype_detector.magic_inferencer import MagicInferencer
from filetype_detector.magika_inferencer import MagikaInferencer
from filetype_detector.mixture_inferencer import CascadingInferencer

# Lexical inferencer - extracts extension from path
lexical = LexicalInferencer()
extension = lexical.infer("document.pdf")  # Returns: '.pdf'

# Magic inferencer - uses libmagic for content analysis
magic = MagicInferencer()
extension = magic.infer("file.dat")  # Returns actual type based on content

# Magika inferencer - AI-powered detection
magika = MagikaInferencer()
extension = magika.infer("script.py")  # Returns: '.py'

# Get confidence score (Magika only)
extension, score = magika.infer_with_score("data.json")  # Returns: ('.json', 0.98)

# Cascading inferencer - best of both worlds
cascading = CascadingInferencer()
extension = cascading.infer("data.txt")  # Uses Magic, then Magika for text files

Available Inferencers

1. LexicalInferencer

Fastest method that extracts file extensions directly from file paths. No content analysis is performed.

When to use: When file extensions are known to be accurate or when you need maximum performance.

from filetype_detector.lexical_inferencer import LexicalInferencer

inferencer = LexicalInferencer()
extension = inferencer.infer("document.pdf")  # Returns: '.pdf'
extension = inferencer.infer("file_without_ext")  # Returns: ''

2. MagicInferencer

Uses python-magic (libmagic) to detect file types based on magic numbers and file signatures. Reliable for files with incorrect or missing extensions.

When to use: When you need content-based detection but don't need AI-level accuracy, or when working with binary files.

from filetype_detector.magic_inferencer import MagicInferencer

inferencer = MagicInferencer()
extension = inferencer.infer("file.dat")  # May return: '.pdf' (detected from content)

Raises:

FileNotFoundError: If the file does not exist
ValueError: If the path is not a file
RuntimeError: If MIME type cannot be determined or converted to an extension

3. MagikaInferencer

Uses Google's Magika AI model for advanced file type detection, especially effective for text files. Provides confidence scores and detailed type information.

When to use: When you need the highest accuracy, especially for text files, or when you need confidence scores.

from filetype_detector.magika_inferencer import MagikaInferencer

inferencer = MagikaInferencer()

# Get extension only
extension = inferencer.infer("script.py")  # Returns: '.py'

# Get extension with confidence score
extension, score = inferencer.infer_with_score("data.json")  
# Returns: ('.json', 0.98)

# With custom prediction mode
from magika import PredictionMode
extension, score = inferencer.infer_with_score(
    "file.txt", 
    prediction_mode=PredictionMode.HIGH_CONFIDENCE
)

Raises:

FileNotFoundError: If the file does not exist
ValueError: If the path is not a file
RuntimeError: If Magika fails to analyze the file

4. CascadingInferencer (Recommended)

A smart two-stage inference strategy that combines Magic and Magika:

Stage 1: Uses Magic for all files (fast)
Stage 2: If detected as a text file (text/* MIME type), uses Magika for detailed type detection

This approach optimizes performance by only using Magika (computationally expensive) for text files where it excels, while using faster Magic detection for binary files.

When to use: Recommended default choice for balanced performance and accuracy.

System Requirements: Requires libmagic system library. See Installation section for OS-specific setup.

from filetype_detector.mixture_inferencer import CascadingInferencer

inferencer = CascadingInferencer()

# Text file - uses Magic then Magika
extension = inferencer.infer("script.py")  # Returns: '.py' (from Magika)

# Binary file - uses Magic only
extension = inferencer.infer("document.pdf")  # Returns: '.pdf' (from Magic)

# JSON file with wrong extension
extension = inferencer.infer("data.txt")  # May return: '.json' (from Magika)

Type-Safe Usage

The library provides type-safe inference method selection:

from filetype_detector.inferencer import InferencerType, FILE_FORMAT_INFERENCER_MAP

def process_file(file_path: str, method: InferencerType) -> str:
    inferencer_func = FILE_FORMAT_INFERENCER_MAP[method]
    extension = inferencer_func(file_path)
    return extension

# Type-safe calls
result1 = process_file("doc.pdf", "magic")      # ✅ Valid
result2 = process_file("doc.pdf", None)         # ✅ Valid
result3 = process_file("doc.pdf", "invalid")   # ❌ Type error

Handling Edge Cases

Files Without Extensions

# Lexical inferencer returns empty string
lexical = LexicalInferencer()
result = lexical.infer("file_without_ext")  # Returns: ''

# Magic/Magika inferencers detect from content
magic = MagicInferencer()
result = magic.infer("file_without_ext")  # Returns: '.txt' (detected)

cascading = CascadingInferencer()
result = cascading.infer("file_without_ext")  # Returns detected extension

Wrong File Extensions

# File named 'data.txt' but contains JSON
magic = MagicInferencer()
result = magic.infer("data.txt")  # May return: '.json'

magika = MagikaInferencer()
result, score = magika.infer_with_score("data.txt")  # Returns: ('.json', 0.95)

Error Handling

All inferencers raise appropriate exceptions:

from filetype_detector.magic_inferencer import MagicInferencer

inferencer = MagicInferencer()

try:
    extension = inferencer.infer("nonexistent.pdf")
except FileNotFoundError:
    print("File not found")
except ValueError:
    print("Path is not a file")
except RuntimeError as e:
    print(f"Detection failed: {e}")

Testing

Run the test suite:

pytest tests/ -v

With logging (using loguru):

pytest tests/ -v -s

Run specific test files:

pytest tests/test_cascading_inferencer.py -v
pytest tests/test_magic_inferencer.py -v
pytest tests/test_magika_inferencer.py -v
pytest tests/test_lexical_inferencer.py -v

Architecture

Base Class

All inferencers inherit from BaseInferencer, which defines a common interface:

from abc import ABC, abstractmethod
from typing import Union
from pathlib import Path

class BaseInferencer(ABC):
    @abstractmethod
    def infer(self, file_path: Union[Path, str]) -> str:
        """Infer file format from path."""
        raise NotImplementedError

Custom Inferencer

You can create custom inferencers by subclassing BaseInferencer:

from filetype_detector.base_inferencer import BaseInferencer
from typing import Union
from pathlib import Path

class CustomInferencer(BaseInferencer):
    def infer(self, file_path: Union[Path, str]) -> str:
        # Your custom logic here
        return ".custom"

Performance Considerations

LexicalInferencer: Fastest (~microseconds), no I/O required
MagicInferencer: Fast (~milliseconds), requires file read
MagikaInferencer: Slower (~5-10ms after model load), requires file read + AI inference
CascadingInferencer: Balanced - Magic speed for binaries, Magika accuracy for text files

Dependencies

python-magic>=0.4.27: For magic number-based file detection
magika>=1.0.1: Google's AI-powered file type detection
pytest>=8.4.2: Testing framework
loguru>=0.7.3: Logging (used in tests)

Requirements

Python >= 3.8

License

This project is open source. See LICENSE file for details.

Contributing

Contributions are welcome! Please ensure:

All tests pass: pytest tests/ -v
Code follows the existing style
New features include appropriate tests
Documentation is updated

Acknowledgments

python-magic for libmagic bindings
Google Magika for AI-powered file type detection

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Nov 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filetype_detector-0.2.0.tar.gz (41.7 kB view details)

Uploaded Nov 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filetype_detector-0.2.0-py3-none-any.whl (13.3 kB view details)

Uploaded Nov 6, 2025 Python 3

File details

Details for the file filetype_detector-0.2.0.tar.gz.

File metadata

Download URL: filetype_detector-0.2.0.tar.gz
Upload date: Nov 6, 2025
Size: 41.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for filetype_detector-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`21d972aeeb3acd2cac59d160cc7f46fa9fb00448d64c313abe2b8fb65d3a5cf7`
MD5	`4f7d45d9451ee34878b47f8772e37fec`
BLAKE2b-256	`d4c63f9f6a9ef36ac9e05ab3778c466dc717533723cd32e52adfcfe48321224b`

See more details on using hashes here.

File details

Details for the file filetype_detector-0.2.0-py3-none-any.whl.

File metadata

Download URL: filetype_detector-0.2.0-py3-none-any.whl
Upload date: Nov 6, 2025
Size: 13.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for filetype_detector-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9aef5cf9e8a4f077d9a0902d25dcb77316333c651da79ad979c0956f8e60ac4a`
MD5	`2d3171dd25dd43ffeb2c8d728763c927`
BLAKE2b-256	`56a62be942aff2d59c8c17ffe41d1079e905d6aa12e5de712f449347befb2266`

See more details on using hashes here.

filetype-detector 0.2.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

filetype-detector

Features

Installation

Python Package

System Dependencies

Ubuntu/Debian

Fedora/RHEL/CentOS

Arch Linux

macOS

Windows

Alpine Linux (Docker)

Verification

Quick Start

Using the Inferencer Map (Recommended)

Using Individual Inferencers

Available Inferencers

1. LexicalInferencer

2. MagicInferencer

3. MagikaInferencer

4. CascadingInferencer (Recommended)

Type-Safe Usage

Handling Edge Cases

Files Without Extensions

Wrong File Extensions

Error Handling

Testing

Architecture

Base Class

Custom Inferencer

Performance Considerations

Dependencies

Requirements

License

Contributing

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes