A flexible Python library for file type inference using multiple strategies including extension-based, magic number-based, and AI-powered detection
Project description
filetype-detector
A Python library for detecting file types using multiple inference strategies, including path-based extraction, magic number detection, and AI-powered content analysis.
Features
- Multiple Inference Methods: Choose from lexical, magic-based, AI-powered, or cascading inference strategies
- Type-Safe API: Type hints and type-safe inference method selection
- Flexible Input: Supports both
Pathobjects and string paths - Performance Optimized: Cascading inferencer intelligently combines methods for optimal performance
- Well-Tested: Comprehensive test suite with logging support
- Extensible: Base class architecture for custom inferencer implementations
Installation
Python Package
pip install filetype-detector
Or using rye:
rye sync
System Dependencies
Important: MagicInferencer and CascadingInferencer require the libmagic system library to be installed.
Ubuntu/Debian
sudo apt-get update
sudo apt-get install libmagic1
Fedora/RHEL/CentOS
sudo dnf install file-libs
# or for older versions:
# sudo yum install file-libs
Arch Linux
sudo pacman -S file
macOS
Using Homebrew:
brew install libmagic
Using MacPorts:
sudo port install file
Windows
Windows users need to use python-magic-bin as an alternative:
pip install python-magic-bin
Or download libmagic DLL manually from file.exe releases.
Alpine Linux (Docker)
apk add --no-cache file
Verification
After installation, verify libmagic is available:
file --version
If the command works, libmagic is properly installed.
Quick Start
Using the Inferencer Map (Recommended)
The simplest way to use filetype-detector is through the centralized FILE_FORMAT_INFERENCER_MAP:
from filetype_detector.inferencer import FILE_FORMAT_INFERENCER_MAP, InferencerType
from pathlib import Path
# Lexical inference (fastest, extension-based)
lexical_infer = FILE_FORMAT_INFERENCER_MAP[None]
extension = lexical_infer("document.pdf") # Returns: '.pdf'
# Magic inference (content-based using magic numbers)
magic_infer = FILE_FORMAT_INFERENCER_MAP["magic"]
extension = magic_infer("file_without_ext") # Returns: '.txt' (detected from content)
# Magika inference (AI-powered with confidence scores)
magika_infer = FILE_FORMAT_INFERENCER_MAP["magika"]
extension = magika_infer("script.py") # Returns extension detected by AI
Using Individual Inferencers
You can also use inferencer classes directly:
from filetype_detector.lexical_inferencer import LexicalInferencer
from filetype_detector.magic_inferencer import MagicInferencer
from filetype_detector.magika_inferencer import MagikaInferencer
from filetype_detector.mixture_inferencer import CascadingInferencer
# Lexical inferencer - extracts extension from path
lexical = LexicalInferencer()
extension = lexical.infer("document.pdf") # Returns: '.pdf'
# Magic inferencer - uses libmagic for content analysis
magic = MagicInferencer()
extension = magic.infer("file.dat") # Returns actual type based on content
# Magika inferencer - AI-powered detection
magika = MagikaInferencer()
extension = magika.infer("script.py") # Returns: '.py'
# Get confidence score (Magika only)
extension, score = magika.infer_with_score("data.json") # Returns: ('.json', 0.98)
# Cascading inferencer - best of both worlds
cascading = CascadingInferencer()
extension = cascading.infer("data.txt") # Uses Magic, then Magika for text files
Available Inferencers
1. LexicalInferencer
Fastest method that extracts file extensions directly from file paths. No content analysis is performed.
When to use: When file extensions are known to be accurate or when you need maximum performance.
from filetype_detector.lexical_inferencer import LexicalInferencer
inferencer = LexicalInferencer()
extension = inferencer.infer("document.pdf") # Returns: '.pdf'
extension = inferencer.infer("file_without_ext") # Returns: ''
2. MagicInferencer
Uses python-magic (libmagic) to detect file types based on magic numbers and file signatures. Reliable for files with incorrect or missing extensions.
When to use: When you need content-based detection but don't need AI-level accuracy, or when working with binary files.
from filetype_detector.magic_inferencer import MagicInferencer
inferencer = MagicInferencer()
extension = inferencer.infer("file.dat") # May return: '.pdf' (detected from content)
Raises:
FileNotFoundError: If the file does not existValueError: If the path is not a fileRuntimeError: If MIME type cannot be determined or converted to an extension
3. MagikaInferencer
Uses Google's Magika AI model for advanced file type detection, especially effective for text files. Provides confidence scores and detailed type information.
When to use: When you need the highest accuracy, especially for text files, or when you need confidence scores.
from filetype_detector.magika_inferencer import MagikaInferencer
inferencer = MagikaInferencer()
# Get extension only
extension = inferencer.infer("script.py") # Returns: '.py'
# Get extension with confidence score
extension, score = inferencer.infer_with_score("data.json")
# Returns: ('.json', 0.98)
# With custom prediction mode
from magika import PredictionMode
extension, score = inferencer.infer_with_score(
"file.txt",
prediction_mode=PredictionMode.HIGH_CONFIDENCE
)
Raises:
FileNotFoundError: If the file does not existValueError: If the path is not a fileRuntimeError: If Magika fails to analyze the file
4. CascadingInferencer (Recommended)
A smart two-stage inference strategy that combines Magic and Magika:
- Stage 1: Uses Magic for all files (fast)
- Stage 2: If detected as a text file (
text/*MIME type), uses Magika for detailed type detection
This approach optimizes performance by only using Magika (computationally expensive) for text files where it excels, while using faster Magic detection for binary files.
When to use: Recommended default choice for balanced performance and accuracy.
System Requirements: Requires libmagic system library. See Installation section for OS-specific setup.
from filetype_detector.mixture_inferencer import CascadingInferencer
inferencer = CascadingInferencer()
# Text file - uses Magic then Magika
extension = inferencer.infer("script.py") # Returns: '.py' (from Magika)
# Binary file - uses Magic only
extension = inferencer.infer("document.pdf") # Returns: '.pdf' (from Magic)
# JSON file with wrong extension
extension = inferencer.infer("data.txt") # May return: '.json' (from Magika)
Type-Safe Usage
The library provides type-safe inference method selection:
from filetype_detector.inferencer import InferencerType, FILE_FORMAT_INFERENCER_MAP
def process_file(file_path: str, method: InferencerType) -> str:
inferencer_func = FILE_FORMAT_INFERENCER_MAP[method]
extension = inferencer_func(file_path)
return extension
# Type-safe calls
result1 = process_file("doc.pdf", "magic") # ✅ Valid
result2 = process_file("doc.pdf", None) # ✅ Valid
result3 = process_file("doc.pdf", "invalid") # ❌ Type error
Handling Edge Cases
Files Without Extensions
# Lexical inferencer returns empty string
lexical = LexicalInferencer()
result = lexical.infer("file_without_ext") # Returns: ''
# Magic/Magika inferencers detect from content
magic = MagicInferencer()
result = magic.infer("file_without_ext") # Returns: '.txt' (detected)
cascading = CascadingInferencer()
result = cascading.infer("file_without_ext") # Returns detected extension
Wrong File Extensions
# File named 'data.txt' but contains JSON
magic = MagicInferencer()
result = magic.infer("data.txt") # May return: '.json'
magika = MagikaInferencer()
result, score = magika.infer_with_score("data.txt") # Returns: ('.json', 0.95)
Error Handling
All inferencers raise appropriate exceptions:
from filetype_detector.magic_inferencer import MagicInferencer
inferencer = MagicInferencer()
try:
extension = inferencer.infer("nonexistent.pdf")
except FileNotFoundError:
print("File not found")
except ValueError:
print("Path is not a file")
except RuntimeError as e:
print(f"Detection failed: {e}")
Testing
Run the test suite:
pytest tests/ -v
With logging (using loguru):
pytest tests/ -v -s
Run specific test files:
pytest tests/test_cascading_inferencer.py -v
pytest tests/test_magic_inferencer.py -v
pytest tests/test_magika_inferencer.py -v
pytest tests/test_lexical_inferencer.py -v
Architecture
Base Class
All inferencers inherit from BaseInferencer, which defines a common interface:
from abc import ABC, abstractmethod
from typing import Union
from pathlib import Path
class BaseInferencer(ABC):
@abstractmethod
def infer(self, file_path: Union[Path, str]) -> str:
"""Infer file format from path."""
raise NotImplementedError
Custom Inferencer
You can create custom inferencers by subclassing BaseInferencer:
from filetype_detector.base_inferencer import BaseInferencer
from typing import Union
from pathlib import Path
class CustomInferencer(BaseInferencer):
def infer(self, file_path: Union[Path, str]) -> str:
# Your custom logic here
return ".custom"
Performance Considerations
- LexicalInferencer: Fastest (~microseconds), no I/O required
- MagicInferencer: Fast (~milliseconds), requires file read
- MagikaInferencer: Slower (~5-10ms after model load), requires file read + AI inference
- CascadingInferencer: Balanced - Magic speed for binaries, Magika accuracy for text files
Dependencies
python-magic>=0.4.27: For magic number-based file detectionmagika>=1.0.1: Google's AI-powered file type detectionpytest>=8.4.2: Testing frameworkloguru>=0.7.3: Logging (used in tests)
Requirements
- Python >= 3.8
License
This project is open source. See LICENSE file for details.
Contributing
Contributions are welcome! Please ensure:
- All tests pass:
pytest tests/ -v - Code follows the existing style
- New features include appropriate tests
- Documentation is updated
Acknowledgments
- python-magic for libmagic bindings
- Google Magika for AI-powered file type detection
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filetype_detector-0.2.0.tar.gz.
File metadata
- Download URL: filetype_detector-0.2.0.tar.gz
- Upload date:
- Size: 41.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
21d972aeeb3acd2cac59d160cc7f46fa9fb00448d64c313abe2b8fb65d3a5cf7
|
|
| MD5 |
4f7d45d9451ee34878b47f8772e37fec
|
|
| BLAKE2b-256 |
d4c63f9f6a9ef36ac9e05ab3778c466dc717533723cd32e52adfcfe48321224b
|
File details
Details for the file filetype_detector-0.2.0-py3-none-any.whl.
File metadata
- Download URL: filetype_detector-0.2.0-py3-none-any.whl
- Upload date:
- Size: 13.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9aef5cf9e8a4f077d9a0902d25dcb77316333c651da79ad979c0956f8e60ac4a
|
|
| MD5 |
2d3171dd25dd43ffeb2c8d728763c927
|
|
| BLAKE2b-256 |
56a62be942aff2d59c8c17ffe41d1079e905d6aa12e5de712f449347befb2266
|