A tool to determine the content type of a file with deep learning
Project description
Magika Python Package
Magika is a novel AI powered file type detection tool that rely on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.
Use Magika as a command line client or in your Python code!
Please check out Magika on GitHub for more information and documentation: https://github.com/google/magika.
[!WARNING] This README is about the soon-to-be released
magika 0.6.0
(which will be first published as0.6.0rc1
on pypi to allow for testing). For older versions, browse the git repository at the latest stable release, here and here.See
CHANGELOG.md
for more details.
Installing Magika
Magika is available as magika
on PyPI:
To install the most recent stable version:
$ pip install magika
If you intend to use Magika only as a command line, you may want to use $ pipx install magika
instead.
To install a specific, possibly unstable version published as a release candidate:
$ pip install magika==0.6.0rc1
Using Magika as a command-line tool
Starting from magika 0.6.0
, the python package ships the new CLI, written in Rust (which replaces the old one written in python).
$ cd tests_data/basic && magika -r *
asm/code.asm: Assembly (code)
batch/simple.bat: DOS batch file (code)
c/code.c: C source (code)
css/code.css: CSS source (code)
csv/magika_test.csv: CSV document (code)
dockerfile/Dockerfile: Dockerfile (code)
docx/doc.docx: Microsoft Word 2007+ document (document)
epub/doc.epub: EPUB document (document)
epub/magika_test.epub: EPUB document (document)
flac/test.flac: FLAC audio bitstream data (audio)
handlebars/example.handlebars: Handlebars source (code)
html/doc.html: HTML document (code)
ini/doc.ini: INI configuration file (text)
javascript/code.js: JavaScript source (code)
jinja/example.j2: Jinja template (code)
jpeg/magika_test.jpg: JPEG image data (image)
json/doc.json: JSON document (code)
latex/sample.tex: LaTeX document (text)
makefile/simple.Makefile: Makefile source (code)
markdown/README.md: Markdown document (text)
[...]
$ magika ./tests_data/basic/python/code.py --json
[
{
"path": "./tests_data/basic/python/code.py",
"result": {
"status": "ok",
"value": {
"dl": {
"description": "Python source",
"extensions": [
"py",
"pyi"
],
"group": "code",
"is_text": true,
"label": "python",
"mime_type": "text/x-python"
},
"output": {
"description": "Python source",
"extensions": [
"py",
"pyi"
],
"group": "code",
"is_text": true,
"label": "python",
"mime_type": "text/x-python"
},
"score": 0.753000020980835
}
}
}
]
$ cat doc.ini | magika -
-: INI configuration file (text)
$ magika --help
Determines the content type of files with deep-learning
Usage: magika [OPTIONS] [PATH]...
Arguments:
[PATH]...
List of paths to the files to analyze.
Use a dash (-) to read from standard input (can only be used once).
Options:
-r, --recursive
Identifies files within directories instead of identifying the directory itself
--no-dereference
Identifies symbolic links as is instead of identifying their content by following them
--colors
Prints with colors regardless of terminal support
--no-colors
Prints without colors regardless of terminal support
-s, --output-score
Prints the prediction score in addition to the content type
-i, --mime-type
Prints the MIME type instead of the content type description
-l, --label
Prints a simple label instead of the content type description
--json
Prints in JSON format
--jsonl
Prints in JSONL format
--format <CUSTOM>
Prints using a custom format (use --help for details).
The following placeholders are supported:
%p The file path
%l The unique label identifying the content type
%d The description of the content type
%g The group of the content type
%m The MIME type of the content type
%e Possible file extensions for the content type
%s The score of the content type for the file
%S The score of the content type for the file in percent
%b The model output if overruled (empty otherwise)
%% A literal %
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Check the Rust CLI docs for more information.
Check the docs on Magika's output for more details about the output format.
Using Magika as a Python module
[!WARNING] The new API is very similar to the old one, but it ships with a number of improvements and introduces a few breaking changes. Updating existing clients should be fairly straighforward, and, where we could, we kept support for the old API and added deprecation warnings. See the CHANGELOG.md for the full list of changes and suggestions on how to fix.
>>> from magika import Magika
>>> m = Magika()
>>> res = m.identify_bytes(b"# Example\nThis is an example of markdown!")
>>> print(res.output.label)
markdown
API documentation
First, create a Magika
instance: magika = Magika()
.
The Magika
object exposes three methods:
magika.identify_bytes(b"test")
: takes as input a stream of bytes and predict its content type.magika.identify_path(Path("test.txt"))
: takes as input onePath
object and predicts its content type.magika.identify_paths([Path("test.txt"), Path("test2.txt")])
: takes as input a list ofPath
objects and returns the predicted type for each of them.
If you are dealing with big files, the identify_path
and identify_paths
variants are generally better: their implementation seek()
s around the file to extract the needed features, without loading the entire content in memory.
These API returns an object of type MagikaResult
, an absl::StatusOr
-like wrapper around MagikaPrediction
, which exposes the same information discussed in the Magika's output documentation.
Here is how the main types look like:
class MagikaResult:
path: Path
status: Status
prediction: MagikaPrediction
[...]
class MagikaPrediction:
dl: ContentTypeInfo
output: ContentTypeInfo
score: float
class ContentTypeInfo:
label: ContentTypeLabel
mime_type: str
group: str
description: str
extensions: List[str]
is_text: bool
class ContentTypeLabel(StrEnum):
APK = "apk"
BMP = "bmp"
[...]
Development setup
magika
usesuv
as a project and dependency managment tool. To install all the dependencies:$ cd python; uv sync
.- To run the tests suite:
$ cd python; uv run pytest tests -m "not slow"
. Check the github action workflows for more information. - We use the
maturin
backend to combine the Rust CLI with the python codebase. To build:$ cd python; uv run ./scripts/build_python_package.py
.
Citation
If you use this software for your research, please cite it as:
@misc{magika,
title={{Magika: AI-Powered Content-Type Detection}},
author={{Fratantonio, Yanick and Invernizzi, Luca and Farah, Loua and Kurt, Thomas and Zhang, Marina and Albertini, Ange and Galilee, Francois and Metitieri, Giancarlo and Cretin, Julien and Petit-Bianco, Alexandre and Tao, David and Bursztein, Elie}},
year={2024},
eprint={2409.13768},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2409.13768},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file magika-0.6.0rc2-py3-none-win_amd64.whl
.
File metadata
- Download URL: magika-0.6.0rc2-py3-none-win_amd64.whl
- Upload date:
- Size: 15.2 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd4a6d23b81d99f346cb7ac362af969a25f02447e5153b31692bb88107116f3c |
|
MD5 | e6932ee76c202ef4c24fd1f650ca5a39 |
|
BLAKE2b-256 | d5a09910d6b48de838bdab7e697f9abd0a96e0f6a0e9f94976eb2cb9466a8dd1 |
File details
Details for the file magika-0.6.0rc2-py3-none-manylinux_2_28_x86_64.whl
.
File metadata
- Download URL: magika-0.6.0rc2-py3-none-manylinux_2_28_x86_64.whl
- Upload date:
- Size: 17.5 MB
- Tags: Python 3, manylinux: glibc 2.28+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ec10064f147976b6d56e76b296d41576214bf3c078fcf6eda5f75a96dfb418a4 |
|
MD5 | ee36fe99e4c89e5f66a2d13c9f113551 |
|
BLAKE2b-256 | 0ea0565c599bf4663e80ec07571328daaa32c49f220b6e5ecfcf9ca17c401e89 |
File details
Details for the file magika-0.6.0rc2-py3-none-macosx_11_0_arm64.whl
.
File metadata
- Download URL: magika-0.6.0rc2-py3-none-macosx_11_0_arm64.whl
- Upload date:
- Size: 15.1 MB
- Tags: Python 3, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5b63afaabbaca4e46a75e8787e6fb5ca41b24665ad388b5c24c2b0cd7f047500 |
|
MD5 | fc031793cfd8a2c14fdd430e0d4e7590 |
|
BLAKE2b-256 | 49065470bd53d2c15bc182b89bbd1a004ca7d394a922dc9aafc5a4f08f9ab5bc |