Skip to main content

A tool to determine the content type of a file with deep learning

Project description

Magika Python Package

PyPI Monthly Downloads

Magika is a novel AI powered file type detection tool that rely on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.

Use Magika as a command line client or in your Python code!

Please check out Magika on GitHub for more information and documentation: https://github.com/google/magika.

[!WARNING] This README is about the soon-to-be released magika 0.6.0 (currently released as 0.6.0rc2 for testing). For older versions, browse the git repository at the latest stable release, here and here.

See CHANGELOG.md for more details.

Installing Magika

Magika is available as magika on PyPI:

To install the most recent stable version:

$ pip install magika

If you intend to use Magika only as a command line, you may want to use $ pipx install magika instead.

To install a specific, possibly unstable version published as a release candidate:

$ pip install magika==0.6.0rc1

Using Magika as a command-line tool

Starting from magika 0.6.0, the python package ships the new CLI, written in Rust (which replaces the old one written in python).

$ cd tests_data/basic && magika -r *
asm/code.asm: Assembly (code)
batch/simple.bat: DOS batch file (code)
c/code.c: C source (code)
css/code.css: CSS source (code)
csv/magika_test.csv: CSV document (code)
dockerfile/Dockerfile: Dockerfile (code)
docx/doc.docx: Microsoft Word 2007+ document (document)
epub/doc.epub: EPUB document (document)
epub/magika_test.epub: EPUB document (document)
flac/test.flac: FLAC audio bitstream data (audio)
handlebars/example.handlebars: Handlebars source (code)
html/doc.html: HTML document (code)
ini/doc.ini: INI configuration file (text)
javascript/code.js: JavaScript source (code)
jinja/example.j2: Jinja template (code)
jpeg/magika_test.jpg: JPEG image data (image)
json/doc.json: JSON document (code)
latex/sample.tex: LaTeX document (text)
makefile/simple.Makefile: Makefile source (code)
markdown/README.md: Markdown document (text)
[...]
$ magika ./tests_data/basic/python/code.py --json
[
  {
    "path": "./tests_data/basic/python/code.py",
    "result": {
      "status": "ok",
      "value": {
        "dl": {
          "description": "Python source",
          "extensions": [
            "py",
            "pyi"
          ],
          "group": "code",
          "is_text": true,
          "label": "python",
          "mime_type": "text/x-python"
        },
        "output": {
          "description": "Python source",
          "extensions": [
            "py",
            "pyi"
          ],
          "group": "code",
          "is_text": true,
          "label": "python",
          "mime_type": "text/x-python"
        },
        "score": 0.753000020980835
      }
    }
  }
]
$ cat doc.ini | magika -
-: INI configuration file (text)
$ magika --help
Determines the content type of files with deep-learning

Usage: magika [OPTIONS] [PATH]...

Arguments:
  [PATH]...
          List of paths to the files to analyze.

          Use a dash (-) to read from standard input (can only be used once).

Options:
  -r, --recursive
          Identifies files within directories instead of identifying the directory itself

      --no-dereference
          Identifies symbolic links as is instead of identifying their content by following them

      --colors
          Prints with colors regardless of terminal support

      --no-colors
          Prints without colors regardless of terminal support

  -s, --output-score
          Prints the prediction score in addition to the content type

  -i, --mime-type
          Prints the MIME type instead of the content type description

  -l, --label
          Prints a simple label instead of the content type description

      --json
          Prints in JSON format

      --jsonl
          Prints in JSONL format

      --format <CUSTOM>
          Prints using a custom format (use --help for details).

          The following placeholders are supported:

            %p  The file path
            %l  The unique label identifying the content type
            %d  The description of the content type
            %g  The group of the content type
            %m  The MIME type of the content type
            %e  Possible file extensions for the content type
            %s  The score of the content type for the file
            %S  The score of the content type for the file in percent
            %b  The model output if overruled (empty otherwise)
            %%  A literal %

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Check the Rust CLI docs for more information.

Check the docs on Magika's output for more details about the output format.

Using Magika as a Python module

[!WARNING] The new API is very similar to the old one, but it ships with a number of improvements and introduces a few breaking changes. Updating existing clients should be fairly straighforward, and, where we could, we kept support for the old API and added deprecation warnings. See the CHANGELOG.md for the full list of changes and suggestions on how to fix.

>>> from magika import Magika
>>> m = Magika()
>>> res = m.identify_bytes(b"# Example\nThis is an example of markdown!")
>>> print(res.output.label)
markdown

API documentation

First, create a Magika instance: magika = Magika().

The Magika object exposes three methods:

  • magika.identify_bytes(b"test"): takes as input a stream of bytes and predict its content type.
  • magika.identify_path(Path("test.txt")): takes as input one Path object and predicts its content type.
  • magika.identify_paths([Path("test.txt"), Path("test2.txt")]): takes as input a list of Path objects and returns the predicted type for each of them.

If you are dealing with big files, the identify_path and identify_paths variants are generally better: their implementation seek()s around the file to extract the needed features, without loading the entire content in memory.

These API returns an object of type MagikaResult, an absl::StatusOr-like wrapper around MagikaPrediction, which exposes the same information discussed in the Magika's output documentation.

Here is how the main types look like:

class MagikaResult:
    path: Path
    status: Status
    prediction: MagikaPrediction
    [...]
class MagikaPrediction:
    dl: ContentTypeInfo
    output: ContentTypeInfo
    score: float
class ContentTypeInfo:
    label: ContentTypeLabel
    mime_type: str
    group: str
    description: str
    extensions: List[str]
    is_text: bool
class ContentTypeLabel(StrEnum):
    APK = "apk"
    BMP = "bmp"
    [...]

Development setup

  • magika uses uv as a project and dependency managment tool. To install all the dependencies: $ cd python; uv sync.
  • To run the tests suite: $ cd python; uv run pytest tests -m "not slow". Check the github action workflows for more information.
  • We use the maturin backend to combine the Rust CLI with the python codebase. To build: $ cd python; uv run ./scripts/build_python_package.py.

Citation

If you use this software for your research, please cite it as:

@misc{magika,
      title={{Magika: AI-Powered Content-Type Detection}},
      author={{Fratantonio, Yanick and Invernizzi, Luca and Farah, Loua and Kurt, Thomas and Zhang, Marina and Albertini, Ange and Galilee, Francois and Metitieri, Giancarlo and Cretin, Julien and Petit-Bianco, Alexandre and Tao, David and Bursztein, Elie}},
      year={2024},
      eprint={2409.13768},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2409.13768},
}

[!NOTE] The Magika paper was accepted at IEEE/ACM International Conference on Software Engineering (ICSE) 2025!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

magika-0.6.0rc3-py3-none-win_amd64.whl (15.2 MB view details)

Uploaded Python 3 Windows x86-64

magika-0.6.0rc3-py3-none-manylinux_2_28_x86_64.whl (17.5 MB view details)

Uploaded Python 3 manylinux: glibc 2.28+ x86-64

magika-0.6.0rc3-py3-none-macosx_11_0_arm64.whl (15.1 MB view details)

Uploaded Python 3 macOS 11.0+ ARM64

File details

Details for the file magika-0.6.0rc3-py3-none-win_amd64.whl.

File metadata

  • Download URL: magika-0.6.0rc3-py3-none-win_amd64.whl
  • Upload date:
  • Size: 15.2 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for magika-0.6.0rc3-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 33e56bcd72c840bdd400ba2cb6e182048e9edba443a11e326be9ce5deb04aef9
MD5 8c86e25f32265e10d607584953c9999a
BLAKE2b-256 9230e028f047d1e237c2378869ae0c4a5c24f0b3215cb874f38000d3c9adbb17

See more details on using hashes here.

File details

Details for the file magika-0.6.0rc3-py3-none-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for magika-0.6.0rc3-py3-none-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e614482693bb5bf2e30a4a54aba4cca6ce87bb027e2700b3c8308c8a22e1883c
MD5 d6301383b0183f44d37e72eed9b0eef9
BLAKE2b-256 117c768edd5bf0825ba554e4c75cbafdf51a015584658dd0e590ea5f3e7f41ab

See more details on using hashes here.

File details

Details for the file magika-0.6.0rc3-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for magika-0.6.0rc3-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 04202b347a13d1175b12045b2a53615c8648ee4784aa257ec29039eacf935f48
MD5 d51968206e159f7fff040223cdcec465
BLAKE2b-256 ff0c6402b8b3115178004827f1aaccec4c58443eb4e8862de286045d9ea93aba

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page