Skip to main content

A tool to determine the content type of a file with deep learning

Project description

Magika Python Package

PyPI Monthly Downloads

Magika is a novel AI powered file type detection tool that rely on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.

Use Magika as a command line client or in your Python code!

Please check out Magika on GitHub for more information and documentation: https://github.com/google/magika.

[!WARNING] This README is about the soon-to-be released magika 0.6.0 (which will be first published as 0.6.0rc1 on pypi to allow for testing). For older versions, browse the git repository at the latest stable release, here and here.

See CHANGELOG.md for more details.

Installing Magika

Magika is available as magika on PyPI:

To install the most recent stable version:

$ pip install magika

If you intend to use Magika only as a command line, you may want to use $ pipx install magika instead.

To install a specific, possibly unstable version published as a release candidate:

$ pip install magika==0.6.0rc1

Using Magika as a command-line tool

Starting from magika 0.6.0, the python package ships the new CLI, written in Rust (which replaces the old one written in python).

$ cd tests_data/basic && magika -r *
asm/code.asm: Assembly (code)
batch/simple.bat: DOS batch file (code)
c/code.c: C source (code)
css/code.css: CSS source (code)
csv/magika_test.csv: CSV document (code)
dockerfile/Dockerfile: Dockerfile (code)
docx/doc.docx: Microsoft Word 2007+ document (document)
epub/doc.epub: EPUB document (document)
epub/magika_test.epub: EPUB document (document)
flac/test.flac: FLAC audio bitstream data (audio)
handlebars/example.handlebars: Handlebars source (code)
html/doc.html: HTML document (code)
ini/doc.ini: INI configuration file (text)
javascript/code.js: JavaScript source (code)
jinja/example.j2: Jinja template (code)
jpeg/magika_test.jpg: JPEG image data (image)
json/doc.json: JSON document (code)
latex/sample.tex: LaTeX document (text)
makefile/simple.Makefile: Makefile source (code)
markdown/README.md: Markdown document (text)
[...]
$ magika ./tests_data/basic/python/code.py --json
[
  {
    "path": "./tests_data/basic/python/code.py",
    "result": {
      "status": "ok",
      "value": {
        "dl": {
          "description": "Python source",
          "extensions": [
            "py",
            "pyi"
          ],
          "group": "code",
          "is_text": true,
          "label": "python",
          "mime_type": "text/x-python"
        },
        "output": {
          "description": "Python source",
          "extensions": [
            "py",
            "pyi"
          ],
          "group": "code",
          "is_text": true,
          "label": "python",
          "mime_type": "text/x-python"
        },
        "score": 0.753000020980835
      }
    }
  }
]
$ cat doc.ini | magika -
-: INI configuration file (text)
$ magika --help
Determines the content type of files with deep-learning

Usage: magika [OPTIONS] [PATH]...

Arguments:
  [PATH]...
          List of paths to the files to analyze.

          Use a dash (-) to read from standard input (can only be used once).

Options:
  -r, --recursive
          Identifies files within directories instead of identifying the directory itself

      --no-dereference
          Identifies symbolic links as is instead of identifying their content by following them

      --colors
          Prints with colors regardless of terminal support

      --no-colors
          Prints without colors regardless of terminal support

  -s, --output-score
          Prints the prediction score in addition to the content type

  -i, --mime-type
          Prints the MIME type instead of the content type description

  -l, --label
          Prints a simple label instead of the content type description

      --json
          Prints in JSON format

      --jsonl
          Prints in JSONL format

      --format <CUSTOM>
          Prints using a custom format (use --help for details).

          The following placeholders are supported:

            %p  The file path
            %l  The unique label identifying the content type
            %d  The description of the content type
            %g  The group of the content type
            %m  The MIME type of the content type
            %e  Possible file extensions for the content type
            %s  The score of the content type for the file
            %S  The score of the content type for the file in percent
            %b  The model output if overruled (empty otherwise)
            %%  A literal %

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Check the Rust CLI docs for more information.

Check the docs on Magika's output for more details about the output format.

Using Magika as a Python module

[!WARNING] The new API is very similar to the old one, but it ships with a number of improvements and introduces a few breaking changes. Updating existing clients should be fairly straighforward, and, where we could, we kept support for the old API and added deprecation warnings. See the CHANGELOG.md for the full list of changes and suggestions on how to fix.

>>> from magika import Magika
>>> m = Magika()
>>> res = m.identify_bytes(b"# Example\nThis is an example of markdown!")
>>> print(res.output.label)
markdown

API documentation

First, create a Magika instance: magika = Magika().

The Magika object exposes three methods:

  • magika.identify_bytes(b"test"): takes as input a stream of bytes and predict its content type.
  • magika.identify_path(Path("test.txt")): takes as input one Path object and predicts its content type.
  • magika.identify_paths([Path("test.txt"), Path("test2.txt")]): takes as input a list of Path objects and returns the predicted type for each of them.

If you are dealing with big files, the identify_path and identify_paths variants are generally better: their implementation seek()s around the file to extract the needed features, without loading the entire content in memory.

These API returns an object of type MagikaResult, an absl::StatusOr-like wrapper around MagikaPrediction, which exposes the same information discussed in the Magika's output documentation.

Here is how the main types look like:

class MagikaResult:
    path: Path
    status: Status
    prediction: MagikaPrediction
    [...]
class MagikaPrediction:
    dl: ContentTypeInfo
    output: ContentTypeInfo
    score: float
class ContentTypeInfo:
    label: ContentTypeLabel
    mime_type: str
    group: str
    description: str
    extensions: List[str]
    is_text: bool
class ContentTypeLabel(StrEnum):
    APK = "apk"
    BMP = "bmp"
    [...]

Development setup

  • magika uses uv as a project and dependency managment tool. To install all the dependencies: $ cd python; uv sync.
  • To run the tests suite: $ cd python; uv run pytest tests -m "not slow". Check the github action workflows for more information.
  • We use the maturin backend to combine the Rust CLI with the python codebase. To build: $ cd python; uv run ./scripts/build_python_package.py.

Citation

If you use this software for your research, please cite it as:

@misc{magika,
      title={{Magika: AI-Powered Content-Type Detection}},
      author={{Fratantonio, Yanick and Invernizzi, Luca and Farah, Loua and Kurt, Thomas and Zhang, Marina and Albertini, Ange and Galilee, Francois and Metitieri, Giancarlo and Cretin, Julien and Petit-Bianco, Alexandre and Tao, David and Bursztein, Elie}},
      year={2024},
      eprint={2409.13768},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2409.13768},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

magika-0.6.0rc2-py3-none-win_amd64.whl (15.2 MB view details)

Uploaded Python 3 Windows x86-64

magika-0.6.0rc2-py3-none-manylinux_2_28_x86_64.whl (17.5 MB view details)

Uploaded Python 3 manylinux: glibc 2.28+ x86-64

magika-0.6.0rc2-py3-none-macosx_11_0_arm64.whl (15.1 MB view details)

Uploaded Python 3 macOS 11.0+ ARM64

File details

Details for the file magika-0.6.0rc2-py3-none-win_amd64.whl.

File metadata

  • Download URL: magika-0.6.0rc2-py3-none-win_amd64.whl
  • Upload date:
  • Size: 15.2 MB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for magika-0.6.0rc2-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 cd4a6d23b81d99f346cb7ac362af969a25f02447e5153b31692bb88107116f3c
MD5 e6932ee76c202ef4c24fd1f650ca5a39
BLAKE2b-256 d5a09910d6b48de838bdab7e697f9abd0a96e0f6a0e9f94976eb2cb9466a8dd1

See more details on using hashes here.

File details

Details for the file magika-0.6.0rc2-py3-none-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for magika-0.6.0rc2-py3-none-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ec10064f147976b6d56e76b296d41576214bf3c078fcf6eda5f75a96dfb418a4
MD5 ee36fe99e4c89e5f66a2d13c9f113551
BLAKE2b-256 0ea0565c599bf4663e80ec07571328daaa32c49f220b6e5ecfcf9ca17c401e89

See more details on using hashes here.

File details

Details for the file magika-0.6.0rc2-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for magika-0.6.0rc2-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5b63afaabbaca4e46a75e8787e6fb5ca41b24665ad388b5c24c2b0cd7f047500
MD5 fc031793cfd8a2c14fdd430e0d4e7590
BLAKE2b-256 49065470bd53d2c15bc182b89bbd1a004ca7d394a922dc9aafc5a4f08f9ab5bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page