Skip to main content

Pure python implementation of magic file detection

Project description

puremagic is a pure python module that will identify a file based off it’s magic numbers.

CoverageStatus License PyPi

It is designed to be minimalistic and inherently cross platform compatible. It is also designed to be a stand in for python-magic, it incorporates the functions from_file(filename[, mime]) and from_string(string[, mime]) however the magic_file() and magic_string() are more powerful and will also display confidence and duplicate matches.

It does NOT try to match files off non-magic string. In other words it will not search for a string within a certain window of bytes like others might.

Advantages over using a wrapper for ‘file’ or ‘libmagic’:

  • Faster

  • Lightweight

  • Cross platform compatible

  • No dependencies

Disadvantages:

  • Does not have as many file types

  • No multilingual comments

  • Duplications due to small or reused magic numbers

(Help fix the first two disadvantages by contributing!)

Compatibility

  • Python 3.7+

Using github ci to run continuous integration tests on listed platforms.

Install from pypy

$ pip install puremagic

On linux environments, you may want to be clear you are using python3

$ python3 -m pip install puremagic

Install from source

In either a virtualenv or globally, simply run:

$ python setup.py install

Usage

“from_file” will return the most likely file extension. “magic_file” will give you every possible result it finds, as well as the confidence.

import puremagic

filename = "test/resources/images/test.gif"

ext = puremagic.from_file(filename)
# '.gif'

puremagic.magic_file(filename)
# [['.gif', 'image/gif', 'Graphics interchange format file (GIF87a)', 0.7],
#  ['.gif', '', 'GIF file', 0.5]]

With “magic_file” it gives each match, highest confidence first:

  • possible extension(s)

  • mime type

  • description

  • confidence (All headers have to perfectly match to make the list, however this orders it by longest header, therefore most precise, first)

If you already have a file open, or raw byte string, you could also use:

  • from_string

  • from_stream

  • magic_string

  • magic_stream

with open(r"test\resources\video\test.mp4", "rb") as file:
    print(puremagic.magic_stream(file))

# [PureMagicWithConfidence(byte_match=b'ftypisom', offset=4, extension='.mp4', mime_type='video/mp4', name='MPEG-4 video', confidence=0.8),
#  PureMagicWithConfidence(byte_match=b'iso2avc1mp4', offset=20, extension='.mp4', mime_type='video/mp4', name='MP4 Video', confidence=0.8)]

Script

Usage

$ python -m puremagic [options] filename <filename2>...

Examples

$ python -m puremagic test/resources/images/test.gif
'test/resources/images/test.gif' : .gif

$ python -m puremagic -m test/resources/images/test.gif test/resources/audio/test.mp3
'test/resources/images/test.gif' : image/gif
'test/resources/audio/test.mp3' : audio/mpeg

imghdr replacement

If you are looking for a replacement for the standard library’s depreciated imghdr, you can use puremagic.what()

import puremagic

filename = "test/resources/images/test.gif"

ext = puremagic.what(filename)
# 'gif'

FAQ

The file type is actually X but it’s showing up as Y with higher confidence?

This can happen when the file’s signature happens to match a subset of a file standard. The subset signature will be longer, therefore report with greater confidence, because it will have both the base file type signature plus the additional subset one.

You don’t have sliding offsets that could better detect plenty of common formats, why’s that?

Design choice, so it will be a lot faster and more accurate. Without more intelligent or deeper identification past a sliding offset I don’t feel comfortable including it as part of a ‘magic number’ library.

Your version isn’t as complete as I want it to be, where else should I look?

Look into python modules that wrap around libmagic or use something like Apache Tika.

Acknowledgements

Gary C. Kessler

For use of his File Signature Tables, available at: http://www.garykessler.net/library/file_sigs.html

Freedesktop.org

For use of their shared-mime-info file, available at: https://cgit.freedesktop.org/xdg/shared-mime-info/

License

MIT Licenced, see LICENSE, Copyright (c) 2013-2024 Chris Griffith

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

puremagic-1.28.tar.gz (314.9 kB view details)

Uploaded Source

Built Distribution

puremagic-1.28-py3-none-any.whl (43.2 kB view details)

Uploaded Python 3

File details

Details for the file puremagic-1.28.tar.gz.

File metadata

  • Download URL: puremagic-1.28.tar.gz
  • Upload date:
  • Size: 314.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for puremagic-1.28.tar.gz
Algorithm Hash digest
SHA256 195893fc129657f611b86b959aab337207d6df7f25372209269ed9e303c1a8c0
MD5 de0256a7110744de7f2a3528e964a0ab
BLAKE2b-256 092d40599f25667733e41bbc3d7e4c7c36d5e7860874aa5fe9c584e90b34954d

See more details on using hashes here.

File details

Details for the file puremagic-1.28-py3-none-any.whl.

File metadata

  • Download URL: puremagic-1.28-py3-none-any.whl
  • Upload date:
  • Size: 43.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.18

File hashes

Hashes for puremagic-1.28-py3-none-any.whl
Algorithm Hash digest
SHA256 e16cb9708ee2007142c37931c58f07f7eca956b3472489106a7245e5c3aa1241
MD5 47c14ffee127ef084a4c0c743e182429
BLAKE2b-256 c553200a97332d10ed3edd7afcbc5f5543920ac59badfe5762598327999f012e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page