Skip to main content

A utility to recursively map the structure of a file.

Project description

PolyFile


PyPI version Tests Slack Status

A utility to identify and map the semantic and syntactic structure of files, including polyglots, chimeras, and schizophrenic files. It has a pure-Python implementation of libmagic and can act as a drop-in replacement for the file command. However, unlike file, PolyFile can recursively identify embedded files, like binwalk.

PolyFile can be used in conjunction with its sister tool PolyTracker for Automated Lexical Annotation and Navigation of Parsers, a backronym devised solely for the purpose of collectively referring to the tools as The ALAN Parsers Project.

Quickstart

You can install the latest stable version of PolyFile from PyPI:

pip3 install polyfile

To install PolyFile from source, in the same directory as this README, run:

pip3 install .

Important: Before installing from source, make sure Java is installed. Java is used to run the Kaitai Struct compiler, which compiles the file format definitions.

This will automatically install the polyfile and polymerge executables in your path.

Usage

Running polyfile on a file with no arguments will mimic the behavior of file --keep-going:

$ polyfile png-polyglot.png
PNG image data, 256 x 144, 8-bit/color RGB, non-interlaced
Brainfu** Program
Malformed PDF
PDF document, version 1.3,  1 pages
ZIP end of central directory record Java JAR archive 

To generate an interactive hex viewer for the file, use the --html option:

$ polyfile --html output.html png-polyglot.png
Found a file of type application/pdf at byte offset 0
Found a file of type application/x-brainfuck at byte offset 0
Found a file of type image/png at byte offset 0
Found a file of type application/zip at byte offset 0
Found a file of type application/java-archive at byte offset 0
Saved HTML output to output.html

Run polyfile --help for full usage instructions.

Interactive Debugger

PolyFile has an interactive debugger both for its file matching and parsing. It can be used to debug a libmagic pattern definition, determine why a specific file fails to be classified as the expected MIME type, or step through a parser. You can run PolyFile with the debugger enabled using the -db option.

File Support

PolyFile has a cleanroom, pure Python implementation of the libmagic file classifier, and supports all 263 MIME types that it can identify.

It currently has support for parsing and semantically mapping the following formats:

For an example that exercises all of these file formats, run:

curl -v --silent https://www.sultanik.com/files/ESultanikResume.pdf | polyfile --html ESultanikResume.html -

Prior to PolyFile version 0.3.0, it used the TrID database for file identification rather than the libmagic file definitions. This proved to be very slow (since TrID has many duplicate entries) and prone to false positives (since TrID's file definitions are much simpler than libmagic's). The original TrID matching code is still shipped with PolyFile and can be invoked programmatically, but it is not used by default.

Output Format

PolyFile has several options for outputting its results, specified by its --format option. For computer-readable output, PolyFile has an extension of the SBuD JSON format described in the documentation. Prior to version 0.5.0 this was the default output format of PolyFile. However, now the default output format is to mimic the behavior of the file command. To maintain the original behavior, use the --format sbud option.

libmagic Implementation

PolyFile has a cleanroom implementation of libmagic (used in the file command). It can be invoked programmatically by running:

from polyfile.magic import MagicMatcher

with open("file_to_test", "rb") as f:
    # the default instance automatically loads all file definitions
    for match in MagicMatcher.DEFAULT_INSTANCE.match(f.read()):
        for mimetype in match.mimetypes:
            print(f"Matched MIME: {mimetype}")
        print(f"Match string: {match!s}")

To load a specific or custom file definition:

list_of_paths_to_definitions = ["def1", "def2"]
matcher = MagicMatcher.parse(*list_of_paths_to_definitions)
with open("file_to_test", "rb") as f:
    for match in matcher.match(f.read()):
        ...

Extending PolyFile

Instructions on extending PolyFile to support more file formats with new matchers and parsers is described [in the documentation](in the documentation).

License and Acknowledgements

This research was developed by Trail of Bits with funding from the Defense Advanced Research Projects Agency (DARPA) under the SafeDocs program as a subcontractor to Galois. It is licensed under the Apache 2.0 license. © 2019, Trail of Bits.

Known Issues & Fixes

Python Reserved Keyword in Auto-generated Code

The Kaitai Struct compiler may generate Python code that uses class as a variable name (e.g., self.class = ...), which is invalid syntax since class is a reserved keyword in Python. This issue specifically affects the auto-generated polyfile/kaitai/parsers/openpgp_message.py file.

Automatic Fix: As of this version, polyfile-weave automatically patches this issue on import. The fix is applied transparently when you first import the package, ensuring it works out-of-the-box.

Manual Fix: If you need to manually apply the fix (e.g., for development or debugging), you can run the included fix_class_keyword.py script:

python fix_class_keyword.py

This will patch all occurrences of self.class to self.class_ in the affected file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polyfile_weave-0.5.9.tar.gz (6.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polyfile_weave-0.5.9-py3-none-any.whl (1.7 MB view details)

Uploaded Python 3

File details

Details for the file polyfile_weave-0.5.9.tar.gz.

File metadata

  • Download URL: polyfile_weave-0.5.9.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polyfile_weave-0.5.9.tar.gz
Algorithm Hash digest
SHA256 12341fab03e06ede1bfebbd3627dd24015fde5353ea74ece2da186321b818bdb
MD5 e993c6b1cc4f3213079b7845e441991a
BLAKE2b-256 7055e5400762e3884f743d59291e71eaaa9c52dd7e144b75a11911e74ec1bac9

See more details on using hashes here.

Provenance

The following attestation bundles were made for polyfile_weave-0.5.9.tar.gz:

Publisher: pythonpublish.yml on zbirenbaum/polyfile-weave

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polyfile_weave-0.5.9-py3-none-any.whl.

File metadata

  • Download URL: polyfile_weave-0.5.9-py3-none-any.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polyfile_weave-0.5.9-py3-none-any.whl
Algorithm Hash digest
SHA256 6ae4b1b5eeac9f5bfc862474484d6d3e33655fab31749d93af0b0a91fddabfc7
MD5 948154980869efc7d57be47cc1d2509f
BLAKE2b-256 5294215005530a48c5f7d4ec4a31acdb5828f2bfb985cc6e577b0eaa5882c0e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for polyfile_weave-0.5.9-py3-none-any.whl:

Publisher: pythonpublish.yml on zbirenbaum/polyfile-weave

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page