A PDF analysis toolkit. Scan a PDF with relevant YARA rules, visualize its inner tree-like data structure in living color (lots of colors), force decodes of suspicious font binaries, and more.

These details have not been verified by PyPI

Project links

Project description

Python Version Release Downloads

THE PDFALYZER

A PDF analysis tool for visualizing the inner tree-like data structure[^1] of a PDF in spectacularly large and colorful diagrams as well as scanning the binary streams embedded in the PDF for hidden potentially malicious content. The Pdfalyzer makes heavy use of YARA (via The Yaralyzer) for matching/extracting byte patterns. The Yaralyzer actually began its life as The Pdfalyzer's matching engine.

PyPi Users: This document renders a lot better on GitHub. Pictures, footnotes, etc.

Quick Start

pipx install pdfalyzer
pdfalyze heidegger_-_being_illmatic.pdf

What It Do

Generate in depth visualizations of a PDF's tree structure[^1] that give you a complete picture of all of the PDF's internal objects and the links between them. See the examples below to get an idea.
Scan for malicious content with all the PDF related YARA rules I could dig up as well as in-depth scans of the embedded compressed/encrypted binaries.
Forcibly decode suspicious byte patterns with many different character encodings. chardet is leveraged to attempt to guess the encoding but no matter what chardet thinks the results of forcing the bytes into an encoding will be displayed.
Usable as a library for your own PDF related code. All[^2] the inner PDF objects are guaranteed to be available in a searchable tree data structure.

If you're looking for one of these things this may be the tool for you.

What It Don't Do

This tool is mostly about examining a PDF's logical structure and assisting with the discovery of malicious content. As such it doesn't have much to offer as far as extracting text from PDFs, rendering PDFs[^3], writing new PDFs, or many of the more conventional things one might do with a portable document.

Did The World Really Need Another PDF Tool?

This tool was built to fill a gap in the PDF assessment landscape following my own recent experience trying to find malicious content in a PDF file. Didier Stevens's pdfid.py and pdf-parser.py are still the best game in town when it comes to PDF analysis tools but they lack in the visualization department and also don't give you much to work with as far as giving you a data model you can write your own code around. Peepdf seemed promising but turned out to be in a buggy, out of date, and more or less unfixable state. And neither of them offered much in the way of tooling for embedded binary analysis.

Thus I felt the world might be slightly improved if I strung together a couple of more stable/well known/actively maintained open source projects (AnyTree, PyPDF2, Rich, and YARA via The Yaralyzer) into this tool.

Installation

Installation with pipx[^4] is preferred though pip3 should also work.

pipx install pdfalyzer

See PyPDF2 installation notes about PyCryptodome if you plan to pdfalyze any files that use AES encryption.

For info on how to setup a dev environment, see Contributing section at the end of this file.

Troubleshooting The Installation

If you used regular pip3 instead of pipx and you only want to use the CLI and don't need to import the python classes to your own code, you should try to install with pipx instead.
If you run into an issue about missing YARA try to install yara-python. If that doesn't work you may have to install the YARA executable separately.
If you encounter an error building the python cryptography package check your pip version (pip --version). If it's less than 22.0, upgrade pip with pip install --upgrade pip.
On linux if you encounter an error building wheel or cffi you may need to install some packages like a compiler for the rust language or some SSL libraries. sudo apt-get install build-essential libssl-dev libffi-dev rustc may help.
While poetry.lock is checked into this repo the versions "required" there aren't really "required" so feel free to delete or downgrade if you need to.

Usage

If your python scripts setup is less than ideal and you can't get the pdfalyze command to work, python -m pdfalyzer should be an equivalent, more portable version of the same command.

Run pdfalyze --help to see usage instructions. As of right now these are the options:

Note that The Pdfalyzer output is extremely verbose if you don't limit the output sections (See ANALYSIS SELECTION in the --help). Almost all of the verbosity comes from the --stream option pulling things that could be (but are almost certainly not) malicious. To get everything except the stream option, use these flags

pdfalyzer lacan_buys_the_dip.pdf -d -t -r -f -y -c

Beyond that there's a few scripts in the repo that may be of interest.

Setting Command Line Options Permanently With A `.pdfalyzer` File

If you find yourself specificying the same options over and over you may be able to automate that with a dotenv setup. When you run pdfalyze on some PDF the tool will check for a file called .pdfalyzer first in the current directory and then in the home directory. If it finds a file in either such place it will load options from it. Documentation on the options that can be configured with these files lives in .pdfalyzer.example which doubles as an example file you can copy into place and edit to your needs. Even if don't configure your own .pdfalyzer file you may still glean some insight from reading the descriptions of the various variables in .pdfalyzer.example; there's a little more exposition there than in the output of pdfalyze -h.

Colors And Themes

Run pdfalyzer_show_color_theme to see the color theme employed.

As A Code Library

At its core The Pdfalyzer is taking PDF internal objects gathered by PyPDF2 and wrapping them in AnyTree's NodeMixin class. Given that things like searching the tree or accessing internal PDF properties will be done through those packages' code it may be helpful to review their documentation.

As far as The Pdfalyzer's unique functionality goes, Pdfalyzer is the class at the heart of the operation. It holds the PDF's logical tree as well as a few other data structures. Chief among these are the FontInfo class which pulls together various properties of a font strewn across 3 or 4 different PDF objects and the BinaryScanner class which lets you dig through the embedded streams' bytes looking for suspicious patterns.

Here's a short intro to how to access these objects:

from pdfalyzer.pdfalyzer import Pdfalyzer

# Load a PDF and parse its nodes into the tree.
pdfalyzer = Pdfalyzer("/path/to/the/evil_or_non_evil.pdf")
actual_pdf_tree = pdfalyzer.pdf_tree

# Find a PDF object by its ID in the PDF
node = pdfalyzer.find_node_by_idnum(44)
pdf_object = node.obj

# Use anytree's findall_by_attr to find nodes with a given property
from anytree.search import findall_by_attr
page_nodes = findall_by_attr(pdfalyzer.pdf_tree, name='type', value='/Page')

# Do stuff with the fonts
for font in pdfalyzer.font_infos:
    do_stuff(font)

# Iterate over all stream objects:
for node in pdfalyzer.stream_nodes():
    do_stuff(node.stream_data)

# Iterate over backtick quoted strings from a font binary and process them
font = pdfalyzer.font_infos[0]

for backtick_quoted_string in font.binary_scanner.extract_backtick_quoted_bytes():
    process(backtick_quoted_string)

Troubleshooting

This tool is by no means complete. It was built to handle a specific use case which encompassed a small fraction of the many and varied types of information that can show up in a PDF. While it has been tested on a decent number of large and very complicated PDFs (500-5,000 page manuals from Adobe itself) I'm sure there are a whole bunch of edge cases that will trip up the code.

If that does happen and you run into an issue using this tool on a particular PDF it will most likely be an issue with relationships between objects within the PDF that are not meant to be parent/child in the tree structure made visible by this tool. There's not so many of these kinds of object references in any given file but there's a whole galaxy of possibilities and they must each be manually configured to prevent the tool from building an invalid tree. If you run into that kind of problem take a look at these list constants in the code:

NON_TREE_REFERENCES
INDETERMINATE_REF_KEYS

You might be able to easily fix your problem by adding the Adobe object's reference key to the appropriate list.

Alternatively, please open a GitHub issue with the compressed (.zip, .gz, whatever) PDF that is causing the problem attached and I'll take a look when I can. I will not take a look at any uncompressed PDFs due to the security risks, so make sure you zip it before you ship it.

Example Output

The Pdfalyzer can export visualizations to HTML, ANSI colored text, and SVG images using the file export functionality that comes with Rich. SVGs can be turned into png format images with a tool like Inkscape or cairosvg (Inkscape works a lot better in our experience).

Basic Tree View

As you can see the suspicious /OpenAction relationship is highlighted bright red, as would be a couple of other sus PDF instructions like /JavaScript or /AcroForm if they exist in the PDF being pdfalyzed.

The dimmer (as in "harder to see") nodes[^5] marked with Non Child Reference give you a way to visualize the relationships between PDF objects that exist outside of the tree structure's parent/child relationships.

That's a pretty basic document. Here's the basic tree for a more complicated PDF containing an NMAP cheat sheet.

Rich Tree View

This image shows a more in-depth view of of the PDF tree for the same document shown above. This tree (AKA the "rich" tree) has almost everything. Shows all PDF object properties, all relationships between objects, and sizable previews of any binary data streams embedded or encrypted in the document. Note that in addition to /OpenAction, the Adobe Type1 font binary is also red (Google's project zero regards any Adobe Type1 font as "mad sus").

And here's the rich tree for the same more complicated NMAP cheat sheet PDF linked instead of shown directly in the previous section.

Binary Analysis (And Lots Of It)

View the properties of the fonts in the PDF. Comes with a preview of the beginning and end of the font's raw binary data stream (at least if it's that kind of font).

Extract character mappings from ancient Adobe font formats: It's actually PyPDF2 doing the lifting here but we're happy to take the credit.

Search Internal Binary Data for Sus Content No Malware Scanner Will Catch[^6]: Things like, say, a hidden binary /F (PDF instruction meaning "URL") followed by a JS (I'll let you guess what "JS" stands for) and then a binary » character (AKA "the character the PDF specification uses to close a section of the PDF's logical structure"). Put all that together and it says that you're looking at a secret JavaScript instruction embedded in the encrypted part of a font binary. A secret instruction that causes the PDF renderer to pop out of its frame prematurely as it renders the font.

Extract And Decode Binary Patterns: Like, say, bytes between common regular expression markers that you might want to force a decode of in a lot of different encodings.

See stats: When all is said and done you can see some stats that may help you figure out what the character encoding may or may not be for the bytes matched by those patterns:

Now There's Even A Fancy Table To Tell You What The `chardet` Library Would Rank As The Most Likely Encoding For A Chunk Of Binary Data

Behold the beauty:

PDF Resources

3rd Party Tools

Installing Didier Stevens's PDF Analysis Tools

Stevens's tools provide comprehensive info about the contents of a PDF, are guaranteed not to trigger the rendering of any malicious content (especially pdfid.py), and have been battle tested for well over a decade. It would probably be a good idea to analyze your PDF with his tools before you start working with this one.

If you're lazy and don't want to retrieve his tools yourself there's a simple bash script to download them from his github repo and place them in a tools/ subdirectory off the project root. Just run this:

scripts/install_didier_stevens_pdf_tools.sh

If there is a discrepancy between the output of betweeen his tools and this one you should assume his tool is correct and The Pdfalyzer is wrong until you conclusively prove otherwise.

Installing The `t1utils` Font Suite

t1utils is a suite of old but battle tested apps for manipulating old Adobe font formats. You don't need it unless you're dealing with an older Type 1 or Type 2 font binary but given that those have been very popular exploit vectors in the past few years it can be extremely helpful. One of the tools in the suite, t1disasm, is particularly useful because it decrypts and decompiles Adobe Type 1 font binaries into a more human readable string representation.

There's a script to help you install the suite if you need it:

scripts/install_t1utils.sh

Documentation

Official Adobe Documentation

Official Adobe PDF 1.7 Specification - Indispensable map when navigating a PDF forest.
Adobe Type 1 Font Format Specification - Official spec for Adobe's original font description language and file format. Useful if you have suspicions about malicious fonts. Type1 seems to be the attack vector of choice recently which isn't so surprising when you consider that it's a 30 year old technology and the code that renders these fonts probably hasn't been extensively tested in decades because almost no one uses them anymore outside of people who want to use them as attack vectors.
Adobe CMap and CIDFont Files Specification - Official spec for the character mappings used by Type1 fonts / basically part of the overall Type1 font specification.
Adobe Type 2 Charstring Format - Describes the newer Type 2 font operators which are also used in some multiple-master Type 1 fonts.

Other Stuff

Didier Stevens's free book about malicious PDFs - The master of the malicious PDFs wrote a whole book about how to analyze them. It's an old book but the PDF spec was last changed in 2008 so it's still relevant.
Analyzing Malicious PDFs Cheat Sheet - Like it says on the tin. If that link fails there's a copy here in the repo.
T1Utils Github Repo - Suite of tools for manipulating Type1 fonts.
t1disasm Manual - Probably the most useful part of the T1Utils suite because it can decompile encrypted ancient Adobe Type 1 fonts into something human readable.

Contributing

One easy way of contributing is to run the script to test against all the PDFs in ~/Documents and reporting any issues.

Beyond that see CONTRIBUTING.md.

Glossary

reference_key - string found in a PDF node that names the property (e.g. /BaseFont or /Subtype)
address - reference_key plus a hash key or numerical array index if that's how the reference works. e.g. if node A has a reference key /Resources pointing to a dict {'/Font2': [IndirectObject(55), IndirectObject(2)]} the address of IndirectObject(55) from node A would be /Resources[/Font2][0]
tree_address - like the address but starting at the root of the tree, all concatenated
relationship - any link between nodes created by addresses/reference keys
reference - any link from node A to other nodes (outward facing relationships for node A, basically)
non_tree_relationship - any link between nodes that is not considered a parent/child tree relationship
indeterminate node - any node whose place in the tree can only be determined when the whole tree has been scanned
link node - nodes like /Dest that just contain a pointer to another node

TODO

highlight decodes done at chardets behest
Highlight decodes with a lot of Javascript keywords
deal with repetitive matches
https://github.com/1Project/Scanr/blob/master/emulator/emulator.py
https://github.com/mandiant/flare-floss

[^1]: The official Adobe PDF specification calls this tree the PDF's "logical structure", which is a good example of nomenclature that does not help those who see it understand anything about what is being described. I can forgive them given that they named this thing back in the 80s, though it's a good example of why picking good names for things at the beginning is so important.

[^2]: An exception will be raised if there's any issue placing a node while parsing or if there are any nodes not reachable from the root of the tree at the end of parsing. If there are no exceptions then all internal PDF objects are guaranteed to exist in the tree except in these situations when warnings will be printed: /ObjStm (object stream) is a collection of objects in a single stream that will be unrolled into its component objects. /XRef Cross-reference stream objects which hold the same references as the /Trailer are hacked in as symlinks of the /Trailer

[^3]: Given the nature of the PDFs this tool is meant to be scan anything resembling "rendering" the document is pointedly NOT offered.

[^4]: pipx is a tool that basically runs pip install for a python package but in such a way that the installed package's requirements are isolated from your system's python packages. If you don't feel like installing pipx then pip install should work fine as long as there are no conflicts between The Pdfalyzer's required packages and those on your system already. (If you aren't using other python based command line tools then your odds of a conflict are basically 0%.)

[^5]: Technically they are SymlinkNodes, a really nice feature of AnyTree.

[^6]: At least they weren't catching it as of September 2022.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.16.1

Oct 26, 2024

1.16.0

Oct 26, 2024

1.15.1

Aug 29, 2024

1.15.0

Aug 29, 2024

1.14.10

Mar 28, 2024

1.14.9

Mar 28, 2024

1.14.8

Mar 27, 2024

1.14.7

Mar 27, 2024

1.14.6

Nov 15, 2023

1.14.5

Nov 15, 2023

1.14.4

May 9, 2023

1.14.3

May 9, 2023

1.14.2 yanked

May 1, 2023

1.14.1

Oct 21, 2022

1.14.0

Oct 21, 2022

1.13.2

Oct 18, 2022

1.13.1

Oct 18, 2022

1.13.0

Oct 17, 2022

1.12.3

Oct 17, 2022

1.12.2

Oct 16, 2022

1.12.1

Oct 16, 2022

This version

1.12.0

Oct 16, 2022

1.11.6

Oct 16, 2022

1.11.5

Oct 16, 2022

1.11.4

Oct 16, 2022

1.11.3

Oct 16, 2022

1.11.2

Oct 15, 2022

1.11.1

Oct 15, 2022

1.11.0

Oct 15, 2022

1.10.8

Oct 13, 2022

1.10.7

Oct 13, 2022

1.10.6

Oct 13, 2022

1.10.5

Oct 12, 2022

1.10.4

Oct 11, 2022

1.10.3

Oct 11, 2022

1.10.2

Oct 9, 2022

1.10.1

Oct 9, 2022

1.10.0

Oct 7, 2022

1.9.0

Oct 7, 2022

1.8.3

Oct 6, 2022

1.8.2

Oct 6, 2022

1.8.1

Oct 5, 2022

1.8.0

Oct 5, 2022

1.7.0

Oct 5, 2022

1.6.0

Oct 3, 2022

1.5.0

Sep 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfalyzer-1.12.0.tar.gz (84.2 kB view hashes)

Uploaded Oct 16, 2022 Source

Built Distribution

pdfalyzer-1.12.0-py3-none-any.whl (100.9 kB view hashes)

Uploaded Oct 16, 2022 Python 3

Hashes for pdfalyzer-1.12.0.tar.gz

Hashes for pdfalyzer-1.12.0.tar.gz
Algorithm	Hash digest
SHA256	`bf7f38001934df0824083f67866166c6614f37a4c9eea90dc6e615650888ab9a`
MD5	`6a6b06934aafab5373deaf5190314b0f`
BLAKE2b-256	`8c6f318fe282f07b05e8381f0f87265598585a4e9c7227287f8c975107669de1`

Hashes for pdfalyzer-1.12.0-py3-none-any.whl

Hashes for pdfalyzer-1.12.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`df6510ae43f69b33a950c10e62c62b9fb6b083406ab68222e77a3ae30d0cfccc`
MD5	`0f094cc77b851b000e1f9d84e3d3be71`
BLAKE2b-256	`41c097e902514ed1780e4af6303138f108c8fb4fb2f6c1d3af0dc9f619eac173`

pdfalyzer 1.12.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

THE PDFALYZER

Quick Start

What It Do

What It Don't Do

Did The World Really Need Another PDF Tool?

Installation

Troubleshooting The Installation

Usage

Setting Command Line Options Permanently With A .pdfalyzer File

Colors And Themes

As A Code Library

Troubleshooting

Example Output

Basic Tree View

Rich Tree View

Binary Analysis (And Lots Of It)

Now There's Even A Fancy Table To Tell You What The chardet Library Would Rank As The Most Likely Encoding For A Chunk Of Binary Data

PDF Resources

3rd Party Tools

Installing Didier Stevens's PDF Analysis Tools

Installing The t1utils Font Suite

Documentation

Official Adobe Documentation

Other Stuff

Contributing

Glossary

TODO

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

Setting Command Line Options Permanently With A `.pdfalyzer` File

Now There's Even A Fancy Table To Tell You What The `chardet` Library Would Rank As The Most Likely Encoding For A Chunk Of Binary Data

Installing The `t1utils` Font Suite