Skip to main content

minimal shell to investigate PDF files

Project description

pdfsh

CircleCI

pdfsh is a utility to investigate the PDF file structure in a shell-like interface. It allows one to "mount" a PDF file and use a simple shell-like interface to navigate inside the PDF file structurally.

Technically, pdfsh is a PDF processor, a PDF reader, but not a viewer that renders the page contents.

In pdfsh, similar to a file system, the PDF file is represented as a tree. All the nodes of the tree are PDF objects.

pdfsh has its own ISO 32000-2:2020 PDF-2.0 parser.

pdfsh uses ccitt and lzw filter implementations and png predictor implementation in pdfminer.six. To minimize the dependency, I decided to add the implementations of these directly to the pdfsh code, so there is no dependency to pdfminer.six.

pdfsh assumes it is run under a ANSI capable terminal as it uses ANSI terminal features and colors. If strange behavior is observed, make sure the terminal emulation it is run is ANSI compatible.

Usage

pip install pdfsh

which installs a pdfsh executable into the path.

When pdfsh is run as pdfsh <pdf_file>, the shell interface is loaded with the document at the root of structural tree. The root node has no name, and represented by a single /.

pdfsh shell interface have commands like ls, cd and cat. For paths, an autocomplete mechanism is implemented.

pdfsh has a simple prompt: <filename>:<current_node> $. The current node is given as a path separated by / like a UNIX filesystem path.

Tutorial

For an introduction to PDF and a tutorial using pdfsh, please see my blog post A Minimum Complete Tutorial of Portable Document Format (PDF) with pdfsh.

Notes

pdfsh supports both cross-reference tables and cross-reference streams as well as hybrid-reference files. However, because pdfsh eagerly constructs the cross-reference table, either the cross-reference table or cross-reference stream is read in a particular update section. Thus, an object that is not visible in cross-reference stream but visible in cross-reference table cannot be found. More information about this topic can be found in ISO 32000-2:2020 7.5.8.4. Compatibility with applications that do not support compressed reference streams.

Changes

Version numbers are in <year>.<positive_integer> format. The <positive_integer monotonically increases in the same year but resets to 1 in the new year.

2024.4

  • cross-reference streams support
  • object streams support
  • --version option added
  • migrated from setup.py to pyproject.toml

2024.3 is skipped

2024.2

  • first public release

2024.1

  • initial test release, not for public use

External Licenses

pdfminer.six

pdfminer.six: Copyright (c) 2004-2016 Yusuke Shinyama <yusuke at shinyama dot jp>

  • ccitt.py and lzw.py are part of pdfminer.six
  • utils.py contains one function (apply_png_predictor) from the same source file (utils.py) from pdfminer.six.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfsh-2024.4.tar.gz (48.1 kB view details)

Uploaded Source

Built Distribution

pdfsh-2024.4-py3-none-any.whl (50.3 kB view details)

Uploaded Python 3

File details

Details for the file pdfsh-2024.4.tar.gz.

File metadata

  • Download URL: pdfsh-2024.4.tar.gz
  • Upload date:
  • Size: 48.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for pdfsh-2024.4.tar.gz
Algorithm Hash digest
SHA256 9ce9ef507e3e05d377df0e1a75c4c1be2fc23a70fc1a02bb9f50d6169c9291fa
MD5 c9ab96d7bcd8c795ba98bf560f6727f6
BLAKE2b-256 082d428d0e0f2b75caabee63bfaceb37307551fdc0f0d22936464fff448a5502

See more details on using hashes here.

File details

Details for the file pdfsh-2024.4-py3-none-any.whl.

File metadata

  • Download URL: pdfsh-2024.4-py3-none-any.whl
  • Upload date:
  • Size: 50.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for pdfsh-2024.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b0db747a4d275cdd204638350ea6aa767711214e972a76325bfd09d65426f101
MD5 8518424b75f2e78924bf3916044fc11c
BLAKE2b-256 2bd9969ba2c56f0c72bab030a5758362f16592165789b61a41af76c38b3c2778

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page