Skip to main content

pdftitle is a small utility to extract the title from a PDF file

Project description

pdftitle

CircleCI Code style: black

pdftitle is a small utility to extract the title of a PDF article.

When you have some PDF articles where you cannot understand their content from their filenames, you can use this utility to extract the title and rename the files if you want. This utility does not look at the metadata of a PDF file. It is particularly suited for PDF files of scientific articles.

pdftitle uses pdfminer.six project to parse PDF document with its own implementation of the PDF device and PDF interpreter. The names of the variables and calculations in the source code is very similar to how they are given in the PDF spec (http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf).

Installation

pip install pdftitle

Usage

pdftitle -p <pdf-file> returns the title of the document if found.

$ pdftitle -p knuth65.pdf 
On the Translation of Languages from Left to Right

pdftitle -p <pdf-file> -c changes the document file name to the title of the document if found while removing the non-ascii chars. This command prints the new file name.

$ pdftitle -p knuth65.pdf -c
on_the_translation_of_languages_from_left_to_right.pdf

For debugging purposes, more info can be seen in verbose mode with -v (logging level INFO) or -vv (logging level DEBUG).

The program follows this procedure:

  1. If any of --use-metadata options are given, metadata streams (for dc:title) and/or document information dictionary (for Title) are checked. If there is a metadata entry, this is used as title and document is not checked further. See Metadata section for more information.

  2. Every text object in the first page (or given page with --page-number) of a PDF document is checked.

  3. If the font and font size is the same in consequent text objects, their content is grouped as one larger text.

  4. Selected algorithm is applied to extract the title. See Algorithms section for more information.

The assumption is that the title of the document is probably the text having the largest (or sometimes second largest etc.) font size (possibly in the first page) and it is the one most close to the top of the page.

One problem is that not all documents uses space character between the words, so it is difficult to find word boundaries if space is not used. There is a recovery procedure for this, that may work.

It is possible that PDF has a character that does not exist in the font, in that case you will receive an error, and you can use the --replace-missing-char option to eliminate this problem.

Sometimes the found title has a strange case (first letter is small but last is big etc.), this can be corrected with -t option.

The title may include a ligature (single character/glyph used for multiple characters/glyphs). Starting with 0.12, the latin ligatures defined in Unicode (ff, fi, fl, ffi, ffl, ft, st) is converted to individual characters (e.g. fi ligature is changed to f and i characters). This behavior can be disabled with --do-not-convert-ligatures. The ligatures of other languages defined in Unicode (Armenian and Hebrew) are not converted.

The reason metadata is not used by default is that the title entry in metadata in many documents do not contain the actual title (but an identifier etc.).

Algorithms

There are three algorithms at the moment:

  • original: finds the maximum font size, then finds the upmost (minimum Y) blocks with this font size and joins them.

  • max2: finds the maximum font size, then first adds the block with maximum font size, then the second maximum size, then continues adding either of them until a block with different font size is found. the block order is the natural order in the pdf, no x-y sorting is performed.

  • eliot: similar to original but can merge blocks having arbitrary number of font sizes ordered by size. the block order is y first then x. the font sizes to use are provided with --eliot-tfs option, this is the index of font sizes from the largest to the smallest, so --eliot-tfs 0,1 means the largest and the second largest fonts.

Algorithms are selected with -a option.

Metadata

PDF has two metadata options to keep the title of the document. The old method is to use the document information dictionary. The new method is to use a metadata stream. pdftitle supports both with --use-document-information-dictionary and --use-metadata-stream options. Also, both of them can be enabled by using --use-metadata or -m option, which then enables both by giving priority to the new method, metadata stream. These are not enabled by default because, to my experience, some/many/most documents do not have the actual title in the metadata but a document identifier.

Logging

Since v0.12, pdftitle uses standard python logging and prints at levels info (with -v) and debug (with -vv) to stderr by default.

Contributing

The best way to help development is to create an issue and discuss it there first.

Unless already discussed and decided, please do not create pull requests directly, it can be difficult to integrate them.

Contributors

The contributors of the merged pull requests are shown in GitHub's contributors page.

Some of the pull requests I could not merge but implemented fully or partially in different ways, so I would like to give them credit here:

Changelog

See CHANGELOG.md.

Development

See DEVELOPMENT.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftitle-0.14.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

pdftitle-0.14-py3-none-any.whl (27.5 kB view details)

Uploaded Python 3

File details

Details for the file pdftitle-0.14.tar.gz.

File metadata

  • Download URL: pdftitle-0.14.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for pdftitle-0.14.tar.gz
Algorithm Hash digest
SHA256 cb6a386d28cd41aa05b34a416ca2ca60ba331984de0baf0a75e5e23a17d5dc77
MD5 49ea594152e819939286c0f8073555e7
BLAKE2b-256 e81da4b543d763af073a52acb158598b4f9a9f33bd743da2c15bc4165e315ed6

See more details on using hashes here.

File details

Details for the file pdftitle-0.14-py3-none-any.whl.

File metadata

  • Download URL: pdftitle-0.14-py3-none-any.whl
  • Upload date:
  • Size: 27.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for pdftitle-0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 cfd450c6c8408e6c6ac0e3370803eac58ff3de5eb0d2968bf762082958f943c4
MD5 de55a2937f390623ae71d7e0583ec77e
BLAKE2b-256 a529bb9ab86bc98654616f9276eed526f3045b6e7aee1b78a679acc40f9b5f55

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page