pdftitle is a small utility to extract the title of a PDF article.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Programming Language
- Python :: 3.6
Topic
- Utilities

Project description

pdftitle

pdftitle is a small utility to extract the title of a PDF article.

When you have some PDF articles where you cannot understand their content from their filenames, you can use this utility to extract the title and rename the files if you want. This utility does not look at the metadata of a PDF file. The title in the metadata can be empty. It works for ~80% of the PDFs I have and it is especially suited for PDF files of scientific articles.

install with pip install pdftitle.

Using pdftitle -p <pdf-file>

returns the estimated title of the document. Much more info can be seen in verbose mode with -v.

Currently, it uses the following heuristic:

Look into every text object in the first page of a PDF document
If the font and font size is same in consequent text objects, group their content as one
Find the groups with maximum text size
If there are more than one group found, select the one the most close to top of the page
Title is in this group

So the assumption is that the title of the document is the text having the largest font size in the first page and the one most close to the top of the page.

One problem is not all documents uses space character between the words, so it is difficult to find word boundaries if space is not used.

There are two options that you can specify on the command line:

--replace-missing-char: if a glyph (i.e. look of character a) cannot be mapped into the character symbol (i.e. character a), normally an exception is raised. If you want no exception but replace it with something, specific it here.

pdftitle uses pdfminer.six project to parse PDF document with its own implementation of the PDF device and PDF interpreter. The names of the variables and calculations in the source code is very similar to how they are given in the PDF spec (http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf).

changes

0.5: - fixed install problem with 0.4

pdfminer version updated.

0.4:

Merged #e4bb0d6 to detect and remove duplicate spaces in the returned title. Contributed by Jakob Guldberg Aaes (https://github.com/jakob1379).

0.3:

Merged #f65ff4c and #f5c60c0 for identifying spaces when no space char is used. Contributed by Fabien Couthouis (https://github.com/Fabien-Couthouis).

0.2:

changed version string to major.minor format.
pdftitle can be used as a library for a project, use get_title_from_io method
added chardet as a dependency
algorithm is changed but there are problems with finding the word boundaries

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Programming Language
- Python :: 3.6
Topic
- Utilities

Release history Release notifications | RSS feed

0.20

Feb 17, 2025

0.19 yanked

Feb 17, 2025

Reason this release was yanked:

use 0.20 instead

0.18

Feb 15, 2025

0.17

Dec 3, 2024

0.16

Nov 17, 2024

0.15

Oct 18, 2024

0.14

Oct 13, 2024

0.13

Oct 13, 2024

0.12

Oct 9, 2024

0.11

Aug 9, 2021

0.10

Aug 9, 2021

0.9

Apr 10, 2021

0.8

Oct 8, 2020

0.7

Apr 5, 2020

0.6

Apr 5, 2020

This version

0.5

Dec 20, 2019

0.4

Nov 29, 2019

0.3

Mar 29, 2019

0.2

Mar 20, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdftitle-0.5.tar.gz (6.8 kB view details)

Uploaded Dec 20, 2019 Source

File details

Details for the file pdftitle-0.5.tar.gz.

File metadata

Download URL: pdftitle-0.5.tar.gz
Upload date: Dec 20, 2019
Size: 6.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.5

File hashes

Hashes for pdftitle-0.5.tar.gz
Algorithm	Hash digest
SHA256	`201412810d331bbe8a60763e9586bafbf27ce823a87a3c3b1bd0f1ac2e1047db`
MD5	`fe17272193c25076c87543c540ea67ef`
BLAKE2b-256	`41c538a7260ea08d40008eb9b442f3c0d18efb9186f297d293d2828bfaa49b2b`

See more details on using hashes here.

pdftitle 0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdftitle

changes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes