Skip to main content

Documentation for ruamel.pdfdouble

Project description

this package provides the pdfdbl command:

pdfdbl scan dir1 dir2

This will walk down the directories provided as argument and for the PDF files found create a hash based on (in order):

- metadata if unique
- images if the number of images
- text

This assumes that pdfinfo, pdfimages and pdftotext` from the poppler-utils package are avaialable.

A “database” is build up in ~/.config/pdfdbl/pdf.lst against which further scans are tested.

Removing markings

In ruamel/pdfdouble/pdfdouble.py there are two methods that can be enhanced to filter out markings in the PDF that make them less unique and make vitually the same files to have different hashes.

For text the method PdfData.filter_for_marking should be extended to remove and markings from the string that is its arguments and return the result.

For scanned images the method PdfData.process_image_and_update needs to be enhanced, e.g. by cutting off the images bottom and top X lines, and by removing any gray background text by setting all black pixels to white. This function needs to update the hash passed in using the .update() method passing in the filtered data.

Restrictions

The current “database” cannot handle paths that contain newlines

This utility is currently Python 2.6/2.7 only.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruamel.pdfdouble-0.1.tar.gz (6.4 kB view details)

Uploaded Source

File details

Details for the file ruamel.pdfdouble-0.1.tar.gz.

File metadata

File hashes

Hashes for ruamel.pdfdouble-0.1.tar.gz
Algorithm Hash digest
SHA256 3918c12ee6a922b70c3e0755a0b83afd0ad3ae6e42cf73d125ac85fcfe23130b
MD5 09ae32a20a88c9f613332bfb318102b1
BLAKE2b-256 25ea8fcddf559e5d9f76dbfc850e6a027a5e6bec3d3a552262fcf4a8c9aca6ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page