Skip to main content

No project description provided

Project description

archminer

PyPI - Version PyPI - Python Version


archminer is a command-line tool that extracts text from a specific PDF, referred to as 'Marginados'. The text in the Marginados PDF was created through OCR, as the PDF is a scan. This tool tries to extract the text in reading order.

The tool makes assumptions specific to the Marginados PDF:

  • only certain pages are relevant
  • the text is layed out in two columns
  • each (relevant) page has a header line and a page number in the footer line

Table of Contents

Installation

pip install archminer

Usage

As mentioned above, this tool has expectations about the input PDF.

It is expected that you use it like this:

archminer --from-page 6 --to-page 590 Marginados_all.pdf top-bottom remove-top-bottom

This reads pages 6 through 590 from the PDF, determines the coordinates of the top (header) and bottom (footer) lines, then removes those lines and writes the text contents of the pages in reading order to Marginados_all.txt (because we did not specify an output file).

See archminer --help for the full usage.

License

archminer is distributed under the terms of the GPLv3 license or any later version.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

archminer-1.1.0.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

archminer-1.1.0-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file archminer-1.1.0.tar.gz.

File metadata

  • Download URL: archminer-1.1.0.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.0

File hashes

Hashes for archminer-1.1.0.tar.gz
Algorithm Hash digest
SHA256 1a98fb91cf2f0b18de550933233d842c307fe492c5b07afacc45a4d281185c2c
MD5 56c6bc676c285f6c8ff62a32fefabb51
BLAKE2b-256 18310306344131f0da5073dfd40c3be29e7401e6984f6680007dc7685414d19b

See more details on using hashes here.

File details

Details for the file archminer-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: archminer-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.23.0

File hashes

Hashes for archminer-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2460c6331f60bc2acc5edce2ef1c448be4d52e6fd2073ac7df16a1a9c040413
MD5 c743841bee11b4873d5a5dee3eb237d3
BLAKE2b-256 d902ed2543fc39e4a846e667d6c5e782cbaa40f31a979050894c93aca40d7f00

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page