Skip to main content

A Python CLI tool designed to assist with literature research by visualizing the occurrence of words in PDF documents with a microarray format.

Project description

PDFMicroarray

Overview

PDFMicroarray is a Python CLI tool designed to assist with literature research for scientific papers and books.

It extracts text from multiples sources within PDF documents, including:

  • Plain text
  • Text from images (through OCR)
  • Text from embedded diagrams (through page rendering and OCR)

and stores the extracted text in a designated output directory.

The processed text can then analyzed using the Levenshtein distance to detect the occurrences of specified words. A graphical representation of these occurrences is offered in a microarray format.

Installation

Tesseract is required for this CLI tool. Please follow the installation instructions for your platform.

pip install pipx
pipx install pdf-microarray

Usage

mkdir processed
pdf-microarray process -i documents -o processed
pdf-microarray analyze -i processed -w words.txt -o data.csv
pdf-microarray plot -i data.csv -o plot.png

The words in words.txt should be separated by newlines. If multiple words are on the same line, only the occurrence of all words is taken into account.

Example

Example

Technical details

The library uses document segmentation and multithreading to speed up the extraction process, so that even large books in PDF form can be parsed within a few minutes.

The library utilized PyMuPDF for OCR, pytesseract for PDF page rendering and thefuzz to calculate the Levenshtein distance.

Contributing

Contributions to PDFMicroarray are welcome! Please feel free to fork the repository, make changes, and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.

License

Distributed under the GNU General Public License v3.0 license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_microarray-1.0.1.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

pdf_microarray-1.0.1-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file pdf_microarray-1.0.1.tar.gz.

File metadata

  • Download URL: pdf_microarray-1.0.1.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.3 Darwin/23.4.0

File hashes

Hashes for pdf_microarray-1.0.1.tar.gz
Algorithm Hash digest
SHA256 31a03c94e5c0993d171a54c6fd232beca73d5eb776d86e79f4071e8e42425b30
MD5 a0a1ce47ba34a0cccc6c0fc46eb2f7ba
BLAKE2b-256 f1e5c72c8641164d47f5e93b0b293e9151aaffcbf7ac23b56300bb870cc92768

See more details on using hashes here.

File details

Details for the file pdf_microarray-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdf_microarray-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.3 Darwin/23.4.0

File hashes

Hashes for pdf_microarray-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b38ff7ada94a954342e637a0be02dfb41caaff982707e567988bba741ff13ddd
MD5 c5013c59b440d513ee2466c79cd9c908
BLAKE2b-256 9feba7b2834c8767aab521c1054c87fedfa27cd9926be779e9858a87572e50f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page