Skip to main content

A Python CLI tool designed to assist with literature research by visualizing the occurrence of words in PDF documents with a microarray format.

Project description

PDFMicroarray

Overview

PDFMicroarray is a Python CLI tool designed to assist with literature research for scientific papers and books.

It extracts text from multiples sources within PDF documents, including:

  • Plain text
  • Text from images (through OCR)
  • Text from embedded diagrams (through page rendering and OCR)

and stores the extracted text in a designated output directory.

The processed text can then analyzed using the Levenshtein distance to detect the occurrences of specified words. A graphical representation of these occurrences is offered in a microarray format.

Installation

pip install pipx
pipx install pdf-microarray

Usage

mkdir processed
pdf-microarray process -i documents -o processed
pdf-microarray analyze -i processed -w words.txt -o data.csv
pdf-microarray plot -i data.csv -o plot.png

The words in words.txt should be separated by newlines. If multiple words are on the same line, only the occurrence of all words is taken into account.

Example

Example

Technical details

The library uses document segmentation and multithreading to speed up the extraction process, so that even large books in PDF form can be parsed within a few minutes.

The library utilized PyMuPDF for OCR, pytesseract for PDF page rendering and thefuzz to calculate the Levenshtein distance.

Contributing

Contributions to PDFMicroarray are welcome! Please feel free to fork the repository, make changes, and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.

License

Distributed under the GNU General Public License v3.0 license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_microarray-1.0.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

pdf_microarray-1.0.0-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file pdf_microarray-1.0.0.tar.gz.

File metadata

  • Download URL: pdf_microarray-1.0.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.3 Darwin/23.4.0

File hashes

Hashes for pdf_microarray-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e6c346ff0e1df88e55d6c2f6e3fa7f6b17e85439646f422413dae2b729b02b63
MD5 1d21bcf80217bc277077e85e7f6b1654
BLAKE2b-256 e252801a18d45aaa5883f666f0e9c5d0d534fe135d1f728e38c2f52ce60ac697

See more details on using hashes here.

File details

Details for the file pdf_microarray-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pdf_microarray-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.12.3 Darwin/23.4.0

File hashes

Hashes for pdf_microarray-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b09f80d14a5ef3157ae96e8fee28f73a09ab1271046dbed7fbf87e57b375a2a
MD5 1da6582d4a691a16f54136d5660f6959
BLAKE2b-256 5f4fae84ae31aaa2b84cfde4fbd26d8909698cb15d611fc4c3616ed01f9d6a86

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page