A Python CLI tool designed to assist with literature research by visualizing the occurrence of words in PDF documents with a microarray format.
Project description
PDFMicroarray
Overview
PDFMicroarray is a Python CLI tool designed to assist with literature research for scientific papers and books.
It extracts text from multiples sources within PDF documents, including:
- Plain text
- Text from images (through OCR)
- Text from embedded diagrams (through page rendering and OCR)
and stores the extracted text in a designated output directory.
The processed text can then analyzed using the Levenshtein distance to detect the occurrences of specified words. A graphical representation of these occurrences is offered in a microarray format.
Installation
Tesseract is required for this CLI tool. Please follow the installation instructions for your platform.
pip install pipx
pipx install pdf-microarray
Usage
mkdir processed
pdf-microarray process -i documents -o processed
pdf-microarray analyze -i processed -w words.txt -o data.csv
pdf-microarray plot -i data.csv -o plot.png
The words in words.txt
should be separated by newlines. If multiple words are on the same line, only the occurrence of all words is taken into account.
Example
Technical details
The library uses document segmentation and multithreading to speed up the extraction process, so that even large books in PDF form can be parsed within a few minutes.
The library utilized PyMuPDF for OCR, pytesseract for PDF page rendering and thefuzz to calculate the Levenshtein distance.
Contributing
Contributions to PDFMicroarray are welcome! Please feel free to fork the repository, make changes, and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.
License
Distributed under the GNU General Public License v3.0 license. See LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdf_microarray-1.0.1.tar.gz
.
File metadata
- Download URL: pdf_microarray-1.0.1.tar.gz
- Upload date:
- Size: 21.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.3 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31a03c94e5c0993d171a54c6fd232beca73d5eb776d86e79f4071e8e42425b30 |
|
MD5 | a0a1ce47ba34a0cccc6c0fc46eb2f7ba |
|
BLAKE2b-256 | f1e5c72c8641164d47f5e93b0b293e9151aaffcbf7ac23b56300bb870cc92768 |
File details
Details for the file pdf_microarray-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: pdf_microarray-1.0.1-py3-none-any.whl
- Upload date:
- Size: 35.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.3 Darwin/23.4.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b38ff7ada94a954342e637a0be02dfb41caaff982707e567988bba741ff13ddd |
|
MD5 | c5013c59b440d513ee2466c79cd9c908 |
|
BLAKE2b-256 | 9feba7b2834c8767aab521c1054c87fedfa27cd9926be779e9858a87572e50f7 |