Search PDF files for specific words and generate frequency statistics.

These details have not been verified by PyPI

Project links

Homepage

Project description

PDF Word Counter

Python Version License

A tool to search PDF files for specific words and generate frequency statistics. Supports both PyPDF2 and pdfminer.six backends, with output to Astropy or Pandas formats.

Features

Case-sensitive/insensitive search
Unicode text normalization (NFC, NFD, NFKC, NFKD)
Parallel processing using threads
Output to Astropy tables or Pandas DataFrames
Select specific pages or page ranges
CLI and Python API
Filename parsing to extract metadata
Optional columns and sorting

Installation

Basic installation:

pip install pdf-word-counter

Full installation (with all optional dependencies):

pip install pdf-word-counter[miner]    # + pdfminer.six
pip install pdf-word-counter[astropy]  # + Astropy Tables
pip install pdf-word-counter[full]     # All extras

Coomand Line Usage

Basic Example

# Count words in all PDFs (case-insensitive) and print table
pdf-word-counter "AI,ML" papers/*.pdf --show

Advanced Examples

# Search pages 1-5 with Unicode normalization
pdf-word-counter "café" doc.pdf --pages 1-5 --unicode --form NFKC

# Case-sensitive search with 4 parallel workers
pdf-word-counter "Python" docs/*.pdf --case --workers 4

# Case-sensitive search with pdfminer
pdf-word-counter "Python" docs/*.pdf --case --miner --outfile counts.csv

# Extract project names from filenames like "ProjectX_2023.pdf"
pdf-word-counter "algorithm" *.pdf --dsep "{'_': {'project': 0}}"

CLI Options

Option	Description
`words`	Comma-separated list of words to search for
`pdfs`	List of PDF files or glob patterns
`--case`	Case-sensitive search
`--pages`	Pages to include, e.g. `"1,3-5"`
`--miner`	Use `pdfminer.six` backend instead of `PyPDF2`
`--pprint N`	Log progress every N pages (PyPDF2 only)
`--unicode`	Normalize Unicode text before search
`--form FORM`	Unicode normalization form (`NFC`, `NFD`, `NFKC`, `NFKD`)
`--outfile FILE`	Save the output table to a file
`--show`	Print the table to stdout
`--sort COLUMNS`	Comma-separated list of columns to sort by
`--workers N`	Number of parallel workers (default: 1)
`--backend`	Output backend: `pandas` or `astropy`
`--ppdf N`	Log progress every N files (serial mode)
`--log-level`	Logging level: `DEBUG`, `INFO`, etc.

Filename Parsing Options

Option	Description
`--dsep`	Dictionary to extract metadata from filenames by splitting them. For example: `"{'_': {'project': 1}}"` will split filenames by underscores (`_`) and extract the second element (index 1) into a new column called `project`.
`--year`	Attempt to extract a 4-digit year from the filename and add it as a new column.
`--ext`	Keep the file extension in the `filename` column (by default, it is removed).

Output Column Toggles

Option	Description
`--nfile`	Omit the filename column
`--npages`	Omit the number-of-pages column

Python API

from pdf_word_counter import search_pdfs

results = search_pdfs(
    ["document1.pdf", "document2.pdf"],
    ["word1", "word2"],
    output_file="results.csv",
    show=True,
    sort_cols=["count"],
    ignore_case=True,
    use_pdfminer=False,
    workers=4,
)

License

MIT

How to Cite

This code was used in the research published as:

García-Benito, Rubén (2025). Beyond Universality: Cultural Diversity in Music and Its Implications for Sound Design and Sonification. Audio Mostly & ICAD Joint Conference (AM.ICAD 2025). Association for Computing Machinery, June 30 – July 4, 2025, Coimbra, Portugal

The code version corresponding to this publication is archived and can be cited using the DOI:

If you use this software in your research or adaptations, please cite the above paper and this repository accordingly.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.3

May 18, 2025

0.3.2

May 18, 2025

0.3.1

May 18, 2025

0.3.0

May 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_word_counter-0.3.3.tar.gz (11.7 kB view details)

Uploaded May 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_word_counter-0.3.3-py3-none-any.whl (12.7 kB view details)

Uploaded May 18, 2025 Python 3

File details

Details for the file pdf_word_counter-0.3.3.tar.gz.

File metadata

Download URL: pdf_word_counter-0.3.3.tar.gz
Upload date: May 18, 2025
Size: 11.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for pdf_word_counter-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`4fd217f9c5f2d7381e9654d1320462da6e7be3d8b01937fb7719fcef6a8252a8`
MD5	`0ed9291dce6a0fbfdebca9c481874d7e`
BLAKE2b-256	`12c1036fcaf1721d3bdc76b0af67297e8822edbabb5662762b1053b600a94fab`

See more details on using hashes here.

File details

Details for the file pdf_word_counter-0.3.3-py3-none-any.whl.

File metadata

Download URL: pdf_word_counter-0.3.3-py3-none-any.whl
Upload date: May 18, 2025
Size: 12.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for pdf_word_counter-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`52416734a8a5ebef32a87c6e209bc3508809f6b52ebffbf1db2e2efabf65bc7d`
MD5	`dd458b5dbada6fff1123e97054b0d213`
BLAKE2b-256	`c2959b0d57e3f6539aec3d0f86748d04d6d10b50eba3cf223a84651eabbbd237`

See more details on using hashes here.

pdf-word-counter 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF Word Counter

Features

Installation

Coomand Line Usage

Basic Example

Advanced Examples

CLI Options

Filename Parsing Options

Output Column Toggles

Python API

License

How to Cite

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes