Search PDF files for specific words and generate frequency statistics.
Project description
PDF Word Counter
A tool to search PDF files for specific words and generate frequency statistics. Supports both PyPDF2 and pdfminer.six backends, with output to Astropy or Pandas formats.
Features
- Case-sensitive/insensitive search
- Unicode text normalization (NFC, NFD, NFKC, NFKD)
- Parallel processing using threads
- Output to Astropy tables or Pandas DataFrames
- Select specific pages or page ranges
- CLI and Python API
- Filename parsing to extract metadata
- Optional columns and sorting
Installation
Basic installation:
pip install pdf-word-counter
Full installation (with all optional dependencies):
pip install pdf-word-counter[miner] # + pdfminer.six
pip install pdf-word-counter[astropy] # + Astropy Tables
pip install pdf-word-counter[full] # All extras
Coomand Line Usage
Basic Example
# Count words in all PDFs (case-insensitive) and print table
pdf-word-counter "AI,ML" papers/*.pdf --show
Advanced Examples
# Search pages 1-5 with Unicode normalization
pdf-word-counter "café" doc.pdf --pages 1-5 --unicode --form NFKC
# Case-sensitive search with 4 parallel workers
pdf-word-counter "Python" docs/*.pdf --case --workers 4
# Case-sensitive search with pdfminer
pdf-word-counter "Python" docs/*.pdf --case --miner --outfile counts.csv
# Extract project names from filenames like "ProjectX_2023.pdf"
pdf-word-counter "algorithm" *.pdf --dsep "{'_': {'project': 0}}"
CLI Options
| Option | Description |
|---|---|
words |
Comma-separated list of words to search for |
pdfs |
List of PDF files or glob patterns |
--case |
Case-sensitive search |
--pages |
Pages to include, e.g. "1,3-5" |
--miner |
Use pdfminer.six backend instead of PyPDF2 |
--pprint N |
Log progress every N pages (PyPDF2 only) |
--unicode |
Normalize Unicode text before search |
--form FORM |
Unicode normalization form (NFC, NFD, NFKC, NFKD) |
--outfile FILE |
Save the output table to a file |
--show |
Print the table to stdout |
--sort COLUMNS |
Comma-separated list of columns to sort by |
--workers N |
Number of parallel workers (default: 1) |
--backend |
Output backend: pandas or astropy |
--ppdf N |
Log progress every N files (serial mode) |
--log-level |
Logging level: DEBUG, INFO, etc. |
Filename Parsing Options
| Option | Description |
|---|---|
--dsep |
Dictionary to extract metadata from filenames by splitting them. For example: "{'_': {'project': 1}}" will split filenames by underscores (_) and extract the second element (index 1) into a new column called project. |
--year |
Attempt to extract a 4-digit year from the filename and add it as a new column. |
--ext |
Keep the file extension in the filename column (by default, it is removed). |
Output Column Toggles
| Option | Description |
|---|---|
--nfile |
Omit the filename column |
--npages |
Omit the number-of-pages column |
Python API
from pdf_word_counter import search_pdfs
results = search_pdfs(
["document1.pdf", "document2.pdf"],
["word1", "word2"],
output_file="results.csv",
show=True,
sort_cols=["count"],
ignore_case=True,
use_pdfminer=False,
workers=4,
)
License
MIT
How to Cite
This code was used in the research published as:
García-Benito, Rubén (2025). Beyond Universality: Cultural Diversity in Music and Its Implications for Sound Design and Sonification. Audio Mostly & ICAD Joint Conference (AM.ICAD 2025). Association for Computing Machinery, June 30 – July 4, 2025, Coimbra, Portugal
The code version corresponding to this publication is archived and can be cited using the DOI:
If you use this software in your research or adaptations, please cite the above paper and this repository accordingly.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf_word_counter-0.3.3.tar.gz.
File metadata
- Download URL: pdf_word_counter-0.3.3.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fd217f9c5f2d7381e9654d1320462da6e7be3d8b01937fb7719fcef6a8252a8
|
|
| MD5 |
0ed9291dce6a0fbfdebca9c481874d7e
|
|
| BLAKE2b-256 |
12c1036fcaf1721d3bdc76b0af67297e8822edbabb5662762b1053b600a94fab
|
File details
Details for the file pdf_word_counter-0.3.3-py3-none-any.whl.
File metadata
- Download URL: pdf_word_counter-0.3.3-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.22
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52416734a8a5ebef32a87c6e209bc3508809f6b52ebffbf1db2e2efabf65bc7d
|
|
| MD5 |
dd458b5dbada6fff1123e97054b0d213
|
|
| BLAKE2b-256 |
c2959b0d57e3f6539aec3d0f86748d04d6d10b50eba3cf223a84651eabbbd237
|