Skip to main content

Search PDF files for specific words and generate frequency statistics.

Project description

PDF Word Counter

Python Version License PyPI DOI

A tool to search PDF files for specific words and generate frequency statistics. Supports both PyPDF2 and pdfminer.six backends, with output to Astropy or Pandas formats.

Features

  • Case-sensitive/insensitive search
  • Unicode text normalization (NFC, NFD, NFKC, NFKD)
  • Parallel processing using threads
  • Output to Astropy tables or Pandas DataFrames
  • Select specific pages or page ranges
  • CLI and Python API
  • Filename parsing to extract metadata
  • Optional columns and sorting

Installation

Basic installation:

pip install pdf-word-counter

Full installation (with all optional dependencies):

pip install pdf-word-counter[miner]    # + pdfminer.six
pip install pdf-word-counter[astropy]  # + Astropy Tables
pip install pdf-word-counter[full]     # All extras

Coomand Line Usage

Basic Example

# Count words in all PDFs (case-insensitive) and print table
pdf-word-counter "AI,ML" papers/*.pdf --show

Advanced Examples

# Search pages 1-5 with Unicode normalization
pdf-word-counter "café" doc.pdf --pages 1-5 --unicode --form NFKC

# Case-sensitive search with 4 parallel workers
pdf-word-counter "Python" docs/*.pdf --case --workers 4

# Case-sensitive search with pdfminer
pdf-word-counter "Python" docs/*.pdf --case --miner --outfile counts.csv

# Extract project names from filenames like "ProjectX_2023.pdf"
pdf-word-counter "algorithm" *.pdf --dsep "{'_': {'project': 0}}" 

CLI Options

Option Description
words Comma-separated list of words to search for
pdfs List of PDF files or glob patterns
--case Case-sensitive search
--pages Pages to include, e.g. "1,3-5"
--miner Use pdfminer.six backend instead of PyPDF2
--pprint N Log progress every N pages (PyPDF2 only)
--unicode Normalize Unicode text before search
--form FORM Unicode normalization form (NFC, NFD, NFKC, NFKD)
--outfile FILE Save the output table to a file
--show Print the table to stdout
--sort COLUMNS Comma-separated list of columns to sort by
--workers N Number of parallel workers (default: 1)
--backend Output backend: pandas or astropy
--ppdf N Log progress every N files (serial mode)
--log-level Logging level: DEBUG, INFO, etc.

Filename Parsing Options

Option Description
--dsep Dictionary to extract metadata from filenames by splitting them. For example: "{'_': {'project': 1}}" will split filenames by underscores (_) and extract the second element (index 1) into a new column called project.
--year Attempt to extract a 4-digit year from the filename and add it as a new column.
--ext Keep the file extension in the filename column (by default, it is removed).

Output Column Toggles

Option Description
--nfile Omit the filename column
--npages Omit the number-of-pages column

Python API

from pdf_word_counter import search_pdfs

results = search_pdfs(
    ["document1.pdf", "document2.pdf"],
    ["word1", "word2"],
    output_file="results.csv",
    show=True,
    sort_cols=["count"],
    ignore_case=True,
    use_pdfminer=False,
    workers=4,
)

License

MIT

How to Cite

This code was used in the research published as:

García-Benito, Rubén (2025). Beyond Universality: Cultural Diversity in Music and Its Implications for Sound Design and Sonification. Audio Mostly & ICAD Joint Conference (AM.ICAD 2025). Association for Computing Machinery, June 30 – July 4, 2025, Coimbra, Portugal

The code version corresponding to this publication is archived and can be cited using the DOI:

DOI

If you use this software in your research or adaptations, please cite the above paper and this repository accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_word_counter-0.3.3.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_word_counter-0.3.3-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file pdf_word_counter-0.3.3.tar.gz.

File metadata

  • Download URL: pdf_word_counter-0.3.3.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.22

File hashes

Hashes for pdf_word_counter-0.3.3.tar.gz
Algorithm Hash digest
SHA256 4fd217f9c5f2d7381e9654d1320462da6e7be3d8b01937fb7719fcef6a8252a8
MD5 0ed9291dce6a0fbfdebca9c481874d7e
BLAKE2b-256 12c1036fcaf1721d3bdc76b0af67297e8822edbabb5662762b1053b600a94fab

See more details on using hashes here.

File details

Details for the file pdf_word_counter-0.3.3-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_word_counter-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 52416734a8a5ebef32a87c6e209bc3508809f6b52ebffbf1db2e2efabf65bc7d
MD5 dd458b5dbada6fff1123e97054b0d213
BLAKE2b-256 c2959b0d57e3f6539aec3d0f86748d04d6d10b50eba3cf223a84651eabbbd237

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page