Extract structured text from pdfs quickly. Adapted from https://github.com/VikParuchuri/pdftext

These details have not been verified by PyPI

Project links

Repository

Project description

PDFText

Text extraction like PyMuPDF, but without the AGPL license. PDFText extracts plain text or structured blocks and lines. It's built on pypdfium2, so it's fast, accurate, and Apache licensed.

Community

Discord is where we discuss future development.

Installation

You'll need python 3.9+ first. Then run pip install pdftext.

Usage

Inspect the settings in pdftext/settings.py. You can override any settings with environment variables.

Plain text

This command will write out a text file with the extracted plain text.

pdftext PDF_PATH --out_path output.txt

PDF_PATH must be a single pdf file.
--out_path path to the output txt file. If not specified, will write to stdout.
--sort will attempt to sort in reading order if specified.
--keep_hyphens will keep hyphens in the output (they will be stripped and words joined otherwise)
--page_range will specify pages (comma separated) to extract. Like 0,5-10,12.
--workers specifies the number of parallel workers to use
--flatten_pdf merges form fields into the PDF

JSON

This command outputs structured blocks and lines with font and other information.

pdftext PDF_PATH --out_path output.txt --json

PDF_PATH must be a single pdf file.
--out_path path to the output txt file. If not specified, will write to stdout.
--json specifies json output
--sort will attempt to sort in reading order if specified.
--page_range will specify pages (comma separated) to extract. Like 0,5-10,12.
--keep_chars will keep individual characters in the json output
--workers specifies the number of parallel workers to use
--flatten_pdf merges form fields into the PDF

The output will be a json list, with each item in the list corresponding to a single page in the input pdf (in order). Each page will include the following keys:

bbox - the page bbox, in [x1, y1, x2, y2] format
rotation - how much the page is rotated, in degrees (0, 90, 180, or 270)
page - the index of the page
blocks - the blocks that make up the text in the pdf. Approximately equal to a paragraph.
- bbox - the block bbox, in [x1, y1, x2, y2] format
- lines - the lines inside the block
  - bbox - the line bbox, in [x1, y1, x2, y2] format
  - spans - the individual text spans in the line (text spans have the same font/weight/etc)
    - text - the text in the span, encoded in utf-8
    - rotation - how much the span is rotated, in degrees
    - bbox - the span bbox, in [x1, y1, x2, y2] format
    - char_start_idx - the start index of the first span character in the pdf
    - char_end_idx - the end index of the last span character in the pdf
    - font this is font info straight from the pdf, see this pdfium code
      - size - the size of the font used for the text
      - weight - font weight
      - name - font name, may be None
      - flags - font flags, in the format of the PDF spec 1.7 Section 5.7.1 Font Descriptor Flags

If the pdf is rotated, the bboxes will be relative to the rotated page (they're rotated after being extracted).

Programmatic usage

Extract plain text:

from pdftext.extraction import plain_text_output

text = plain_text_output(PDF_PATH, sort=False, hyphens=False, page_range=[1,2,3]) # Optional arguments explained above

Extract structured blocks and lines:

from pdftext.extraction import dictionary_output

text = dictionary_output(PDF_PATH, sort=False, page_range=[1,2,3], keep_chars=False) # Optional arguments explained above

Extract text from table cells:

from pdftext.extraction import table_output

table_inputs = [
  # Each dictionary entry is a single page
  {
    "tables": [[5,10,10,20]], # Coordinates for tables on the page
    "img_size": [512, 512] # The size of the image the tables were detected in
  }
]
text = table_output(PDF_PATH, table_inputs, page_range=[1,2,3])

If you want more customization, check out the pdftext.extraction._get_pages function for a starting point to dig deeper. pdftext is a pretty thin wrapper around pypdfium2, so you might want to look at the documentation for that as well.

Benchmarks

I benchmarked extraction speed and accuracy of pymupdf, pdfplumber, and pdftext. I chose pymupdf because it extracts blocks and lines. Pdfplumber extracts words and bboxes. I did not benchmark pypdf, even though it is a great library, because it doesn't provide individual character/line/block and bbox information.

Here are the scores, run on an M1 Macbook, without multiprocessing:

Library	Time (s per page)	Alignment Score (% accuracy vs pymupdf)
pymupdf	0.32	--
pdftext	1.36	97.78
pdfplumber	3.16	90.36

pdftext is approximately 2x slower than using pypdfium2 alone (if you were to extract all the same character information).

There are additional benchmarks for pypdfium2 and other tools here.

Methodology

I used a benchmark set of 200 pdfs extracted from common crawl, then processed by a team at HuggingFace.

For each library, I used a detailed extraction method, to pull out font information, as well as just the words. This ensured we were comparing similar performance numbers. I formatted the text similarly when extracting - newlines after lines, and double newlines after blocks. For pdfplumber, I could only do the newlines after lines, since it doesn't recognize blocks.

For the alignment score, I extracted the text, then used the rapidfuzz library to find the alignment percentage. I used the text extracted by pymupdf as the pseudo-ground truth.

Running benchmarks

You can run the benchmarks yourself. To do so, you have to first install pdftext manually. The install assumes you have poetry and Python 3.9+ installed.

git clone https://github.com/VikParuchuri/pdftext.git
cd pdftext
poetry install
python benchmark.py # Will download the benchmark pdfs automatically

The benchmark script has a few options:

--max this controls the maximum number of pdfs to benchmark
--result_path a folder to save the results. A file called results.json will be created in the folder.
--pdftext_only skip running pdfplumber, which can be slow.

How it works

PDFText is a very light wrapper around pypdfium2. It first uses pypdfium2 to extract characters in order, along with font and other information. Then it uses a simple decision tree algorithm to group characters into lines and blocks. It does some simple postprocessing to clean up the text.

Credits

This is built on some amazing open source work, including:

pypdfium2
scikit-learn
pypdf for very thorough and fair benchmarks

Thank you to the pymupdf devs for creating such a great library - I just wish it had a simpler license!

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

0.6.2.post2

May 19, 2025

0.6.2.post1 yanked

May 18, 2025

This version

0.6.2 yanked

May 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunking_pdftext-0.6.2.tar.gz (22.1 kB view details)

Uploaded May 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunking_pdftext-0.6.2-py3-none-any.whl (24.2 kB view details)

Uploaded May 18, 2025 Python 3

File details

Details for the file chunking_pdftext-0.6.2.tar.gz.

File metadata

Download URL: chunking_pdftext-0.6.2.tar.gz
Upload date: May 18, 2025
Size: 22.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for chunking_pdftext-0.6.2.tar.gz
Algorithm	Hash digest
SHA256	`507c4c54a45fde521a4b78eb6b9b21195b0b42154fa60d2f661a652bf50b291b`
MD5	`d99d6145d999394e594dd16f85b4df05`
BLAKE2b-256	`77884eb927f4be9e3391667874a45f39be5ce82bd38d3bad397b76d28e386f9d`

See more details on using hashes here.

File details

Details for the file chunking_pdftext-0.6.2-py3-none-any.whl.

File metadata

Download URL: chunking_pdftext-0.6.2-py3-none-any.whl
Upload date: May 18, 2025
Size: 24.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for chunking_pdftext-0.6.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`47f356608dfe2cb41cd40d4b1a202eba230631d72073eac7f68e6c84314a7cec`
MD5	`03d660a21fa67c052299fed080e23ed4`
BLAKE2b-256	`65cd3971205dc9fdb3be3969c1e588a8af435f226b7c85e0cbc8af2c3d2c7c0d`

See more details on using hashes here.

chunking-pdftext 0.6.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDFText

Community

Installation

Usage

Plain text

JSON

Programmatic usage

Benchmarks

Methodology

Running benchmarks

How it works

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes