Skip to main content

A PDF to text extraction pipeline component for spaCy.

Project description

spacypdfreader

Easy PDF to text to spaCy text extraction in Python.

Package version PyPI - Downloads pytest


Documentation: https://samedwardes.github.io/spacypdfreader/

Source code: https://github.com/SamEdwardes/spacypdfreader

PyPi: https://pypi.org/project/spacypdfreader/

spaCy universe: https://spacy.io/universe/project/spacypdfreader


spacypdfreader is a python library for extracting text from PDF documents into spaCy Doc objects. When you use spacypdfreader the token and doc objects from spacy are annotated with additional information about the pdf.

The key features are:

  • PDF to spaCy Doc object: Convert a PDF document directly into a spaCy Doc object.
  • Custom spaCy attributes and methods:
    • token._.page_number
    • doc._.page_range
    • doc._.first_page
    • doc._.last_page
    • doc._.pdf_file_name
    • doc._.page(int)
  • Multiple parsers: Select between multiple built in PDF to text parsers or bring your own PDF to text parser.

Installation

Install spacypdfreader using pip:

pip install spacypdfreader

To install with the required pytesseract dependencies:

pip install 'spacypdfreader[pytesseract]'

Usage

import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

# Get the page number of any token.
print(doc[0]._.page_number)  # 1
print(doc[-1]._.page_number)  # 4

# Get page meta data about the PDF document.
print(doc._.pdf_file_name)  # "tests/data/test_pdf_01.pdf"
print(doc._.page_range)  # (1, 4)
print(doc._.first_page)  # 1
print(doc._.last_page)  # 4

# Get all of the text from a specific PDF page.
print(doc._.page(4))  # "able to display the destination page (unless..."

What is spaCy?

spaCy is a natural language processing (NLP) tool. It can be used to perform a variety of NLP tasks. For more information check out the excellent documentation at https://spacy.io.

Implementation Notes

spaCyPDFreader behaves a little bit different than your typical spaCy custom component. Typically a spaCy component should receive and return a spacy.tokens.Doc object.

spaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead pdf_reader takes a path to a PDF file and a spacy.Language object as parameters and returns a spacy.tokens.Doc object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the spacy.Language object.

Example of a "traditional" spaCy pipeline component negspaCy:

import spacy
from negspacy.negation import Negex

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("negex", config={"ent_types": ["PERSON", "ORG"]})
doc = nlp("She does not like Steve Jobs but likes Apple products.")

Example of spaCyPDFreader usage:

import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")

doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

Note that the nlp.add_pipe is not used by spaCyPDFreader.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacypdfreader-0.3.2.tar.gz (918.5 kB view details)

Uploaded Source

Built Distribution

spacypdfreader-0.3.2-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file spacypdfreader-0.3.2.tar.gz.

File metadata

  • Download URL: spacypdfreader-0.3.2.tar.gz
  • Upload date:
  • Size: 918.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.4.18

File hashes

Hashes for spacypdfreader-0.3.2.tar.gz
Algorithm Hash digest
SHA256 bc2932fc5b60e2b37ba85ce36fc47c69a3e920b1c971b6e46ec258c190a53cac
MD5 513fca9f94987955729903c20a72c89b
BLAKE2b-256 c0789045499faa6f4b951a5f2488f3605bbd7df4e2c06a7742cf029718447405

See more details on using hashes here.

File details

Details for the file spacypdfreader-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for spacypdfreader-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 746a20478f6cbf53afe51c54a82ebd9c39b90372a2f3e866fa30a371075078f5
MD5 e0ea39523cb66d127cdada9d48d0bbfe
BLAKE2b-256 d767fba9ed350dfa79cf7f4c24ba10f5613ae8b7ee376e1315905595316e8af2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page