Skip to main content

A PDF to text extraction pipeline component for spaCy.

Project description

spacypdfreader

Extract text from PDFs using spaCy and capture the page number as a spaCy extension.

Links

Table of Contents

Installation

pip install spacypdfreader

Usage

>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Each token will now have an additional extension ._.page_number that indicates the pdf page number the token came from.

>>> [print(f"Token: `{token}`, page number  {token._.page_number}") for token in doc[0:3]]
Token: `Test`, page number  1
Token: `PDF`, page number  1
Token: `01`, page number  1
[None, None, None]

Implementation Notes

spaCyPDFreader behaves a little bit different than your typical spaCy custom component. Typically a spaCy component should receive and return a spacy.tokens.Doc object.

spaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead pdf_reader takes a path to a PDF file and a spacy.Language object as parameters and returns a spacy.tokens.Doc object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the spacy.Language object.

Example of a "traditional" spaCy pipeline component negspaCy:

>>> import spacy
>>> from negspacy.negation import Negex
>>> 
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
>>> 
>>> doc = nlp("She does not like Steve Jobs but likes Apple products.")

Example of spaCyPDFreader usage:

>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Note that the nlp.add_pipe is not used by spaCyPDFreader.

API Reference

Functions

spacypdfreader.pdf_reader

Extract text from PDF files directly into a spacy.Doc object while capturing the page number of each token.

Name Type Description
pdf_path str Path to a PDF file.
nlp spacy.Language A spaCy Language object with a loaded pipeline. For examplespacy.load("en_core_web_sm").
RETURNS spacy.tokens.Doc A spacy Doc object with the custom extension._.page_number.

Example

>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
>>> [print(f"Token: `{token}`, page number  {token._.page_number}") for token in doc[0:3]]
Token: `Test`, page number  1
Token: `PDF`, page number  1
Token: `01`, page number  1
[None, None, None]

Extensions

When using spacypdfreader.pdf_reader a spacy.tokens.Doc object with custom extensions is returned.

Extension Type Description Default
token._.page_number int The PDF page number in which the token was extracted from. The first page is 1. None

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacypdfreader-0.1.1.tar.gz (4.3 kB view hashes)

Uploaded Source

Built Distribution

spacypdfreader-0.1.1-py3-none-any.whl (4.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page