A PDF to text extraction pipeline component for spaCy.
Project description
spacypdfreader
Extract text from PDFs using spaCy and capture the page number as a spaCy extension.
Links
Table of Contents
Installation
pip install spacypdfreader
Usage
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Each token will now have an additional extension ._.page_number
that indicates the pdf page number the token came from.
>>> [print(f"Token: `{token}`, page number {token._.page_number}") for token in doc[0:3]]
Token: `Test`, page number 1
Token: `PDF`, page number 1
Token: `01`, page number 1
[None, None, None]
Implementation Notes
spaCyPDFreader behaves a little bit different than your typical spaCy custom component. Typically a spaCy component should receive and return a spacy.tokens.Doc
object.
spaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead pdf_reader
takes a path to a PDF file and a spacy.Language
object as parameters and returns a spacy.tokens.Doc
object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the spacy.Language
object.
Example of a "traditional" spaCy pipeline component negspaCy:
>>> import spacy
>>> from negspacy.negation import Negex
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.add_pipe("negex", config={"ent_types":["PERSON","ORG"]})
>>>
>>> doc = nlp("She does not like Steve Jobs but likes Apple products.")
Example of spaCyPDFreader
usage:
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Note that the nlp.add_pipe
is not used by spaCyPDFreader.
API Reference
Functions
spacypdfreader.pdf_reader
Extract text from PDF files directly into a spacy.Doc
object while capturing the page number of each token.
Name | Type | Description |
---|---|---|
pdf_path |
str |
Path to a PDF file. |
nlp |
spacy.Language |
A spaCy Language object with a loaded pipeline. For examplespacy.load("en_core_web_sm") . |
RETURNS | spacy.tokens.Doc |
A spacy Doc object with the custom extension._.page_number . |
Example
>>> import spacy
>>> from spacypdfreader import pdf_reader
>>>
>>> nlp = spacy.load("en_core_web_sm")
>>> doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
Extracting text from 4 pdf pages... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
>>> [print(f"Token: `{token}`, page number {token._.page_number}") for token in doc[0:3]]
Token: `Test`, page number 1
Token: `PDF`, page number 1
Token: `01`, page number 1
[None, None, None]
Extensions
When using spacypdfreader.pdf_reader
a spacy.tokens.Doc
object with custom extensions is returned.
Extension | Type | Description | Default |
---|---|---|---|
token._.page_number | int | The PDF page number in which the token was extracted from. The first page is 1 . |
None |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spacypdfreader-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bacb9243419a8c999d81b3771e54e202166ebad5d318bc8b4c69a57ae5ca04c3 |
|
MD5 | df8fcfdd83ed1a35a9b8e3e1fecfa56d |
|
BLAKE2b-256 | c6104bc8497416b8c47ded6894e456fee00db2c1e880692304ce2cce7cb3a9df |