Information extraction and named-entity recognition for indexing PDFs
Project description
pdfner
Information extraction and named entity recognition for indexing PDFs
Install NLP tools
- Download language-specific model data in spaCy
$ python -m spacy download en
- Download Stanford CoreNLP from https://stanfordnlp.github.io/CoreNLP/download.html and extract to {project root}/pdfner/tests/tools
Install OCRmyPDF
https://ocrmypdf.readthedocs.io/en/latest/installation.html
Installation
pip install pdfner
Usage
Processing a PDF
from typing import List
from pdfner import *
# Each page of the PDF is processed to an NerDocument.
processed_pdf: List[NerDocument] = process_pdf('scanned.pdf', entities_detector=SpacyDetectEntities())
print(f"Extracted text: {processed_pdf[0].text}")
print(f"Detected entities: {processed_pdf[0].entities}")
Indexing with Elasticsearch
import simplejson as json
from elasticsearch import Elasticsearch
es = Elasticsearch()
# NerDocument implements for_json function for easy serialization with simplejson.
doc: NerDocument
for doc in processed_pdf:
res = es.index(index='pdfner', id=doc.id, body=json.dumps(doc, for_json=True))
print(res['result'])
Indexing with Solr
import pysolr
# Collection "gettingstarted" auto created by: solr -c -e schemaless
solr = pysolr.Solr('http://localhost:8983/solr/gettingstarted', always_commit=True)
# encode returns NerDocument object as dict which is required by pysolr
solr.add([doc.encode() for doc in processed_pdf])
API
process_pdf
A function that converts a scanned PDF to a text-based PDF and applies the NER detector object to the text to extract entities. Returns a list of NerDocument objects.
- filepath: str - path to PDF file
- make_thumbnail: Optional[bool]=False - whether to create a thumbnail PNG for the first page
- cache_entities: Optional[bool]=False - whether to cache entities to the local filesystem
- parallelize_pages: Optional[bool]=True - whether to process multiple pages in parallel
- out_filepath: Optional[str]=None - optional location of resulting processed PDF
- entities_detector: AbstractDetectEntities - named argument for NER detector object (SpacyDetectEntities, CoreNlpDetectEntities)
- **kwargs - additional named arguments to attach to the returned NerDocument objects
AbstractDetectEntities
Roll your own NER detector by subclassing AbstractDetectEntities and overriding detect_entities.
- detect_entities(text: str, **kwargs) - extract entities from input text and returns a list of NamedEntity objects
NerDocument
A class representing a single page of a processed PDF.
Attributes
- id: str - auto-generated random UUID
- text: str - text extracted from PDF page
- page_number: int - PDF page number
- entities: List[str] - entities extracted from PDF text
- processed_location: str - location of processed PDF
- original_location: str - location of original PDF
- redacted_location: str - location of redacted PDF
- thumbnail_location: str - location of thumbnail PNG for first page of processed PDF
- **kwargs - additional named arguments to store with object
Instance methods
- encode() - returns dict representation of object
- for_json() - for simplejson to serialize object to JSON
Class methods
- decode(d: Dict) - object_hook function for simplejson's loads function to deserialize JSON to a proper NerDocument object
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdfner-0.1.1.tar.gz
(8.8 kB
view details)
Built Distribution
pdfner-0.1.1-py3-none-any.whl
(11.3 kB
view details)
File details
Details for the file pdfner-0.1.1.tar.gz
.
File metadata
- Download URL: pdfner-0.1.1.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f58450939f4ecfd124893ea399b241d3bb97afbee088a3d50c8345bd89964ddb |
|
MD5 | 0fa18cadf86a1d8c3888cad75148939c |
|
BLAKE2b-256 | 80b74e90bd593fe0159662f89a73dd7d47303a66098ba5b6fbd23d6790fb321d |
File details
Details for the file pdfner-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: pdfner-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc367ec1cc742874000eb529c48e44351a27247dbfae18aba024cb3c8830df88 |
|
MD5 | 37b84ab0995938b138361d80ab4296f5 |
|
BLAKE2b-256 | c9214b371b85978ab1c781c34574758e4c37279257add8babd0565b42b6e0d3b |