Information extraction and named-entity recognition for indexing PDFs
Project description
pdfner
Information extraction and named entity recognition for indexing PDFs
Install NLP tools
- Download language-specific model data in spaCy
$ python -m spacy download en - Download Stanford CoreNLP from https://stanfordnlp.github.io/CoreNLP/download.html and extract to {project root}/pdfner/tests/tools
Install OCRmyPDF
https://ocrmypdf.readthedocs.io/en/latest/installation.html
Installation
pip install pdfner
Usage
Processing a PDF
from typing import List
from pdfner import *
# Each page of the PDF is processed to an NerDocument.
processed_pdf: List[NerDocument] = process_pdf('scanned.pdf', entities_detector=SpacyDetectEntities())
print(f"Extracted text: {processed_pdf[0].text}")
print(f"Detected entities: {processed_pdf[0].entities}")
Indexing with Elasticsearch
import simplejson as json
from elasticsearch import Elasticsearch
es = Elasticsearch()
# NerDocument implements for_json function for easy serialization with simplejson.
doc: NerDocument
for doc in processed_pdf:
res = es.index(index='pdfner', id=doc.id, body=json.dumps(doc, for_json=True))
print(res['result'])
Indexing with Solr
import pysolr
# Collection "gettingstarted" auto created by: solr -c -e schemaless
solr = pysolr.Solr('http://localhost:8983/solr/gettingstarted', always_commit=True)
# encode returns NerDocument object as dict which is required by pysolr
solr.add([doc.encode() for doc in processed_pdf])
API
process_pdf
A function that converts a scanned PDF to a text-based PDF and applies the NER detector object to the text to extract entities. Returns a list of NerDocument objects.
- filepath: str - path to PDF file
- make_thumbnail: Optional[bool]=False - whether to create a thumbnail PNG for the first page
- cache_entities: Optional[bool]=False - whether to cache entities to the local filesystem
- parallelize_pages: Optional[bool]=True - whether to process multiple pages in parallel
- out_filepath: Optional[str]=None - optional location of resulting processed PDF
- entities_detector: AbstractDetectEntities - named argument for NER detector object (SpacyDetectEntities, CoreNlpDetectEntities)
- **kwargs - additional named arguments to attach to the returned NerDocument objects
AbstractDetectEntities
Roll your own NER detector by subclassing AbstractDetectEntities and overriding detect_entities.
- detect_entities(text: str, **kwargs) - extract entities from input text and returns a list of NamedEntity objects
NerDocument
A class representing a single page of a processed PDF.
Attributes
- id: str - auto-generated random UUID
- text: str - text extracted from PDF page
- page_number: int - PDF page number
- entities: List[str] - entities extracted from PDF text
- processed_location: str - location of processed PDF
- original_location: str - location of original PDF
- redacted_location: str - location of redacted PDF
- thumbnail_location: str - location of thumbnail PNG for first page of processed PDF
- **kwargs - additional named arguments to store with object
Instance methods
- encode() - returns dict representation of object
- for_json() - for simplejson to serialize object to JSON
Class methods
- decode(d: Dict) - object_hook function for simplejson's loads function to deserialize JSON to a proper NerDocument object
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfner-0.1.1.tar.gz.
File metadata
- Download URL: pdfner-0.1.1.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f58450939f4ecfd124893ea399b241d3bb97afbee088a3d50c8345bd89964ddb
|
|
| MD5 |
0fa18cadf86a1d8c3888cad75148939c
|
|
| BLAKE2b-256 |
80b74e90bd593fe0159662f89a73dd7d47303a66098ba5b6fbd23d6790fb321d
|
File details
Details for the file pdfner-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pdfner-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc367ec1cc742874000eb529c48e44351a27247dbfae18aba024cb3c8830df88
|
|
| MD5 |
37b84ab0995938b138361d80ab4296f5
|
|
| BLAKE2b-256 |
c9214b371b85978ab1c781c34574758e4c37279257add8babd0565b42b6e0d3b
|