Skip to main content

Information extraction and named-entity recognition for indexing PDFs

Project description

pdfner

Information extraction and named entity recognition for indexing PDFs

Install NLP tools

  1. Download language-specific model data in spaCy
        $ python -m spacy download en
    
  2. Download Stanford CoreNLP from https://stanfordnlp.github.io/CoreNLP/download.html and extract to {project root}/pdfner/tests/tools

Install OCRmyPDF

https://ocrmypdf.readthedocs.io/en/latest/installation.html

Installation

pip install pdfner

Usage

Processing a PDF

from typing import List
from pdfner import *

# Each page of the PDF is processed to an NerDocument.
processed_pdf: List[NerDocument] = process_pdf('scanned.pdf', entities_detector=SpacyDetectEntities())
print(f"Extracted text: {processed_pdf[0].text}")
print(f"Detected entities: {processed_pdf[0].entities}") 

Indexing with Elasticsearch

import simplejson as json
from elasticsearch import Elasticsearch
es = Elasticsearch()

# NerDocument implements for_json function for easy serialization with simplejson.
doc: NerDocument
for doc in processed_pdf:
    res = es.index(index='pdfner', id=doc.id, body=json.dumps(doc, for_json=True))
    print(res['result'])

Indexing with Solr

import pysolr
# Collection "gettingstarted" auto created by: solr -c -e schemaless
solr = pysolr.Solr('http://localhost:8983/solr/gettingstarted', always_commit=True)

# encode returns NerDocument object as dict which is required by pysolr 
solr.add([doc.encode() for doc in processed_pdf])

API

process_pdf

A function that converts a scanned PDF to a text-based PDF and applies the NER detector object to the text to extract entities. Returns a list of NerDocument objects.

  • filepath: str - path to PDF file
  • make_thumbnail: Optional[bool]=False - whether to create a thumbnail PNG for the first page
  • cache_entities: Optional[bool]=False - whether to cache entities to the local filesystem
  • parallelize_pages: Optional[bool]=True - whether to process multiple pages in parallel
  • out_filepath: Optional[str]=None - optional location of resulting processed PDF
  • entities_detector: AbstractDetectEntities - named argument for NER detector object (SpacyDetectEntities, CoreNlpDetectEntities)
  • **kwargs - additional named arguments to attach to the returned NerDocument objects

AbstractDetectEntities

Roll your own NER detector by subclassing AbstractDetectEntities and overriding detect_entities.

  • detect_entities(text: str, **kwargs) - extract entities from input text and returns a list of NamedEntity objects

NerDocument

A class representing a single page of a processed PDF.

Attributes
  • id: str - auto-generated random UUID
  • text: str - text extracted from PDF page
  • page_number: int - PDF page number
  • entities: List[str] - entities extracted from PDF text
  • processed_location: str - location of processed PDF
  • original_location: str - location of original PDF
  • redacted_location: str - location of redacted PDF
  • thumbnail_location: str - location of thumbnail PNG for first page of processed PDF
  • **kwargs - additional named arguments to store with object
Instance methods
  • encode() - returns dict representation of object
  • for_json() - for simplejson to serialize object to JSON
Class methods
  • decode(d: Dict) - object_hook function for simplejson's loads function to deserialize JSON to a proper NerDocument object

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfner-0.1.1.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

pdfner-0.1.1-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file pdfner-0.1.1.tar.gz.

File metadata

  • Download URL: pdfner-0.1.1.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5

File hashes

Hashes for pdfner-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f58450939f4ecfd124893ea399b241d3bb97afbee088a3d50c8345bd89964ddb
MD5 0fa18cadf86a1d8c3888cad75148939c
BLAKE2b-256 80b74e90bd593fe0159662f89a73dd7d47303a66098ba5b6fbd23d6790fb321d

See more details on using hashes here.

File details

Details for the file pdfner-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pdfner-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.5

File hashes

Hashes for pdfner-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bc367ec1cc742874000eb529c48e44351a27247dbfae18aba024cb3c8830df88
MD5 37b84ab0995938b138361d80ab4296f5
BLAKE2b-256 c9214b371b85978ab1c781c34574758e4c37279257add8babd0565b42b6e0d3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page