Skip to main content

Extracting content from spesific address books

Project description

historical-text-extraction (hte)

PyPI version

Package to extract text from historical documents. The package is written for personal use.

Installation

The current release from the PyPI repository:

pip install hte

The development version from GitHub with:

pip install git+ssh://git@github.com/eirikberger/hte.git

Note that it is nessecary with a SSH key for this approach to work.

Using it

Import the package

from hte import digitize

The basic setup is the following:

# Define class
book = digitize.Book("data/finnmark_1968.pdf", "books")

# Run methods on the class
book.CreateFolderStructure()
book.PdfImport(page_info=False, from_page=21, to_page=263)
book.Split(multiple_columns=True)
book.RunOCR(type="splits", export_image=False)
book.CombineCleanGroup(ocr_grouping=True, group_type='norway')
book.RegexStructure("norway")

Make sure to install the correct language package for Tesseract.

# Check languages already installed: 
tesseract --list-langs

# Languages available for installation
apt-cache search tesseract-ocr

# Install the Norwegian language pack
sudo apt-get install tesseract-ocr-nor

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hte-0.0.15.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

hte-0.0.15-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file hte-0.0.15.tar.gz.

File metadata

  • Download URL: hte-0.0.15.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for hte-0.0.15.tar.gz
Algorithm Hash digest
SHA256 e698a99e0eab6e8b2b256f7f16cd904f8a127a91c4c58eff0090e989075a81da
MD5 084ba772299d9d62c4efe29bc5fcc2df
BLAKE2b-256 cefad353fd3b0fe498a97e15ce8a0109864cb4648ffa46c4729e9e4a3138f144

See more details on using hashes here.

File details

Details for the file hte-0.0.15-py3-none-any.whl.

File metadata

  • Download URL: hte-0.0.15-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.14

File hashes

Hashes for hte-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 2396ed15599cffd91dc6e9130aaf6a2c69d44e3096c939bcc67dfb0328a402c1
MD5 26924fbbb6149d880c15f33d69593359
BLAKE2b-256 1bfd8fe9a0374153c45573b86a362234e32f6f5c0061f68e3a76c2ffb1905a4e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page