Extracting content from spesific address books
Project description
historical-text-extraction (hte)
Package to extract text from historical documents. The package is written for personal use.
Installation
The current release from the PyPI repository:
pip install hte
The development version from GitHub with:
pip install git+ssh://git@github.com/eirikberger/hte.git
Note that it is nessecary with a SSH key for this approach to work.
Using it
Import the package
from hte import digitize
The basic setup is the following:
# Define class
book = digitize.Book("data/finnmark_1968.pdf", "books")
# Run methods on the class
book.CreateFolderStructure()
book.PdfImport(page_info=False, from_page=21, to_page=263)
book.Split(multiple_columns=True)
book.RunOCR(type="splits", export_image=False)
book.CombineCleanGroup(ocr_grouping=True, group_type='norway')
book.RegexStructure("norway")
Make sure to install the correct language package for Tesseract.
# Check languages already installed:
tesseract --list-langs
# Languages available for installation
apt-cache search tesseract-ocr
# Install the Norwegian language pack
sudo apt-get install tesseract-ocr-nor
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
hte-0.0.5.tar.gz
(11.3 kB
view details)
Built Distribution
hte-0.0.5-py3-none-any.whl
(10.8 kB
view details)
File details
Details for the file hte-0.0.5.tar.gz
.
File metadata
- Download URL: hte-0.0.5.tar.gz
- Upload date:
- Size: 11.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbc5c984df697d03056af34f4c726ffafef872d006458b3c3e8c3837d408741a |
|
MD5 | 0a39aadb370c6ad28c6d2cad70f1dd61 |
|
BLAKE2b-256 | 1c1ba0e96949d4b2f9fe59af0fa9b2f57e553c9c5dd7da95afed97fa1f7ddc47 |
File details
Details for the file hte-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: hte-0.0.5-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8216b4c7a0907a8b4df0f541c9d10be0afa7396cc378159c0ad487a016d5ec0 |
|
MD5 | 5fb29f16735fe1f3fbce31f0dc926e4d |
|
BLAKE2b-256 | f3cff7e6be98894a0fe0c998a29ccbacb9669cb0959d7bd9f5b232dd42863849 |