Extracting content from spesific address books
Project description
historical-text-extraction (hte)
Package to extract text from historical documents. The package is written for personal use.
Installation
The current release from the PyPI repository:
pip install hte
The development version from GitHub with:
pip install git+ssh://git@github.com/eirikberger/hte.git
Note that it is nessecary with a SSH key for this approach to work.
Using it
Import the package
from hte import digitize
The basic setup is the following:
# Define class
book = digitize.Book("data/finnmark_1968.pdf", "books")
# Run methods on the class
book.CreateFolderStructure()
book.PdfImport(page_info=False, from_page=21, to_page=263)
book.Split(multiple_columns=True)
book.RunOCR(type="splits", export_image=False)
book.CombineCleanGroup(ocr_grouping=True, group_type='norway')
book.RegexStructure("norway")
Make sure to install the correct language package for Tesseract.
# Check languages already installed:
tesseract --list-langs
# Languages available for installation
apt-cache search tesseract-ocr
# Install the Norwegian language pack
sudo apt-get install tesseract-ocr-nor
Extracting headers
Start by converting xml files to json. These files can be created by using the free software labelImg
.
import os
os.chdir('/home/eirikb/Desktop')
header = Headers('train', 2022)
header.runbbxConverting()
Then convert the json file to csv.
convertFromJson('/home/eirikb/Desktop/training_xml/json-bbox', 'xml')
Finally, read the concent of the boxes.
ReadBoxes('json-bbox/xml.csv', 'hordaland', 'train', print_images=True)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hte-0.0.22.tar.gz
.
File metadata
- Download URL: hte-0.0.22.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7cab202a5bd8ddf4a7463a9b59f746308a8a02d5e315a8a0a6fc36260f2d123c |
|
MD5 | cec0fe46f1f592bc60e7053f176565ee |
|
BLAKE2b-256 | b342b728597dbdc63ea67c51b19955fcaa34b5aa0a3ca344818ec371f8a0c910 |
File details
Details for the file hte-0.0.22-py3-none-any.whl
.
File metadata
- Download URL: hte-0.0.22-py3-none-any.whl
- Upload date:
- Size: 15.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e85145641348175798748c4eef6b24a10e756c89131bc2aca4dc3ee03af97ee |
|
MD5 | 940a848bdff8f00d0e6d3eb2ac921617 |
|
BLAKE2b-256 | 87e6fa69296a4e37054f021bcc6fd3303ac952a1755793843362508448f058c6 |