Extracting content from spesific address books
Project description
historical-text-extraction (hte)
Package to extract text from historical documents. The package is written for personal use.
Installation
The current release from the PyPI repository:
pip install hte
The development version from GitHub with:
pip install git+ssh://git@github.com/eirikberger/hte.git
Note that it is nessecary with a SSH key for this approach to work.
Using it
Import the package.
from hte import digitize
The basic setup is the following:
# Define class
book = digitize.Book("data/finnmark_1968.pdf", "books")
# Run methods on the class
book.CreateFolderStructure()
book.PdfImport(page_info=False, from_page=21, to_page=263)
book.Split(multiple_columns=True)
book.RunOCR(type="splits", export_image=False)
book.CombineCleanGroup(ocr_grouping=True, group_type='norway')
book.RegexStructure("norway")
Make sure to install the correct language package for Tesseract.
# Check languages already installed:
tesseract --list-langs
# Languages available for installation
apt-cache search tesseract-ocr
# Install the Norwegian language pack
sudo apt-get install tesseract-ocr-nor
Extracting headers
Start by converting XML files using the Pascal VOC data format to JSON. These files can be created by using the free software labelImg
.
import os
import hte.headers as hteheaders
os.chdir('/home/eirikb/Desktop')
header = Headers('train', 2022)
header.runbbxConverting()
Then convert the json file to csv.
hteheaders.convertFromJson('/home/eirikb/Desktop/training_xml/json-bbox', 'xml')
Finally, read the concent of the boxes.
hteheaders.ReadBoxes('json-bbox/xml.csv', 'hordaland', 'train', print_images=True)
Downloade more content
These functions create a list of relevant content from the API of the Norwegian National Library and then downloads a high resolution pdf version.
import os
import hte.NB as NB
os.chdir('/home/eirikb/Desktop/')
ListOfBooks = NB.ListNB(1920, 1930, 'skatteligning', 'digitidsskrift')
ListOfBooks = NB.ListOfBooks[0:1]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.