Data pipelines for extraction, transformation and visualization of architectural visuals in Python.
Project description
VisArchPy
Data pipelines for extraction, transformation and visualization of architectural visuals in Python. It extracts images embedded in PDF files, collects relevant metadata, and extracts visual features using the DinoV2 model.
Main Features
Extraction pipelines
- Layout: pipeline for extracting metadata and visuals (images) from PDF files using a layout analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements.
- OCR: pipeline for extracting metadata and visuals from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.
- LayoutOCR: pipeline for extracting metadata and visuals from PDF files that combines layout and OCR analysis.
Metadata Extraction
- Extraction of medatdata of extracted images (document page, image size)
- Extraction of captions of images based on proximity to images and text-analysis using keywords.
Transformation utilities
- Dino: pipeline for transforming images into visual features using the self-supervised learning in DinoV2.
Visualization utilities
-
Viz: an utility to create a bounding box plot. This plot provides an overview of the shapes and sizes of images in a data set.
Requirements
- Python 3.10 or newer
- Tesseract v4.0 or recent
- PyTorch v2.1 or recent
Installion
After installing the requirements, install VisArchPy using pip
.
pip install visarchpy
Installing from source
-
Clone the repository.
git clone https://github.com/AiDAPT-A/VisArchPy.git
-
Go to the root of the repository.
cd VisArchPy/
-
Install the package using
pip
.pip install .
Usage
VisArchPy provides a command line interface to access its functionality. If you want to VisArchPy as a Python package consult the documentation.
- To access the main CLI program:
visarch -h
- To access a particular pipeline:
visarch [PIPELINE] [SUBCOMMAND]
For example, to run the layout
pipeline using a single PDF file, do the following:
visarch layout from-file <path-to-pdf-file> <path-output-directory>
Use visarch [PIPELINE] [SUBCOMMAND] -h
for help.
Results:
Results from the data extraction pipelines (Layout, OCR, LayoutOCR) are save to the output directory. Results are organized as following:
00000/ # results directory
├── pdf-001 # directory where images are saved to. One per PDF file
├── 00000-metadata.csv # extracted metadata as CSV
├── 00000-metadata.json # extracted metadata as JSON
├── 00000-settings.json # settings used by pipeline
└── 00000.log # log file
Settings
The pipeline's settings determine how visual extraction from PDF files is performed. Settings must be passed as a JSON file on the CLI. Settings may must include all items listed below. The values showed belowed are the defaults.
{
"layout": { # setting for layout analysis
"caption": {
"offset": [ # distance used to locate captions
4,
"mm"
],
"direction": "down", # direction used to locate captions
"keywords": [ # keywords used to find captions based on text analysis
"figure",
"caption",
"figuur"
]
},
"image": { # images smaller than these dimensions will be ignored
"width": 120,
"height": 120
}
},
"ocr": { # settings for OCR analysis
"caption": {
"offset": [
50,
"px"
],
"direction": "down",
"keywords": [
"figure",
"caption",
"figuur"
]
},
"image": {
"width": 120,
"height": 120
},
"resolution": 250, # dpi to convert PDF pages to images before OCR
"resize": 30000 # total pixels. Larger OCR inputs are downsize to this before OCR
}
}
When no seetings are passed to a pipeline, the defaults are used. To print the default seetting to the terminal use:
visarch [PIPELINE] settings
Citation
Please cite this software using as follows:
Garcia Alvarez, M. G., Khademi, S., & Pohl, D. (2023). VisArchPy [Computer software]. https://github.com/AiDAPT-A/VisArchPy
Acknowlegdements
- AeoLiS is supported by the Digital Competence Centre, Delft University of Technology.
- Reseach Data Services, Delft University of Technology, The Netherlands.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file visarchpy-1.0.0.tar.gz
.
File metadata
- Download URL: visarchpy-1.0.0.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 35ad2addf7014e679810aadcc05afbea33701cf55652ecc666f130fb64941c55 |
|
MD5 | 2b98e42e29f6bb41035b48c643a8697a |
|
BLAKE2b-256 | 1956909faff0364e337a6b11d7d4426b5cf08311be195248adad1474f4fc6f5d |
File details
Details for the file visarchpy-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: visarchpy-1.0.0-py3-none-any.whl
- Upload date:
- Size: 40.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 012a4943beff01324f70bbc7a33eddb75c4fa6be0876dd1fbbb7db8dd60b0c19 |
|
MD5 | 21e3bc8c6e52303df51393780aceadd8 |
|
BLAKE2b-256 | 02177423890089ef9910e1e55c10e868ceccc231c3ceca22823eedc4e05c21fc |