Skip to main content

Data pipelines for extraction, transformation and visualization of architectural visuals in Python.

Project description

License: MIT PyPI PyPI_versions PyPI_status PyPI_format Unit Tests Docs

VisArchPy

Data pipelines for extraction, transformation and visualization of architectural visuals in Python. It extracts images embedded in PDF files, collects relevant metadata, and extracts visual features using the DinoV2 model. We ambition to make of this package Ai-powered tool with features for recorgnizing different types architectural visuals (types of buildings, structures, etc.). The package is still in development and we are working on adding more features and improving the existing ones. If you have any suggestions or questions, please open an issue in our GitHub repository <https://github.com/AiDAPT-A/VisArchPy/issues>_.

Main Features

Extraction pipelines

  • Layout: pipeline for extracting metadata and visuals (images) from PDF files using a layout analysis. Layout analysis recursively checks elements in the PDF file and sorts them into images, text, and other elements.
  • OCR: pipeline for extracting metadata and visuals from PDF files using OCR analysis. OCR analysis extracts images from PDF files using Tesseract OCR.
  • LayoutOCR: pipeline for extracting metadata and visuals from PDF files that combines layout and OCR analysis.

Metadata Extraction

  • Extraction of medatdata of extracted images (document page, image size)
  • Extraction of captions of images based on proximity to images and text-analysis using keywords.

Transformation utilities

  • Dino: pipeline for transforming images into visual features using the self-supervised learning in DinoV2.

Visualization utilities

  • Viz: an utility to create a bounding box plot. This plot provides an overview of the shapes and sizes of images in a data set.

    Example Bbox plot

Dependencies

Installion

After installing the dependencies, install VisArchPy using pip.

pip install visarchpy

Installing from source

  1. Clone the repository.

    git clone https://github.com/AiDAPT-A/VisArchPy.git
    
  2. Go to the root of the repository.

    cd VisArchPy/
    
  3. Install the package using pip.

    pip install .
    

Developers who intend to modify the sourcecode can install additional dependencies for test and documentation as follows.

  1. Go to the root directory visarchpy/

  2. Run:

pip install -e .[dev]

Usage

VisArchPy provides a command line interface to access its functionality. If you want to VisArchPy as a Python package consult the documentation.

  1. To access the CLI:
visarch -h
  1. To access a particular pipeline:
visarch [PIPELINE] [SUBCOMMAND]

For example, to run the layout pipeline using a single PDF file, do the following:

visarch layout from-file <path-to-pdf-file> <path-output-directory>

Use visarch [PIPELINE] [SUBCOMMAND] -h for help.

Results

Results from the data extraction pipelines (Layout, OCR, LayoutOCR) are save to the output directory. Results are organized as following:

00000/  # results directory
├── pdf-001  # directory where images are saved to. One per PDF file
├── 00000-metadata.csv  # extracted metadata as CSV
├── 00000-metadata.json  # extracted metadata as JSON
├── 00000-settings.json  # settings used by pipeline
└── 00000.log  # log file

Settings

The pipeline's settings determine how visual extraction from PDF files is performed. Settings must be passed as a JSON file on the CLI. Settings may must include all items listed below. The values showed belowed are the defaults.

Available settings
{
    "layout": { # setting for layout analysis
        "caption": { 
            "offset": [ # distance used to locate captions
                4,
                "mm"
            ],
            "direction": "down", # direction used to locate captions
            "keywords": [  # keywords used to find captions based on text analysis
                "figure",
                "caption",
                "figuur"
            ]
        },
        "image": { # images smaller than these dimensions will be ignored
            "width": 120,
            "height": 120
        }
    },
    "ocr": {  # settings for OCR analysis
        "caption": {
            "offset": [
                50,
                "px"
            ],
            "direction": "down",
            "keywords": [
                "figure",
                "caption",
                "figuur"
            ]
        },
        "image": {
            "width": 120,
            "height": 120
        },
        "resolution": 250, # dpi to convert PDF pages to images before OCR
        "resize": 30000  # total pixels. Larger OCR inputs are downsize to this before OCR
        "tesseract" : "--psm 1 --oem 3"  # tesseract options
    }
}


When no seetings are passed to a pipeline, the defaults are used. To print the default seetting to the terminal use:

visarch [PIPELINE] settings

Citation

Please cite this software using as follows:

Garcia Alvarez, M. G., Khademi, S., & Pohl, D. (2023). VisArchPy [Computer software]. https://github.com/AiDAPT-A/VisArchPy

Acknowlegdements

  • AeoLiS is supported by the Digital Competence Centre, Delft University of Technology.
  • Reseach Data Services, Delft University of Technology, The Netherlands.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

visarchpy-1.0.2.tar.gz (39.3 kB view details)

Uploaded Source

Built Distribution

visarchpy-1.0.2-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file visarchpy-1.0.2.tar.gz.

File metadata

  • Download URL: visarchpy-1.0.2.tar.gz
  • Upload date:
  • Size: 39.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for visarchpy-1.0.2.tar.gz
Algorithm Hash digest
SHA256 9a6d7fd488ed7fcda0d2fb63564218080c0371b3852196da7bf17ccb54c07c9e
MD5 08d388b22ae2a0694148e1ee82ebfb97
BLAKE2b-256 9d742cab704ca0b36131cff1429aae715a55deffc728fe409bcfbe264a384540

See more details on using hashes here.

File details

Details for the file visarchpy-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: visarchpy-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 40.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for visarchpy-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 da043b7d24e9c9f9730d0e83a81b2aea1d0420133b4806d59affe19a0409180d
MD5 08a359d4b87b1bb71a8e5c4d96304211
BLAKE2b-256 08c3addec93bae371ab9d109f8b213087f7ed629195d2f3c0610e4e74dc6451c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page