Skip to main content

Extracting structured variables from image data

Project description

imgtovar

Imgtovar is a Python module, developed in collaboration with researchers, which allows for variable extraction from image data. The pipeline consists of three steps - image extraction, cleaning and prediction. This module was bootstrapped using the popular face analysis framework DeepFace.

Currently supporting Natural vs Man-made background analysis, Chart recognition and identification, Age, Gender, Race and Emotion prediction, and finally Object Detection of a total of 100+ objects.

Table of Contents

  1. Installation
  2. Variable extraction pipeline
  3. License
  4. Contact
  5. Acknowledgments

Installation PyPI

The easiest way to install imgtovar is to download it from PyPI. It's going to install the library itself and its prerequisites as well.

$ pip install imgtovar

Variable Extraction Pipeline

Here is an example of a full feature extraction pipeline:

commands - add later

Now we go over a quick explanation of all the methods.

:heavy_exclamation_mark: All code has detailed comments explaining the different parameters and their functions

Image extraction

Imgtovar allows the user to extract images from documents in case there is no image database.

ImgtoVar.extract(data, mode="PDF")

This function extracts all images from a document and stores them in a directory named: ./extract_output

The data parameter can be either a single file or a directory.

:warning: Currently supported modes: PDF only!

(back to top)

Cleaning

During initial development, cleaning was found to be a crucial step, especially when the images were acquired through automatic image extraction from documents.

There are three methods used in the cleaning process, I have found that its best to filter out corrupted images and infographics before running the color analysis in order to limit false positives in the last step, here are the methods in the intended order:

1. detect_infographics

This method detects if an image is an infographic and if it predicts it's type. The detection stage has an f1-score of 97% showing strong performance. On the other hand the overall classifier accuracy is 87% which is an indication of the performance of the identification stage.

charts_df = ImgtoVar.detect_infographics(data)

The functions returns a DataFrame with the image file names and the predicted chart_type. Additionally, provided that the user agrees, Infographics will be moved to a new directory.

This method should be used before the color_analysis in order to reduce the false positives in that step.

2. detect_invertedImg

Through trial and error we have found that image extraction from PDF files results in a small percentage of images being corrupted. Those images have inverted channel values and additional problems with contrast and lightness. To identify those images ImgtoVar provides a method that has 93% accuracy in detecting the corrupted images.

hl_pairs_df = ImgtoVar.detect_invertedImg(data)

The method returns a DataFrame with the predicted status of the analysed images. Additionally, provided user approval, the images identified as inverted are moved to a new directory.

3. color_analysis

To filter out the undesirable images, ImgtoVar provides a method for analysing the distribution on hue/lightness pairs across all pixels.

hl_pairs_df = ImgtoVar.color_analysis(data)

A pandas DataFrame is returned containing the image file name, the total H/L pairs found the and proportion represented by the top 200 pairs (this is adjustable). Additionally, provided that the user agrees, images identified as "Artificial" will be moved to a new directory.

The filtering is based on that proportion. Where real images will have low proportion and drawings, logos or single color images will have a high proportion. By varying the threshold parameter the user can make the filtering more or less aggressive.

Color analysis 200 H/L Width
Dominant pairs proportion: 3% Dominant pairs proportion: 64%

This method is very effective at identifying real photographs, but can mistakenly label simpler images like infographics as undesirable.

(back to top)

Feature extraction

Once we have clean data, we can begin structuring our image data by using the methods included in ImgtoVar.

This section outlines the three main methods behind feature extraction. Several pre-trained models are included with the library, but the methods also allow for integration with custom models.

Facial attributes analysis

The facial attribute analysis was made based on the popular Python module DeepFace. ImgtoVar adds two important features.

First, the face detection function was reworked in order to return all detected faces in an image, and therefore run the analysis on each.

Second, the apparent age classifier was changed for a new custom model, as the original model included with DeepFace lacked training examples below 18 years old, and had limited examples in the higher age groups as well. This leads to poor performance on the test data. The new model classifies age in one of the following groups: child, young adult, adult, middle age, old with 72% accuracy.

Here is a potential workflow:

backends = ['opencv', 'ssd', 'dlib', 'mtcnn', 'retinaface', 'mediapipe']

#facial analysis
demography_df = ImgtoVar.face_analysis(data, actions=("emotion", "age", "gender", "race"), detector_backend = backends[4])

The method returns a DataFrame with the image file names and the predicted label for each action specified. By default, retinaface is used as a backend and all actions are predicted. To see a comparison of the different backends you can refer to this demo created by the author of DeepFace.

As he writes in the DeepFace documentation: "RetinaFace and MTCNN seem to overperform in detection and alignment stages but they are much slower. If the speed of your pipeline is more important, then you should use opencv or ssd. On the other hand, if you consider the accuracy, then you should use retinaface or mtcnn."

Background analysis

The background analysis detects the context of an image, e.g. if an image depicts an urban skyline or a forrest. The classifier has 93% accuracy in identifying natural vs man-made image backgrounds.

background_df = ImgtoVar.background_analysis(data)

The method returns a DataFrame with the image file names and the predicted background. To train this classifier, a custom dataset was created.

Due to limitations of the training data some nature examples have limited man-made structures, therefore this classifier cannot be used on its own to filter out images in which no man-made objects exist.

For example, if an image shows a natural landscape with a small house in the middle, that will be classified as natural.

Label: Natural Natural Man-made
Image:

If the researcher wants to detect images with nothing man-made in them, the background_analysis method can be used in combination with the object_detection method to identify false positive cases.

Object Detection

For object detection, ImgtoVar uses the YoloV5 family of models.

There are 2 custom pre-trained models included with the module, as well as all the pre-trained models on the COCO dataset included with YoloV5 itself.

The COCO dataset covers 80 classes, to which ImgtoVar adds 24 classes extracted from the OpenImages dataset and additional 6 classes trained on a custom dataset. Finally, the module allows users to specify their own custom pre-trained weights and model architecture.

To use the COCO dataset:

coco_od_df = ImgtoVar.detect_objects(data, model="custom", weights="yolov5l.pt",labels=None)

The method returns a DataFrame with the the image file name, object detected, the position and the confidence of prediction.

Since for most researchers the existence of an object is more important than its exact coordinates, we report mAP at 0.5 IoU, which for the yolov5l.pt model is 67.3%.

Here is a list of the labels included in the COCO dataset:
[

"person",

"bicycle",

"car",

"motorcycle",

"airplane",

"bus",

"train",

"truck",

"boat",

"traffic light",

"fire hydrant",

"stop sign",

"parking meter",

"bench",

"bird",

"cat",

"dog",

"horse",

"sheep",

"cow",

"elephant",

"bear",

"zebra",

"giraffe",

"backpack",

"umbrella",

"handbag",

"tie",

"suitcase",

"frisbee",

"skis",

"snowboard",

"sports ball",

"kite",

"baseball bat",

"baseball glove",

"skateboard",

"surfboard",

"tennis racket",

"bottle",

"wine glass",

"cup",

"fork",

"knife",

"spoon",

"bowl",

"banana",

"apple",

"sandwich",

"orange",

"broccoli",

"carrot",

"hot dog",

"pizza",

"donut",

"cake",

"chair",

"couch",

"potted plant",

"bed",

"dining table",

"toilet",

"tv",

"laptop",

"mouse",

"remote",

"keyboard",

"cell phone",

"microwave",

"oven",

"toaster",

"sink",

"refrigerator",

"book",

"clock",

"vase",

"scissors",

"teddy bear",

"hair drier",

"toothbrush",

]

To use the subset of OpenImages dataset:

coco_od_df = ImgtoVar.detect_objects(data, model="sub_open_images")

The mAP at 0.5 IoU is 59%. This value is comparable with scores from the OpenImages data challenge where a much more complicated model achieved an overall score of 65% mAP at 0.5. The advantage of YoloV5 is its cutting edge speed, which allows researchers to extract variables from larger datasets.

Here is a list of the labels included in the custom OpenImages dataset:
[

"Animal",

"Tree",

"Plant",

"Flower",

"Fruit",

"Suitcase",

"Motorcycle",

"Helicopter",

"Sports_equipment",

"Office_building",

"Tool",

"Medical_equipment",

"Mug",

"Sunglasses",

"Headphones",

"Swimwear",

"Suit",

"Dress",

"Shirt",

"Desk",

"Whiteboard",

"Jeans",

"Helmet",

"Building",

]

The final dataset is a custom dataset, made in connection with the research that this module is being applied to. It includes 6 labels connected to sustainability, such as wind mills, solar panels, oil pumps etc.

coco_od_df = ImgtoVar.detect_objects(data, model="c_energy")

The mAP at 0.5 IoU is 91%.

Finally, here is a list of the 6 labels included in this dataset:
[

"Crane",

"Wind turbine",

"farm equipment",

"oil pumps",

"plant chimney",

"solar panels",

]

(back to top)

License

Distributed under the GNU License. See LICENSE.txt for more information.

(back to top)

Contact

Dimitar Dimitrov
Email - dvdimitrov13@gmail.com
LinkedIn - https://www.linkedin.com/in/dimitarvalentindimitrov/

(back to top)

Acknowledgements

Special thanks goes to my thesis advisor, Francesco Grossetti, who has helped me develop and verify my work.

(back to top)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

imgtovar-0.3.tar.gz (24.0 kB view hashes)

Uploaded Source

Built Distribution

imgtovar-0.3-py3-none-any.whl (21.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page