Skip to main content

Divide long web page screenshots into blocks to input models with shorter contexts. 将长网页截图进行区块分割,用于输入上下文较短的模型

Project description

PyPI - Version GitHub Workflow Status (with event)PyPI - License Static Badge Static Badge

Introduction

This project is used to split the long screenshot of web pages into several parts based on the height of the text. The main idea is to find the low variation region of the image, and then find the split line in the low variation region. The Red lines are split lines The output are small but complete images of the web page, which can be used to generate web pages using Screen-to-code or to train models. More results can be found in the images directory.

Getting started

Install

 pip install Web-page-Screenshot-Segmentation

Using in the command line

Obtain the height of the split line of the image

python -m Web_page_Screenshot_Segmentation.master -f "path/to/img"

The output looks like this: [6, 868, 1912, 2672, 3568, 4444, 5124, 6036, 7698]. It is the height list of the split line of the image.

If you want to check the split line on the image, you can use the following command:

python -m Web_page_Screenshot_Segmentation.master -f "path/to/img" -s True

Then you can get the path to the result image.

Draw the split lines on the image

python -m Web_page_Screenshot_Segmentation.drawer --image_file path/to/image.jpg --hl [100,200] --color (0,255,0)

Split the image

python -m Web_page_Screenshot_Segmentation.spliter --f path/to/image.jpg -ht "[233,456]"

You will get the split image at the path returned by the command.

For details, please refer to the help information

python -m Web_page_Screenshot_Segmentation.master --help
python -m Web_page_Screenshot_Segmentation.drawer --help
python -m Web_page_Screenshot_Segmentation.spliter --help

Using from the Source Code

split_heights function

The split_heights function is used to split an image into several parts based on various thresholds. It takes the following parameters:

  • file_path: The path of the image file.
  • split: A boolean indicating whether to split the image.
  • height_threshold: The height threshold of the low variation region.
  • variation_threshold: The variation threshold of the low variation region.
  • color_threshold: The threshold of the color difference.
  • color_variation_threshold: The threshold of the color difference variation.
  • merge_threshold: The threshold of the least distance between two lines.

The function returns a list of heights of the split lines if split is False, or the path of the split image if split is True.

Example usage

import Web_page_Screenshot_Segmentation
from Web_page_Screenshot_Segmentation.master import split_heights

# Split the image at 'path/to/image.jpg' into several parts
split_image_path = split_heights(
    file_path='path/to/image.jpg',
    split=True,
    height_threshold=102,
    variation_threshold=0.5,
    color_threshold=100,
    color_variation_threshold=15,
    merge_threshold=350
)

print(f"The split image is saved at {split_image_path}")

In this example, the image at 'path/to/image.jpg' is split into several parts based on the provided thresholds. The split image is saved at the path returned by the function.

draw_line_from_file function

The draw_line_from_file function is used to draw lines on an image at specified heights. It takes the following parameters:

  • image_file: The path of the image file.
  • heights: A list of heights at which to draw the lines.
  • color: The color of the lines to be drawn. The default color is red (0, 0, 255).

The function reads the image from the provided file path, draws lines at the specified heights, and then saves the modified image to a new file. The new file is saved in the result directory with the same name as the original file, but with 'result' appended before the file extension.

If the function encounters an error while reading the image file (for example, if the file path contains '.' or Chinese characters), it raises an exception.

Example usage

import Web_page_Screenshot_Segmentation
from Web_page_Screenshot_Segmentation.spliter import draw_line_from_file

# Draw lines on the image at 'path/to/image.jpg' at heights 100 and 200
result_image_path = draw_line_from_file(
    image_file='path/to/image.jpg',
    heights=[100, 200],
    color=(0, 255, 0)  # Draw the lines in green
)

print(f"The modified image is saved at {result_image_path}")

In this example, the image at 'path/to/image.jpg' is modified by drawing green lines at heights 100 and 200. The modified image is saved at the path returned by the function.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file Web_page_Screenshot_Segmentation-1.0.4.tar.gz.

File metadata

File hashes

Hashes for Web_page_Screenshot_Segmentation-1.0.4.tar.gz
Algorithm Hash digest
SHA256 cf3bfb4f6b775bcd6724cd6a1f7e6d3ed5eb6735108a11aeb2ce3e59856fe02f
MD5 5ba99dcd63a9cebdb6034d9a913f5358
BLAKE2b-256 7e37c1a2f89b0c277ce74bca0054ba56480a2f06578e50d5d63657b0d229e568

See more details on using hashes here.

File details

Details for the file Web_page_Screenshot_Segmentation-1.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for Web_page_Screenshot_Segmentation-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 5c943c53d3558c9f7c45086a25d8b2c71fb404deb2f421afb67964847b6af304
MD5 180cc3cd93b5005d9410b766f1809001
BLAKE2b-256 034f171760c1fae734d0c218d7a153f6079e332e9a00de3067d771b51728d46b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page