A Python toolkit for end-to-end image analysis with cloud (Minio, S3) support.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

VizScout

VizScout is a Python package for advanced image data analysis and correction. It provides utilities for detecting and analyzing image quality, identifying duplicates, checking for corrupt images, and generating exploratory data analysis (EDA) reports on datasets. The package is designed to handle datasets stored locally, in AWS S3, or MinIO for large-scale data processing.

Features

Image Data Quality Analysis: Evaluate the quality of images based on brightness, blur, and uniformity.
Duplicate Image Detection: Identify exact and near-duplicate images in a dataset.
Corruption Detection: Automatically detect corrupt images that cannot be read or processed.
Exploratory Data Analysis (EDA): Generate detailed reports on dataset-level and image-level statistics.
Support for Large Datasets: Efficient handling of large datasets using parallel processing and batch loading.

Installation

To install viz_scout, use pip:

pip install viz_scout

Alternatively, clone the repository and install manually:

git clone https://github.com/yourusername/viz_scout.git
cd viz_scout
pip install .

for Apple Macbook M1 (Silicon)

conda install scipy numpy matplotlib
conda install --channel=conda-forge scikit-learn

Then install remaining libraries

Dependencies

Python 3.6+
Pillow - for image handling
opencv-python - for image processing tasks like blur detection
numpy - for numerical operations
boto3 - for interacting with AWS S3 (if using S3 buckets)
minio - for interacting with MinIO (if using MinIO)

You can install the required dependencies using:

pip install -r requirements.txt

Usage

1. Basic Example: Analyzing a Local Dataset

from viz_scout import EDAReport


# Initialize the EDA report generator for a local dataset
eda = EDAReport(
    dataset_path="path/to/dataset",
    duplicate_check=True,
    blur_threshold=5
)


# Generate the report
report = eda.generate_report()


# Print the report
print(json.dumps(report, indent=4))


# Optionally save the report to a file
eda.save_report(report, output_path="dataset_eda_report.json")

2. Advanced Example: Analyzing an AWS S3 Dataset
To analyze datasets stored on S3 or MinIO, provide the relevant configuration:

from viz_scout import EDAReport


# Example for S3
eda = EDAReport(
    dataset_path="s3://my-bucket/dataset/",
    s3_config={
        "access_key": "your-access-key",
        "secret_key": "your-secret-key",
        "region": "your-region",
    },
    duplicate_check=True,
    blur_threshold=5
)


# Generate the report
report = eda.generate_report()


# Print or save the report
print(json.dumps(report, indent=4))
eda.save_report(report, output_path="s3_eda_report.json")


########################################################


# Example for MinIO
eda = EDAReport(
    dataset_path="minio://my-bucket/dataset/",
    minio_config={
        "endpoint": "minio-server-url",
        "access_key": "your-access-key",
        "secret_key": "your-secret-key",
    },
    duplicate_check=True,
    blur_threshold=5
)


# Generate the report
report = eda.generate_report()


# Print or save the report
print(json.dumps(report, indent=4))
eda.save_report(report, output_path="minio_eda_report.json")

3. Parallel Processing and Batch Loading for large dataset
When processing large datasets with hundreds of thousands of images, you can enable parallel processing for faster results:

from viz_scout import EDAReport


eda = EDAReport(
    dataset_path="path/to/large/dataset",
    batch_size=200,  # Process in batches of 200 images at a time
    num_workers=8    # Use 8 parallel workers (threads)
)


# Generate the report
report = eda.generate_report()


# Print or save the report
print(json.dumps(report, indent=4))
eda.save_report(report, output_path="large_dataset_eda_report.json")

from viz_scout import EDAPlots

plot_generator = EDAPlots(
        dataset_path="path/to/image/dataset")
    
save_dir = "path/to/save/plots"

img_size_distribution = plot_generator.get_image_size_distribution()
# img_size_distribution.plot()
img_size_distribution.save(
    save_dir=save_dir, 
    file_name="img_size_distribution",
    file_format="png"
    )

aspect_ratio_distribution = plot_generator.get_aspect_ratio_distribution()
aspect_ratio_distribution.save(
    save_dir=save_dir, 
    file_name="aspect_ratio_distribution", 
    file_format="pdf"
    )

width_height_correlation = plot_generator.get_width_height_correlation()
width_height_correlation.save(
    save_dir=save_dir,
    file_name="width_height_correlation",
    file_format="html"
)

Key Functions and Methods

`EDAReport`: The main class for generating EDA reports on image datasets.

__init__(self, dataset_path, minio_config=None, s3_config=None, corrupt_check=True, blur_threshold=3, batch_size=100, num_workers=4)
- Initializes the EDA report generator.
- Supports local, S3, or MinIO datasets.
- Customizes corruption checks, blur thresholds, batch size, and parallel processing.
generate_report()
- Generates an EDA report containing both dataset-level and image-level statistics.
- Returns the report as a dictionary.
save_report(report, output_path)
- Saves the generated report to a JSON file at the specified output path (.json format).

`ImageQualityAnalyzer`: A class for analyzing image quality based on brightness, blur, and uniformity.

brightness_score(image)
- Returns a score (0-10) indicating the brightness of the image.
blur_score(image)
- Returns a score (0-10) indicating the blur level of the image.
uniformity_score(image)
- Returns a score (0-10) indicating the uniformity of the image.

`DuplicateDetector`: A class for detecting exact and near-duplicate images in a dataset.

get_exact_duplicates(images)
- Returns a list of exact duplicate images from the dataset.

`CorruptionDetector`: A class for detecting corrupt images in a dataset.

is_corrupt(image)
- Returns True if the image is corrupt or unreadable.

Performance Optimizations

Parallel Processing: The package supports parallel processing via ThreadPoolExecutor to handle large datasets efficiently.
Batch Processing: Process images in batches to avoid memory overload.
Lazy Loading: Images are processed on-demand to minimize memory usage.

Contributing

We welcome contributions to viz_scout! If you'd like to contribute, please follow these steps:

1. Fork the repository.
2. Create a new branch.
3. Make your changes.
4.Write tests to cover your changes (if applicable).
5. Submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contact

For any questions or issues, please open an issue on GitHub or contact [rohandhatbale@gmail.com].

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.7

Mar 10, 2025

0.2.6

Mar 7, 2025

0.2.5

Mar 7, 2025

This version

0.2.4

Jan 22, 2025

0.2.3

Jan 22, 2025

0.2.2

Dec 30, 2024

0.2.1

Dec 30, 2024

0.1.1

Dec 27, 2024

0.1.0

Dec 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

viz_scout-0.2.4.tar.gz (15.6 kB view details)

Uploaded Jan 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

viz_scout-0.2.4-py3-none-any.whl (15.0 kB view details)

Uploaded Jan 22, 2025 Python 3

File details

Details for the file viz_scout-0.2.4.tar.gz.

File metadata

Download URL: viz_scout-0.2.4.tar.gz
Upload date: Jan 22, 2025
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.9.0

File hashes

Hashes for viz_scout-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`f1b816e6fafc35e41d0abf47e9c9863aab8df7a1fea35719c102c4a09f7b902f`
MD5	`535e66f11a3cc1183826a71ce8aa4e47`
BLAKE2b-256	`6780be910ea605cce14a90b33074e9b5a2a543a3c04f6e5c6c47460b77e65127`

See more details on using hashes here.

File details

Details for the file viz_scout-0.2.4-py3-none-any.whl.

File metadata

Download URL: viz_scout-0.2.4-py3-none-any.whl
Upload date: Jan 22, 2025
Size: 15.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.9.0

File hashes

Hashes for viz_scout-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`49bd88549c00c47a6af48733ec90d1feddd006fd554205ec04dfcd989d6e6bd7`
MD5	`d96be5851dcd38e7d918d5a96330f797`
BLAKE2b-256	`2e4d2a6d22e9cdf7e427d8a4529e8978bebd2146dfe083fa4e1ce92420f62b75`

See more details on using hashes here.

viz-scout 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

VizScout

Features

Installation

Then install remaining libraries

Dependencies

Usage

Key Functions and Methods

EDAReport: The main class for generating EDA reports on image datasets.

ImageQualityAnalyzer: A class for analyzing image quality based on brightness, blur, and uniformity.

DuplicateDetector: A class for detecting exact and near-duplicate images in a dataset.

CorruptionDetector: A class for detecting corrupt images in a dataset.

Performance Optimizations

Contributing

License

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`EDAReport`: The main class for generating EDA reports on image datasets.

`ImageQualityAnalyzer`: A class for analyzing image quality based on brightness, blur, and uniformity.

`DuplicateDetector`: A class for detecting exact and near-duplicate images in a dataset.

`CorruptionDetector`: A class for detecting corrupt images in a dataset.