Skip to main content

A Python toolkit for end-to-end image analysis with cloud (Minio, S3) support.

Project description

viz_scout

viz_scout is a Python package for advanced image data analysis and correction. It provides utilities for detecting and analyzing image quality, identifying duplicates, checking for corrupt images, and generating exploratory data analysis (EDA) reports on datasets. The package is designed to handle datasets stored locally, in AWS S3, or MinIO for large-scale data processing.


Features

  • Image Data Quality Analysis: Evaluate the quality of images based on brightness, blur, and uniformity.
  • Duplicate Image Detection: Identify exact and near-duplicate images in a dataset.
  • Corruption Detection: Automatically detect corrupt images that cannot be read or processed.
  • Exploratory Data Analysis (EDA): Generate detailed reports on dataset-level and image-level statistics.
  • Support for Large Datasets: Efficient handling of large datasets using parallel processing and batch loading.

Installation

To install viz_scout, use pip:

pip install viz_scout

Alternatively, clone the repository and install manually:

git clone https://github.com/yourusername/viz_scout.git
cd viz_scout
pip install .

for Apple Macbook M1 (Silicon)

conda install scipy numpy matplotlib
conda install --channel=conda-forge scikit-learn

Then install remaining libraries

Dependencies

  • Python 3.6+
  • Pillow - for image handling
  • opencv-python - for image processing tasks like blur detection
  • numpy - for numerical operations
  • boto3 - for interacting with AWS S3 (if using S3 buckets)
  • minio - for interacting with MinIO (if using MinIO)

You can install the required dependencies using:

pip install -r requirements.txt

Usage

1. Basic Example: Analyzing a Local Dataset

from viz_scout import EDAReport


# Initialize the EDA report generator for a local dataset
eda = EDAReport(
    dataset_path="path/to/dataset",
    corrupt_check=True,
    blur_threshold=5
)


# Generate the report
report = eda.generate_report()


# Print the report
print(json.dumps(report, indent=4))


# Optionally save the report to a file
eda.save_report(report, output_path="dataset_eda_report.json")

2. Advanced Example: Analyzing an AWS S3 Dataset
To analyze datasets stored on S3 or MinIO, provide the relevant configuration:

from viz_scout import EDAReport


# Example for S3
eda = EDAReport(
    dataset_path="s3://my-bucket/dataset/",
    s3_config={
        "access_key": "your-access-key",
        "secret_key": "your-secret-key",
        "region": "your-region",
    },
    corrupt_check=True,
    blur_threshold=5
)


# Generate the report
report = eda.generate_report()


# Print or save the report
print(json.dumps(report, indent=4))
eda.save_report(report, output_path="s3_eda_report.json")


########################################################


# Example for MinIO
eda = EDAReport(
    dataset_path="minio://my-bucket/dataset/",
    minio_config={
        "endpoint": "minio-server-url",
        "access_key": "your-access-key",
        "secret_key": "your-secret-key",
    },
    corrupt_check=True,
    blur_threshold=5
)


# Generate the report
report = eda.generate_report()


# Print or save the report
print(json.dumps(report, indent=4))
eda.save_report(report, output_path="minio_eda_report.json")

3. Parallel Processing and Batch Loading for large dataset
When processing large datasets with hundreds of thousands of images, you can enable parallel processing for faster results:

from viz_scout import EDAReport


eda = EDAReport(
    dataset_path="path/to/large/dataset",
    batch_size=200,  # Process in batches of 200 images at a time
    num_workers=8    # Use 8 parallel workers (threads)
)


# Generate the report
report = eda.generate_report()


# Print or save the report
print(json.dumps(report, indent=4))
eda.save_report(report, output_path="large_dataset_eda_report.json")

Key Functions and Methods

EDAReport: The main class for generating EDA reports on image datasets.

  • __init__(self, dataset_path, minio_config=None, s3_config=None, corrupt_check=True, blur_threshold=3, batch_size=100, num_workers=4)

    • Initializes the EDA report generator.
    • Supports local, S3, or MinIO datasets.
    • Customizes corruption checks, blur thresholds, batch size, and parallel processing.
  • generate_report()

    • Generates an EDA report containing both dataset-level and image-level statistics.
    • Returns the report as a dictionary.
  • save_report(report, output_path)

    • Saves the generated report to a JSON file at the specified output path (.json format).

ImageQualityAnalyzer: A class for analyzing image quality based on brightness, blur, and uniformity.

  • brightness_score(image)

    • Returns a score (0-10) indicating the brightness of the image.
  • blur_score(image)

    • Returns a score (0-10) indicating the blur level of the image.
  • uniformity_score(image)

    • Returns a score (0-10) indicating the uniformity of the image.

DuplicateDetector: A class for detecting exact and near-duplicate images in a dataset.

  • get_exact_duplicates(images)
    • Returns a list of exact duplicate images from the dataset.

CorruptionDetector: A class for detecting corrupt images in a dataset.

  • is_corrupt(image)
    • Returns True if the image is corrupt or unreadable.

Performance Optimizations

  • Parallel Processing: The package supports parallel processing via ThreadPoolExecutor to handle large datasets efficiently.
  • Batch Processing: Process images in batches to avoid memory overload.
  • Lazy Loading: Images are processed on-demand to minimize memory usage.

Contributing

We welcome contributions to viz_scout! If you'd like to contribute, please follow these steps:

1. Fork the repository.
2. Create a new branch.
3. Make your changes.
4.Write tests to cover your changes (if applicable).
5. Submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.


Contact

For any questions or issues, please open an issue on GitHub or contact [rohandhatbale@gmail.com].

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

viz_scout-0.1.0.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

viz_scout-0.1.0-py3-none-any.whl (3.4 kB view details)

Uploaded Python 3

File details

Details for the file viz_scout-0.1.0.tar.gz.

File metadata

  • Download URL: viz_scout-0.1.0.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.0

File hashes

Hashes for viz_scout-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a533ac86adecc59d5fe39b43b9ac879fa7832ed24c8b308758102fdc9fd24b98
MD5 76e3c22ff67f295b7248556e22bfa842
BLAKE2b-256 5261f550b625250a6baa9968ee196221e2fc878d12ff2d7a434429bf846551b2

See more details on using hashes here.

File details

Details for the file viz_scout-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: viz_scout-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.0

File hashes

Hashes for viz_scout-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 48cb8b142f876362687a67db24a73379a0996f04077eb3d7dec8043356491918
MD5 cedb33f09cdfeb95badeffabc2573888
BLAKE2b-256 29dd7685f2fe7b9db8ea9491e19ecc8bc1cbb4f3509fdaa2dfa7434d1d0c93e3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page