Skip to main content

The “Content Accessibility Utility on AWS” offers a comprehensive solution for modernizing web content accessibility with state-of-the-art Generative AI models, powered by Amazon Bedrock. “Content Accessibility Utility on AWS” allows users to automatically audit and remediate WCAG 2.1 accessibility compliance issues.

Project description

Content Accessibility Utility on AWS

Digital content stakeholders across industries aim to streamline how they meet accessibility compliance standards efficiently. The “Content Accessibility Utility on AWS” offers a comprehensive solution for modernizing web content accessibility with state-of-the-art Generative AI models, powered by Amazon Bedrock. “Content Accessibility Utility on AWS” allows users to automatically audit and remediate WCAG 2.1 accessibility compliance issues. To get started, the solution offers a Python CLI and API. Capabilities currently include batch processing capabilities for handling large volumes of content efficiently, usage tracking to enable detailed cost management, and will continue to expand capabilities to support other content type and modals.

Table of Contents

Features

  • Convert PDF documents to accessible HTML
  • Preserve layout and visual appearance
  • Extract and embed images
  • Audit HTML for WCAG 2.1 accessibility compliance
  • Remediate common accessibility issues using Bedrock models
  • Advanced table remediation strategies
  • Support for single-page and multi-page output formats
  • Batch processing capabilities for large-scale document processing
  • Detailed usage tracking for BDA pages and Bedrock tokens
  • Cost analysis tools for resource usage monitoring
  • Streamlit sample web interface with usage visualization

Prerequisites

Before using the Content Accessibility with AWS tool, ensure the following prerequisites are met:

  1. AWS Account: You need an AWS account with appropriate permissions.

  2. S3 Bucket: Create an S3 bucket for storing input files, intermediate results, and outputs.

    aws s3 mb s3://my-accessibility-bucket
    
  3. BDA Project: Set up an AWS Bedrock Data Automation (BDA) project.

    aws bedrock-data-automation create-data-automation-project \
        --project-name my-accessibility-project \
        --standard-output-configuration '{"document": {"extraction": {"granularity": {"types": ["DOCUMENT", "PAGE", "ELEMENT"]},"boundingBox": {"state": "ENABLED"}},"generativeField": {"state": "DISABLED"},"outputFormat": {"textFormat": {"types": ["HTML"]},"additionalFileFormat": {"state": "ENABLED"}}}}'
    

    Note the projectArn from the output, as it will be required for processing.

  4. AWS CLI Configuration: Configure AWS credentials and default region.

    aws configure
    

Installation

# From PyPI
pip install content-accessibilty-utility-on-aws

# From source
pip install .

Configuration

Environment Variables

Set the following environment variables to configure the tool:

export BDA_S3_BUCKET=my-accessibility-bucket
export BDA_PROJECT_ARN=arn:aws:bedrock:us-west-2:123456789012:project/my-accessibility-project

Optional environment variables:

  • AWS_PROFILE: Specify an AWS CLI profile to use.
  • CONTENT_ACCESSIBILITY_WORK_DIR: Directory for temporary files (default: system temp).

Example Configuration File

The tool supports configuration files for easier setup. Below is an example configuration file (my-config.yaml):

# PDF conversion settings
pdf:
  extract_images: true
  image_format: png
  embed_images: false
  single_file: true
  continuous: true
  embed_fonts: false
  exclude_images: false
  cleanup_bda_output: false

# Accessibility audit settings
audit:
  audit_accessibility: true
  min_severity: minor
  detailed_context: true
  skip_automated_checks: false
  issue_types: null  # Set to a list of specific issue types or null for all

# Remediation settings
remediate:
  max_issues: 100
  model_id: amazon.nova-lite-v1:0
  issue_types: null
  severity_threshold: minor
  report_format: json

# AWS settings
aws:
  # To use an existing BDA project:
  create_bda_project: false
  bda_project_arn: "arn:aws:bedrock:us-west-2:123456789012:project/my-accessibility-project"
  
  # OR to create a new BDA project:
  # create_bda_project: true
  # bda_project_name: "my-new-accessibility-project"
  
  s3_bucket: my-accessibility-bucket

Architecture

The package consists of four main modules working together to convert, audit, remediate, and batch process documents:

graph TD
    A[PDF2HTML] --> B[Convert PDFs to HTML]
    A --> C[Extract & Process Images]
    D[Audit] --> E[Check Accessibility Issues]
    F[Remediate] --> G[Fix Accessibility Problems]
    F --> H[Generate Remediation Reports]
    I[Batch] --> J[Orchestrate Large-scale Processing]
    I --> K[Track Jobs & Handle AWS Integration]
    
    A --> I
    D --> I
    F --> I

Core Packages

PDF2HTML

The PDF2HTML module handles conversion of PDF documents to HTML, including image extraction and processing.

graph TD
    A[PDF Source] --> B[PDF2HTML]
    B --> C[BDA Integration]
    B --> D[Image Processing]
    B --> E[HTML Generation]
    C --> F[HTML Output]
    D --> F
    E --> F

Key components:

  • Bedrock Data Automation (BDA) integration for PDF parsing
  • Image extraction and processing
  • HTML structure generation with preserved layout
  • Support for both single-page and multi-page output

Audit

The Audit module analyzes HTML for accessibility issues according to WCAG 2.1 guidelines.

graph TD
    A[HTML Input] --> B[Audit Module]
    B --> C[Document Checks]
    B --> D[Structure Checks]
    B --> E[Image Checks]
    B --> F[Table Checks]
    C --> G[Audit Report]
    D --> G
    E --> G
    F --> G

Key components:

  • Comprehensive accessibility checks
  • Issue severity classification
  • Detailed context information
  • Multiple report formats (HTML, JSON, text)

Remediate

The Remediate module fixes accessibility issues identified during audit.

graph TD
    A[HTML with Issues] --> B[Remediate Module]
    B --> C[AI Remediation Strategies]
    B --> D[Direct Fixes]
    C --> E[Remediated HTML]
    D --> E
    B --> F[Table Remediation]
    F --> G[Direct Table Fixes]
    F --> H[AI-Powered Table Fixes]
    G --> E
    H --> E

Key components:

  • AI-powered remediation using Bedrock models
  • Direct fixes for common issues
  • Advanced table structure remediation
  • Image accessibility enhancements
  • Remediation reporting

Batch

The Batch module provides orchestration for processing documents at scale.

graph TD
    A[Document Source] --> B[Batch Module]
    B --> C[Job Management]
    B --> D[AWS Integration]
    B --> E[Processing Pipeline]
    C --> F[Status Tracking]
    D --> G[S3 & DynamoDB]
    E --> H[Lambda Integration]
    F --> I[Job Completion]
    G --> I
    H --> I

Key components:

  • AWS service integrations
  • Job tracking and status management
  • Asynchronous processing
  • Lambda function support

Command Line Interface

The package provides a command-line interface with several subcommands:

PDF to HTML Conversion

content-accessibilty-utility-on-aws convert --input path/to/document.pdf --output output/directory

Options:

  • --single-file: Generate a single output file
  • --single-page: Combine all pages into a single HTML document
  • --multi-page: Keep pages as separate HTML files
  • --extract-images: Extract and include images from the PDF (default: True)
  • --image-format [png|jpg|webp]: Format for extracted images
  • --embed-images: Embed images as data URIs in HTML
  • --s3-bucket: Name of an existing S3 bucket to use
  • --bda-project-arn: ARN of an existing BDA project to use
  • --create-bda-project: Create a new BDA project if needed
  • --config: Path to configuration file

Accessibility Audit

content-accessibilty-utility-on-aws audit --input path/to/document.html --output accessibility-report.json --format json

For HTML report:

content-accessibilty-utility-on-aws audit --input path/to/document.html --output accessibility-report.html --format html

Options:

  • --format, -f [json|html|text]: Output format for audit report
  • --checks: Comma-separated list of checks to run
  • --severity [minor|major|critical]: Minimum severity level to include in report
  • --detailed: Include detailed context information in report (default: True)
  • --summary-only: Only include summary information in report
  • --config: Path to configuration file

Remediation

content-accessibilty-utility-on-aws remediate --input path/to/document.html --output remediated.html

Options:

  • --auto-fix: Automatically fix issues where possible
  • --max-issues: Maximum number of issues to remediate
  • --model-id: Bedrock model ID to use for remediation
  • --severity-threshold [minor|major|critical]: Minimum severity level to remediate
  • --audit-report: Path to audit report JSON file to use for remediation
  • --single-page: Combine all pages into a single HTML document
  • --multi-page: Keep pages as separate HTML files
  • --generate-report: Generate a remediation report after remediation (default: True)
  • --report-format [html|json|text]: Format for the remediation report
  • --config: Path to configuration file

Complete Processing

content-accessibilty-utility-on-aws process --input path/to/document.pdf --output output/directory

This command runs the full workflow:

  1. Converts PDF to HTML
  2. Audits the HTML for accessibility issues
  3. Remediates the issues found

Options:

  • --skip-audit: Skip the audit step
  • --skip-remediation: Skip the remediation step
  • --audit-format [json|html|text]: Format for the audit report
  • --severity [minor|major|critical]: Minimum severity level for audit and remediation
  • --auto-fix: Automatically fix issues where possible
  • Plus all options available in the individual commands
  • --config: Path to configuration file

Use a configuration file

content-accessibilty-utility-on-aws convert --config my-config.yaml --input document.pdf

Override config file settings with command-line arguments

content-accessibilty-utility-on-aws audit --config my-config.yaml --severity major --input document.html

Common Options

These options are available for all commands:

  • --input, -i: Input file or directory path (required)
  • --output, -o: Output file or directory path (defaults to a path based on input name)
  • --debug: Enable debug logging
  • --quiet, -q: Only output reports, suppress other output
  • --config, -c: Path to configuration file
  • --profile: AWS profile name to use for credentials

Output Structure

Convert Command Output

output-directory/
├── extracted_html/              # Directory with HTML files
│   ├── document.html            # Combined HTML file (if --single-file)
│   ├── page-0.html              # Individual page files (if not --single-file)
│   ├── page-1.html
│   └── ...
└── images/                      # Directory with extracted images
    ├── image-0.png
    ├── image-1.png
    └── ...

Process Command Output

output-directory/
├── html/                        # Directory with HTML files
├── images/                      # Directory with extracted images
├── audit_report.[json|html|txt] # Audit report
└── remediated_document.html     # Final remediated HTML file

Streamlit Sample Web Interface

A sample Streamlit web interface has been developed to demonstrate the functionality of the Document Accessibility tool. This interface allows users to upload documents, configure processing options, and view results interactively. To learn more about the Streamlit interface, refer to the Streamlit Guide.

Python API

The package provides a Python API for programmatic use:

Complete Processing Pipeline

from content_accessibility_with_aws.api import process_pdf_accessibility

# Process a PDF through the full pipeline
result = process_pdf_accessibility(
    pdf_path="document.pdf",
    output_dir="output/",
    conversion_options={
        "single_file": True,
        "image_format": "png"
    },
    audit_options={
        "severity_threshold": "minor",
        "detailed": True
    },
    remediation_options={
        "model_id": "amazon.nova-lite-v1:0",
        "auto_fix": True
    },
    perform_audit=True,
    perform_remediation=True
)

Individual Components

from content_accessibility_with_aws.api import (
    convert_pdf_to_html,
    audit_html_accessibility,
    remediate_html_accessibility
)

# Convert PDF to HTML
conversion_result = convert_pdf_to_html(
    pdf_path="document.pdf",
    output_dir="output/",
    options={
        "single_file": True,
        "image_format": "png"
    }
)

# Audit HTML for accessibility issues
audit_result = audit_html_accessibility(
    html_path="output/document.html",
    options={
        "severity_threshold": "minor",
        "detailed_context": True
    }
)

# Remediate accessibility issues
remediation_result = remediate_html_accessibility(
    html_path="output/document.html",
    audit_report=audit_result,
    options={
        "model_id": "amazon.nova-lite-v1:0",
        "auto_fix": True
    }
)

Batch Processing

from content_accessibility_with_aws.batch import (
    submit_batch_job,
    check_job_status,
    get_job_results
)

# Submit a batch job
job_id = submit_batch_job(
    input_bucket="my-bucket",
    input_key="documents/file.pdf",
    output_bucket="my-bucket",
    output_prefix="results/",
    process_options={
        "perform_audit": True,
        "perform_remediation": True
    }
)

# Check job status
status = check_job_status(job_id)

# Get job results when complete
if status["status"] == "COMPLETED":
    results = get_job_results(job_id)

Requirements

  • Python 3.11+
  • AWS credentials for Bedrock Data Automation and Bedrock models
  • Appropriate IAM permissions for S3 and BDA services

For AWS credentials configuration:

  1. Set up AWS CLI with aws configure
  2. Use environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  3. Or specify a profile with the --profile option

License

Apache-2.0 License. See LICENSE for details.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details on how to contribute to this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

content_accessibility_utility_on_aws-0.6.1.tar.gz (165.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file content_accessibility_utility_on_aws-0.6.1.tar.gz.

File metadata

File hashes

Hashes for content_accessibility_utility_on_aws-0.6.1.tar.gz
Algorithm Hash digest
SHA256 58779de74f3cd32939e548f1d844f850202c362c9689425750ac33c97d180303
MD5 915234c203ee0c0f9f6c733181a2fe76
BLAKE2b-256 c4f36a7cf8f3a32f76fd60111ce25ecda638b541662c6c0cc0233c8cc368af89

See more details on using hashes here.

Provenance

The following attestation bundles were made for content_accessibility_utility_on_aws-0.6.1.tar.gz:

Publisher: publish.yml on awslabs/content-accessibility-utility-on-aws

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file content_accessibility_utility_on_aws-0.6.1-py3-none-any.whl.

File metadata

File hashes

Hashes for content_accessibility_utility_on_aws-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ec8296e5456b3515e974ada8f1ea7dfdb3ea61edee4c01b3749e1a8db589e916
MD5 64266e10b8044b747be6f15d075ab731
BLAKE2b-256 320ebf3fdc4992f264868a285c1c3b1e263e0270455d41356f9730e316cd0e63

See more details on using hashes here.

Provenance

The following attestation bundles were made for content_accessibility_utility_on_aws-0.6.1-py3-none-any.whl:

Publisher: publish.yml on awslabs/content-accessibility-utility-on-aws

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page