Skip to main content

NLP-based compliance analysis for EU AI Act Annex IV documents

Project description

annex4nlp

NLP-based compliance analysis for EU AI Act Annex IV documents.

This package provides advanced natural language processing capabilities for analyzing technical documentation for compliance with EU AI Act Annex IV and GDPR requirements.

⚠️ Legal Disclaimer: This software is provided for informational and compliance assistance purposes only. It is not legal advice and should not be relied upon as such. Users are responsible for ensuring their documentation meets all applicable legal requirements and should consult with qualified legal professionals for compliance matters.

🔒 Data Protection: All processing occurs locally on your machine. No data leaves your system.

🚀 Quick Start

# Install the package
pip install annex4nlp

# Analyze a single PDF file
annex4nlp document.pdf

# Analyze multiple PDF files
annex4nlp doc1.pdf doc2.pdf doc3.pdf

# Hide informational messages (negated terms)
annex4nlp document.pdf --hide-info

✨ Features

  • 📄 PDF Text Extraction: Extract text from PDF documents using multiple libraries (PyPDF2, pdfplumber, PyMuPDF)
  • 🔍 Compliance Analysis: Analyze documents for missing Annex IV sections and compliance issues
  • ⚠️ Contradiction Detection: Detect contradictions within and across documents using NLP
  • 🔒 GDPR Compliance: Check for GDPR compliance issues in technical documentation
  • ⚡ Batch Processing: Efficient batch processing of multiple documents
  • 🖥️ CLI Interface: Command-line interface for easy integration
  • 🧠 Advanced NLP: Uses spaCy and negspaCy for intelligent analysis
  • 📊 Detailed Reporting: Console output with error/warning/info classification

Installation

pip install annex4nlp

📖 Usage

CLI Usage

# Analyze a single PDF file
annex4nlp document.pdf

# Analyze multiple PDF files
annex4nlp doc1.pdf doc2.pdf doc3.pdf

# Hide informational messages (negated terms)
annex4nlp document.pdf --hide-info

# Get help
annex4nlp --help

Python API

from annex4nlp import review_documents
from pathlib import Path

# Analyze multiple PDF files
pdf_files = [Path("doc1.pdf"), Path("doc2.pdf")]
issues = review_documents(pdf_files)

for issue in issues:
    print(f"{issue['type']}: {issue['message']}")

Single Document Analysis

from annex4nlp import review_single_document

issues = review_single_document(Path("document.pdf"))

Text Analysis

from annex4nlp import analyze_text

text_content = "Your technical documentation text here..."
issues = analyze_text(text_content, "document_name")

API with Info Filtering

from annex4nlp import create_review_response

# Get all issues including info messages
response = create_review_response(issues, ["document.pdf"], hide_info=False)

# Filter out info messages
response = create_review_response(issues, ["document.pdf"], hide_info=True)

🔍 Analysis Capabilities

Annex IV Compliance

  • Section Validation: Checks for all required Annex IV sections (1-9)
  • High-risk Detection: Validates high-risk system declarations
  • Missing Elements: Identifies missing compliance elements
  • Content Analysis: Analyzes section content for completeness

GDPR Compliance

  • Personal Data: Personal data handling analysis
  • Legal Basis: Legal basis verification for data processing
  • Data Subject Rights: Checking for data subject rights mentions
  • Retention Periods: Validation of data retention periods
  • Consent Management: Analysis of consent mechanisms

Contradiction Detection

  • Internal Contradictions: Finds inconsistencies within single documents
  • Cross-document Issues: Detects contradictions between multiple documents
  • System Information: Identifies conflicts in system names and versions
  • Policy Inconsistencies: Finds conflicting policy statements

Advanced NLP Features

  • Negation Detection: Uses negspaCy for intelligent negation handling
  • Term Matching: Advanced term matching with spaCy
  • Semantic Analysis: Semantic analysis of compliance terms
  • Context Awareness: Context-aware analysis of technical documentation
  • Info Messages: Informational messages about negated terms (can be filtered with --hide-info)

Issue Types

The analysis categorizes issues into three types:

  • ❌ ERRORS: Critical compliance issues that need immediate attention

    • Missing Annex IV sections
    • Internal contradictions within documents
    • Cross-document contradictions
  • ⚠️ WARNINGS: Potential issues that should be reviewed

    • GDPR compliance concerns
    • Missing transparency elements
    • Incomplete policy statements
  • ℹ️ INFO: Informational messages about negated terms

    • Terms found only with negation (e.g., "does not collect personal data")
    • These may be intentional - use --hide-info to suppress

📦 Dependencies

  • typer[all]>=0.12 - CLI framework
  • spacy>=3.7.5 - Natural language processing
  • negspacy>=1.0.4 - Negation detection
  • PyPDF2>=3.0 - PDF text extraction
  • pdfplumber>=0.10 - PDF text extraction
  • PyMuPDF>=1.23 - PDF text extraction
  • nltk>=3.8 - Natural language toolkit
  • spacy-lookups-data>=1.0 - spaCy language data

📊 Example Output

Standard Output (with INFO messages)

============================================================
COMPLIANCE REVIEW RESULTS
============================================================

❌ ERRORS (4):
  1. [document.pdf] (Section 4) Missing content for Annex IV section 4 (performance metrics).
  2. [document.pdf] (Section 6) Missing content for Annex IV section 6 (changes and versions).
  3. [document.pdf] (Section 7) Missing content for Annex IV section 7 (standards applied).
  4. [document.pdf] (Section 8) Missing content for Annex IV section 8 (compliance declaration).

⚠️  WARNINGS (2):
  1. [document.pdf] Personal data use without mention of consent or lawful basis (possible GDPR issue).
  2. [document.pdf] No mention of data deletion or subject access rights (check GDPR compliance).

ℹ️  INFO (3):
  1. [document.pdf] Term 'personal data' negated on page 1.
  2. [document.pdf] Term 'post-market monitoring' negated on page 1.
  3. [document.pdf] Term 'authentication' negated on page 1.

     Note: These informational messages indicate terms found only with negation.
     This may be intentional - please verify if the negation is correct.
     Use --hide-info flag to suppress these messages.

Found 9 total issue(s): 4 errors, 2 warnings, 3 info

Output with --hide-info flag

============================================================
COMPLIANCE REVIEW RESULTS
============================================================

❌ ERRORS (4):
  1. [document.pdf] (Section 4) Missing content for Annex IV section 4 (performance metrics).
  2. [document.pdf] (Section 6) Missing content for Annex IV section 6 (changes and versions).
  3. [document.pdf] (Section 7) Missing content for Annex IV section 7 (standards applied).
  4. [document.pdf] (Section 8) Missing content for Annex IV section 8 (compliance declaration).

⚠️  WARNINGS (2):
  1. [document.pdf] Personal data use without mention of consent or lawful basis (possible GDPR issue).
  2. [document.pdf] No mention of data deletion or subject access rights (check GDPR compliance).

Found 6 total issue(s): 4 errors, 2 warnings

📄 License

MIT License - see LICENSE file for details.

Third-Party Licenses

This package uses several third-party libraries. See THIRD_PARTY_LICENSES.md for the complete list of licenses.

Key dependencies:

  • spaCy: MIT License
  • negspaCy: MIT License
  • PyPDF2: BSD 3-Clause License
  • pdfplumber: MIT License
  • PyMuPDF: GNU Affero General Public License v3.0
  • NLTK: Apache License 2.0
  • Typer: MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

annex4nlp-1.0.1.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

annex4nlp-1.0.1-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file annex4nlp-1.0.1.tar.gz.

File metadata

  • Download URL: annex4nlp-1.0.1.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for annex4nlp-1.0.1.tar.gz
Algorithm Hash digest
SHA256 ca5cc5570000a37d080ec6a9aa40079041f95c83269f4cfc151f0b9208df1c15
MD5 7308a80a797ab04dafd216c8953e22ea
BLAKE2b-256 08ca2b497f8d2f72bffd18d977ac235b4d3e3eca78e02f1c35bf0f1e9475a4d3

See more details on using hashes here.

File details

Details for the file annex4nlp-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: annex4nlp-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.6

File hashes

Hashes for annex4nlp-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4444d6a66452a438904fe24fe53d9e2c8b8a1894565cf4ed63428b5e85b1889a
MD5 45a03a1bc6d1808240b7901b497f826e
BLAKE2b-256 7cf0021e1fec2df0179fa61ca3705a6d1ccd8c54e847df0c323b70814e8f0616

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page