NLP-based compliance analysis for EU AI Act Annex IV documents
Project description
annex4nlp
NLP-based compliance analysis for EU AI Act Annex IV documents.
This package provides advanced natural language processing capabilities for analyzing technical documentation for compliance with EU AI Act Annex IV and GDPR requirements.
⚠️ Legal Disclaimer: This software is provided for informational and compliance assistance purposes only. It is not legal advice and should not be relied upon as such. Users are responsible for ensuring their documentation meets all applicable legal requirements and should consult with qualified legal professionals for compliance matters.
🔒 Data Protection: All processing occurs locally on your machine. No data leaves your system.
🚀 Quick Start
# Install the package
pip install annex4nlp
# Analyze a single PDF file
annex4nlp document.pdf
# Analyze multiple PDF files
annex4nlp doc1.pdf doc2.pdf doc3.pdf
# Hide informational messages (negated terms)
annex4nlp document.pdf --hide-info
✨ Features
- 📄 PDF Text Extraction: Extract text from PDF documents using multiple libraries (PyPDF2, pdfplumber, PyMuPDF)
- 🔍 Compliance Analysis: Analyze documents for missing Annex IV sections and compliance issues
- ⚠️ Contradiction Detection: Detect contradictions within and across documents using NLP
- 🔒 GDPR Compliance: Check for GDPR compliance issues in technical documentation
- ⚡ Batch Processing: Efficient batch processing of multiple documents
- 🖥️ CLI Interface: Command-line interface for easy integration
- 🧠 Advanced NLP: Uses spaCy and negspaCy for intelligent analysis
- 📊 Detailed Reporting: Console output with error/warning/info classification
Installation
pip install annex4nlp
📖 Usage
CLI Usage
# Analyze a single PDF file
annex4nlp document.pdf
# Analyze multiple PDF files
annex4nlp doc1.pdf doc2.pdf doc3.pdf
# Hide informational messages (negated terms)
annex4nlp document.pdf --hide-info
# Get help
annex4nlp --help
Python API
from annex4nlp import review_documents
from pathlib import Path
# Analyze multiple PDF files
pdf_files = [Path("doc1.pdf"), Path("doc2.pdf")]
issues = review_documents(pdf_files)
for issue in issues:
print(f"{issue['type']}: {issue['message']}")
Single Document Analysis
from annex4nlp import review_single_document
issues = review_single_document(Path("document.pdf"))
Text Analysis
from annex4nlp import analyze_text
text_content = "Your technical documentation text here..."
issues = analyze_text(text_content, "document_name")
API with Info Filtering
from annex4nlp import create_review_response
# Get all issues including info messages
response = create_review_response(issues, ["document.pdf"], hide_info=False)
# Filter out info messages
response = create_review_response(issues, ["document.pdf"], hide_info=True)
🔍 Analysis Capabilities
Annex IV Compliance
- Section Validation: Checks for all required Annex IV sections (1-9)
- High-risk Detection: Validates high-risk system declarations
- Missing Elements: Identifies missing compliance elements
- Content Analysis: Analyzes section content for completeness
GDPR Compliance
- Personal Data: Personal data handling analysis
- Legal Basis: Legal basis verification for data processing
- Data Subject Rights: Checking for data subject rights mentions
- Retention Periods: Validation of data retention periods
- Consent Management: Analysis of consent mechanisms
Contradiction Detection
- Internal Contradictions: Finds inconsistencies within single documents
- Cross-document Issues: Detects contradictions between multiple documents
- System Information: Identifies conflicts in system names and versions
- Policy Inconsistencies: Finds conflicting policy statements
Advanced NLP Features
- Negation Detection: Uses negspaCy for intelligent negation handling
- Term Matching: Advanced term matching with spaCy
- Semantic Analysis: Semantic analysis of compliance terms
- Context Awareness: Context-aware analysis of technical documentation
- Info Messages: Informational messages about negated terms (can be filtered with
--hide-info)
Issue Types
The analysis categorizes issues into three types:
-
❌ ERRORS: Critical compliance issues that need immediate attention
- Missing Annex IV sections
- Internal contradictions within documents
- Cross-document contradictions
-
⚠️ WARNINGS: Potential issues that should be reviewed
- GDPR compliance concerns
- Missing transparency elements
- Incomplete policy statements
-
ℹ️ INFO: Informational messages about negated terms
- Terms found only with negation (e.g., "does not collect personal data")
- These may be intentional - use
--hide-infoto suppress
📦 Dependencies
- typer[all]>=0.12 - CLI framework
- spacy>=3.7.5 - Natural language processing
- negspacy>=1.0.4 - Negation detection
- PyPDF2>=3.0 - PDF text extraction
- pdfplumber>=0.10 - PDF text extraction
- PyMuPDF>=1.23 - PDF text extraction
- nltk>=3.8 - Natural language toolkit
- spacy-lookups-data>=1.0 - spaCy language data
📊 Example Output
Standard Output (with INFO messages)
============================================================
COMPLIANCE REVIEW RESULTS
============================================================
❌ ERRORS (4):
1. [document.pdf] (Section 4) Missing content for Annex IV section 4 (performance metrics).
2. [document.pdf] (Section 6) Missing content for Annex IV section 6 (changes and versions).
3. [document.pdf] (Section 7) Missing content for Annex IV section 7 (standards applied).
4. [document.pdf] (Section 8) Missing content for Annex IV section 8 (compliance declaration).
⚠️ WARNINGS (2):
1. [document.pdf] Personal data use without mention of consent or lawful basis (possible GDPR issue).
2. [document.pdf] No mention of data deletion or subject access rights (check GDPR compliance).
ℹ️ INFO (3):
1. [document.pdf] Term 'personal data' negated on page 1.
2. [document.pdf] Term 'post-market monitoring' negated on page 1.
3. [document.pdf] Term 'authentication' negated on page 1.
Note: These informational messages indicate terms found only with negation.
This may be intentional - please verify if the negation is correct.
Use --hide-info flag to suppress these messages.
Found 9 total issue(s): 4 errors, 2 warnings, 3 info
Output with --hide-info flag
============================================================
COMPLIANCE REVIEW RESULTS
============================================================
❌ ERRORS (4):
1. [document.pdf] (Section 4) Missing content for Annex IV section 4 (performance metrics).
2. [document.pdf] (Section 6) Missing content for Annex IV section 6 (changes and versions).
3. [document.pdf] (Section 7) Missing content for Annex IV section 7 (standards applied).
4. [document.pdf] (Section 8) Missing content for Annex IV section 8 (compliance declaration).
⚠️ WARNINGS (2):
1. [document.pdf] Personal data use without mention of consent or lawful basis (possible GDPR issue).
2. [document.pdf] No mention of data deletion or subject access rights (check GDPR compliance).
Found 6 total issue(s): 4 errors, 2 warnings
📄 License
MIT License - see LICENSE file for details.
Third-Party Licenses
This package uses several third-party libraries. See THIRD_PARTY_LICENSES.md for the complete list of licenses.
Key dependencies:
- spaCy: MIT License
- negspaCy: MIT License
- PyPDF2: BSD 3-Clause License
- pdfplumber: MIT License
- PyMuPDF: GNU Affero General Public License v3.0
- NLTK: Apache License 2.0
- Typer: MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file annex4nlp-1.0.1.tar.gz.
File metadata
- Download URL: annex4nlp-1.0.1.tar.gz
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca5cc5570000a37d080ec6a9aa40079041f95c83269f4cfc151f0b9208df1c15
|
|
| MD5 |
7308a80a797ab04dafd216c8953e22ea
|
|
| BLAKE2b-256 |
08ca2b497f8d2f72bffd18d977ac235b4d3e3eca78e02f1c35bf0f1e9475a4d3
|
File details
Details for the file annex4nlp-1.0.1-py3-none-any.whl.
File metadata
- Download URL: annex4nlp-1.0.1-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4444d6a66452a438904fe24fe53d9e2c8b8a1894565cf4ed63428b5e85b1889a
|
|
| MD5 |
45a03a1bc6d1808240b7901b497f826e
|
|
| BLAKE2b-256 |
7cf0021e1fec2df0179fa61ca3705a6d1ccd8c54e847df0c323b70814e8f0616
|