Skip to main content

CLI and GUI tool for GDPR-compliant pseudonymization of French text documents using NLP-based entity detection and reversible mapping

Project description

๐Ÿ‡ฌ๐Ÿ‡ง English | ๐Ÿ‡ซ๐Ÿ‡ท Franรงais

GDPR Pseudonymizer

PyPI version Python versions License: MIT CI Docs

AI-Assisted Pseudonymization for French Documents with Human Verification

Transform sensitive French documents for safe AI analysis with local processing, mandatory human review, and GDPR compliance.


What's New in v2.1

  • Validate-once-per-entity โ€” Accept or reject one entity occurrence and it applies to all same-text occurrences in the document (productivity boost for repeated names)
  • Excel/CSV Format Support โ€” Process .xlsx and .csv files with cell-aware pseudonymization for HR/compliance use cases (pip install gdpr-pseudonymizer[excel])
  • Neutral ID Pseudonym Theme โ€” Counter-based identifiers (PERSON-001, LIEU-001, ORG-001) for formal/legal contexts (--theme neutral_id)
  • NER Accuracy Improvements โ€” Expanded ORG detection patterns, POS-tag disambiguation for geography matching; LOCATION false-negative rate reduced from 27% to 13%
  • GUI Discoverability โ€” F1 keyboard shortcuts help dialog with all shortcut groups, database path persistence across sessions, "Hide confirmed" toggle

Upgrade: pip install --upgrade gdpr-pseudonymizer[gui,excel]


Download โ€” Standalone Executables (No Python Required)

Pre-built standalone executables are available for Windows, macOS, and Linux. No Python installation needed.

Download Latest Release

Platform File Notes
Windows gdpr-pseudonymizer-2.1.0-windows-setup.exe Run the installer. Adds Start Menu shortcut.
macOS (Apple Silicon) gdpr-pseudonymizer-2.1.0-macos-arm64.dmg Open DMG, drag to Applications.
macOS (Intel) gdpr-pseudonymizer-2.1.0-macos-x86_64.dmg Open DMG, drag to Applications.
Linux gdpr-pseudonymizer-2.1.0-linux.AppImage chmod +x then run.

Platform Notes

  • Windows: If SmartScreen shows "Windows protected your PC", click "More info" then "Run anyway". This appears because the executable is not yet code-signed. It is safe to run.
  • macOS: If Gatekeeper blocks the app, right-click the app and select "Open" (instead of double-clicking). This bypasses the unsigned app warning.
  • Linux: Make the AppImage executable first: chmod +x gdpr-pseudonymizer-*.AppImage. If it fails to start, install Qt dependencies: sudo apt-get install libegl1 libxkbcommon0.

Troubleshooting (Standalone)

  • Antivirus false positives (Windows): Windows Defender or Norton may flag PyInstaller-bundled apps. This is a known false positive. Add an exclusion for the install directory if needed.
  • Gatekeeper warnings (macOS): Right-click the app and select "Open" to bypass the warning for unsigned builds.
  • Slow first launch: The first launch may take longer (~10-15s) while the OS caches the application files. Subsequent launches will be faster.
  • Missing system libraries (Linux): Install libegl1 and libxkbcommon0 if the AppImage fails to start: sudo apt-get install -y libegl1 libxkbcommon0.

๐ŸŽฏ Overview

GDPR Pseudonymizer is a privacy-first CLI and GUI tool that combines AI efficiency with human accuracy to pseudonymize French text documents. Available as a command-line tool for developers, a desktop GUI application for non-technical users, and as standalone executables (no Python required). Unlike fully automatic tools or cloud services, we prioritize zero false negatives and legal defensibility through mandatory validation workflows.

Perfect for:

  • ๐Ÿ›๏ธ Privacy-conscious organizations needing GDPR-compliant AI analysis
  • ๐ŸŽ“ Academic researchers with ethics board requirements
  • โš–๏ธ Legal/HR teams requiring defensible pseudonymization
  • ๐Ÿค– LLM users who want to analyze confidential documents safely

โœจ Key Features

๐Ÿ”’ Privacy-First Architecture

  • โœ… 100% local processing - Your data never leaves your machine
  • โœ… No cloud dependencies - Works completely offline after installation
  • โœ… Encrypted mapping tables - AES-256-SIV encryption with PBKDF2 key derivation (210K iterations), passphrase-protected reversible pseudonymization
  • โœ… Zero telemetry - No analytics, crash reporting, or external communication

๐Ÿค AI + Human Verification

  • โœ… Hybrid detection - AI pre-detects ~60% of entities (NLP + regex + geography dictionary)
  • โœ… Mandatory validation - You review and confirm all entities (ensures 100% accuracy)
  • โœ… Fast validation UI - Rich CLI interface with keyboard shortcuts, <2 min per document
  • โœ… Smart workflow - Entity-by-type grouping (PERSON โ†’ ORG โ†’ LOCATION) with context display
  • โœ… Entity variant grouping - Related forms ("Marie Dubois", "Pr. Dubois", "Dubois") merged into one validation item with "Also appears as:" display
  • โœ… Batch actions - Confirm/reject multiple entities efficiently

๐Ÿ“Š Batch Processing

  • โœ… Consistent pseudonyms - Same entity = same pseudonym across 10-100+ documents
  • โœ… Compositional matching - "Marie Dubois" โ†’ "Leia Organa", "Marie" alone โ†’ "Leia"
  • โœ… Smart name handling - Title stripping ("Dr. Marie Dubois" = "Marie Dubois"), compound names ("Jean-Pierre" treated as atomic)
  • โœ… Selective entity processing - --entity-types flag to filter by type (e.g., --entity-types PERSON,LOCATION)
  • โœ… 50%+ time savings vs manual redaction (AI pre-detection + validation)

๐ŸŽญ Themed Pseudonyms

  • โœ… Readable output - Star Wars, LOTR, generic French names, or neutral identifiers (PER-001, LOC-001)
  • โœ… Maintains context - LLM analysis preserves 85% document utility (validated: 4.27/5.0)
  • โœ… Gender-aware - Auto-detects French first name gender from 945-name dictionary and assigns gender-matched pseudonyms (female names โ†’ female pseudonyms, male names โ†’ male pseudonyms)
  • โœ… Full entity support - PERSON, LOCATION, and ORGANIZATION pseudonyms for all themes

๐Ÿ–ฅ๏ธ GUI Features (v2.0)

  • โœ… Visual entity validation - Color-coded entities by type (click to accept/reject), undo/redo support
  • โœ… Drag-and-drop document processing - Drop files onto the home screen to start processing
  • โœ… Batch processing with progress dashboard - Real-time progress, per-document validation, pause/cancel controls
  • โœ… Light/dark/high-contrast themes - Persistent theme preference with WCAG AA compliance
  • โœ… Full French UI - Complete French/English interface with live language switching
  • โœ… Keyboard-only operation - Full accessibility with keyboard navigation and screen reader support

๐Ÿš€ Quick Start

Status: ๐ŸŽ‰ v2.1.0 (March 2026) โ€” GUI Polish, Excel/CSV Support & NER Accuracy

Getting Started

For non-technical users (no Python required): Download a standalone executable from the Download section above and run it directly.

For developers (PyPI):

# CLI only
pip install gdpr-pseudonymizer

# CLI + GUI
pip install gdpr-pseudonymizer[gui]

# CLI + Excel/CSV support
pip install gdpr-pseudonymizer[excel]

# All optional formats (PDF, DOCX, Excel)
pip install gdpr-pseudonymizer[formats]

What v2.1 Delivers

  • ๐Ÿ–ฅ๏ธ Desktop GUI โ€” Visual entity validation with drag-and-drop, batch dashboard, and database management
  • ๐Ÿ“ฆ Standalone executables โ€” Windows .exe, macOS .dmg, Linux AppImage โ€” no Python required
  • โ™ฟ WCAG 2.1 AA accessibility โ€” Keyboard navigation, screen reader, high contrast mode
  • ๐ŸŒ French UI โ€” Complete FR/EN interface with live language switching
  • ๐Ÿค– AI-assisted detection โ€” Hybrid NLP + regex detects ~60% of entities automatically
  • โœ… Mandatory human verification โ€” You review and confirm all entities (ensures 100% accuracy)
  • ๐Ÿ”’ 100% local processing โ€” Your data never leaves your machine
  • ๐Ÿ“„ PDF/DOCX support โ€” Process PDF and DOCX files directly (optional extras)
  • ๐Ÿ“Š Excel/CSV support โ€” Process .xlsx and .csv files with cell-aware pseudonymization (optional extra: [excel])
  • ๐Ÿ†” Neutral ID theme โ€” Counter-based identifiers (PERSON-001, LIEU-001) for formal/legal contexts
  • ๐ŸŽฏ NER accuracy โ€” LOCATION false-negative rate reduced from 27% to 13% via regex expansion & POS disambiguation

What v2.1 does NOT deliver:

  • โŒ Fully automatic "set and forget" processing
  • โŒ 85%+ AI accuracy (current: ~60% F1 with hybrid approach)
  • โŒ Optional validation mode (validation is mandatory)

Roadmap

v1.0 (MVP - Q1 2026): AI-assisted CLI with mandatory validation

v1.1 (Q1 2026): GDPR erasure, gender-aware pseudonyms, NER accuracy improvements, PDF/DOCX support, French docs

v2.0 (Q1 2026): Desktop GUI, standalone executables, WCAG AA accessibility, French UI, batch validation, core hardening

v2.1 (Q1 2026) โ€” CURRENT RELEASE: GUI polish, Excel/CSV support, neutral ID theme, NER accuracy improvements, keyboard shortcuts help

v3.0 (2027+): NLP accuracy & automation

  • Fine-tuned French NER model (70-85% F1 target, up from ~60%)
  • Optional --no-validate flag for high-confidence workflows
  • Confidence-based auto-processing (85%+ F1 target)
  • Multi-language support (English, Spanish, German)

โš™๏ธ Installation (Python / PyPI)

See Installation Guide for detailed platform-specific instructions.

Prerequisites

  • Python 3.10, 3.11, or 3.12 (validated in CI/CD โ€” 3.13+ not yet tested)

Install from PyPI (Recommended)

pip install gdpr-pseudonymizer

# Verify installation
gdpr-pseudo --help

Note: The spaCy French model (~571MB) downloads automatically on first use. To pre-download it:

python -m spacy download fr_core_news_lg

Install from Source (Developer)

# Clone repository
git clone https://github.com/LioChanDaYo/RGPDpseudonymizer.git
cd RGPDpseudonymizer

# Install dependencies via Poetry
pip install poetry>=1.7.0
poetry install

# Verify installation
poetry run gdpr-pseudo --help

Note: The spaCy French model (~571MB) downloads automatically on first use. To pre-download it:

poetry run python -m spacy download fr_core_news_lg

Quick Test

# Test on sample document
echo "Marie Dubois travaille ร  Paris pour Acme SA." > test.txt
gdpr-pseudo process test.txt

# Or specify custom output file
gdpr-pseudo process test.txt -o output.txt

Expected output: "Leia Organa travaille ร  Coruscant pour Rebel Alliance."

Configuration File (Optional)

Generate a config template to customize default settings:

# Generate .gdpr-pseudo.yaml template in current directory
poetry run gdpr-pseudo config --init

# View current effective configuration
poetry run gdpr-pseudo config

Example .gdpr-pseudo.yaml:

database:
  path: mappings.db

pseudonymization:
  theme: star_wars    # neutral, star_wars, lotr, neutral_id
  model: spacy

batch:
  workers: 4          # 1-8 (use 1 for interactive validation)
  output_dir: null

logging:
  level: INFO

Note: Passphrase is never stored in config files (security). Use GDPR_PSEUDO_PASSPHRASE env var or interactive prompt. Minimum 12 characters required (NFR12).


๐Ÿ“– Documentation

Documentation Site: https://liochandayo.github.io/RGPDpseudonymizer/

For Users:

For Developers:

For Stakeholders:


๐ŸŒ Language Support

The GUI and CLI are available in French (default) and English, with live language switching.

GUI Language Switching

Select your language in Settings > Appearance > Language. The change takes effect immediately โ€” no restart required.

CLI Language

# French help (default on French systems)
gdpr-pseudo --lang fr --help

# English help (default on non-French systems)
gdpr-pseudo --lang en --help

# Via environment variable
GDPR_PSEUDO_LANG=fr gdpr-pseudo --help

Language detection priority:

  1. --lang flag (explicit)
  2. GDPR_PSEUDO_LANG environment variable
  3. System locale auto-detection
  4. English (CLI default) / French (GUI default)

๐Ÿ”ฌ Technical Details

NLP Library Selection (Story 1.2 - Completed)

After comprehensive benchmarking on 25 French interview/business documents (1,737 annotated entities):

Approach F1 Score Precision Recall Notes
spaCy only fr_core_news_lg 29.5% 27.0% 32.7% Story 1.2 baseline
Hybrid (spaCy + regex) 59.97% 48.17% 79.45% Story 5.3
Hybrid + expanded patterns 31.79% 19.49% 85.15% Story 7.5 (current)

Accuracy trajectory: spaCy-only baseline โ†’ hybrid approach with annotation cleanup, expanded regex patterns, and French geography dictionary doubled F1 score. Story 7.5 added 12 ORG pattern keywords, POS-tag disambiguation for geography matching, and 7 international locations โ€” reducing LOCATION false-negative rate from 27.42% to 12.90%.

Approved Solution:

  • โœ… Hybrid approach (NLP + regex + geography dictionary + POS disambiguation)
  • โœ… Mandatory validation ensures 100% final accuracy
  • ๐Ÿ“… Fine-tuning deferred to v3.0 (70-85% F1 target, requires training data from v1.x/v2.x user validations)

See full analysis: docs/qa/ner-accuracy-report.md | Historical baseline: docs/nlp-benchmark-report.md

Validation Workflow (Story 1.7 - Complete)

The validation UI provides an intuitive keyboard-driven interface for reviewing detected entities:

Features:

  • โœ… Entity-by-type grouping - Review PERSON โ†’ ORG โ†’ LOCATION in logical order
  • โœ… Context display - See 10 words before/after each entity with highlighting
  • โœ… Confidence scores - Color-coded confidence from spaCy NER (green >80%, yellow 60-80%, red <60%)
  • โœ… Keyboard shortcuts - Single-key actions: [Space] Confirm, [R] Reject, [E] Modify, [A] Add, [C] Change pseudonym
  • โœ… Batch operations - Accept/reject all entities of a type at once (Shift+A/R) with entity count feedback
  • โœ… Context cycling indicator - Dot indicator (โ— โ—‹ โ—‹ โ—‹ โ—‹) shows current context position; [Press X to cycle] hint improves discoverability
  • โœ… Help overlay - Press [H] for full command reference
  • โœ… Performance - <2 minutes for typical 20-30 entity documents

Workflow Steps:

  1. Summary screen (entity counts by type)
  2. Review entities by type with context
  3. Flag ambiguous entities for careful review
  4. Final confirmation with summary of changes
  5. Process document with validated entities

Deduplication Feature (Story 1.9): Duplicate entities grouped together - validate once, apply to all occurrences (66% time reduction for large docs)

Entity Variant Grouping (Story 4.6): Related entity forms automatically merged into single validation items. "Marie Dubois", "Pr. Dubois", and "Dubois" appear as one item with "Also appears as:" showing variant forms. Prevents Union-Find transitive bridging for ambiguous surnames shared by different people.


Technology Stack

Component Technology Version Purpose
Runtime Python 3.10-3.12 Validated in CI/CD (3.13+ not yet tested)
NLP Library spaCy 3.8.0 French entity detection (fr_core_news_lg)
CLI Framework Typer 0.9+ Command-line interface
Database SQLite 3.35+ Local mapping table storage with WAL mode
Encryption cryptography (AESSIV) 44.0+ AES-256-SIV encryption for sensitive fields (PBKDF2 key derivation, passphrase-protected)
ORM SQLAlchemy 2.0+ Database abstraction and session management
Desktop GUI PySide6 6.7+ Desktop application (optional: pip install gdpr-pseudonymizer[gui])
Validation UI rich 13.7+ Interactive CLI entity review
Keyboard Input readchar 4.2+ Single-keypress capture for validation UI
Testing pytest 7.4+ Unit & integration testing
CI/CD GitHub Actions N/A Automated testing (Windows/Mac/Linux)

๐Ÿค” Why AI-Assisted Instead of Automatic?

Short answer: Privacy and compliance require human oversight.

Long answer:

  1. GDPR defensibility - Human verification provides legal audit trail
  2. Zero false negatives - AI misses entities, humans catch them (100% coverage)
  3. Current NLP limitations - French models on interview/business docs: 29.5% F1 out-of-box (hybrid approach reaches ~60%)
  4. Better than alternatives:
    • โœ… vs Manual redaction: 50%+ faster (AI pre-detection)
    • โœ… vs Cloud services: 100% local processing (no data leakage)
    • โœ… vs Fully automatic tools: 100% accuracy (human verification)

User Perspective:

"I WANT human review for compliance reasons. The AI saves me time by pre-flagging entities, but I control the final decision." - Compliance Officer


๐ŸŽฏ Use Cases

1. Research Ethics Compliance

Scenario: Academic researcher with 50 interview transcripts needing IRB approval

Without GDPR Pseudonymizer:

  • โŒ Manual redaction: 16-25 hours
  • โŒ Destroys document coherence for analysis
  • โŒ Error-prone (human fatigue)

With GDPR Pseudonymizer:

  • โœ… AI pre-detection: ~30 min processing
  • โœ… Human validation: ~90 min review (50 docs ร— ~2 min each)
  • โœ… Total: 2-3 hours (85%+ time savings)
  • โœ… Audit trail for ethics board

2. HR Document Analysis

Scenario: HR team analyzing employee feedback with ChatGPT

Without GDPR Pseudonymizer:

  • โŒ Can't use ChatGPT (GDPR violation - employee names exposed)
  • โŒ Manual analysis only (slow, limited insights)

With GDPR Pseudonymizer:

  • โœ… Pseudonymize locally (employee names โ†’ pseudonyms)
  • โœ… Send to ChatGPT safely (no personal data exposed)
  • โœ… Get AI insights while staying GDPR-compliant

3. Legal Document Preparation

Scenario: Law firm preparing case materials for AI legal research

Without GDPR Pseudonymizer:

  • โŒ Cloud pseudonymization service (third-party risk)
  • โŒ Manual redaction (expensive billable hours)

With GDPR Pseudonymizer:

  • โœ… 100% local processing (client confidentiality)
  • โœ… Human-verified accuracy (legal defensibility)
  • โœ… Reversible mappings (can de-pseudonymize if needed)

โš–๏ธ GDPR Compliance

How GDPR Pseudonymizer Supports Compliance

GDPR Requirement Implementation
Art. 25 - Data Protection by Design Local processing, no cloud dependencies, encrypted storage
Art. 30 - Processing Records Comprehensive audit logs (Story 2.5): operations table tracks timestamp, files processed, entity count, model version, theme, success/failure, processing time; JSON/CSV export for compliance reporting
Art. 32 - Security Measures AES-256-SIV encryption with PBKDF2 key derivation (210,000 iterations), passphrase-protected storage, column-level encryption for sensitive fields
Art. 35 - Privacy Impact Assessment Transparent methodology, cite-able approach for DPIA documentation
Recital 26 - Pseudonymization Consistent pseudonym mapping, reversibility with passphrase

What Pseudonymization Means (Legally)

According to GDPR Article 4(5):

"Pseudonymization means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately."

GDPR Pseudonymizer approach:

  • โœ… Personal data replaced: Names, locations, organizations โ†’ pseudonyms
  • โœ… Separate storage: Mapping table encrypted with passphrase (separate from documents)
  • โœ… Reversibility: Authorized users can de-pseudonymize with passphrase
  • โš ๏ธ Note: Pseudonymization reduces risk but does NOT make data anonymous

Recommendation: Consult your Data Protection Officer (DPO) for specific compliance guidance.


๐Ÿ› ๏ธ Development Status

Epics 1-7 Complete โ€” v2.1.0 (March 2026). GUI polish, Excel/CSV support, NER accuracy improvements.

  • โœ… Epic 1: Foundation & NLP Validation (9 stories) โ€” spaCy integration, validation UI, hybrid detection, entity deduplication
  • โœ… Epic 2: Core Pseudonymization Engine (9 stories) โ€” pseudonym libraries, encryption, audit logging, batch processing, GDPR 1:1 mapping
  • โœ… Epic 3: CLI Interface & Batch Processing (7 stories) โ€” 8 CLI commands, progress reporting, config files, parallel batch, UX polish
  • โœ… Epic 4: Launch Readiness (8 stories) โ€” LLM utility validation, cross-platform testing, documentation, NER accuracy suite, performance validation, beta feedback integration, codebase refactoring, launch preparation
  • โœ… Epic 5: Quick Wins & GDPR Compliance (7 stories) โ€” GDPR Article 17 erasure, gender-aware pseudonyms, NER accuracy improvements (F1 29.74% โ†’ 59.97%), French documentation translation, PDF/DOCX support, CLI polish & benchmarks, v1.1 release
  • โœ… Epic 6: v2.0 Desktop GUI & Broader Accessibility (9 stories) โ€” PySide6 desktop application, visual entity validation, batch GUI, i18n, WCAG AA, standalone executables
    • โœ… Story 6.1: UX Architecture & GUI Framework Selection
    • โœ… Story 6.2: GUI Application Foundation (main window, theming, home screen, settings, 77 GUI tests)
    • โœ… Story 6.3: Document Processing Workflow (passphrase dialog, processing worker, results screen, 45 new GUI tests)
    • โœ… Story 6.4: Visual Entity Validation Interface (entity editor, entity panel, validation state with undo/redo, 72 new GUI tests)
    • โœ… Story 6.5: Batch Processing & Configuration Management (batch screen, database management, settings enhancements, 40 new tests)
    • โœ… Story 6.6: Internationalization & French UI (dual-track i18n: Qt Linguist + gettext, 267 GUI strings, ~50 CLI strings, live language switching, 53 new tests)
    • โœ… Story 6.7: Accessibility (WCAG 2.1 Level AA) โ€” keyboard navigation, screen reader support, high contrast mode, color-blind safe palette, DPI scaling, 33 accessibility tests
    • โœ… Story 6.7.1: Core Processing Hardening & Security โ€” PII sanitization in error messages, typed exception handling, DRY refactoring, per-document entity type counts (DATA-001 fix), 26 new tests
    • โœ… Story 6.7.2: Database Background Threading โ€” All DB operations on background threads (list, search, delete, export), cancel-and-replace strategy, debounced search, 38 new tests
    • โœ… Story 6.7.3: Batch Validation Workflow โ€” Per-document entity validation in batch mode, Prรฉcรฉdent/Suivant navigation, cancel with proper status display, 21 new tests
    • โœ… Story 6.8: Standalone Executables & Distribution โ€” PyInstaller builds, NSIS installer (Windows), DMG (macOS), AppImage (Linux), CI workflow
    • โœ… Story 6.9: v2.0 Release Preparation โ€” Version bump, CHANGELOG, documentation updates, release coordination
  • โœ… Epic 7: v2.1 GUI Polish, Excel/CSV & NER Accuracy (7 stories) โ€” Validate-once-per-entity, keyboard shortcuts help, database path persistence, neutral ID theme, Excel/CSV format support, NER regex expansion & POS disambiguation, integration tests, v2.1 release
    • โœ… Story 7.1: Validation UX Improvements (validate-once-per-entity, hide confirmed toggle)
    • โœ… Story 7.2: GUI Discoverability (F1 shortcuts dialog, settings sync, database path persistence)
    • โœ… Story 7.3: Neutral ID Pseudonym Theme (counter-based identifiers)
    • โœ… Story 7.4: Excel & CSV Format Support (tabular document pipeline)
    • โœ… Story 7.5: NER Accuracy Regex Expansion & POS Disambiguation
    • โœ… Story 7.6: Quality & Compatibility Integration Tests
    • โœ… Story 7.7: v2.1 Release Preparation
  • Total: 60 stories, 1670+ tests, 86%+ coverage, all quality gates green

๐Ÿค Contributing

We welcome contributions! See CONTRIBUTING.md for details on:

  • Bug reports and feature requests
  • Development setup and code quality requirements
  • PR process and commit message format

Please read our Code of Conduct before participating.


๐Ÿ“ง Contact & Support

Project Lead: Lionel Deveaux - @LioChanDaYo

For questions and support:


๐Ÿ“œ License

This project is licensed under the MIT License.


๐Ÿ™ Acknowledgments

Built with:

  • spaCy - Industrial-strength NLP library
  • Typer - Modern CLI framework
  • rich - Beautiful CLI formatting

Inspired by:

  • GDPR privacy-by-design principles
  • Academic research ethics requirements
  • Real-world need for safe AI document analysis

Methodology:

  • Developed using BMAD-METHODโ„ข framework
  • Interactive elicitation and multi-perspective validation

โš ๏ธ Disclaimer

GDPR Pseudonymizer is a tool to assist with GDPR compliance. It does NOT provide legal advice.

Important notes:

  • โš ๏ธ Pseudonymization reduces risk but is NOT anonymization
  • โš ๏ธ You remain the data controller under GDPR
  • โš ๏ธ Consult your DPO or legal counsel for compliance guidance
  • โš ๏ธ Human validation is MANDATORY - do not skip review steps
  • โš ๏ธ Test thoroughly before production use

Current limitations:

  • AI detection: ~60% F1 baseline (not 85%+)
  • Validation required for ALL documents (not optional)
  • French documents only (English, Spanish, etc. in future versions)
  • Text-based formats: .txt, .md, .pdf, .docx, .xlsx, .csv (PDF/DOCX/Excel require optional extras: pip install gdpr-pseudonymizer[formats])
  • Excel formulas are read as cached display values; formula strings are not preserved in pseudonymized output
  • Binary .xls format (Excel 97-2003) is not supported โ€” save as .xlsx first

๐Ÿงช Testing

Running Tests

The project includes comprehensive unit and integration tests covering the validation workflow, NLP detection, and core functionality.

Note for Windows users: Due to known spaCy access violations on Windows (spaCy issue #12659), Windows CI runs non-spaCy tests only. Full test suite runs on Linux/macOS.

Run all tests:

poetry run pytest -v

Run only unit tests:

poetry run pytest tests/unit/ -v

Run only integration tests:

poetry run pytest tests/integration/ -v

Run accuracy validation tests (requires spaCy model):

poetry run pytest tests/accuracy/ -v -m accuracy -s

Run performance & stability tests (requires spaCy model):

# All performance tests (stability, memory, startup, stress)
poetry run pytest tests/performance/ -v -s -p no:benchmark --timeout=600

# Benchmark tests only (pytest-benchmark)
poetry run pytest tests/performance/ --benchmark-only -v -s

Run with coverage report:

poetry run pytest --cov=gdpr_pseudonymizer --cov-report=term-missing --cov-report=html

Run validation workflow integration tests specifically:

poetry run pytest tests/integration/test_validation_workflow_integration.py -v

Run quality checks:

# Code formatting check
poetry run black --check gdpr_pseudonymizer tests

# Format code automatically
poetry run black gdpr_pseudonymizer tests

# Linting check
poetry run ruff check gdpr_pseudonymizer tests

# Type checking
poetry run mypy gdpr_pseudonymizer

Run Windows-safe tests only (excludes spaCy-dependent tests):

# Run non-spaCy unit tests (follows Windows CI pattern)
poetry run pytest tests/unit/test_benchmark_nlp.py tests/unit/test_config_manager.py tests/unit/test_data_models.py tests/unit/test_file_handler.py tests/unit/test_logger.py tests/unit/test_naive_processor.py tests/unit/test_name_dictionary.py tests/unit/test_process_command.py tests/unit/test_project_config.py tests/unit/test_regex_matcher.py tests/unit/test_validation_models.py tests/unit/test_validation_stub.py -v

# Run validation workflow integration tests (Windows-safe)
poetry run pytest tests/integration/test_validation_workflow_integration.py -v

Test Coverage

  • Unit tests: 1030+ tests covering validation models, UI components, encryption, database operations, audit logging, progress tracking, gender detection, context cycling indicator, i18n (GUI + CLI), and core logic
  • Integration tests: 90 tests for end-to-end workflows including validation (Story 2.0.1), encrypted database operations (Story 2.4), compositional logic, and hybrid detection
  • Accuracy tests: 22 tests validating NER accuracy against 25-document ground-truth corpus (Story 4.4)
  • Performance tests: 19 tests validating all NFR targets โ€” single-document benchmarks (NFR1), entity-detection benchmarks, batch performance (NFR2), memory profiling (NFR4), startup time (NFR5), stability/error rate (NFR6), stress testing (Story 4.5)
  • Current coverage: 86%+ across all modules (100% for progress module, 91.41% for AuditRepository)
  • Total tests: 1670+ tests
  • CI/CD: Tests run on Python 3.10-3.12 across Windows, macOS, and Linux
  • Quality gates: All pass (Black, Ruff, mypy, pytest)

Key Integration Test Scenarios

The integration test suite covers:

Validation Workflow (19 tests):

  • โœ… Full workflow: entity detection โ†’ summary โ†’ review โ†’ confirmation
  • โœ… User actions: confirm (Space), reject (R), modify (E), add entity (A), change pseudonym (C), context cycling (X)
  • โœ… State transitions: PENDING โ†’ CONFIRMED/REJECTED/MODIFIED
  • โœ… Entity deduplication with grouped review
  • โœ… Edge cases: empty documents, large documents (320+ entities), Ctrl+C interruption, invalid input
  • โœ… Batch operations: Accept All Type (Shift+A), Reject All Type (Shift+R) with confirmation prompts
  • โœ… Mock user input: Full simulation of keyboard interactions and prompts

Encrypted Database (9 tests):

  • โœ… End-to-end workflow: init โ†’ open โ†’ save โ†’ query โ†’ close
  • โœ… Cross-session consistency: Same passphrase retrieves same data
  • โœ… Idempotency: Multiple queries return same results
  • โœ… Encrypted data at rest: Sensitive fields stored encrypted in SQLite
  • โœ… Compositional logic integration: Encrypted component queries
  • โœ… Repository integration: All repositories (mapping, audit, metadata) work with encrypted session
  • โœ… Concurrent reads: WAL mode enables multiple readers
  • โœ… Database indexes: Query performance optimization verified
  • โœ… Batch save rollback: Transaction integrity on errors

๐Ÿ“Š Project Metrics (As of 2026-03-17)

Metric Value Status
Development Progress v2.1.0 โœ… Epics 1-7 complete
Stories Complete 60 (Epics 1-7) โœ… All epics complete
LLM Utility (NFR10) 4.27/5.0 (85.4%) โœ… PASSED (threshold: 80%)
Installation Success (NFR3) 87.5% (7/8 platforms) โœ… PASSED (threshold: 85%)
First Pseudonymization (NFR14) 100% within 30 min โœ… PASSED (threshold: 80%)
Critical Bugs Found 1 (Story 2.8) โœ… RESOLVED - Epic 3 Unblocked
Test Corpus Size 25 docs, 1,737 entities โœ… Complete (post-cleanup)
NLP Accuracy (Baseline) 29.5% F1 (spaCy only) โœ… Measured (Story 1.2)
Hybrid Accuracy (NLP+Regex) 59.97% F1 (+30.23pp vs baseline) โœ… Story 5.3 Complete
Final Accuracy (AI+Human) 100% (validated) ๐ŸŽฏ By Design
Pseudonym Libraries 3 themes (2,426 names + 240 locations + 588 orgs) โœ… Stories 2.1, 3.0, 4.6 Complete
Compositional Matching Operational (component reuse + title stripping + compound names) โœ… Stories 2.2, 2.3 Complete
Batch Processing Architecture validated (multiprocessing.Pool, 1.17x-2.5x speedup) โœ… Story 2.7 Complete
Encrypted Storage AES-256-SIV with passphrase protection (PBKDF2 210K iterations) โœ… Story 2.4 Complete
Audit Logging GDPR Article 30 compliance (operations table + JSON/CSV export) โœ… Story 2.5 Complete
Validation UI Operational with deduplication โœ… Stories 1.7, 1.9 Complete
Validation Time <2 min (20-30 entities), <5 min (100 entities) โœ… Targets Met
Single-Doc Performance (NFR1) ~6s mean for 3.5K words โœ… PASSED (<30s threshold, 80% headroom)
Batch Performance (NFR2) ~5 min for 50 docs โœ… PASSED (<30min threshold, 83% headroom)
Memory Usage (NFR4) ~1 GB Python-tracked peak โœ… PASSED (<8GB threshold)
CLI Startup (NFR5) 0.56s (help), 6.0s (cold start w/ model) โœ… PASSED (<5s for CLI startup)
Error Rate (NFR6) ~0% unexpected errors โœ… PASSED (<10% threshold)
Test Coverage 1670+ tests (incl. 393 GUI), 86%+ coverage โœ… All Quality Checks Pass
Quality Gates Ruff, mypy, pytest โœ… All Pass (0 issues)
GUI/CLI Languages French (default), English ๐ŸŒ Live switching (Story 6.6)
Supported Document Languages French ๐Ÿ‡ซ๐Ÿ‡ท v1.0 only
Supported Formats .txt, .md, .pdf, .docx, .xlsx, .csv ๐Ÿ“ PDF/DOCX/Excel via optional extras

๐Ÿ”— Quick Links


Last Updated: 2026-03-17 (v2.1.0 โ€” GUI polish, Excel/CSV support, neutral ID theme, NER accuracy improvements, 1670+ tests)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdpr_pseudonymizer-2.1.0.tar.gz (289.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gdpr_pseudonymizer-2.1.0-py3-none-any.whl (358.3 kB view details)

Uploaded Python 3

File details

Details for the file gdpr_pseudonymizer-2.1.0.tar.gz.

File metadata

  • Download URL: gdpr_pseudonymizer-2.1.0.tar.gz
  • Upload date:
  • Size: 289.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.15 Linux/6.8.0-1044-azure

File hashes

Hashes for gdpr_pseudonymizer-2.1.0.tar.gz
Algorithm Hash digest
SHA256 a166ca77e53acbbc594c16104b307192aa07cb7962bed9f4b269018bfb586ab6
MD5 7981959caf92347129c58bc9bd0f4461
BLAKE2b-256 d50fdec9602ed9881c846485df3128a3544eeffa23c1d8f508236895e1f6e51b

See more details on using hashes here.

File details

Details for the file gdpr_pseudonymizer-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: gdpr_pseudonymizer-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 358.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.2 CPython/3.11.15 Linux/6.8.0-1044-azure

File hashes

Hashes for gdpr_pseudonymizer-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4881c0aeaa8d0280fe31a2320c15be914893a86f2d44b9a05219be5948e89bc5
MD5 a5355eb4d6c1b39a468e1bd4c5b6098e
BLAKE2b-256 23decd2d24729f9c083480d7af4c879ae6e301edfced931a6ad550f568e74126

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page