Tool for anonymizing schema data in SQL queries.

Project description

SQL Query Anonymizer

A Python tool for anonymizing SQL queries while preserving their structure and maintaining the ability to de-anonymize them later. Perfect for query optimization workflows, security testing, or data sharing scenarios where you need to hide sensitive identifiers.

✨ Features

🔒 Anonymization: Replace table names, column names, and literals with generic placeholders
🔓 De-anonymization: Restore original identifiers from anonymized queries
💾 Persistent Mappings: Pickle-based storage maintains mappings across sessions
🎯 Structure Preservation: SQL syntax and query structure remain intact
📊 Table Alias Detection: Quantifies table aliases in SELECT statements
🔄 Roundtrip Guarantee: Perfect roundtrip anonymization ↔ de-anonymization
🖥️ CLI & Python API: Use via command-line or import in Python code
📁 File Processing: Process SQL files in batch
🧪 Comprehensive Tests: 73 passing tests with pytest

📦 Installation

# Clone the repository
git clone https://github.com/ncarsner/sql-query-anonymizer.git
cd sql-query-anonymizer

# Install dependencies (using uv)
uv sync

# Or using pip
pip install -e .

# After installation, the sql-anonymizer command will be available
sql-anonymizer --help

🚀 Quick Start

Command Line Interface

# Anonymize a SQL query
sql-anonymizer anonymize "SELECT name, email FROM users WHERE age > 25"
# Output: SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1

# De-anonymize back to original
sql-anonymizer deanonymize "SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1"
# Output: SELECT name , email FROM users WHERE age > 25

# Show current mappings
sql-anonymizer show-mappings

# Process SQL files
sql-anonymizer anonymize -f input.sql -o anonymized.sql

Python API

from src.sql_query_anonymizer.utils import Anonymizer

# Create an anonymizer instance with default mapping file
anonymizer = Anonymizer()

# Anonymize a query
original = "SELECT customer_id, name FROM customers WHERE age > 30"
anonymized = anonymizer.anonymize_query(original)
print(anonymized)  # SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1

# Save mappings to persist for later
anonymizer.save()

# Later session - create new anonymizer and load mappings
anonymizer2 = Anonymizer()
anonymizer2.load()

# De-anonymize back to original
restored = anonymizer2.de_anonymize_query(anonymized)
print(restored)  # SELECT customer_id , name FROM customers WHERE age > 30

# Or use context manager for automatic save/load
with Anonymizer() as anon:
    anonymized = anon.anonymize_query(original)
    # Mappings automatically saved on exit

📁 Project Structure

sql-query-anonymizer/
├── .github/
│   └── workflows/
│       ├── publish.yml           # Automated PyPI publishing
│       └── tests.yml             # CI/CD testing
├── src/
│   └── sql_query_anonymizer/
│       ├── __init__.py
│       ├── anonymize.py          # Anonymization preprocessing
│       ├── cli.py                # Command-line interface
│       ├── constants.py          # SQL keywords and configuration
│       ├── helper_utilities.py   # File I/O utilities
│       ├── tokenize.py           # SQL tokenization engine
│       └── utils.py              # Core Anonymizer class
├── tests/
│   ├── __init__.py
│   ├── test_anonymize.py         # Anonymization tests
│   ├── test_cli.py               # CLI tests
│   └── test_utils.py             # Core utilities tests
├── data/
│   ├── _raw/                     # Sample raw SQL files
│   ├── _anonymized/              # Anonymized outputs
│   ├── _optimized/               # Optimized queries
│   └── _deanonymized/            # De-anonymized outputs
├── .pre-commit-config.yaml       # Pre-commit hooks configuration
├── CLI_USAGE.md                  # CLI quick reference
├── pyproject.toml                # Project configuration
├── uv.lock                       # Dependency lock file
├── README.md
└── LICENSE

🎯 Use Cases

Query Optimization: Anonymize queries before sending to external optimization services
Security Testing: Share query structures without exposing sensitive database schema
Documentation: Create generic examples from real queries
Training: Generate training datasets with anonymized queries
Debugging: Share problematic queries with support teams without revealing confidential data

🔧 CLI Commands

Command	Description
`anonymize`	Anonymize SQL query from string or file
`deanonymize`	De-anonymize query back to original form
`show-mappings`	Display current mapping statistics
`clear-mappings`	Clear all stored mappings
`export-mappings`	Export mappings to another pickle file
`import-mappings`	Import mappings from a pickle file
`interactive`	Start interactive mode

Examples:

# Anonymize a query
sql-anonymizer anonymize "SELECT * FROM users"

# Show current mappings
sql-anonymizer show-mappings

# Export mappings for backup
sql-anonymizer export-mappings backup.pkl

# Use custom mapping file
sql-anonymizer -m custom.pkl anonymize "SELECT * FROM products"

💾 Persistent Storage

Mappings are stored as pickle files in ~/.sql_anonymizer/mappings.pkl and persist across sessions. This ensures:

Consistent anonymization: Same identifiers always map to the same placeholders
Session persistence: Mappings survive between CLI sessions
Backup and restore: Export/import mappings as needed
Context manager support: Automatic save/load using Python's with statement

The pickle format stores mappings, reverse mappings, and counters efficiently.

🧪 Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/sql_query_anonymizer

# Run specific test file
uv run pytest tests/test_utils.py

# Run with verbose output
uv run pytest -v

📋 Requirements

Python 3.13+
pytest (for testing)
pytest-cov (for coverage reports)

🔍 How It Works

Tokenization: SQL query is parsed into tokens (keywords, identifiers, literals, operators)
Classification: Tokens are classified by type (table names, column names, literals, etc.)
Mapping: Each unique identifier is mapped to a generic placeholder
Replacement: Original identifiers are replaced with placeholders
Persistence: Mappings are saved for future de-anonymization
De-anonymization: Placeholders are mapped back to original identifiers

🎨 Example Transformations

-- Original
SELECT c.customer_id, c.first_name, o.order_date, SUM(o.amount)
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.status = 'active' AND o.date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, o.order_date;

-- Anonymized
SELECT c.identifier_1 , c.identifier_2 , o.identifier_3 , SUM ( o.identifier_4 )
FROM table_1 c
JOIN table_2 o ON c.identifier_1 = o.identifier_1
WHERE c.identifier_5 = literal_1 AND o.identifier_6 >= literal_2
GROUP BY c.identifier_1 , c.identifier_2 , o.identifier_3 ;

🛠️ Configuration

Custom mapping file location:

sql-anonymizer -m custom_mappings.pkl anonymize "SELECT * FROM users"

Disable auto-save:

sql-anonymizer --no-auto-save anonymize "SELECT * FROM users"

Python API with custom location:

# Use custom mapping file
anonymizer = Anonymizer(mapping_file="project_mappings.pkl")
anonymizer.load()  # Load existing mappings

# Or use absolute path
anonymizer = Anonymizer(mapping_file="/path/to/mappings.pkl")

🔍 Feature Status

✅ Completed

Core anonymization/de-anonymization engine
Pickle-based persistent mapping storage
Context manager support (with statement)
Command-line interface with multiple commands
File processing capabilities
Table alias detection and quantification
Comprehensive test suite (73 passing tests)
Export/import functionality

🚧 Future Enhancements

Support for additional SQL dialects (PostgreSQL, MySQL, etc.)
GUI interface
API server mode
Pattern-based anonymization rules
Integration with query optimization tools
Publish to PyPI

🧾 License

MIT License - See LICENSE for details

✍️ Author

Developed by Nicholas Carsner

Contributions, issues, and feature requests are welcome!

Project details

Release history Release notifications | RSS feed

This version

0.1.2

Dec 16, 2025

0.1.1

Dec 15, 2025

0.1.0

Dec 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sql_query_anonymizer-0.1.2.tar.gz (14.6 kB view details)

Uploaded Dec 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sql_query_anonymizer-0.1.2-py3-none-any.whl (17.1 kB view details)

Uploaded Dec 16, 2025 Python 3

File details

Details for the file sql_query_anonymizer-0.1.2.tar.gz.

File metadata

Download URL: sql_query_anonymizer-0.1.2.tar.gz
Upload date: Dec 16, 2025
Size: 14.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sql_query_anonymizer-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`f26e77aee4bff7b43a87352dc822965125d5c0dbb2af25e363db9c25979c36a0`
MD5	`68b1ee671d738aa585a478c5cad80597`
BLAKE2b-256	`91bc82df36ffd1524ecde777919ee5a9c9e1a82352baa5b6f3d5296885c80671`

See more details on using hashes here.

File details

Details for the file sql_query_anonymizer-0.1.2-py3-none-any.whl.

File metadata

Download URL: sql_query_anonymizer-0.1.2-py3-none-any.whl
Upload date: Dec 16, 2025
Size: 17.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sql_query_anonymizer-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0eac17a08bf8529e33e59e0e2470bf571109215a555722d8f12a1e6215bf46f2`
MD5	`e3a3a39bba9a6ffebcc88de559b37e4c`
BLAKE2b-256	`b96dc249e77832bbc903906ed93a152a0e9b78d45259b6427e687a58c1339d42`

See more details on using hashes here.

sql-query-anonymizer 0.1.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

SQL Query Anonymizer

✨ Features

📦 Installation

🚀 Quick Start

Command Line Interface

Python API

📁 Project Structure

🎯 Use Cases

🔧 CLI Commands

💾 Persistent Storage

🧪 Testing

📋 Requirements

🔍 How It Works

🎨 Example Transformations

🛠️ Configuration

🔍 Feature Status

✅ Completed

🚧 Future Enhancements

🧾 License

✍️ Author

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes