Skip to main content

Tool for anonymizing schema data in SQL queries.

Project description

SQL Query Anonymizer

A Python tool for anonymizing SQL queries while preserving their structure and maintaining the ability to de-anonymize them later. Perfect for query optimization workflows, security testing, or data sharing scenarios where you need to hide sensitive identifiers.

โœจ Features

  • ๐Ÿ”’ Anonymization: Replace table names, column names, and literals with generic placeholders
  • ๐Ÿ”“ De-anonymization: Restore original identifiers from anonymized queries
  • ๐Ÿ’พ Persistent Mappings: Pickle-based storage maintains mappings across sessions
  • ๐ŸŽฏ Structure Preservation: SQL syntax and query structure remain intact
  • ๐Ÿ“Š Table Alias Detection: Quantifies table aliases in SELECT statements
  • ๐Ÿ”„ Roundtrip Guarantee: Perfect roundtrip anonymization โ†” de-anonymization
  • ๐Ÿ–ฅ๏ธ CLI & Python API: Use via command-line or import in Python code
  • ๐Ÿ“ File Processing: Process SQL files in batch
  • ๐Ÿงช Comprehensive Tests: 73 passing tests with pytest

๐Ÿ“ฆ Installation

# Clone the repository
git clone https://github.com/ncarsner/sql-query-anonymizer.git
cd sql-query-anonymizer

# Install dependencies (using uv)
uv sync

# Or using pip
pip install -e .

# After installation, the sql-anonymizer command will be available
sql-anonymizer --help

๐Ÿš€ Quick Start

Command Line Interface

# Anonymize a SQL query
sql-anonymizer anonymize "SELECT name, email FROM users WHERE age > 25"
# Output: SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1

# De-anonymize back to original
sql-anonymizer deanonymize "SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1"
# Output: SELECT name , email FROM users WHERE age > 25

# Show current mappings
sql-anonymizer show-mappings

# Process SQL files
sql-anonymizer anonymize -f input.sql -o anonymized.sql

Python API

from src.sql_query_anonymizer.utils import Anonymizer

# Create an anonymizer instance with default mapping file
anonymizer = Anonymizer()

# Anonymize a query
original = "SELECT customer_id, name FROM customers WHERE age > 30"
anonymized = anonymizer.anonymize_query(original)
print(anonymized)  # SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1

# Save mappings to persist for later
anonymizer.save()

# Later session - create new anonymizer and load mappings
anonymizer2 = Anonymizer()
anonymizer2.load()

# De-anonymize back to original
restored = anonymizer2.de_anonymize_query(anonymized)
print(restored)  # SELECT customer_id , name FROM customers WHERE age > 30

# Or use context manager for automatic save/load
with Anonymizer() as anon:
    anonymized = anon.anonymize_query(original)
    # Mappings automatically saved on exit

๐Ÿ“ Project Structure

sql-query-anonymizer/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ sql_query_anonymizer/
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ”œโ”€โ”€ anonymize.py          # Anonymization preprocessing
โ”‚       โ”œโ”€โ”€ cli.py                # Command-line interface
โ”‚       โ”œโ”€โ”€ constants.py          # SQL keywords and configuration
โ”‚       โ”œโ”€โ”€ helper_utilities.py   # File I/O utilities
โ”‚       โ”œโ”€โ”€ tokenize.py           # SQL tokenization engine
โ”‚       โ””โ”€โ”€ utils.py              # Core Anonymizer class
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_anonymize.py         # Anonymization tests
โ”‚   โ”œโ”€โ”€ test_cli.py               # CLI tests
โ”‚   โ”œโ”€โ”€ test_tokenize.py          # Tokenization tests
โ”‚   โ””โ”€โ”€ test_utils.py             # Core utilities tests
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ 0_raw/                    # Sample raw SQL files
โ”‚   โ”œโ”€โ”€ 1_anonymized/             # Anonymized outputs
โ”‚   โ”œโ”€โ”€ 2_optimized/              # Optimized queries
โ”‚   โ””โ”€โ”€ 3_deanonymized/           # De-anonymized outputs
โ”œโ”€โ”€ pyproject.toml                # Project configuration
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ LICENSE

๐ŸŽฏ Use Cases

  1. Query Optimization: Anonymize queries before sending to external optimization services
  2. Security Testing: Share query structures without exposing sensitive database schema
  3. Documentation: Create generic examples from real queries
  4. Training: Generate training datasets with anonymized queries
  5. Debugging: Share problematic queries with support teams without revealing confidential data

๐Ÿ”ง CLI Commands

Command Description
anonymize Anonymize SQL query from string or file
deanonymize De-anonymize query back to original form
show-mappings Display current mapping statistics
clear-mappings Clear all stored mappings
export-mappings Export mappings to another pickle file
import-mappings Import mappings from a pickle file
interactive Start interactive mode

Examples:

# Anonymize a query
sql-anonymizer anonymize "SELECT * FROM users"

# Show current mappings
sql-anonymizer show-mappings

# Export mappings for backup
sql-anonymizer export-mappings backup.pkl

# Use custom mapping file
sql-anonymizer -m custom.pkl anonymize "SELECT * FROM products"

๐Ÿ’พ Persistent Storage

Mappings are stored as pickle files in ~/.sql_anonymizer/mappings.pkl and persist across sessions. This ensures:

  • Consistent anonymization: Same identifiers always map to the same placeholders
  • Session persistence: Mappings survive between CLI sessions
  • Backup and restore: Export/import mappings as needed
  • Context manager support: Automatic save/load using Python's with statement

The pickle format stores mappings, reverse mappings, and counters efficiently.

๐Ÿงช Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/sql_query_anonymizer

# Run specific test file
uv run pytest tests/test_utils.py

# Run with verbose output
uv run pytest -v

๐Ÿ“‹ Requirements

  • Python 3.13+
  • pytest (for testing)
  • pytest-cov (for coverage reports)

๐Ÿ” How It Works

  1. Tokenization: SQL query is parsed into tokens (keywords, identifiers, literals, operators)
  2. Classification: Tokens are classified by type (table names, column names, literals, etc.)
  3. Mapping: Each unique identifier is mapped to a generic placeholder
  4. Replacement: Original identifiers are replaced with placeholders
  5. Persistence: Mappings are saved for future de-anonymization
  6. De-anonymization: Placeholders are mapped back to original identifiers

๐ŸŽจ Example Transformations

-- Original
SELECT c.customer_id, c.first_name, o.order_date, SUM(o.amount)
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.status = 'active' AND o.date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, o.order_date;

-- Anonymized
SELECT c.identifier_1 , c.identifier_2 , o.identifier_3 , SUM ( o.identifier_4 )
FROM table_1 c
JOIN table_2 o ON c.identifier_1 = o.identifier_1
WHERE c.identifier_5 = literal_1 AND o.identifier_6 >= literal_2
GROUP BY c.identifier_1 , c.identifier_2 , o.identifier_3 ;

๐Ÿ› ๏ธ Configuration

Custom mapping file location:

sql-anonymizer -m custom_mappings.pkl anonymize "SELECT * FROM users"

Disable auto-save:

sql-anonymizer --no-auto-save anonymize "SELECT * FROM users"

Python API with custom location:

# Use custom mapping file
anonymizer = Anonymizer(mapping_file="project_mappings.pkl")
anonymizer.load()  # Load existing mappings

# Or use absolute path
anonymizer = Anonymizer(mapping_file="/path/to/mappings.pkl")

๐Ÿ” Feature Status

โœ… Completed

  • Core anonymization/de-anonymization engine
  • Pickle-based persistent mapping storage
  • Context manager support (with statement)
  • Command-line interface with multiple commands
  • File processing capabilities
  • Table alias detection and quantification
  • Comprehensive test suite (73 passing tests)
  • Export/import functionality

๐Ÿšง Future Enhancements

  • Support for additional SQL dialects (PostgreSQL, MySQL, etc.)
  • GUI interface
  • API server mode
  • Pattern-based anonymization rules
  • Integration with query optimization tools
  • Publish to PyPI

๐Ÿงพ License

MIT License - See LICENSE for details

โœ๏ธ Author

Developed by Nicholas Carsner

Contributions, issues, and feature requests are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sql_query_anonymizer-0.1.1.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sql_query_anonymizer-0.1.1-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file sql_query_anonymizer-0.1.1.tar.gz.

File metadata

  • Download URL: sql_query_anonymizer-0.1.1.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sql_query_anonymizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d0c85a2631d63542e91ba76df4b6c48aff80caf740f9007d06954f6cd06d21fe
MD5 aa6811c66cba0f8ff44fe1368d25ad3a
BLAKE2b-256 b7b3d800aaf2d144c66499054fcd3338cda7e4fc9636b68d390ba9b12525c774

See more details on using hashes here.

File details

Details for the file sql_query_anonymizer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sql_query_anonymizer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sql_query_anonymizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ab596d8b28128477e94622ac91e7279e7e45ed83fdadde34caf5803bcacb5c3e
MD5 a4477125532c01d366d7034fa1fb96ee
BLAKE2b-256 08a58ea4369cb1d2af3fee0b67368f9d60ba7e2c7023335ee8370c948449874b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page