Skip to main content

Tool for anonymizing schema data in SQL queries.

Project description

SQL Query Anonymizer

A Python tool for anonymizing SQL queries while preserving their structure and maintaining the ability to de-anonymize them later. Perfect for query optimization workflows, security testing, or data sharing scenarios where you need to hide sensitive identifiers.

โœจ Features

  • ๐Ÿ”’ Anonymization: Replace table names, column names, and literals with generic placeholders
  • ๐Ÿ”“ De-anonymization: Restore original identifiers from anonymized queries
  • ๐Ÿ’พ Persistent Mappings: Pickle-based storage maintains mappings across sessions
  • ๐ŸŽฏ Structure Preservation: SQL syntax and query structure remain intact
  • ๐Ÿ“Š Table Alias Detection: Quantifies table aliases in SELECT statements
  • ๐Ÿ”„ Roundtrip Guarantee: Perfect roundtrip anonymization โ†” de-anonymization
  • ๐Ÿ–ฅ๏ธ CLI & Python API: Use via command-line or import in Python code
  • ๐Ÿ“ File Processing: Process SQL files in batch
  • ๐Ÿงช Comprehensive Tests: 73 passing tests with pytest

๐Ÿ“ฆ Installation

# Clone the repository
git clone https://github.com/ncarsner/sql-query-anonymizer.git
cd sql-query-anonymizer

# Install dependencies (using uv)
uv sync

# Or using pip
pip install -e .

# After installation, the sql-anonymizer command will be available
sql-anonymizer --help

๐Ÿš€ Quick Start

Command Line Interface

# Anonymize a SQL query
sql-anonymizer anonymize "SELECT name, email FROM users WHERE age > 25"
# Output: SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1

# De-anonymize back to original
sql-anonymizer deanonymize "SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1"
# Output: SELECT name , email FROM users WHERE age > 25

# Show current mappings
sql-anonymizer show-mappings

# Process SQL files
sql-anonymizer anonymize -f input.sql -o anonymized.sql

Python API

from src.sql_query_anonymizer.utils import Anonymizer

# Create an anonymizer instance with default mapping file
anonymizer = Anonymizer()

# Anonymize a query
original = "SELECT customer_id, name FROM customers WHERE age > 30"
anonymized = anonymizer.anonymize_query(original)
print(anonymized)  # SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1

# Save mappings to persist for later
anonymizer.save()

# Later session - create new anonymizer and load mappings
anonymizer2 = Anonymizer()
anonymizer2.load()

# De-anonymize back to original
restored = anonymizer2.de_anonymize_query(anonymized)
print(restored)  # SELECT customer_id , name FROM customers WHERE age > 30

# Or use context manager for automatic save/load
with Anonymizer() as anon:
    anonymized = anon.anonymize_query(original)
    # Mappings automatically saved on exit

๐Ÿ“ Project Structure

sql-query-anonymizer/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ sql_query_anonymizer/
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ”œโ”€โ”€ anonymize.py          # Anonymization preprocessing
โ”‚       โ”œโ”€โ”€ cli.py                # Command-line interface
โ”‚       โ”œโ”€โ”€ constants.py          # SQL keywords and configuration
โ”‚       โ”œโ”€โ”€ helper_utilities.py   # File I/O utilities
โ”‚       โ”œโ”€โ”€ tokenize.py           # SQL tokenization engine
โ”‚       โ””โ”€โ”€ utils.py              # Core Anonymizer class
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_anonymize.py         # Anonymization tests
โ”‚   โ”œโ”€โ”€ test_cli.py               # CLI tests
โ”‚   โ”œโ”€โ”€ test_tokenize.py          # Tokenization tests
โ”‚   โ””โ”€โ”€ test_utils.py             # Core utilities tests
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ 0_raw/                    # Sample raw SQL files
โ”‚   โ”œโ”€โ”€ 1_anonymized/             # Anonymized outputs
โ”‚   โ”œโ”€โ”€ 2_optimized/              # Optimized queries
โ”‚   โ””โ”€โ”€ 3_deanonymized/           # De-anonymized outputs
โ”œโ”€โ”€ pyproject.toml                # Project configuration
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ LICENSE

๐ŸŽฏ Use Cases

  1. Query Optimization: Anonymize queries before sending to external optimization services
  2. Security Testing: Share query structures without exposing sensitive database schema
  3. Documentation: Create generic examples from real queries
  4. Training: Generate training datasets with anonymized queries
  5. Debugging: Share problematic queries with support teams without revealing confidential data

๐Ÿ”ง CLI Commands

Command Description
anonymize Anonymize SQL query from string or file
deanonymize De-anonymize query back to original form
show-mappings Display current mapping statistics
clear-mappings Clear all stored mappings
export-mappings Export mappings to another pickle file
import-mappings Import mappings from a pickle file
interactive Start interactive mode

Examples:

# Anonymize a query
sql-anonymizer anonymize "SELECT * FROM users"

# Show current mappings
sql-anonymizer show-mappings

# Export mappings for backup
sql-anonymizer export-mappings backup.pkl

# Use custom mapping file
sql-anonymizer -m custom.pkl anonymize "SELECT * FROM products"

๐Ÿ’พ Persistent Storage

Mappings are stored as pickle files in ~/.sql_anonymizer/mappings.pkl and persist across sessions. This ensures:

  • Consistent anonymization: Same identifiers always map to the same placeholders
  • Session persistence: Mappings survive between CLI sessions
  • Backup and restore: Export/import mappings as needed
  • Context manager support: Automatic save/load using Python's with statement

The pickle format stores mappings, reverse mappings, and counters efficiently.

๐Ÿงช Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/sql_query_anonymizer

# Run specific test file
uv run pytest tests/test_utils.py

# Run with verbose output
uv run pytest -v

๐Ÿ“‹ Requirements

  • Python 3.13+
  • pytest (for testing)
  • pytest-cov (for coverage reports)

๐Ÿ” How It Works

  1. Tokenization: SQL query is parsed into tokens (keywords, identifiers, literals, operators)
  2. Classification: Tokens are classified by type (table names, column names, literals, etc.)
  3. Mapping: Each unique identifier is mapped to a generic placeholder
  4. Replacement: Original identifiers are replaced with placeholders
  5. Persistence: Mappings are saved for future de-anonymization
  6. De-anonymization: Placeholders are mapped back to original identifiers

๐ŸŽจ Example Transformations

-- Original
SELECT c.customer_id, c.first_name, o.order_date, SUM(o.amount)
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.status = 'active' AND o.date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, o.order_date;

-- Anonymized
SELECT c.identifier_1 , c.identifier_2 , o.identifier_3 , SUM ( o.identifier_4 )
FROM table_1 c
JOIN table_2 o ON c.identifier_1 = o.identifier_1
WHERE c.identifier_5 = literal_1 AND o.identifier_6 >= literal_2
GROUP BY c.identifier_1 , c.identifier_2 , o.identifier_3 ;

๐Ÿ› ๏ธ Configuration

Custom mapping file location:

sql-anonymizer -m custom_mappings.pkl anonymize "SELECT * FROM users"

Disable auto-save:

sql-anonymizer --no-auto-save anonymize "SELECT * FROM users"

Python API with custom location:

# Use custom mapping file
anonymizer = Anonymizer(mapping_file="project_mappings.pkl")
anonymizer.load()  # Load existing mappings

# Or use absolute path
anonymizer = Anonymizer(mapping_file="/path/to/mappings.pkl")

๐Ÿ” Feature Status

โœ… Completed

  • Core anonymization/de-anonymization engine
  • Pickle-based persistent mapping storage
  • Context manager support (with statement)
  • Command-line interface with multiple commands
  • File processing capabilities
  • Table alias detection and quantification
  • Comprehensive test suite (73 passing tests)
  • Export/import functionality

๐Ÿšง Future Enhancements

  • Support for additional SQL dialects (PostgreSQL, MySQL, etc.)
  • GUI interface
  • API server mode
  • Pattern-based anonymization rules
  • Integration with query optimization tools
  • Publish to PyPI

๐Ÿงพ License

MIT License - See LICENSE for details

โœ๏ธ Author

Developed by Nicholas Carsner

Contributions, issues, and feature requests are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sql_query_anonymizer-0.1.0.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sql_query_anonymizer-0.1.0-py3-none-any.whl (17.0 kB view details)

Uploaded Python 3

File details

Details for the file sql_query_anonymizer-0.1.0.tar.gz.

File metadata

  • Download URL: sql_query_anonymizer-0.1.0.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sql_query_anonymizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9375be4c63bdbf8e8ec30a680ebaac30ebb6d994edc3c9b60eeaae12ec6e4e84
MD5 ca7aa4c114395d1995dd811632e6f8d7
BLAKE2b-256 89489c35a097f9d972c86de8c4ab2789c7d1904f3badbf4d6c54d4cd697e1d62

See more details on using hashes here.

File details

Details for the file sql_query_anonymizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sql_query_anonymizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sql_query_anonymizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5210122afc19e3e14cc344631f59df75b293c9c050ab2176c5b3c293a851e584
MD5 2f7a78eefd1897a80c3bd58b64bf3b48
BLAKE2b-256 5eed67eef79b9c52fca868aea58260e9b136ce97b4d1a29424feafd58f3486a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page