Skip to main content

Tool for anonymizing schema data in SQL queries.

Project description

SQL Query Anonymizer

A Python tool for anonymizing SQL queries while preserving their structure and maintaining the ability to de-anonymize them later. Perfect for query optimization workflows, security testing, or data sharing scenarios where you need to hide sensitive identifiers.

โœจ Features

  • ๐Ÿ”’ Anonymization: Replace table names, column names, and literals with generic placeholders
  • ๐Ÿ”“ De-anonymization: Restore original identifiers from anonymized queries
  • ๐Ÿ’พ Persistent Mappings: Pickle-based storage maintains mappings across sessions
  • ๐ŸŽฏ Structure Preservation: SQL syntax and query structure remain intact
  • ๐Ÿ“Š Table Alias Detection: Quantifies table aliases in SELECT statements
  • ๐Ÿ”„ Roundtrip Guarantee: Perfect roundtrip anonymization โ†” de-anonymization
  • ๐Ÿ–ฅ๏ธ CLI & Python API: Use via command-line or import in Python code
  • ๐Ÿ“ File Processing: Process SQL files in batch
  • ๐Ÿงช Comprehensive Tests: 73 passing tests with pytest

๐Ÿ“ฆ Installation

# Clone the repository
git clone https://github.com/ncarsner/sql-query-anonymizer.git
cd sql-query-anonymizer

# Install dependencies (using uv)
uv sync

# Or using pip
pip install -e .

# After installation, the sql-anonymizer command will be available
sql-anonymizer --help

๐Ÿš€ Quick Start

Command Line Interface

# Anonymize a SQL query
sql-anonymizer anonymize "SELECT name, email FROM users WHERE age > 25"
# Output: SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1

# De-anonymize back to original
sql-anonymizer deanonymize "SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1"
# Output: SELECT name , email FROM users WHERE age > 25

# Show current mappings
sql-anonymizer show-mappings

# Process SQL files
sql-anonymizer anonymize -f input.sql -o anonymized.sql

Python API

from src.sql_query_anonymizer.utils import Anonymizer

# Create an anonymizer instance with default mapping file
anonymizer = Anonymizer()

# Anonymize a query
original = "SELECT customer_id, name FROM customers WHERE age > 30"
anonymized = anonymizer.anonymize_query(original)
print(anonymized)  # SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1

# Save mappings to persist for later
anonymizer.save()

# Later session - create new anonymizer and load mappings
anonymizer2 = Anonymizer()
anonymizer2.load()

# De-anonymize back to original
restored = anonymizer2.de_anonymize_query(anonymized)
print(restored)  # SELECT customer_id , name FROM customers WHERE age > 30

# Or use context manager for automatic save/load
with Anonymizer() as anon:
    anonymized = anon.anonymize_query(original)
    # Mappings automatically saved on exit

๐Ÿ“ Project Structure

sql-query-anonymizer/
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ”œโ”€โ”€ publish.yml           # Automated PyPI publishing
โ”‚       โ””โ”€โ”€ tests.yml             # CI/CD testing
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ sql_query_anonymizer/
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ”œโ”€โ”€ anonymize.py          # Anonymization preprocessing
โ”‚       โ”œโ”€โ”€ cli.py                # Command-line interface
โ”‚       โ”œโ”€โ”€ constants.py          # SQL keywords and configuration
โ”‚       โ”œโ”€โ”€ helper_utilities.py   # File I/O utilities
โ”‚       โ”œโ”€โ”€ tokenize.py           # SQL tokenization engine
โ”‚       โ””โ”€โ”€ utils.py              # Core Anonymizer class
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ test_anonymize.py         # Anonymization tests
โ”‚   โ”œโ”€โ”€ test_cli.py               # CLI tests
โ”‚   โ””โ”€โ”€ test_utils.py             # Core utilities tests
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ _raw/                     # Sample raw SQL files
โ”‚   โ”œโ”€โ”€ _anonymized/              # Anonymized outputs
โ”‚   โ”œโ”€โ”€ _optimized/               # Optimized queries
โ”‚   โ””โ”€โ”€ _deanonymized/            # De-anonymized outputs
โ”œโ”€โ”€ .pre-commit-config.yaml       # Pre-commit hooks configuration
โ”œโ”€โ”€ CLI_USAGE.md                  # CLI quick reference
โ”œโ”€โ”€ pyproject.toml                # Project configuration
โ”œโ”€โ”€ uv.lock                       # Dependency lock file
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ LICENSE

๐ŸŽฏ Use Cases

  1. Query Optimization: Anonymize queries before sending to external optimization services
  2. Security Testing: Share query structures without exposing sensitive database schema
  3. Documentation: Create generic examples from real queries
  4. Training: Generate training datasets with anonymized queries
  5. Debugging: Share problematic queries with support teams without revealing confidential data

๐Ÿ”ง CLI Commands

Command Description
anonymize Anonymize SQL query from string or file
deanonymize De-anonymize query back to original form
show-mappings Display current mapping statistics
clear-mappings Clear all stored mappings
export-mappings Export mappings to another pickle file
import-mappings Import mappings from a pickle file
interactive Start interactive mode

Examples:

# Anonymize a query
sql-anonymizer anonymize "SELECT * FROM users"

# Show current mappings
sql-anonymizer show-mappings

# Export mappings for backup
sql-anonymizer export-mappings backup.pkl

# Use custom mapping file
sql-anonymizer -m custom.pkl anonymize "SELECT * FROM products"

๐Ÿ’พ Persistent Storage

Mappings are stored as pickle files in ~/.sql_anonymizer/mappings.pkl and persist across sessions. This ensures:

  • Consistent anonymization: Same identifiers always map to the same placeholders
  • Session persistence: Mappings survive between CLI sessions
  • Backup and restore: Export/import mappings as needed
  • Context manager support: Automatic save/load using Python's with statement

The pickle format stores mappings, reverse mappings, and counters efficiently.

๐Ÿงช Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/sql_query_anonymizer

# Run specific test file
uv run pytest tests/test_utils.py

# Run with verbose output
uv run pytest -v

๐Ÿ“‹ Requirements

  • Python 3.13+
  • pytest (for testing)
  • pytest-cov (for coverage reports)

๐Ÿ” How It Works

  1. Tokenization: SQL query is parsed into tokens (keywords, identifiers, literals, operators)
  2. Classification: Tokens are classified by type (table names, column names, literals, etc.)
  3. Mapping: Each unique identifier is mapped to a generic placeholder
  4. Replacement: Original identifiers are replaced with placeholders
  5. Persistence: Mappings are saved for future de-anonymization
  6. De-anonymization: Placeholders are mapped back to original identifiers

๐ŸŽจ Example Transformations

-- Original
SELECT c.customer_id, c.first_name, o.order_date, SUM(o.amount)
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.status = 'active' AND o.date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, o.order_date;

-- Anonymized
SELECT c.identifier_1 , c.identifier_2 , o.identifier_3 , SUM ( o.identifier_4 )
FROM table_1 c
JOIN table_2 o ON c.identifier_1 = o.identifier_1
WHERE c.identifier_5 = literal_1 AND o.identifier_6 >= literal_2
GROUP BY c.identifier_1 , c.identifier_2 , o.identifier_3 ;

๐Ÿ› ๏ธ Configuration

Custom mapping file location:

sql-anonymizer -m custom_mappings.pkl anonymize "SELECT * FROM users"

Disable auto-save:

sql-anonymizer --no-auto-save anonymize "SELECT * FROM users"

Python API with custom location:

# Use custom mapping file
anonymizer = Anonymizer(mapping_file="project_mappings.pkl")
anonymizer.load()  # Load existing mappings

# Or use absolute path
anonymizer = Anonymizer(mapping_file="/path/to/mappings.pkl")

๐Ÿ” Feature Status

โœ… Completed

  • Core anonymization/de-anonymization engine
  • Pickle-based persistent mapping storage
  • Context manager support (with statement)
  • Command-line interface with multiple commands
  • File processing capabilities
  • Table alias detection and quantification
  • Comprehensive test suite (73 passing tests)
  • Export/import functionality

๐Ÿšง Future Enhancements

  • Support for additional SQL dialects (PostgreSQL, MySQL, etc.)
  • GUI interface
  • API server mode
  • Pattern-based anonymization rules
  • Integration with query optimization tools
  • Publish to PyPI

๐Ÿงพ License

MIT License - See LICENSE for details

โœ๏ธ Author

Developed by Nicholas Carsner

Contributions, issues, and feature requests are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sql_query_anonymizer-0.1.2.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sql_query_anonymizer-0.1.2-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file sql_query_anonymizer-0.1.2.tar.gz.

File metadata

  • Download URL: sql_query_anonymizer-0.1.2.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sql_query_anonymizer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f26e77aee4bff7b43a87352dc822965125d5c0dbb2af25e363db9c25979c36a0
MD5 68b1ee671d738aa585a478c5cad80597
BLAKE2b-256 91bc82df36ffd1524ecde777919ee5a9c9e1a82352baa5b6f3d5296885c80671

See more details on using hashes here.

File details

Details for the file sql_query_anonymizer-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: sql_query_anonymizer-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for sql_query_anonymizer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0eac17a08bf8529e33e59e0e2470bf571109215a555722d8f12a1e6215bf46f2
MD5 e3a3a39bba9a6ffebcc88de559b37e4c
BLAKE2b-256 b96dc249e77832bbc903906ed93a152a0e9b78d45259b6427e687a58c1339d42

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page