Tool for anonymizing schema data in SQL queries.
Project description
SQL Query Anonymizer
A Python tool for anonymizing SQL queries while preserving their structure and maintaining the ability to de-anonymize them later. Perfect for query optimization workflows, security testing, or data sharing scenarios where you need to hide sensitive identifiers.
โจ Features
- ๐ Anonymization: Replace table names, column names, and literals with generic placeholders
- ๐ De-anonymization: Restore original identifiers from anonymized queries
- ๐พ Persistent Mappings: Pickle-based storage maintains mappings across sessions
- ๐ฏ Structure Preservation: SQL syntax and query structure remain intact
- ๐ Table Alias Detection: Quantifies table aliases in SELECT statements
- ๐ Roundtrip Guarantee: Perfect roundtrip anonymization โ de-anonymization
- ๐ฅ๏ธ CLI & Python API: Use via command-line or import in Python code
- ๐ File Processing: Process SQL files in batch
- ๐งช Comprehensive Tests: 73 passing tests with pytest
๐ฆ Installation
# Clone the repository
git clone https://github.com/ncarsner/sql-query-anonymizer.git
cd sql-query-anonymizer
# Install dependencies (using uv)
uv sync
# Or using pip
pip install -e .
# After installation, the sql-anonymizer command will be available
sql-anonymizer --help
๐ Quick Start
Command Line Interface
# Anonymize a SQL query
sql-anonymizer anonymize "SELECT name, email FROM users WHERE age > 25"
# Output: SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1
# De-anonymize back to original
sql-anonymizer deanonymize "SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1"
# Output: SELECT name , email FROM users WHERE age > 25
# Show current mappings
sql-anonymizer show-mappings
# Process SQL files
sql-anonymizer anonymize -f input.sql -o anonymized.sql
Python API
from src.sql_query_anonymizer.utils import Anonymizer
# Create an anonymizer instance with default mapping file
anonymizer = Anonymizer()
# Anonymize a query
original = "SELECT customer_id, name FROM customers WHERE age > 30"
anonymized = anonymizer.anonymize_query(original)
print(anonymized) # SELECT identifier_1 , identifier_2 FROM table_1 WHERE identifier_3 > literal_1
# Save mappings to persist for later
anonymizer.save()
# Later session - create new anonymizer and load mappings
anonymizer2 = Anonymizer()
anonymizer2.load()
# De-anonymize back to original
restored = anonymizer2.de_anonymize_query(anonymized)
print(restored) # SELECT customer_id , name FROM customers WHERE age > 30
# Or use context manager for automatic save/load
with Anonymizer() as anon:
anonymized = anon.anonymize_query(original)
# Mappings automatically saved on exit
๐ Project Structure
sql-query-anonymizer/
โโโ .github/
โ โโโ workflows/
โ โโโ publish.yml # Automated PyPI publishing
โ โโโ tests.yml # CI/CD testing
โโโ src/
โ โโโ sql_query_anonymizer/
โ โโโ __init__.py
โ โโโ anonymize.py # Anonymization preprocessing
โ โโโ cli.py # Command-line interface
โ โโโ constants.py # SQL keywords and configuration
โ โโโ helper_utilities.py # File I/O utilities
โ โโโ tokenize.py # SQL tokenization engine
โ โโโ utils.py # Core Anonymizer class
โโโ tests/
โ โโโ __init__.py
โ โโโ test_anonymize.py # Anonymization tests
โ โโโ test_cli.py # CLI tests
โ โโโ test_utils.py # Core utilities tests
โโโ data/
โ โโโ _raw/ # Sample raw SQL files
โ โโโ _anonymized/ # Anonymized outputs
โ โโโ _optimized/ # Optimized queries
โ โโโ _deanonymized/ # De-anonymized outputs
โโโ .pre-commit-config.yaml # Pre-commit hooks configuration
โโโ CLI_USAGE.md # CLI quick reference
โโโ pyproject.toml # Project configuration
โโโ uv.lock # Dependency lock file
โโโ README.md
โโโ LICENSE
๐ฏ Use Cases
- Query Optimization: Anonymize queries before sending to external optimization services
- Security Testing: Share query structures without exposing sensitive database schema
- Documentation: Create generic examples from real queries
- Training: Generate training datasets with anonymized queries
- Debugging: Share problematic queries with support teams without revealing confidential data
๐ง CLI Commands
| Command | Description |
|---|---|
anonymize |
Anonymize SQL query from string or file |
deanonymize |
De-anonymize query back to original form |
show-mappings |
Display current mapping statistics |
clear-mappings |
Clear all stored mappings |
export-mappings |
Export mappings to another pickle file |
import-mappings |
Import mappings from a pickle file |
interactive |
Start interactive mode |
Examples:
# Anonymize a query
sql-anonymizer anonymize "SELECT * FROM users"
# Show current mappings
sql-anonymizer show-mappings
# Export mappings for backup
sql-anonymizer export-mappings backup.pkl
# Use custom mapping file
sql-anonymizer -m custom.pkl anonymize "SELECT * FROM products"
๐พ Persistent Storage
Mappings are stored as pickle files in ~/.sql_anonymizer/mappings.pkl and persist across sessions. This ensures:
- Consistent anonymization: Same identifiers always map to the same placeholders
- Session persistence: Mappings survive between CLI sessions
- Backup and restore: Export/import mappings as needed
- Context manager support: Automatic save/load using Python's
withstatement
The pickle format stores mappings, reverse mappings, and counters efficiently.
๐งช Testing
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src/sql_query_anonymizer
# Run specific test file
uv run pytest tests/test_utils.py
# Run with verbose output
uv run pytest -v
๐ Requirements
- Python 3.13+
- pytest (for testing)
- pytest-cov (for coverage reports)
๐ How It Works
- Tokenization: SQL query is parsed into tokens (keywords, identifiers, literals, operators)
- Classification: Tokens are classified by type (table names, column names, literals, etc.)
- Mapping: Each unique identifier is mapped to a generic placeholder
- Replacement: Original identifiers are replaced with placeholders
- Persistence: Mappings are saved for future de-anonymization
- De-anonymization: Placeholders are mapped back to original identifiers
๐จ Example Transformations
-- Original
SELECT c.customer_id, c.first_name, o.order_date, SUM(o.amount)
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.status = 'active' AND o.date >= '2024-01-01'
GROUP BY c.customer_id, c.first_name, o.order_date;
-- Anonymized
SELECT c.identifier_1 , c.identifier_2 , o.identifier_3 , SUM ( o.identifier_4 )
FROM table_1 c
JOIN table_2 o ON c.identifier_1 = o.identifier_1
WHERE c.identifier_5 = literal_1 AND o.identifier_6 >= literal_2
GROUP BY c.identifier_1 , c.identifier_2 , o.identifier_3 ;
๐ ๏ธ Configuration
Custom mapping file location:
sql-anonymizer -m custom_mappings.pkl anonymize "SELECT * FROM users"
Disable auto-save:
sql-anonymizer --no-auto-save anonymize "SELECT * FROM users"
Python API with custom location:
# Use custom mapping file
anonymizer = Anonymizer(mapping_file="project_mappings.pkl")
anonymizer.load() # Load existing mappings
# Or use absolute path
anonymizer = Anonymizer(mapping_file="/path/to/mappings.pkl")
๐ Feature Status
โ Completed
- Core anonymization/de-anonymization engine
- Pickle-based persistent mapping storage
- Context manager support (
withstatement) - Command-line interface with multiple commands
- File processing capabilities
- Table alias detection and quantification
- Comprehensive test suite (73 passing tests)
- Export/import functionality
๐ง Future Enhancements
- Support for additional SQL dialects (PostgreSQL, MySQL, etc.)
- GUI interface
- API server mode
- Pattern-based anonymization rules
- Integration with query optimization tools
- Publish to PyPI
๐งพ License
MIT License - See LICENSE for details
โ๏ธ Author
Developed by Nicholas Carsner
Contributions, issues, and feature requests are welcome!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sql_query_anonymizer-0.1.2.tar.gz.
File metadata
- Download URL: sql_query_anonymizer-0.1.2.tar.gz
- Upload date:
- Size: 14.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f26e77aee4bff7b43a87352dc822965125d5c0dbb2af25e363db9c25979c36a0
|
|
| MD5 |
68b1ee671d738aa585a478c5cad80597
|
|
| BLAKE2b-256 |
91bc82df36ffd1524ecde777919ee5a9c9e1a82352baa5b6f3d5296885c80671
|
File details
Details for the file sql_query_anonymizer-0.1.2-py3-none-any.whl.
File metadata
- Download URL: sql_query_anonymizer-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0eac17a08bf8529e33e59e0e2470bf571109215a555722d8f12a1e6215bf46f2
|
|
| MD5 |
e3a3a39bba9a6ffebcc88de559b37e4c
|
|
| BLAKE2b-256 |
b96dc249e77832bbc903906ed93a152a0e9b78d45259b6427e687a58c1339d42
|