Skip to main content

A tool to parse and analyze the structure for Postgres sql queries.

Project description

This is a Python package that parses a given sql query, matches the column and tables within your given metastore, and analyzes the query to generate a list of referenced columns within the metastore.

Quick Start

$ pip install sqlanalyzer

Example Usage

1. Format a query to follow the ANSI standards for SQL:

>>> from sqlanalyzer import column_parser
>>> query = """SELECT api.name, acct.customer_tier_c, acct.name FROM api_requests_by_account api
... LEFT JOIN accounts 
... acct ON api.user_id = acct.customer_api_id
... """
>>> formatter = column_parser.Parser(query)
>>> formatted = formatter.format_query(query)
>>> print(formatted)
SELECT api.name,
       acct.customer_tier_c,
       acct.name
FROM api_requests_by_account api
LEFT JOIN accounts acct ON api.user_id = acct.customer_api_id

2. Separate CTE's and extract alias names and queries:

>>> query = """WITH a AS
...   (SELECT DISTINCT anonymous_id,
...                    user_id
...    FROM customer_data.segment_identifies
...    WHERE dt >= '2018-07-01'),
...      b AS
...   (SELECT id,
...           email,
...           created
...    FROM customer_data.accounts)
... SELECT a.*,
...        b.*
... FROM a
... LEFT JOIN b ON a.user_id = b.id
... WHERE context_campaign_name IS NOT NULL
... """
>>> formatter = column_parser.Parser(query)
>>> cte_query = formatter.parse_cte(query)
>>> cte_query
{'a': "SELECT DISTINCT anonymous_id,\n                   user_id\n   FROM customer_data.segment_identifies\n   WHERE dt >= '2018-07-01'", 'b': 'SELECT id,\n          email,\n          created\n   FROM customer_data.accounts', 'main_query': 'SELECT a.*,\n       b.*\nFROM a\nLEFT JOIN b ON a.user_id = b.id\nWHERE context_campaign_name IS NOT NULL\n'}
>>> cte_query.keys()
dict_keys(['a', 'b', 'main_query'])

3. Match table aliases with the actual database name:

>>> query = """SELECT *
... FROM api_requests.requests_by_account m
... INNER JOIN mapbox_customer_data.styles s ON m.metadata_version = s.id
... LEFT JOIN sfdc.users u ON m.csm = u.id
... """
>>> formatter = column_parser.Parser(query)
>>> formatted = formatter.format_query(query)
>>> table_alias_mapping = formatter.get_table_names(formatted.split('\n'))
>>> table_alias_mapping
{'m': 'api_requests.requests_by_account', 's': 'mapbox_customer_data.styles', 'u': 'sfdc.users'}

4. Analyze and parse complex query with subqueries, Common Table Expressions and a mix of the two types.

a) Parse multiple and deeply (3+ levels) nested subqueries:

>>> from sqlanalyzer import query_analyzer
>>> query = """SELECT *
... FROM
...   (SELECT a.*,
...           b.*
...    FROM
...      (SELECT DISTINCT anonymous_id,
...                       user_id
...       FROM customer_data.segment_identifies
...       WHERE dt >= '2018-07-01') a
...    LEFT JOIN
...      (SELECT id,
...              email,
...              created
...       FROM customer_data.accounts) b ON a.user_id = b.id
...    WHERE context_campaign_name IS NOT NULL )
... """
>>> analyzer = query_analyzer.Analyzer(query)
>>> analyzer.parse_query(query)
[{'level_1_main': 'SELECT * FROM no alias '}, 
{'level_2_main': 'SELECT a.*,        b.* WHERE context_campaign_name IS NOT NULL FROM a LEFT JOIN b ON a.user_id = b.id '}, 
{'a': "SELECT DISTINCT anonymous_id, user_id FROM customer_data.segment_identifies WHERE dt >= '2018-07-01'"}, 
{'b': 'SELECT id, email, created FROM customer_data.accounts'}]

b) Parse Common Table Expressions (CTE's):

>>> query = """WITH a AS
...   (SELECT DISTINCT anonymous_id,
...                    user_id
...    FROM customer_data.segment_identifies
...    WHERE dt >= '2018-07-01'),
...      b AS
...   (SELECT id,
...           email,
...           created
...    FROM customer_data.accounts)
... SELECT a.*,
...        b.*
... FROM a
... LEFT JOIN b ON a.user_id = b.id
... WHERE context_campaign_name IS NOT NULL
... """
>>> analyzer = query_analyzer.Analyzer(query)
>>> analyzer.parse_query(query)
[{'a': "SELECT DISTINCT anonymous_id,\n                   user_id\n   FROM customer_data.segment_identifies\n   WHERE dt >= '2018-07-01'"}, {'b': 'SELECT id,\n          email,\n          created\n   FROM customer_data.accounts'}, {'main_query': 'SELECT a.*,\n       b.*\nFROM a\nLEFT JOIN b ON a.user_id = b.id\nWHERE context_campaign_name IS NOT NULL'}]

c) Parse mixed type of nested queries and CTE's:

>>> query = """SELECT email,
...        COUNT(DISTINCT context_campaign_name)
... FROM
...   (WITH a AS
...      (SELECT DISTINCT anonymous_id,
...                       user_id
...       FROM customer_data.segment_identifies
...       WHERE dt >= '2018-07-01'),
...         b AS
...      (SELECT id,
...              email,
...              created
...       FROM customer_data.accounts) SELECT a.*,
...                                           b.*
...    FROM a
...    LEFT JOIN b ON a.user_id = b.id
...    WHERE context_campaign_name IS NOT NULL )
... WHERE user_id IN ('123',
...                   '234',
...                   '345')
... GROUP BY 1
... ORDER BY 2 DESC
... LIMIT 200
... """
>>> analyzer = query_analyzer.Analyzer(query)
>>> analyzer.parse_query(query)
[{'level_1_main': "SELECT email,        COUNT(DISTINCT context_campaign_name) WHERE user_id IN ('123',                   '234',                   '345') FROM no alias "}, {'no alias': [{'a': "SELECT DISTINCT anonymous_id,\n                   user_id\n   FROM customer_data.segment_identifies\n   WHERE dt >= '2018-07-01'"}, {'b': 'SELECT id,\n          email,\n          created\n   FROM customer_data.accounts'}, {'main_query': 'SELECT a.*,\n       b.*\nFROM a\nLEFT JOIN b ON a.user_id = b.id\nWHERE context_campaign_name IS NOT NULL'}]}]

Notes:

Upload instructions python3 -m pip install --user --upgrade setuptools wheel twine python3 setup.py sdist bdist_wheel twine check dist/* twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlanalyzer-0.4.8.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

sqlanalyzer-0.4.8-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file sqlanalyzer-0.4.8.tar.gz.

File metadata

  • Download URL: sqlanalyzer-0.4.8.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for sqlanalyzer-0.4.8.tar.gz
Algorithm Hash digest
SHA256 dc8abef7d6e141a7a9a158436d8120894a9baf32be2454fae5891965739c4069
MD5 3f71c9a968f722813e22938a78c43a07
BLAKE2b-256 1259d2d4bdb666ec549734c62d853900e4faec8fb4244e0f7ba0b2b76fed1bfb

See more details on using hashes here.

File details

Details for the file sqlanalyzer-0.4.8-py3-none-any.whl.

File metadata

  • Download URL: sqlanalyzer-0.4.8-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for sqlanalyzer-0.4.8-py3-none-any.whl
Algorithm Hash digest
SHA256 5393a1430d5832d692ad17980de26f393b919e1b9f41df65d78ccc037810df12
MD5 c184326bde61aeacf444b8b6b899d4df
BLAKE2b-256 e806163e49359d0a4fd00565b6390dd0bc9a6fee98586580b69dc82e1e559fcf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page