Skip to main content

A tool to parse and analyze the structure for Postgres sql queries.

Project description

This is a Python package that parses a given sql query, matches the column and tables within your given metastore, and analyzes the query to generate a list of referenced columns within the metastore.

Quick Start

$ pip install sqlanalyzer

Example Usage

1. Format a query to follow the ANSI standards for SQL:

>>> from sqlanalyzer import column_parser
>>> query = """SELECT api.name, acct.customer_tier_c, acct.name FROM api_requests_by_account api
... LEFT JOIN accounts 
... acct ON api.user_id = acct.customer_api_id
... """
>>> formatter = column_parser.Parser(query)
>>> formatted = formatter.format_query(query)
>>> print(formatted)
SELECT api.name,
       acct.customer_tier_c,
       acct.name
FROM api_requests_by_account api
LEFT JOIN accounts acct ON api.user_id = acct.customer_api_id

2. Separate CTE's and extract alias names and queries:

>>> query = """WITH a AS
...   (SELECT DISTINCT anonymous_id,
...                    user_id
...    FROM customer_data.segment_identifies
...    WHERE dt >= '2018-07-01'),
...      b AS
...   (SELECT id,
...           email,
...           created
...    FROM customer_data.accounts)
... SELECT a.*,
...        b.*
... FROM a
... LEFT JOIN b ON a.user_id = b.id
... WHERE context_campaign_name IS NOT NULL
... """
>>> formatter = column_parser.Parser(query)
>>> cte_query = formatter.parse_cte(query)
>>> cte_query
{'a': "SELECT DISTINCT anonymous_id,\n                   user_id\n   FROM customer_data.segment_identifies\n   WHERE dt >= '2018-07-01'",
'b': 'SELECT id,\n          email,\n          created\n   FROM customer_data.accounts', 
'main_query': 'SELECT a.*,\n       b.*\nFROM a\nLEFT JOIN b ON a.user_id = b.id\nWHERE context_campaign_name IS NOT NULL\n'}
>>> cte_query.keys()
dict_keys(['a', 'b', 'main_query'])

3. Match table aliases with the actual database name:

>>> query = """SELECT *
... FROM api_requests.requests_by_account m
... INNER JOIN mapbox_customer_data.styles s ON m.metadata_version = s.id
... LEFT JOIN sfdc.users u ON m.csm = u.id
... """
>>> formatter = column_parser.Parser(query)
>>> formatted = formatter.format_query(query)
>>> table_alias_mapping = formatter.get_table_names(formatted.split('\n'))
>>> table_alias_mapping
{'m': 'api_requests.requests_by_account', 
's': 'mapbox_customer_data.styles', 
'u': 'sfdc.users'}

4. Analyze and parse complex query with subqueries, Common Table Expressions and a mix of the two types.

a) Parse multiple and deeply (3+ levels) nested subqueries:

>>> from sqlanalyzer import query_analyzer
>>> query = """SELECT *
... FROM
...   (SELECT a.*,
...           b.*
...    FROM
...      (SELECT DISTINCT anonymous_id,
...                       user_id
...       FROM customer_data.segment_identifies
...       WHERE dt >= '2018-07-01') a
...    LEFT JOIN
...      (SELECT id,
...              email,
...              created
...       FROM customer_data.accounts) b ON a.user_id = b.id
...    WHERE context_campaign_name IS NOT NULL )
... """
>>> analyzer = query_analyzer.Analyzer(query)
>>> analyzer.parse_query(query)
[{'level_1_main': 'SELECT * FROM no alias '}, 
{'level_2_main': 'SELECT a.*,        b.* WHERE context_campaign_name IS NOT NULL FROM a LEFT JOIN b ON a.user_id = b.id '}, 
{'a': "SELECT DISTINCT anonymous_id, user_id FROM customer_data.segment_identifies WHERE dt >= '2018-07-01'"}, 
{'b': 'SELECT id, email, created FROM customer_data.accounts'}]

b) Parse Common Table Expressions (CTE's):

>>> query = """WITH a AS
...   (SELECT DISTINCT anonymous_id,
...                    user_id
...    FROM customer_data.segment_identifies
...    WHERE dt >= '2018-07-01'),
...      b AS
...   (SELECT id,
...           email,
...           created
...    FROM customer_data.accounts)
... SELECT a.*,
...        b.*
... FROM a
... LEFT JOIN b ON a.user_id = b.id
... WHERE context_campaign_name IS NOT NULL
... """
>>> analyzer = query_analyzer.Analyzer(query)
>>> analyzer.parse_query(query)
[{'a': "SELECT DISTINCT anonymous_id,\n                   user_id\n   FROM customer_data.segment_identifies\n   WHERE dt >= '2018-07-01'"}, 
{'b': 'SELECT id,\n          email,\n          created\n   FROM customer_data.accounts'}, 
{'main_query': 'SELECT a.*,\n       b.*\nFROM a\nLEFT JOIN b ON a.user_id = b.id\nWHERE context_campaign_name IS NOT NULL'}]

c) Parse mixed type of nested queries and CTE's:

>>> query = """SELECT email,
...        COUNT(DISTINCT context_campaign_name)
... FROM
...   (WITH a AS
...      (SELECT DISTINCT anonymous_id,
...                       user_id
...       FROM customer_data.segment_identifies
...       WHERE dt >= '2018-07-01'),
...         b AS
...      (SELECT id,
...              email,
...              created
...       FROM customer_data.accounts) SELECT a.*,
...                                           b.*
...    FROM a
...    LEFT JOIN b ON a.user_id = b.id
...    WHERE context_campaign_name IS NOT NULL )
... WHERE user_id IN ('123',
...                   '234',
...                   '345')
... GROUP BY 1
... ORDER BY 2 DESC
... LIMIT 200
... """
>>> analyzer = query_analyzer.Analyzer(query)
>>> analyzer.parse_query(query)
[{'level_1_main': "SELECT email,        COUNT(DISTINCT context_campaign_name) WHERE user_id IN ('123',                   '234',                   '345') FROM no alias "}, 
{'no alias': [{'a': "SELECT DISTINCT anonymous_id,\n                   user_id\n   FROM customer_data.segment_identifies\n   WHERE dt >= '2018-07-01'"}, 
{'b': 'SELECT id,\n          email,\n          created\n   FROM customer_data.accounts'}, 
{'main_query': 'SELECT a.*,\n       b.*\nFROM a\nLEFT JOIN b ON a.user_id = b.id\nWHERE context_campaign_name IS NOT NULL'}]}]

Notes:

Upload instructions python3 -m pip install --user --upgrade setuptools wheel twine python3 setup.py sdist bdist_wheel twine check dist/* twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sqlanalyzer-0.5.42.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

sqlanalyzer-0.5.42-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file sqlanalyzer-0.5.42.tar.gz.

File metadata

  • Download URL: sqlanalyzer-0.5.42.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for sqlanalyzer-0.5.42.tar.gz
Algorithm Hash digest
SHA256 594c938d93ba04a26b90e44f7171720b9d01802af80dd00ad4a22510dda2e6df
MD5 ad52a9a895928b5180b134c27a650ad2
BLAKE2b-256 5a63bc71e89c8ffa7057636b2108824ccfa9c6ff40b10b6bc58ce9fb6afe0444

See more details on using hashes here.

File details

Details for the file sqlanalyzer-0.5.42-py3-none-any.whl.

File metadata

  • Download URL: sqlanalyzer-0.5.42-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for sqlanalyzer-0.5.42-py3-none-any.whl
Algorithm Hash digest
SHA256 7349db0d6fc62742479cd8e7f7b70f585dad4e101362d69979251cb1933de085
MD5 41baa1eb713b0a29f8e3ba931f4fa789
BLAKE2b-256 f59450e43d6dfe054d45d9bc02d09cd8a50ddbfb90b90a37f66db91421cefb19

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page