Skip to main content

More SQL Parsing! Parse SQL into JSON parse tree

Project description

More SQL Parsing!

PyPI Latest Release Build Status PyPI Downloads

Parse SQL into JSON so we can translate it for other datastores!

See changes

Objective

The objective is to convert SQL queries to JSON-izable parse trees. This originally targeted MySQL, but has grown to include other database engines. Please paste some SQL into a new issue if it does not work for you

Project Status

October 2024 - I continue to resolve issues as they are raised. There are over 1200 tests, that cover most SQL for most databases, with limited DML and UDF support, including:

  • inner queries,
  • with clauses,
  • window functions
  • create/drop/alter tables and views
  • insert/update/delete statements
  • create procedure and function statements (MySQL only)

Install

pip install mo-sql-parsing

Parsing SQL

>>> from mo_sql_parsing import parse
>>> parse("select count(1) from jobs")
{'select': {'value': {'count': 1}}, 'from': 'jobs'}

Each SQL query is parsed to an object: Each clause is assigned to an object property of the same name.

>>> parse("select a as hello, b as world from jobs")
{'select': [{'value': 'a', 'name': 'hello'}, {'value': 'b', 'name': 'world'}], 'from': 'jobs'}

The SELECT clause is an array of objects containing name and value properties.

SQL Flavours

There are a few parsing modes you may be interested in:

Double-quotes for literal strings

MySQL uses both double quotes and single quotes to declare literal strings. This is not ansi behaviour, but it is more forgiving for programmers coming from other languages. A specific parse function is provided:

result = parse_mysql(sql)

SQLServer Identifiers ([])

SQLServer uses square brackets to delimit identifiers. For example

SELECT [Timestamp] FROM [table]

which conflicts with BigQuery array constructor (eg [1, 2, 3, 4]). You may use the SqlServer flavour with

from mo_sql_parsing import parse_sqlserver as parse

NULL is None

The default output for this parser is to emit a null function {"null":{}} wherever NULL is encountered in the SQL. If you would like something different, you can replace nulls with None (or anything else for that matter):

result = parse(sql, null=None)

this has been implemented with a post-parse rewriting of the parse tree.

Normalized function call form

The default behaviour of the parser is to output function calls in simple_op format: The operator being a key in the object; {op: params}. This form can be difficult to work with because the object must be scanned for known operators, or possible optional arguments, or at least distinguished from a query object.

You can have the parser emit function calls in normal_op format

>>> from mo_sql_parsing import parse, normal_op
>>> parse("select trim(' ' from b+c)", calls=normal_op)

which produces calls in a normalized format

{"op": op, "args": args, "kwargs": kwargs}

here is the pretty-printed JSON from the example above:

{'select': {'value': {
    'op': 'trim', 
    'args': [{'op': 'add', 'args': ['b', 'c']}], 
    'kwargs': {'characters': {'literal': ' '}}
}}}

Generating SQL

You may also generate SQL from a given JSON document. This is done by the formatter, which is usually lagging the parser (Dec2023).

>>> from mo_sql_parsing import format
>>> format({"from":"test", "select":["a.b", "c"]})
'SELECT a.b, c FROM test'

Contributing

In the event that the parser is not working for you, you can help make this better but simply pasting your sql (or JSON) into a new issue. Extra points if you describe the problem. Even more points if you submit a PR with a test. If you also submit a fix, then you also have my gratitude.

Run Tests

See the tests directory for instructions running tests, or writing new ones.

More about implementation

SQL queries are translated to JSON objects: Each clause is assigned to an object property of the same name.

# SELECT * FROM dual WHERE a>b ORDER BY a+b
{
    "select": {"all_columns": {}} 
    "from": "dual", 
    "where": {"gt": ["a", "b"]}, 
    "orderby": {"value": {"add": ["a", "b"]}}
}

Expressions are also objects, but with only one property: The name of the operation, and the value holding (an array of) parameters for that operation.

{op: parameters}

and you can see this pattern in the previous example:

{"gt": ["a","b"]}

Array Programming

The mo-sql-parsing.scrub() method is used liberally throughout the code, and it "simplifies" the JSON. You may find this form a bit tedious to work with because the JSON property values can be values, lists of values, or missing. Please consider converting everything to arrays:

def listwrap(value):
    if value is None:
        return []
    elif isinstance(value, list)
        return value
    else:
        return [value]

then you may avoid all the is-it-a-list checks :

for select in listwrap(parsed_result.get('select')):
    do_something(select)

Version Changes, Features

Version 11

October 2024

The PIVOT clause has been promoted to top-level. Instead of being part of the joins found in the FROM clause, it is now a sibling to SELECT.

>>> from mo_sql_parsing import parse
>>> parse("SELECT * FROM table PIVOT (SUM(x) FOR y IN (1, 2, 3))")

now emits

{
    'select': {'all_columns': {}}, 
    'from': 'table', 
    'pivot': {'sum': 'x', 'for': 'y', 'in': [1, 2, 3]}
}

instead of

{
    'select': {'all_columns': {}}, 
    'from': [
        'table', 
        'pivot': {'sum': 'x', 'for': 'y', 'in': [1, 2, 3]}
    ]
}

Version 10

December 2023

SELECT * now emits an all_columns call instead of plain star (*).

>>> from mo_sql_parsing import parse
>>> parse("SELECT * FROM table")
{'select': {'all_columns': {}}, 'from': 'table'}

This works better with the except clause, and is more explicit when selecting all child properties.

>>> parse("SELECT a.* EXCEPT b FROM table")
>>> {"select": {"all_columns": "a", "except": "b"}, "from": "table"}

You may get the original behaviour by staying with version 9, or by using all_columns="*":

>>> parse("SELECT * FROM table", all_columns="*")
{'select': "*", 'from': 'table'}

Version 9

November 2022

Output for COUNT(DISTINCT x) has changed from function composition

{"count": {"distinct": x}}

to named parameters

{"count": x, "distinct": true}

This was part of a bug fix issue142 - realizing distinct is just one parameter of many in an aggregate function. Specifically, using the calls=normal_op for clarity:

>>> from mo_sql_parsing import parse, normal_op
>>> parse("select count(distinct x)", calls=normal_op)

{'select': {'value': {
    'op': 'count', 
    'args': [x], 
    'kwargs': {'distinct': True}
}}}

Version 8.200+

September 2022

  • Added ALTER TABLE and COPY command parsing for Snowflake

Version 8

November 2021

  • Prefer BigQuery [] (create array) over SQLServer [] (identity)

  • Added basic DML (INSERT/UPDATE/DELETE)

  • flatter CREATE TABLE structures. The option list in column definition has been flattened:
    Old column format

      {"create table": {
          "columns": {
              "name": "name",
              "type": {"decimal": [2, 3]},
              "option": [
                  "not null",
                  "check": {"lt": [{"length": "name"}, 10]}
              ]
          }
      }}
    

    New column format

      {"create table": {
          "columns": {
              "name": "name", 
              "type": {"decimal": [2, 3]}
              "nullable": False,
              "check": {"lt": [{"length": "name"}, 10]} 
          }
      }}
    

Version 7

October 2021

  • changed error reporting; still terrible
  • upgraded mo-parsing library which forced version change

Version 6

October 2021

  • fixed SELECT DISTINCT parsing
  • added DISTINCT ON parsing

Version 5

August 2021

  • remove inline module mo-parsing
  • support CREATE TABLE, add SQL "flavours" emit {null:{}} for None

Version 4

November 2021

  • changed parse result of SELECT DISTINCT
  • simpler ORDER BY clause in window functions

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mo_sql_parsing-11.697.25301.tar.gz (47.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mo_sql_parsing-11.697.25301-py3-none-any.whl (44.7 kB view details)

Uploaded Python 3

File details

Details for the file mo_sql_parsing-11.697.25301.tar.gz.

File metadata

  • Download URL: mo_sql_parsing-11.697.25301.tar.gz
  • Upload date:
  • Size: 47.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for mo_sql_parsing-11.697.25301.tar.gz
Algorithm Hash digest
SHA256 364b62673c623a6bd6e4ece933b0521de8be0357212019371cf658b943e8081d
MD5 91feb4abf7de9c8cdb69e5edf7fc6c1e
BLAKE2b-256 a99d993e90623dcbc2d5a2e8c533e799cad0fb66e474d9078e7c767dc268ac44

See more details on using hashes here.

File details

Details for the file mo_sql_parsing-11.697.25301-py3-none-any.whl.

File metadata

File hashes

Hashes for mo_sql_parsing-11.697.25301-py3-none-any.whl
Algorithm Hash digest
SHA256 c1755550c73a29aeda0e6483ee208f660dd3fb25dfad0658464a69c352842a3e
MD5 cde90b5260f76aa89661c538bd28c629
BLAKE2b-256 1516379ea6c2b8e0ef1d6f0bdde0c4c051b2db374c8f23db81bb8b136a0a199a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page