Skip to main content

Ultra fast correct context-aware "parsing" SAS code lexer

Project description

SAS Lexer

crates pypi license python

Ultra fast "correct" static context-aware parsing SAS code lexer.

Let me break it down for you:

  • How fast exactly?: On my MacBook M1 Pro 2021, I get a single-threaded throughput of ~180MB/s. That's about 10 million lines of real-world SAS code in 1-2 seconds! This is despite full Unicode support, context-awareness, and all the quirks of the SAS language.
  • What's the fuss with correctness & context-awareness?: SAS isn't just context-sensitive; it's environment-sensitive. To get 100% correct lexing for all possible programs, you actually need to execute the code in a SAS session. Yes, you read that right—it's environment-sensitive lexing. No joke. See below for more details.
  • What do you mean by "parsing" lexer?: The term might be my invention, but due to the unique nature of the SAS language, the lexer has to handle tasks that typically fall under parsing.

Table of Contents

Lexer Features

  • Correctness: handles all known SAS language quirks and rarest of edge cases to get as accurate as possible without executing the code. This includes a small amount of heuristics, which should work for 99.99999% of the cases.
  • Unicode Support: full support for Unicode characters in SAS code.
  • Parses Literals: supports and parses all numeric and string literals in SAS, including scientific notation, hex and decimal notation. And yes, it supports the Character Constants in Hexadecimal Notation, thanks for asking!
  • Ridiculously Fast: leverages cutting-edge techniques for performance. Even python version still clocks in at 1-2 million lines per second on a single thread.
  • Error detection & recovery: a number of coding errors are detected, reported and sometimes recovered from. See error.rs for the full list.
  • Test coverage: with 2000+ meticulously manually crafted test cases, the lexer has a very high level of confidence in correctness.

Available in two flavors:

  • Rust Crate: A high-performance Rust crate for efficient SAS language lexing.
  • Python Bindings: Easy-to-use Python package with bindings for seamless integration with Python projects.

Heuristics, limitations and known deviations from the SAS engine

The key limitation is that the lexer is static, meaning it does not execute the code. One can produce SAS code that is impossible to statically tokenize the same way SAS scanner would. Hence the need for some heuristics. However, you're unlikely to run into these limitations in practice.

  • Lexer supports files up-to 4GB in size. For those of you with 5GB SAS programs, well, I am sorry...
  • String expressions and literals in macro text expressions are lexed as in open code, although SAS lexes them as just text, verbatim and later interprets at call site. E.g. %let v='01jan87'd; will lex '01jan87'd as a DateLiteral token instead of MacroString.
  • Parenthesis following a macro identifier are always assumed to be a part of the macro call as lexer is not environment-aware. See below for more details.
  • Trailing whitespace is insignificant in macro strings, but is not stripped by the lexer in all contexts. For example, %mcall( arg value ) will have a MacroString token with the text arg value .
  • Numeric formats are not lexed as a separate token, as they are indistinguishable from numeric literals and/or column references and require context to interpret.
  • SAS session skips the entire macro definition (including the body) on pretty much any error. For example, %macro $bad will cause whatever follows up-to %mend to be skipped. The lexer does not do this, and will try to recover and continue lexing.
  • Lexer recovery sometimes goes beyond what SAS engine does. For instance, both SAS and this lexer will recover missing = in %let a 1; but SAS will not recover missing ) in %macro a(a=1; , while this lexer will.

Keyword Token Types

SAS has thousands of keywords, and none of them are reserved. All fans of columns named when, rejoice, you can finally execute sql that looks like this select case when when = 42 then then else else end from table!

Thus the selection of keywords that are lexed as a dedicated token type vs. as an identifier is somewhat arbitrary and based on personal experience of writing parsers for SAS code.

Getting Started

Installation

You can add the Rust crate as a dependency via Cargo:

cargo add sas-lexer

For Python, install the package using pip:

pip install sas-lexer

Usage (Rust)

use sas_lexer::{lex_program, LexResult, TokenIdx};

fn main() {
    let source = "data mydata; set mydataset; run;";

    let LexResult { buffer, .. } = lex_program(&source).unwrap();

    let tokens: Vec<TokenIdx> = buffer.iter_tokens().collect();

    for token in tokens {
        println!("{:?}", buffer.get_token_raw_text(token, &source));
    }
}

Crate Features

  • macro_sep: Enables a special virtual MacroSep token that is emitted between open code and macro statements when there is no "natural" separator, or when semicolon is missing between two macro statements (a coding error). This may be used by a downstream parser as a reliable terminating token for dynamic open code and thus avoid doing lookaheads. Dynamic, means that the statement has a macro statements in it, like data %if cond %then %do; t1 %end; %else %do; t2 %end;;
  • serde: Enables serialization and deserialization of the ResolvedTokenInfo struct using the serde library. For an example of usage, see the Python bindings crate sas-lexer-py.
  • opti_stats: Enables some additional statistics during lexing, used for performance tuning. Not intended for general use.

Usage (Python)

from sas_lexer import lex_program_from_str

tokens, errors, str_lit_buf = lex_program_from_str(
    "data mydata; set mydataset; run;"
)

for token in tokens:
    print(token)

Let's talk about SAS

Whether it is because the Dragon Book had not been published when the language was conceived, or due to the deep and unwavering love of its users, the SAS language allows for almost anything, except perhaps brewing your coffee in the morning. Although, I wouldn't be surprised if that turned out to be another undocumented feature.

If you think I am exaggerating, read on.

THIS SECTION IS WIP. PLANNED CONTENT:

  • Integer literals with inline comments
  • Fun with macro mnemonics and "null" strings in expressions
  • Statements inside macro/function call arguments, string expressions and comments
  • Total ambiguity of numeric formats
  • Environment-dependent lexing: parenthesis following macro identifier
  • Macro call arguments starting with =
  • Context-aware masking of ',' in macro call arguments and discrepancies between sister functions
  • %sysfunc/%syscall function aware lexing
  • String literals that "hide" semicolon from macro but are not string literals
  • Star comments that sometimes disable macro processing and sometimes not

Motivation

Why build a modern lexer specifically for the SAS language? Mostly for fun! SAS is possibly the most complicated programming language for static parsing in the world. I have worked with it for many years as part of my day job, which eventually included a transpiler from SAS to PySpark. I wanted to see how fast a complex context-aware lexer can theoretically be, and SAS seemed like a perfect candidate for this experiment.

License

This project is licensed under the AGPL-3.0. If you are interested in using the lexer for commercial purposes, please reach out to me for further discussion.

Contributing

We welcome contributions in the form of issues, feature requests, and feedback! However, due to licensing complexities, we are not currently accepting pull requests. Please feel free to open an issue for any proposals or suggestions.

Acknowledgments

  • The lexer is inspired by the the Carbon language parser, particularly as described in the talk "Modernizing Compiler Design for Carbon Toolchain" by Chandler Carruth at CppNow 2023. You can find the talk here.
  • Cargo benchmark and an end-2-end test use SAS code from the SAS Enlighten Apply GitHub repository, which is licensed under Apache-2.0. The code is included in the tests directory without modifications.
  • The Python package utilizes the amazing msgspec library for (de)serialization, which is licensed under BSD-3-Clause.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sas_lexer-1.0.0b2.tar.gz (40.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sas_lexer-1.0.0b2-cp312-cp312-win_amd64.whl (195.5 kB view details)

Uploaded CPython 3.12Windows x86-64

sas_lexer-1.0.0b2-cp312-cp312-musllinux_1_2_x86_64.whl (462.1 kB view details)

Uploaded CPython 3.12musllinux: musl 1.2+ x86-64

sas_lexer-1.0.0b2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (290.5 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

sas_lexer-1.0.0b2-cp312-cp312-macosx_11_0_arm64.whl (262.6 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

sas_lexer-1.0.0b2-cp312-cp312-macosx_10_12_x86_64.whl (271.4 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

sas_lexer-1.0.0b2-cp311-cp311-win_amd64.whl (194.7 kB view details)

Uploaded CPython 3.11Windows x86-64

sas_lexer-1.0.0b2-cp311-cp311-musllinux_1_2_x86_64.whl (462.4 kB view details)

Uploaded CPython 3.11musllinux: musl 1.2+ x86-64

sas_lexer-1.0.0b2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (290.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

sas_lexer-1.0.0b2-cp311-cp311-macosx_11_0_arm64.whl (263.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

sas_lexer-1.0.0b2-cp311-cp311-macosx_10_12_x86_64.whl (271.7 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

sas_lexer-1.0.0b2-cp310-cp310-win_amd64.whl (195.1 kB view details)

Uploaded CPython 3.10Windows x86-64

sas_lexer-1.0.0b2-cp310-cp310-musllinux_1_2_x86_64.whl (462.3 kB view details)

Uploaded CPython 3.10musllinux: musl 1.2+ x86-64

sas_lexer-1.0.0b2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (290.9 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

sas_lexer-1.0.0b2-cp310-cp310-macosx_11_0_arm64.whl (263.2 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

sas_lexer-1.0.0b2-cp310-cp310-macosx_10_12_x86_64.whl (271.7 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file sas_lexer-1.0.0b2.tar.gz.

File metadata

  • Download URL: sas_lexer-1.0.0b2.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.10

File hashes

Hashes for sas_lexer-1.0.0b2.tar.gz
Algorithm Hash digest
SHA256 a3c6f5f22517d4fddf7ec778a371a3b844f8c68b25cf56e699d08465f8c6e3dd
MD5 25f745a4a5feb38a3db70baf38d050ec
BLAKE2b-256 8ae8c98e2e246e4e2f222b04cda4edf2ea90a0ee07b14d67d5141ad85ebf7bd9

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a3758a90ed931d77232171e825f43871252de5965e8feaf866420067498af50d
MD5 3fe28644e74c7b4e2b02e56c42e49338
BLAKE2b-256 24657298003ede3339d22f3ee032a4ed18adc5f08e508eee2f16b28dc2161bb3

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 462fe5713ff38854bff831b9ef7e72fe65b8127981116d8fd19ce417752d0ea0
MD5 c3529e70294a04769dbb9eb87f523a5b
BLAKE2b-256 5c0810d48497d63512c514f2ec8694c0882a6ffba8ea278e56c54e262c3254a5

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 46c42bd1384a19fc95c63ee154503c40f40cd94f591e2ac9a2ff13e6c36971ff
MD5 d46f5092e20a1ad73df287d176b5f893
BLAKE2b-256 c75ae180879bd6ea1b6ee417f961255ada579c7110e191c745d2bb7769d7ba26

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aa2da41b3b4541ab88c992692f849c6be1a4814be5ff9f4cc36d2d76a5cff408
MD5 59589a97e8b4c3c84ae7b8e11aaf7f10
BLAKE2b-256 f2f809b16d3ae4fc24431e2d4bac887a351c3fb26091040f5ffbab9622796970

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 1ca2221e3e512169333d58a9ba8ffdef3f475f28e54c77b32a5172b7ae489c5c
MD5 7bb2310e0a4395caaedd606553ab731d
BLAKE2b-256 5fb22fbdef77780286bdfa8944a40f3e9a26ba4587dde2433adc235c48ec0daa

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d46584a8d8849e18d5450cdc1a6cc505279a27093e84613f4c48144838e0cf1c
MD5 165f83aefe7573e618374e86f8f400aa
BLAKE2b-256 46cc37b49ce274aed99a8aff1f7b479979d15f853a4ce6e6832b8f33c4f25a68

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 4d523a5f532d5dd8bb6635c72f0629a10695501dc6679ce31f10246b5177b17d
MD5 9585224c7dddc3c949f5a4fe595d99d8
BLAKE2b-256 d05191027a798a63b048629105a11908cffc0639ef4d0abbc1854441e6b7af9c

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 04ab0ccbbb55f56cc8e8882dd33188321b369ab04c8cde59dcd47b424a97890b
MD5 1423ab1296029236049224bef4417e79
BLAKE2b-256 76cd6d37f18f3a574e16f5543001ed419d4656d8d88bd73321cf1de1d334bd4f

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ecdbef2b285f81f841c5b815b19bc87352ddc26c42e5ddb32cccc46b8ddd5ad7
MD5 45f966ad756d10b9cfee78124e23ba33
BLAKE2b-256 645b7bd595e3c45f37c55ab6c94f253f4b9c285fe6ff2e62374c47c8fb5900a9

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 389efe062085e39564bc62df250bc773ee0a95861fa513a72927c5b858947f39
MD5 f23bf6c0ad68d70d9476050cff892671
BLAKE2b-256 efd35469cc8e2089c2df58fe3f3a58b8a4c99df83d99af365cc083fa420e2132

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a46a040232ae0e63407052c68e275813480522895da5fd7b3094458dcc88e224
MD5 55d484d7ae47463ff6b752a68626d837
BLAKE2b-256 808c7cf500c7f5581b6763b0f563f5587aa73b2e3a093b1a0f2c4a1c232059af

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d4d48d99fd85dca5b840990b53519cbbf74ebe192415a50057bfab31377586a5
MD5 edf77e8fb5917b8fbcc4dbf9c911096a
BLAKE2b-256 1126e8661d87303bf2fcd2ba6b095fa4a771698eb10edafc0cb52ab6c37838a2

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 29b292b56b729221d824978a2a7de22ca51bb740216ce5b2c7ab929361783ca7
MD5 e18af75c2a928ff39776dc8005610e25
BLAKE2b-256 8046f6662b37dcd7df3f42a1b681fe68ecaa545fb8dd2dc527d2c9cb276fa1ac

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 44c87520e6478f41056364b7e838fc5a84074a06aa276b98735f442886f2b1c0
MD5 dd8b79f0a553ad4406342819ecef31bd
BLAKE2b-256 abc44cfab266eb2374edfdd18ad9b9195c25e6dac6e592e34b4e20b0c320f745

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b2-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b2-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 39fb5a5ba9eec711f57b77c97159e341048bafab73775d6dc35811c1728fdbd6
MD5 07abddc745414505b3f953c8587819d1
BLAKE2b-256 db576e116fe5ad953a15f508373f99b4d7504b01b744bcfa1434014372330616

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page