Skip to main content

Ultra fast correct context-aware "parsing" SAS code lexer

Project description

SAS Lexer

crates pypi license python

Ultra fast "correct" static context-aware parsing SAS code lexer.

Let me break it down for you:

  • How fast exactly?: On my MacBook M1 Pro 2021, I get a single-threaded throughput of ~180MB/s. That's about 10 million lines of real-world SAS code in 1-2 seconds! This is despite full Unicode support, context-awareness, and all the quirks of the SAS language.
  • What's the fuss with correctness & context-awareness?: SAS isn't just context-sensitive; it's environment-sensitive. To get 100% correct lexing for all possible programs, you actually need to execute the code in a SAS session. Yes, you read that right—it's environment-sensitive lexing. No joke. See below for more details.
  • What do you mean by "parsing" lexer?: The term might be my invention, but due to the unique nature of the SAS language, the lexer has to handle tasks that typically fall under parsing.

Table of Contents

Lexer Features

  • Correctness: handles all known SAS language quirks and rarest of edge cases to get as accurate as possible without executing the code. This includes a small amount of heuristics, which should work for 99.99999% of the cases.
  • Unicode Support: full support for Unicode characters in SAS code.
  • Parses Literals: supports and parses all numeric and string literals in SAS, including scientific notation, hex and decimal notation. And yes, it supports the Character Constants in Hexadecimal Notation, thanks for asking!
  • Ridiculously Fast: leverages cutting-edge techniques for performance. Even python version still clocks in at 1-2 million lines per second on a single thread.
  • Error detection & recovery: a number of coding errors are detected, reported and sometimes recovered from. See error.rs for the full list.
  • Test coverage: with 2000+ meticulously manually crafted test cases, the lexer has a very high level of confidence in correctness.

Available in two flavors:

  • Rust Crate: A high-performance Rust crate for efficient SAS language lexing.
  • Python Bindings: Easy-to-use Python package with bindings for seamless integration with Python projects.

Heuristics, limitations and known deviations from the SAS engine

The key limitation is that the lexer is static, meaning it does not execute the code. One can produce SAS code that is impossible to statically tokenize the same way SAS scanner would. Hence the need for some heuristics. However, you're unlikely to run into these limitations in practice.

  • Lexer supports files up-to 4GB in size. For those of you with 5GB SAS programs, well, I am sorry...
  • String expressions and literals in macro text expressions are lexed as in open code, although SAS lexes them as just text, verbatim and later interprets at call site. E.g. %let v='01jan87'd; will lex '01jan87'd as a DateLiteral token instead of MacroString.
  • Parenthesis following a macro identifier are always assumed to be a part of the macro call as lexer is not environment-aware. See below for more details.
  • Trailing whitespace is insignificant in macro strings, but is not stripped by the lexer in all contexts. For example, %mcall( arg value ) will have a MacroString token with the text arg value .
  • Numeric formats are not lexed as a separate token, as they are indistinguishable from numeric literals and/or column references and require context to interpret.
  • SAS session skips the entire macro definition (including the body) on pretty much any error. For example, %macro $bad will cause whatever follows up-to %mend to be skipped. The lexer does not do this, and will try to recover and continue lexing.
  • Lexer recovery sometimes goes beyond what SAS engine does. For instance, both SAS and this lexer will recover missing = in %let a 1; but SAS will not recover missing ) in %macro a(a=1; , while this lexer will.

Keyword Token Types

SAS has thousands of keywords, and none of them are reserved. All fans of columns named when, rejoice, you can finally execute sql that looks like this select case when when = 42 then then else else end from table!

Thus the selection of keywords that are lexed as a dedicated token type vs. as an identifier is somewhat arbitrary and based on personal experience of writing parsers for SAS code.

Getting Started

Installation

You can add the Rust crate as a dependency via Cargo:

cargo add sas-lexer

For Python, install the package using pip:

pip install sas-lexer

Usage (Rust)

use sas_lexer::{lex_program, LexResult, TokenIdx};

fn main() {
    let source = "data mydata; set mydataset; run;";

    let LexResult { buffer, .. } = lex_program(&source).unwrap();

    let tokens: Vec<TokenIdx> = buffer.into_iter().collect();

    for token in tokens {
        println!("{:?}", buffer.get_token_raw_text(token, &source));
    }
}

Crate Features

  • macro_sep: Enables a special virtual MacroSep token that is emitted between open code and macro statements when there is no "natural" separator, or when semicolon is missing between two macro statements (a coding error). This may be used by a downstream parser as a reliable terminating token for dynamic open code and thus avoid doing lookaheads. Dynamic, means that the statement has a macro statements in it, like data %if cond %then %do; t1 %end; %else %do; t2 %end;;
  • serde: Enables serialization and deserialization of the ResolvedTokenInfo struct using the serde library. For an example of usage, see the Python bindings crate sas-lexer-py.
  • opti_stats: Enables some additional statistics during lexing, used for performance tuning. Not intended for general use.

Usage (Python)

from sas_lexer import lex_program_from_str

tokens, errors, str_lit_buf = lex_program_from_str(
    "data mydata; set mydataset; run;"
)

for token in tokens:
    print(token)

Let's talk about SAS

Whether it is because the Dragon Book had not been published when the language was conceived, or due to the deep and unwavering love of its users, the SAS language allows for almost anything, except perhaps brewing your coffee in the morning. Although, I wouldn't be surprised if that turned out to be another undocumented feature.

If you think I am exaggerating, read on.

THIS SECTION IS WIP. PLANNED CONTENT:

  • Integer literals with inline comments
  • Fun with macro mnemonics and "null" strings in expressions
  • Statements inside macro/function call arguments, string expressions and comments
  • Total ambiguity of numeric formats
  • Environment-dependent lexing: parenthesis following macro identifier
  • Macro call arguments starting with =
  • Context-aware masking of ',' in macro call arguments and discrepancies between sister functions
  • %sysfunc/%syscall function aware lexing
  • String literals that "hide" semicolon from macro but are not string literals
  • Star comments that sometimes disable macro processing and sometimes not

Motivation

Why build a modern lexer specifically for the SAS language? Mostly for fun! SAS is possibly the most complicated programming language for static parsing in the world. I have worked with it for many years as part of my day job, which eventually included a transpiler from SAS to PySpark. I wanted to see how fast a complex context-aware lexer can theoretically be, and SAS seemed like a perfect candidate for this experiment.

License

This project is licensed under the AGPL-3.0. If you are interested in using the lexer for commercial purposes, please reach out to me for further discussion.

Contributing

We welcome contributions in the form of issues, feature requests, and feedback! However, due to licensing complexities, we are not currently accepting pull requests. Please feel free to open an issue for any proposals or suggestions.

Acknowledgments

  • The lexer is inspired by the the Carbon language parser, particularly as described in the talk "Modernizing Compiler Design for Carbon Toolchain" by Chandler Carruth at CppNow 2023. You can find the talk here.
  • Cargo benchmark and an end-2-end test use SAS code from the SAS Enlighten Apply GitHub repository, which is licensed under Apache-2.0. The code is included in the tests directory without modifications.
  • The Python package utilizes the amazing msgspec library for (de)serialization, which is licensed under BSD-3-Clause.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sas_lexer-1.0.0b1.tar.gz (50.2 kB view details)

Uploaded Source

Built Distributions

sas_lexer-1.0.0b1-cp312-none-win_amd64.whl (195.1 kB view details)

Uploaded CPython 3.12 Windows x86-64

sas_lexer-1.0.0b1-cp312-none-win32.whl (191.0 kB view details)

Uploaded CPython 3.12 Windows x86

sas_lexer-1.0.0b1-cp312-cp312-musllinux_1_2_x86_64.whl (462.8 kB view details)

Uploaded CPython 3.12 musllinux: musl 1.2+ x86-64

sas_lexer-1.0.0b1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.5 kB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

sas_lexer-1.0.0b1-cp312-cp312-macosx_11_0_arm64.whl (261.5 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

sas_lexer-1.0.0b1-cp312-cp312-macosx_10_12_x86_64.whl (270.7 kB view details)

Uploaded CPython 3.12 macOS 10.12+ x86-64

sas_lexer-1.0.0b1-cp311-none-win_amd64.whl (194.9 kB view details)

Uploaded CPython 3.11 Windows x86-64

sas_lexer-1.0.0b1-cp311-none-win32.whl (190.6 kB view details)

Uploaded CPython 3.11 Windows x86

sas_lexer-1.0.0b1-cp311-cp311-musllinux_1_2_x86_64.whl (462.9 kB view details)

Uploaded CPython 3.11 musllinux: musl 1.2+ x86-64

sas_lexer-1.0.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.7 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

sas_lexer-1.0.0b1-cp311-cp311-macosx_11_0_arm64.whl (261.6 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

sas_lexer-1.0.0b1-cp311-cp311-macosx_10_12_x86_64.whl (270.6 kB view details)

Uploaded CPython 3.11 macOS 10.12+ x86-64

sas_lexer-1.0.0b1-cp310-none-win_amd64.whl (194.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

sas_lexer-1.0.0b1-cp310-none-win32.whl (190.7 kB view details)

Uploaded CPython 3.10 Windows x86

sas_lexer-1.0.0b1-cp310-cp310-musllinux_1_2_x86_64.whl (462.6 kB view details)

Uploaded CPython 3.10 musllinux: musl 1.2+ x86-64

sas_lexer-1.0.0b1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.5 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

sas_lexer-1.0.0b1-cp310-cp310-macosx_11_0_arm64.whl (261.5 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

sas_lexer-1.0.0b1-cp310-cp310-macosx_10_12_x86_64.whl (270.6 kB view details)

Uploaded CPython 3.10 macOS 10.12+ x86-64

File details

Details for the file sas_lexer-1.0.0b1.tar.gz.

File metadata

  • Download URL: sas_lexer-1.0.0b1.tar.gz
  • Upload date:
  • Size: 50.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.0

File hashes

Hashes for sas_lexer-1.0.0b1.tar.gz
Algorithm Hash digest
SHA256 45391f19d304959cc8f8d5aa06352ae1cc103781a1c1bea3b4d7fba72cdfa267
MD5 ba2ec47574c8d4a4b4b479befa72e3a8
BLAKE2b-256 b4333e95e8f9eadd84cf1fac51c570854805b763663f1bb17b9c57391dce928b

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp312-none-win_amd64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp312-none-win_amd64.whl
Algorithm Hash digest
SHA256 76bbb49060eeb4f443f1de4fac7d4edddf96045362af75bd7c82d724e7c1aaa6
MD5 8e5fef22b0e3dae6461b965782e05156
BLAKE2b-256 5969bf61ade4fdec10549fc8b2e52dddaf2eead2acddd40b79fc4df919dc0de0

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp312-none-win32.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp312-none-win32.whl
Algorithm Hash digest
SHA256 0197eff0f0072ee478923a31c4079617c5d2d0aea2f33aa43302af79c4349bc0
MD5 e4591f8b93a5ea17fa45c3cd60593af0
BLAKE2b-256 67023b554e08df8cc01bedb2acef21b48e6aa42df80bda129058cc1cd8f4d3f4

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp312-cp312-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp312-cp312-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d3adf622b09adb9d1d2c5f3e5cd1007f5d96eb4d9620cf8a50f4df319b6f11bc
MD5 e4da2a7508df5d4b421b8e0e1cc304b1
BLAKE2b-256 9c24dbeefc9368fa44b48eeaadfe54c6ebd4e24a7b56240f32ce05ae293f1c8e

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 893f2db01ba04adf0c7fd40186988958732c0f5ad334a0ad8784b59c2b8fce7c
MD5 f9533ee5be3ae70dcecca223eb984ab0
BLAKE2b-256 6439e562fc8aede8054aae9aea1e54e06a013c25adf1df3583d36377fb8cf64e

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d953bca2e1defc7d2ec1d5c3afb13b75051d14b41b1c4c85ae54b0ef489c240b
MD5 f3774411080ea95276d4e6a81d83538e
BLAKE2b-256 7a682e871a1100c906d8b925d5c9c547ef53ee0c1dc5d3ec3399eaa7d4f3686b

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 35eab1ccce161ea54fefde02e9c08675fd7e001a69cb40c53e05904db69428d5
MD5 11071f5a17bdf2fd3062c13983f34e63
BLAKE2b-256 c351621fd2a3049eac2b73bebbc80cdba6466da773063a68f692fd7fe42374b6

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp311-none-win_amd64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp311-none-win_amd64.whl
Algorithm Hash digest
SHA256 3563b0db380d1f8443f823801968abc07af961bc5aaea95a44cf9ed62a40c350
MD5 65654b30794c49c03c2ca4eca22af8a6
BLAKE2b-256 6fad4f9654ddffac6bd99d0508a27a89898f76f253aeb8b4cbce1bf421362ef8

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp311-none-win32.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp311-none-win32.whl
Algorithm Hash digest
SHA256 8642c20e6d02882513c7572760a263d06e9036c0fc1b2c336b7efc870018ed94
MD5 2673c0dc0929105ff5e070a7a8bb09a7
BLAKE2b-256 5a5b7887cf1b481d8c870602db4deb1c3e3e4f3f811d7d1266f168d876f69476

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp311-cp311-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp311-cp311-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e38b5fc0dc63a33b90f07ead17b6c6f3d08f2b3e07afe0ef13760a59cf6ad740
MD5 dca1e4a56e32a4f72346baae9e189a35
BLAKE2b-256 fd8a6ad0fc44e7812933a25312d886c0ebf9d52862f62a86b4f1c456ba377a1e

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 11e91e944c1f06f821419aaec23fa08c85908ee5d7c78070d2f9614551ea602d
MD5 f0fac9c2af22ed8e7c312d088ca81baa
BLAKE2b-256 62ae146e5ad2a68455df27759aafc54685ff316a55f6b48bed1edc8b1fa4b060

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 db864eaeee972ef675ab9162f8b73b2f5ff7e028d3fe087671f84e2868c21b1c
MD5 1d9ae43177ae13d9c83a1cc217a152db
BLAKE2b-256 34c4538bee81f0eb0345bb89bfe2f368c0821b925e74306ec5c0d24df8e21fe2

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 61b497f872546b5ae2477fec3fa16e0fd5fb755727ab5c6ba3fdd8894385c3e9
MD5 be452cd1aa4b65caeb9e469373872e66
BLAKE2b-256 5b027bb1147185557d74556e53dfefbbec72d07d0e755f54414c88aefe3588e8

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp310-none-win_amd64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp310-none-win_amd64.whl
Algorithm Hash digest
SHA256 ea6095f5cf348e642161e8accbeb0a7f2055395a4a75129e8e3b582b54041ada
MD5 3999908ead9db0d79aa5b786000fd5e2
BLAKE2b-256 4decd091a6007c42977befbb6a77fe59913aa4355bf6fc0ed2508755fc26d36a

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp310-none-win32.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp310-none-win32.whl
Algorithm Hash digest
SHA256 10bda8000cb79acfa4103c9ed03a4460ee575c03c2483ae49b3e1213c6178ab0
MD5 29732bf95df4d28ede7539289190b4d5
BLAKE2b-256 4a847f5afd06105f00a0dd387c5b11e72c7c4749d9f2ef977f0ffefd1d102dd1

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp310-cp310-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp310-cp310-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 521da280f5e584641d31e8a796f4f01b30d8f3e8c7cf7e87cbf1ec8b88675e3c
MD5 01423a76b9cda0497f7a1ec5559c3e1c
BLAKE2b-256 0b48344c5f7bfd41b72a7ddd1318d93549219cd202fa91c64849c7b2bce915d1

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cb1637732a632ed2dfe719a50c828faaf9ba165b04b07cbb20ffed43ebc79ac0
MD5 69e9af79cfe2921aa9aa4c752215eb28
BLAKE2b-256 e8c4f41bd620a519d4c250160e0fb655707f8f3ce86dec5eabdda350c99b862e

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 efbdc08bd443676fcf321bb62b283812581d808a7468477cbc1bb5f39a907448
MD5 52cd9119781d87eca0cc77cd6a12d15b
BLAKE2b-256 715e5aef6e404e1d947f26152cdb3367205e56fbe0b7f7fe9a2bd25cb1aa693d

See more details on using hashes here.

File details

Details for the file sas_lexer-1.0.0b1-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sas_lexer-1.0.0b1-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 41d2c4b9e4e7e1d19587cd4aec1afd5c661040c5b6bce04d4dd188db403581a6
MD5 54fcd5f969c84f466fbe1d5746f7c53e
BLAKE2b-256 a4f8042e2c35a0705b947a6ec6653bc28a9cec95b5a51c308e3bf948d12442b8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page