Python-based statistical scripting language with Jupyter notebook support
Project description
StatLang
An open-source, Python-based statistical scripting language
Write and run statistical scripts with full syntax highlighting and a Python backend.
Overview
StatLang provides an open-source environment for statistical analysis by offering:
- Expressive scripting syntax for data manipulation and analysis
- Python backend for execution and performance
- Jupyter notebook support with a StatLang kernel
- VS Code extension with syntax highlighting and execution
- Cross-platform compatibility (Windows, macOS, Linux)
- Open source and free to use
What Makes StatLang Special?
- AI Integration: Built-in PROC LANGUAGE with LLM capabilities for intelligent data analysis
- Complete ML Pipeline: From data exploration to model deployment using familiar, concise syntax
- Deep Learning: PyTorch-powered DNN training, NLP, computer vision (including object detection), and reinforcement learning
- Modern SQL: PROC SQL powered by DuckDB for high-performance data querying
- Robust language features: Macro system, format system, and 38+ statistical/ML procedures
- Rich Visualizations: Professional output formatting with TITLE statements and structured results
Features
Core Interpreter
- DATA step with MERGE, ARRAY, RETAIN, DO loops (iterative/while/until), FIRST./LAST., LAG/DIF
- INFILE/FILE I/O, INPUT parsing, and PUT output
- DATALINES/CARDS for inline data
- Subsetting IF and conditional IF/THEN/ELSE
- Row-by-row and vectorised execution paths
- Python pandas/numpy backend for performance
Macro System
%MACRO/%MENDdefinitions with parameter lists%LET,%PUT,&varsubstitution%IF/%THEN/%ELSE,%DO/%ENDcontrol flow%INCLUDEfile injection (recursive with depth limit)%SYSEVALFarithmetic,%SYSFUNC(30+ built-in functions)%GLOBAL/%LOCALscoping- System variables:
&SYSDATE9,&SYSLAST,&SYSCC,&SYSJOBID
Model Store and Pipeline
- In-memory model store with optional pickle persistence
- Save, load, list, and delete trained models across procedures
run_pipeline()for end-to-end.statlangfile execution
Jupyter Notebook Support
- StatLang kernel for Jupyter notebooks
- Interactive statistical programming in notebook environment
- Rich output display with formatted tables
- Dataset visualisation and exploration
VS Code Extension
- Syntax highlighting for
.statlangfiles - Code snippets for common statistical analysis patterns
- File execution directly from VS Code
- Notebook support for interactive analysis
Supported Procedures
Statistical Procedures
| Procedure | Description |
|---|---|
| PROC MEANS | Descriptive statistics with CLASS variables and OUTPUT |
| PROC FREQ | Frequency tables and cross-tabulations |
| PROC SORT | Data sorting with ascending/descending order |
| PROC PRINT | Data display and formatting |
| PROC REG | Linear regression with MODEL, OUTPUT, and SCORE |
| PROC UNIVARIATE | Detailed univariate analysis with distribution diagnostics |
| PROC CORR | Correlation analysis (Pearson, Spearman) |
| PROC FACTOR | Principal component and factor analysis |
| PROC CLUSTER | Clustering methods (k-means, hierarchical) |
| PROC NPAR1WAY | Nonparametric tests (Mann-Whitney, Kruskal-Wallis) |
| PROC TTEST | T-tests (independent and paired) |
| PROC LOGIT | Logistic regression |
| PROC TIMESERIES | Time series analysis and seasonal decomposition |
| PROC SURVEYSELECT | Random sampling (SRS, SAMPRATE/N, OUTALL) |
| PROC GLM | General Linear Models via statsmodels (Type III ANOVA) |
| PROC ANOVA | Balanced Analysis of Variance |
| PROC GENMOD | Generalised Linear Models (Gaussian, Binomial, Poisson, Gamma) |
| PROC MIXED | Mixed / multilevel models (random intercepts & slopes) |
| PROC ROBUSTREG | Robust regression (M-estimation via RLM) |
| PROC LIFEREG | Parametric survival (Weibull, Log-Normal, Log-Logistic AFT) |
| PROC PHREG | Cox proportional hazards regression |
| PROC DISCRIM | Discriminant analysis (LDA / QDA) |
| PROC PRINCOMP | Principal Component Analysis with StandardScaler |
Machine Learning Procedures
| Procedure | Description |
|---|---|
| PROC TREE | Decision trees for classification and regression |
| PROC FOREST | Random forests for ensemble learning |
| PROC BOOST | Gradient boosting |
| PROC DNN | PyTorch feedforward neural networks (classification & regression) |
| PROC NLP | HuggingFace NLP (sentiment, classification, NER, summarisation) |
| PROC CVISION | Image classification (ResNet, VGG) and Faster R-CNN object detection |
| PROC RL | Tabular Q-learning for reinforcement learning |
| PROC LLM | Text generation, fill-mask, and QA via HuggingFace |
Data Management Procedures
| Procedure | Description |
|---|---|
| PROC TRANSPOSE | Reshape data (wide / long) with BY group support |
| PROC APPEND | Concatenate datasets with FORCE option |
| PROC DATASETS | Delete, rename, and list datasets |
| PROC EXPORT | Export to CSV, Excel, JSON, Parquet |
| PROC IMPORT | Import from CSV, Excel, JSON, Parquet |
| PROC SQL | SQL query processing with DuckDB backend |
| PROC LANGUAGE | LLM-powered text generation, Q&A, and data analysis |
Installation
Python Package
# Core statistical procedures
pip install statlang
# With deep learning (PROC DNN, PROC CVISION, PROC RL)
pip install statlang[dl]
# With NLP (PROC NLP, PROC LLM)
pip install statlang[nlp]
# With DuckDB SQL engine (PROC SQL)
pip install statlang[sql]
# With Jupyter notebook support
pip install statlang[notebook]
# Everything
pip install statlang[all]
Jupyter Kernel Installation
# Install the StatLang kernel
python -m statlang.kernel install
# List available kernels
jupyter kernelspec list
VS Code Extension
- Install from VS Code Marketplace: "StatLang" by RyanBlakeStory
- Or install from source (see Development section)
Quick Start
1. Interactive Python Usage
from statlang import StatLangInterpreter
# Create interpreter
interpreter = StatLangInterpreter()
# Create sample data using StatLang syntax
interpreter.run_code('''
data work.employees;
input employee_id name $ department $ salary;
datalines;
1 Alice Engineering 75000
2 Bob Marketing 55000
3 Carol Engineering 80000
4 David Sales 45000
;
run;
''')
# Run statistical analysis
interpreter.run_code('''
proc means data=work.employees;
class department;
var salary;
run;
''')
2. Macro-Powered Pipeline
%LET target = spend;
%LET features = age income;
%macro train_and_evaluate(depvar, indepvars);
proc reg data=work.train;
model &depvar = &indepvars;
output out=work.results p=predicted r=residuals;
run;
proc means data=work.results mean;
var residuals;
run;
%mend;
%train_and_evaluate(&target, &features);
3. Object Detection (Deep Learning)
/* Generate synthetic training data */
proc cvision mode=generate_samples out=annotations
n_train=30 n_test=10 img_size=128 seed=42;
run;
/* Fine-tune Faster R-CNN */
proc cvision data=train_annot mode=train_detect
model_name=shape_detector epochs=5 lr=0.005;
image image_path;
run;
/* Score new images with the trained model */
proc cvision data=test_images mode=serve out=detections
model_name=shape_detector confidence=0.5;
image image_path;
run;
4. Jupyter Notebook Usage
- Install the StatLang kernel:
python -m statlang.kernel install
- Create a new Jupyter notebook (
.ipynb) - Select "statlang" as the kernel
- Write StatLang code in cells and execute
5. VS Code Usage
- Install the StatLang extension from the marketplace
- Create a new file with
.statlangextension - Write your StatLang code
- Use
Ctrl+Shift+P> "StatLang: Run File" to execute
6. Command Line Usage
# Run StatLang code from file
python -m statlang.cli run example.statlang
# Interactive mode
python -m statlang.cli interactive
Examples & Demos
ML Regression Project
ML Project Demo - A comprehensive machine learning workflow:
- Synthetic dataset creation with 30 observations
- PROC UNIVARIATE for distribution analysis
- PROC SURVEYSELECT for train/test splitting (70/30)
- PROC REG with MODEL, OUTPUT, and SCORE statements
- Macro-based reusable analysis functions
Object Detection Walkthrough
Object Detection Pipeline - End-to-end computer vision:
- Synthetic shape data generation with bounding-box annotations
- Faster R-CNN fine-tuning with PROC CVISION
- Model store persistence and serving
- Composable
%MACROpipeline with%LET-driven configuration
Comprehensive Walkthrough
StatLang Walkthrough - Complete feature demonstration:
- All statistical procedures with examples
- Macro system demonstrations
- Format system usage
- Advanced data manipulation techniques
Project Structure
StatLang/
├── stat_lang/ # Core Python package
│ ├── __init__.py
│ ├── interpreter.py # Main interpreter
│ ├── cli.py # Command line interface
│ ├── pipeline.py # End-to-end pipeline runner
│ ├── kernel/ # Jupyter kernel implementation
│ │ ├── statlang_kernel.py
│ │ └── install.py
│ ├── parser/ # Syntax parsers
│ │ ├── data_step_parser.py # DATA step (MERGE, ARRAY, DO, etc.)
│ │ ├── proc_parser.py # Generic PROC option scanner
│ │ └── macro_parser.py
│ ├── procs/ # 38+ procedure implementations
│ │ ├── proc_means.py # Statistical procs
│ │ ├── proc_reg.py
│ │ ├── proc_glm.py
│ │ ├── proc_dnn.py # Deep learning procs
│ │ ├── proc_cvision.py # Computer vision / object detection
│ │ ├── proc_export.py # Data management procs
│ │ └── ...
│ └── utils/
│ ├── expression_evaluator.py
│ ├── macro_processor.py # Macro engine
│ ├── model_store.py # In-memory + pickle model store
│ ├── data_utils.py
│ └── libname_manager.py
├── tests/ # Test suite (55+ tests)
├── examples/ # Example notebooks & scripts
├── vscode-extension/ # VS Code extension
├── media/ # Logo and icons
├── pyproject.toml # Package config & dependencies
└── README.md
Development
Setup Development Environment
git clone https://github.com/Stryve-Analytics/StatLang.git
cd StatLang
pip install -e ".[dev]"
Running Tests
# Run the full test suite
pytest
# With verbose output
pytest -v --tb=short
Linting & Type Checking
# Lint
ruff check stat_lang tests --select E,F,I --ignore E501
# Type check
mypy stat_lang tests
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Areas for Contribution
- Additional statistical procedures
- Macro functionality enhancements
- Performance optimisations
- VS Code extension features
- Documentation and examples
License
MIT License - see LICENSE for details.
Support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file statlang-0.2.0.tar.gz.
File metadata
- Download URL: statlang-0.2.0.tar.gz
- Upload date:
- Size: 118.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
660d214da2fcf4ca33d3ecdc819f037c87531ed025fc8026cb2676b6e6aa0ac6
|
|
| MD5 |
3eb43b5fe4593caf50399d1ff7a774ff
|
|
| BLAKE2b-256 |
34d5ba253fbc767bd560c67e51db06088f31a5d3241e5ca3b9c3020f86189adc
|
File details
Details for the file statlang-0.2.0-py3-none-any.whl.
File metadata
- Download URL: statlang-0.2.0-py3-none-any.whl
- Upload date:
- Size: 150.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c8adf356b369953ca2e81651927c6b1c10c0243b72cc5ef4c21859069f53fe7
|
|
| MD5 |
1babcd5c19f8f1b6793d97ca087096bd
|
|
| BLAKE2b-256 |
0d996888b543bb26441a6f2c25896dbf1ab874432c398cf51e0816298ff4edd4
|